An AI examining a screen of an application and deciding what to click next

AI-in-the-Loop Visual Testing: Your App, Tested by Sight

Ask OpenFactory to test a workflow and the AI drives your app by screenshot, reasons about what it sees, asserts with image recognition, logs in with one-time codes, and records a replay you can scrub step by step.

June 15, 2026

← Back to Blog

Describe a workflow in a sentence and OpenFactory tests it for you. The AI opens your app on a real machine, drives it by sight, checks each result with image recognition, signs in with a one-time code when it has to, and hands you a video of the whole run with a marker on every step.

Traditional UI tests are written against the skeleton of your app — this CSS selector, that DOM id, this element index. They are precise and they are brittle: rename a class, reorder a layout, swap a component library, and a green suite turns red even though nothing a user cares about broke. Teams respond by spending more time maintaining tests than writing features, and eventually they stop trusting the suite at all.

OpenFactory takes the other path. Instead of binding to markup, it tests your app the way a person would: it looks at the screen, decides what to do, does it, and looks again to confirm. That single shift — from reading the DOM to seeing the screen — is what makes the tests robust, and it is what lets you write a test by simply describing it.

Ask for a test, get a verdict

The headline tool is test_workflow. You give it a goal in plain language and it returns a run id right away, then works through the task in the background while you poll for the result. Under the hood it runs a tight perception → reasoning → action loop: capture the screen, reason about what is there and what to do next, take the action, and repeat until the goal is met or a step fails. Before it ever navigates, it checks that the target URL is actually reachable, so a broken deploy fails loudly instead of silently testing the wrong page.

The test loop. Every pass captures a screenshot as evidence, so the final report is a frame-by-frame account of what happened.

Please test the booking flow on my staging site.

Open the booking page, pick the first available slot, fill in a test
name and email, submit, and verify the confirmation screen shows a
booking reference. Record a video and give me the verdict.

Assertions that see, not just read

A test is only as good as its checks. Matching raw text gets you part of the way, but plenty of UI states have no convenient text to grep — a button that should be enabled, a chart that should have rendered, a modal that should have appeared. OpenFactory adds a visual gate: image recognition runs over each screenshot so you can assert semantic states like “the Sign In button is visible” or “the dashboard loaded.” It judges both terminal-style output and real graphical windows, and it is hardened against the awkward edge cases — blank frames, compositor quirks — that would otherwise produce false failures.

The visual gate recognizes interface elements as a person would, so assertions survive restyles and reach UI that text matching can't.

Tests that get faster, not slower

Reasoning about every pixel on every run would be slow and expensive. So a saved test hardens itself. The first time it runs, it resolves each step the careful way — find the element, confirm it, remember where it was and what it looked like. On later runs it replays from that memory in milliseconds and only drops back to full reasoning for the specific steps whose UI actually changed, re-learning just those. The result is the opposite of the usual flaky-suite spiral: your tests get quicker and steadier the more you run them.

Slow once, fast forever after — except the exact steps your UI changed, which the test quietly re-learns on its own.

A video you can scrub by step

When a test fails at 2 a.m., a red checkmark is not enough — you need to see what happened. Every run can be recorded as video, and the recording carries a seek marker for each step. Open the report, click a step in the list, and the player jumps straight to that moment. No more squinting at a wall of logs to reconstruct the failure: you watch it. The report also embeds the per-step screenshots and the verdict notes, and you can make any report shareable with a link.

Each step is a marker on the timeline. Click “submit” and the replay seeks straight to the submit.

A run report's step list: a passed Wait step with its screenshot, and a failed assert-text step showing 'Expected text not found' with its screenshot — Step-by-step evidence from a real run report: every step keeps its own screenshot and verdict. Here the visual gate flags a missing-text assertion and fails the run — you see exactly what the test saw.

Real apps mean real logins

Most workflows worth testing sit behind a sign-in, and modern sign-ins often mean a one-time code. A test can fetch that code from a connected mailbox at run time and type it in, so email-OTP authentication works end to end — no human babysitting the test, and no password living inside it. Any secret a test needs is supplied at run time and never stored, so a saved test holds only non-sensitive defaults like a URL or a test account's email address.

Why seeing beats reading the DOM

It is worth dwelling on why this approach is sturdier, because it changes what a “passing test” means. A selector-based test asserts something about your implementation: that an element with a particular id or class exists in a particular place. But users never see your implementation — they see pixels. When a redesign moves a button into a new component, the user's experience is unchanged, yet a selector-based test fails because the implementation it was pinned to moved. You spend the afternoon updating selectors to re-assert a thing that never broke.

Testing by sight asserts something about the experience instead: the button a user would click is visible and clickable; the confirmation a user would read appears on screen. That is the thing you actually care about, and it is stable across the cosmetic churn — renamed classes, reordered markup, swapped component libraries — that breaks brittle suites. You get tests that fail when the product is broken and pass when it works, which is the only contract a test is supposed to honor.

Evidence by default

A test that only tells you whether it passed leaves you to reconstruct why. OpenFactory captures the why automatically. Each step records a screenshot at the moment it ran, a short note on what was attempted and what was verified, and a status. Put together, a run report reads like a flipbook of exactly what the test saw and did, in order. When a step fails, its screenshot is right there showing the state of the screen at the instant things went wrong — frequently you diagnose the bug from the report without ever re-running anything.

Layer the video on top and you have both the frame-by-frame stills and the motion between them. Reports are private by default — screenshots, video, and notes are all gated — and you can make any single report shareable with a link when you want to hand a teammate the receipts. The point is that the evidence is a side effect of running the test, not extra work you have to remember to collect.

From a one-off check to a permanent test

There is a natural lifecycle here. You start by asking for a check — “does the booking flow still work?” — and OpenFactory drives it once and reports back. If it is something you will want to verify again, you promote that run into a saved scenario: a named, reusable test that lives under its app in the Test Panel. From then on it replays on demand, hardens itself for speed, and joins the one-click group re-runs alongside the rest of your suite.

Crucially, nothing about the test is locked to a brittle script you now have to maintain. The scenario stores the intent of each step and a cache of where things were last time; if the UI shifts, the test re-learns the changed step on its next run instead of failing and waiting for a human to patch a selector. Your suite maintains itself in the small ways, and asks for your attention only when something genuinely broke.

How to use it

Just ask. In chat, describe the workflow and ask OpenFactory to test it; it uses test_workflow and reports back with a verdict and a video.
Save it as a scenario. Turn a good run into a reusable test with create_app_scenario, then replay it any time with run_app_scenario — on its own or as part of a one-click group re-run.
Review in the console. Every run, its steps, and its video live in the Test Panel at console.openfactory.tech.

Testing stops being a maintenance tax when the tests see what your users see. Describe the workflow once, let it harden itself, and keep a video record of every run — then re-run the whole suite in one click whenever something changes.

Frequently asked questions

How is this different from a scripted test like Selenium or Playwright?

Scripted tests bind to CSS selectors and DOM structure, so a markup change breaks them even when the app still works. AI-in-the-loop testing drives the app the way a person does — it looks at the screen, decides what to click, and checks the result by sight — so it tolerates the cosmetic churn that makes scripted suites brittle.

What does the test_workflow tool actually do?

You describe a workflow in plain language; OpenFactory drives a tester machine through it in a screenshot → reason → act loop, captures evidence at every step, validates the outcome, and records the whole run as video. It returns a run id immediately and you poll for the verdict, so long workflows don't block.

How does it assert that the UI is correct?

Beyond plain text matching, a visual gate runs image recognition over each screenshot, so you can assert semantic states like 'the submit button is visible' or 'the dashboard rendered' — checks that survive restyles and work even on canvas or non-text UI.

Do my tests get slower every run?

No. The first run resolves each step the slow, careful way and remembers where things were; later runs replay from that cache and only fall back to full reasoning for the steps whose UI actually changed. Tests get faster as they stabilize.

How do tests handle login and 2FA?

A test can fetch a one-time login code from a connected mailbox at run time, so email-OTP sign-ins work end to end without anyone storing a password in the test. Secrets are supplied at run time and never persisted.

Ready to ship this in production?

OpenFactory's free flow is for browsing. Persistent VMs, SSH access, snapshots, your own ISO, and fleet deployment live on a paid plan.

See pricing →Book a demo