- UI Test — Agentic UI Testing Skill
- Test UI changes in a real browser. Your job is to
- try to break things
- , not confirm they work.
- Three workflows:
- Diff-driven
- — analyze a git diff, test only what changed
- Exploratory
- — navigate the app, find bugs the developer didn't think about
- Parallel
- — fan out independent test groups across multiple Browserbase browsers
- How Testing Works
- The main agent
- coordinates
- — it plans test strategy, delegates to sub-agents, and merges results. Sub-agents do the actual browser testing.
- Planning: multiple angles, then execute once
- You MUST complete all three planning rounds yourself and output them before launching any sub-agents.
- Planning happens in your own response — it is NOT delegated to sub-agents. Do not skip ahead to execution.
- Round 1 — Functional:
- What are the core user flows? What should work? Write out each test as: action → expected result.
- Round 2 — Adversarial:
- Re-read Round 1. What did you miss? Think about: different user types/roles, error paths, empty states, race conditions, edge inputs (empty, huge, special chars, rapid clicks).
- Round 3 — Coverage gaps:
- Re-read Rounds 1–2. What about: accessibility (axe-core, keyboard-only), mobile viewports, console errors, visual consistency with the rest of the app?
- Deduplicate:
- Merge all three rounds into one numbered list of tests. Remove overlaps. Assign each test to a group (e.g. Group A, Group B).
- Then execute once
- — launch one sub-agent per group. Each sub-agent receives its specific list of tests to run, nothing more. Sub-agents do not explore or plan — they execute assigned tests and report results.
- Output the three rounds, the merged plan, and the group assignments in your response before calling any Agent tool.
- Principles for splitting work
- Sub-agents run assigned tests, not open exploration.
- The main agent hands each sub-agent a specific numbered list of tests. Sub-agents do not plan, explore, or decide what to test — they execute the list and stop.
- The bottleneck is the slowest agent
- — split work so no single agent has a disproportionate share. Many small agents > few large ones.
- Size the effort to the change
- — a single component fix doesn't need many agents or many steps. A full-page redesign does. Let the scope of the diff drive the plan.
- No early stopping on failures
- — find as many bugs as possible within the assigned tests.
- Giving sub-agents a step budget
- The main agent MUST include an explicit browse step limit in every sub-agent prompt.
- Sub-agents do not self-limit — they will run until done unless told otherwise.
- As a rough heuristic: ~25 steps for a few targeted checks, ~40 for a full page with functional + adversarial + a11y, ~75 for multiple pages or a broad category.
- Adjust based on what the assigned tests actually require
- — these are starting points, not rules.
- As a rough heuristic: ~25 steps for a few targeted checks, ~40 for a full page with functional + adversarial + a11y, ~75 for multiple pages or a broad category.
- Adjust based on what the assigned tests actually require
- — these are starting points, not rules.
- Every sub-agent prompt must include:
- You have a budget of N browse steps (each
browsecommand = 1 step). Count your steps as you go. When you reach N, stop immediately and report: - - STEP_PASS/STEP_FAIL for every test you completed
- - STEP_SKIP|
|budget reached for every test you didn't get to - Do not retry or continue after hitting the budget.
- Run only these tests: [numbered list from the merged plan]
- Do not explore beyond the assigned tests.
- Do NOT generate an HTML report or write any files. Return only step markers and your findings as text.
- The main agent should NOT run
- browse
- commands itself (except to verify the dev server is up). All testing happens in sub-agents.
- When a sub-agent hits its budget, the main agent accepts the partial results as-is.
- Do not re-run or retry the sub-agent. Include SKIPPED tests in the final report so the developer knows what wasn't covered.
- Reporting
- Every sub-agent reports back with:
- Tests: 8 | Passed: 5 | Failed: 2 | Skipped: 1 | Pages visited: 2
- The main agent merges into a final report with:
- Tests: 20 | Passed: 14 | Failed: 4 | Skipped: 2 | Agents: 3 | Pass rate: 70%
- Do not report "steps used" — browse command counts are implementation plumbing, not a meaningful metric for reviewers.
- Testing Philosophy
- You are an adversarial tester.
- Your goal is to find bugs, not prove correctness.
- Try to break every feature you test.
- Don't just check "does the button exist?" — click it twice rapidly, submit empty forms, paste 500 characters, press Escape mid-flow.
- Test what the developer didn't think about.
- Empty states, error recovery, keyboard-only navigation, mobile overflow.
- Every assertion must be evidence-based.
- Compare before/after snapshots. Check specific elements by ref. Never report PASS without concrete evidence from the accessibility tree or a deterministic check.
- Report failures with enough detail to reproduce.
- Include the exact action, what you expected, what you got, and a suggested fix.
- Assertion Protocol
- Every test step MUST produce a structured assertion. Do not write freeform "this looks good."
- Step markers
- For each test step, emit exactly one marker:
- STEP_PASS|
| - or
- STEP_FAIL|
| → | - step-id
-
- short identifier like
- homepage-cta
- ,
- form-validation-error
- ,
- modal-cancel
- evidence
-
- what you observed that proves the step passed (element ref, text content, URL, eval result)
- expected → actual
-
- what you expected vs what you got
- screenshot-path
- path to the saved screenshot (failures only — see Screenshot Capture below) Screenshot Capture for Failures Every STEP_FAIL MUST have an accompanying screenshot so the developer can see what went wrong visually. When a test step fails:
1. Take a screenshot immediately after observing the failure
browse screenshot --path .context/ui-test-screenshots/ < step-id
.png
If --path is not supported, take the screenshot and save manually:
browse screenshot
The browse CLI will output the screenshot path — move/copy it:
cp /tmp/browse-screenshot-*.png .context/ui-test-screenshots/ < step-id
.png Setup the screenshot directory at the start of any test run: mkdir -p .context/ui-test-screenshots Rules: File name = step-id (e.g., double-submit.png , axe-audit.png , modal-focus-trap.png ) Store in .context/ui-test-screenshots/ — this directory is gitignored and accessible to the developer and other agents For parallel runs, include the session name:
- .png (e.g., signup-double-submit.png ) Take the screenshot at the moment of failure — capture the broken state, not after recovery For visual/layout bugs, also screenshot the baseline (working state) for comparison: -baseline.png How to verify (in order of rigor) Deterministic check (strongest) — browse eval returns structured data you can inspect. Examples: axe-core violation count, document.title , form field value, console error array, element count. Snapshot element match — a specific element with a specific role and text exists in the accessibility tree. Check by ref: @0-12 button "Save" . An element either exists in the tree or it doesn't. Before/after comparison — snapshot before action, act, snapshot after. Verify the tree changed in the expected way (element appeared, disappeared, text changed). Screenshot + visual judgment (weakest) — only for visual-only properties (color, spacing, layout) that the accessibility tree cannot capture. Always accompany with what specifically you're evaluating. Before/after comparison pattern This is the core verification loop. Use it for every interaction:
1. BEFORE: capture state
browse snapshot
Record: what elements exist, their text, their refs
2. ACT: perform the interaction
browse click @0-12
3. AFTER: capture new state
browse snapshot
Compare: what changed? What appeared? What disappeared?
4. ASSERT: emit marker based on comparison
If dialog appeared: STEP_PASS|modal-open|dialog "Confirm" appeared at @0-20
If nothing changed:
browse screenshot --path .context/ui-test-screenshots/modal-open.png
STEP_FAIL|modal-open|expected dialog to appear → snapshot unchanged|.context/ui-test-screenshots/modal-open.png
Setup
which
browse
||
npm
install
-g
@browserbasehq/browse-cli
Avoid permission fatigue
This skill runs many
browse
commands (snapshots, clicks, evals). To avoid approving each one, add
browse
to your allowed commands:
Add both patterns to
.claude/settings.json
(project-level) or
~/.claude/settings.json
(user-level):
{
"permissions"
:
{
"allow"
:
[
"Bash(browse:)"
,
"Bash(BROWSE_SESSION=)"
]
}
}
The first pattern covers plain
browse
commands. The second covers parallel sessions (
BROWSE_SESSION=signup browse open ...
). Both are needed to avoid approval prompts.
Mode Selection
Target
Mode
Command
Auth
localhost
/
127.0.0.1
Local
browse env local
None needed (clean isolated local browser by default)
Deployed/staging site
Remote
browse env remote
cookie-sync →
--context-id
Rule: If the target URL contains
localhost
or
127.0.0.1
, always use
browse env local
.
Local Mode (default for localhost)
browse
env
local
browse
open
http://localhost:3000
browse env local
uses a clean isolated local browser by default, which is best for reproducible localhost QA runs.
Use local-mode variants only when needed:
browse env local --auto-connect
— auto-discover existing local Chrome, fallback to isolated. Use this only when the test explicitly needs existing local login/cookies/state.
browse env local
Step 1: Sync cookies from local Chrome to Browserbase
node .claude/skills/cookie-sync/scripts/cookie-sync.mjs --domains your-app.com
Output: Context ID: ctx_abc123
Step 2: Switch to remote mode
browse env remote browse open https://staging.your-app.com --context-id ctx_abc123 --persist browse snapshot
... run tests ...
browse stop Cookie-sync flags: --domains , --context , --stealth , --proxy "City,ST,US" Workflow A: Diff-Driven Testing Phase 1: Analyze the diff git diff --name-only HEAD~1
or: git diff --name-only / git diff --name-only main...HEAD
git diff HEAD~1 -- < file
read actual changes
Categorize changed files: File pattern UI impact What to test .tsx , .jsx , .vue , .svelte Component Render, interaction, state, edge cases pages/ , app/ , src/routes/ Route/page Navigation, page load, content, 404 handling .css , .scss , .module.css Style Visual appearance (screenshot), responsive form , input , field Form Validation, submission, empty input, long input, special chars modal , dialog , dropdown Interactive Open/close, escape, focus trap, cancel vs confirm nav , menu , header* Navigation Links, active states, routing, keyboard nav Non-UI files only None Skip — report "no UI tests needed" Phase 2: Map files to URLs Detect framework: cat package.json | grep -E '"(next|react|vue|nuxt|svelte|@sveltejs|angular|vite)"' Framework Default port File → URL pattern Next.js App Router 3000 app/dashboard/page.tsx → /dashboard Next.js Pages Router 3000 pages/about.tsx → /about Vite 5173 Check router config Nuxt 3000 pages/index.vue → / SvelteKit 5173 src/routes/+page.svelte → / Angular 4200 Check routing module Phase 3: Ensure the right code is running Before testing, verify the dev server is serving the code from the diff — not a stale branch. If testing a PR or specific branch:
Check what branch is currently checked out
git branch --show-current
If it's not the PR branch, switch to it
git fetch origin < branch
&& git checkout < branch
Install deps — the lockfile may differ between branches
yarn install
or npm install / pnpm install
If the dev server was already running on a different branch, restart it after checkout. Find a running dev server: for port in 3000 3001 5173 4200 8080 8000 5000 ; do s = $( curl -s -o /dev/null -w "%{http_code}" "http://localhost: $port " 2
/dev/null ) if [ " $s " != "000" ] ; then echo "Dev server on port $port (HTTP $s )" ; fi done If nothing found: tell the user to start their dev server. Verify it actually renders: After browse open + browse snapshot , check that the accessibility tree contains real page content (navigation, headings, interactive elements) — not just an error overlay or empty body. Next.js dev servers can return HTTP 200 while showing a full-screen build error dialog. If the snapshot is empty or dominated by an error dialog, the server is broken — fix the build before testing. Phase 4: Generate test plan For each changed area, plan both happy path AND adversarial tests : Test Plan (based on git diff) ============================= Changed: src/components/SignupForm.tsx (added email validation) 1. [happy] Valid email submits successfully URL: http://localhost:3000/signup Steps: fill valid email → submit → verify success message appears 2. [adversarial] Invalid email shows error Steps: fill "not-an-email" → submit → verify error message appears 3. [adversarial] Empty form submission Steps: click submit without filling anything → verify error, no crash 4. [adversarial] XSS in email field Steps: fill "" → submit → verify sanitized/rejected 5. [adversarial] Rapid double-submit Steps: click submit twice quickly → verify no duplicate submission 6. [adversarial] Keyboard-only flow Steps: Tab to email → type → Tab to submit → Enter → verify success Phase 5: Execute tests browse stop 2
/dev/null mkdir -p .context/ui-test-screenshots
localhost/default QA → clean, reproducible local run
browse env local For each test, follow the before/after pattern :
Navigate
browse open http://localhost:3000/path browse wait load
BEFORE snapshot
browse snapshot
Note the current state: elements, refs, text
ACT
browse click @0-ref
or: browse fill "selector" "value"
or: browse type "text"
or: browse press Enter
AFTER snapshot
browse snapshot
Compare against BEFORE: what changed?
ASSERT with marker
STEP_PASS|step-id|evidence OR STEP_FAIL|step-id|expected → actual
Phase 6: Report results
UI Test Results
STEP_PASS|valid-email-submit|status "Thanks!" appeared at @0-42 after submit
- URL: http://localhost:3000/signup
- Before: form with email input @0-3, submit button @0-7
- Action: filled "user@test.com", clicked @0-7
- After: form replaced by status element with "Thanks! We'll be in touch."
STEP_FAIL|double-submit|expected single submission → form submitted twice|.context/ui-test-screenshots/double-submit.png
- URL: http://localhost:3000/signup
- Before: form with submit button @0-7
- Action: clicked @0-7 twice rapidly
- After: two success toasts appeared, suggesting duplicate submission
- Screenshot: .context/ui-test-screenshots/double-submit.png
- Suggestion: disable submit button after first click, or debounce the handler
Summary: 4/6 passed, 2 failed
Failed: double-submit, xss-sanitization
Screenshots saved to .context/ui-test-screenshots/ — open any failed step's screenshot to see the broken state.
Always
browse stop
when done.
Phase 7: Generate HTML report
After producing the text report, generate a standalone HTML report that a reviewer can open in a browser. The report embeds screenshots inline (base64) so it works as a single file — no external dependencies.
Why:
Text reports are good for the agent conversation, but reviewers (PMs, designers, other engineers) want a visual artifact they can open, scan, and share. Screenshots inline make failures immediately obvious.
How to generate
Read the HTML template at
references/report-template.html
Build the report by replacing the template placeholders with actual test data:
Placeholder
Value
{{TITLE}}
Report title for
<title>
tag (e.g., "UI Test: PR #1234 — OAuth Settings")
{{TITLE_HTML}}
Report title for the visible