AI in Software Testing: How QA Agents Run Your Test Plans

By Testcollab Team •May 20, 2026 •14 min read •Posted in AI Test Management

You've got a test plan in TestCollab. 47 test cases, each with detailed steps and expected results, curated by your QA team. Now someone needs to actually execute them against the app — clicking through pages, filling forms, verifying that the right things show up in the right places.

What if that someone was an AI agent with a browser?

This is what AI in software testing actually looks like in 2026 — not theoretical, not "AI-powered" marketing copy. A QA agent reads your curated test plan, drives a real browser, and reports results back to your test management tool. Humans decide what to test; the agent does the clicking.

The TestCollab CLI makes this possible with any AI coding agent that can drive a browser. This tutorial covers three QA agents — Hermes Agent, Claude Code, and Codex — so you can pick the one that fits your stack.

What "AI in software testing" actually means here

AI software testing has fragmented into two distinct workflows that often get conflated:

AI generates test cases — an LLM reads your requirements or user stories and proposes test cases (covered in our AI test case generation guide)

AI executes test cases — a QA agent reads existing test cases and runs them against your app via a real browser

This post is about the second workflow. The model isn't deciding what to test — that's still a human's job. It's deciding how to click through the flow being tested, which is the part hand-written E2E suites do brittlely and humans do slowly. That's the part of QA where AI agents earn their keep.

How the agentic QA workflow works

The workflow has four steps, and the CLI handles the plumbing between TestCollab and the agent:

Humans curate the test plan in TestCollab — what to test, what the expected results are

tc getTestPlan exports the plan as structured JSON that agents can consume

The AI agent opens a browser, executes each test case step-by-step, and records pass/fail/skip

tc report uploads the JUnit XML results back to TestCollab

Humans stay in control of what to test. The agent handles executing it. The tc getTestPlan + tc report pattern is agent-agnostic — any agent with browser access can slot into step 3.

Common setup

Every agent needs the TestCollab CLI and a test plan. Set these up first, then pick your agent below.

How the install pattern works

All three agents below follow the same shape. You install a small TestCollab skill into the agent's user-global skills directory once. After that, you launch the agent from your own project's directory — the same place you'd cd into to run tests by hand — and the skill is auto-loaded from its description when you mention the work. You never cd into testcollab-cli to run QA.

Agent	Skill installs to	You launch from
Hermes	`~/.hermes/skills/software-development/testcollab-qa/`	your app's project dir
Claude Code	`~/.claude/skills/testcollab-qa/`	your app's project dir
Codex	`~/.codex/skills/testcollab-qa/`	your app's project dir

Install the TestCollab CLI

npm install -g @testcollab/cli
tc --version

Set your TestCollab API token (get it from Account Settings → API Tokens in TestCollab):

export TESTCOLLAB_TOKEN=your-token-here

Create a test plan in TestCollab

If you don't have one already, create a test plan in TestCollab with the test cases you want to execute. Each test case should have:

A clear title describing what's being tested
Steps with specific actions (e.g., "Navigate to /login", "Enter admin@example.com in the email field")
Expected results for each step (e.g., "Dashboard is displayed", "Success toast appears")

The more concrete your steps and expected results are, the better the agent performs. "Page loads correctly" is ambiguous. "URL changes to /dashboard and a heading with text 'Welcome' is visible" gives the agent something specific to verify.

Note your project ID and test plan ID — you'll need them when running the agent.

Option A: Hermes Agent

Hermes Agent is an open-source AI agent by Nous Research with built-in browser automation via MCP. It uses a skill system to learn workflows, and the TestCollab QA skill teaches it the full getTestPlan → browser → report cycle.

Prerequisites

Python 3.10+
An LLM API key (OpenRouter, Anthropic, or any provider Hermes supports)

Install Hermes Agent

curl -fsSL https://raw.githubusercontent.com/NousResearch/hermes-agent/main/scripts/install.sh | bash
source ~/.bashrc
hermes setup

Hermes supports 200+ models via OpenRouter, Anthropic, OpenAI, and other providers. Models with strong instruction-following (Claude, GPT-4, Gemini) work best for QA tasks.

Install the TestCollab QA skill (one-time, user-global)

Hermes loads skills from ~/.hermes/skills/, so installing the TestCollab QA skill there makes it available from every directory on your machine. One install, then you forget it.

curl -fsSL https://raw.githubusercontent.com/TCSoftInc/testcollab-cli/main/hermes-skill/testcollab-qa/scripts/install.sh | bash

That script copies SKILL.md into ~/.hermes/skills/software-development/testcollab-qa/. Hermes organizes skills by category — software-development is the category declared in this skill's manifest. To verify:

ls ~/.hermes/skills/software-development/testcollab-qa/

Add your TestCollab token to Hermes' global env file so the skill can call the API:

echo "TESTCOLLAB_TOKEN=your-token-here" >> ~/.hermes/.env

Run from your own project

Switch to the project you actually want to test — the app's repo, not testcollab-cli:

cd ~/your-app
hermes

Then prompt it:

Execute test plan 555 in project 16 against http://localhost:3000.
Login: testuser@example.com / password123.

Hermes finds the TestCollab QA skill in your user-global skills dir, fetches the plan, drives the browser, generates JUnit XML, and calls tc report.

Non-interactive mode

For CI or scripted runs, pass the prompt inline:

hermes -z "Execute test plan 555 in project 16 against http://staging.example.com. \
  Login: testuser@example.com / password123." --accept-hooks

hermes chat -q "..." works the same way if you prefer the subcommand form.

Option B: Claude Code

Claude Code is Anthropic's CLI coding agent. It has built-in browser automation via playwright-cli — the same accessibility-tree approach that Hermes uses. No separate browser agent or skill installation needed.

Prerequisites

Claude Code installed and authenticated
playwright-cli browser installed (Claude Code will prompt you if needed)

Install the TestCollab QA skill (one-time, user-global)

Claude Code loads skills from two locations:

~/.claude/skills/<name>/SKILL.md — user-global, available in every project
<project>/.claude/skills/<name>/SKILL.md — project-local, available only in that one project

For most teams, user-global is the right choice — install once and the skill works against any app you're testing. Same one-liner shape as the Hermes install:

curl -fsSL https://raw.githubusercontent.com/TCSoftInc/testcollab-cli/main/claude-code-skill/testcollab-qa/scripts/install.sh | bash

The script drops SKILL.md into ~/.claude/skills/testcollab-qa/, checks that Claude Code and @testcollab/cli are installed, and warns if TESTCOLLAB_TOKEN is unset. To verify:

ls ~/.claude/skills/testcollab-qa/

(If you'd rather scope the skill to one repo, drop the same SKILL.md into your-app/.claude/skills/testcollab-qa/SKILL.md instead — same skill, different visibility.)

Run from your own project

Switch to the app you want to test — not testcollab-cli:

cd ~/your-app
claude

Then prompt it:

Execute test plan 555 in project 16 against http://localhost:3000.
Login: testuser@example.com / password123.

Claude Code matches the prompt to the testcollab-qa skill via its description and:

Runs tc getTestPlan to fetch the plan as JSON

Opens a Chromium browser via playwright-cli

Navigates to each page, interacts with elements, and verifies expected results

Generates a JUnit XML result file with [TC-ID] prefixes

Uploads results via tc report

Prints a summary of pass/fail/skip results

Non-interactive mode

For CI or scripted runs:

claude -p "Execute test plan 555 in project 16 against http://staging.example.com. \
  Login: testuser@example.com / password123."

Option C: Codex

Codex is OpenAI's CLI coding agent. It has full shell access and a skill system, so once you install the TestCollab QA skill it can drive a browser via playwright-cli and run the same getTestPlan → browser → report cycle as Hermes and Claude Code.

Prerequisites

Codex CLI installed and authenticated (npm install -g @openai/codex)
playwright-cli available on PATH so Codex can launch a real browser

Install the TestCollab QA skill (one-time, user-global)

Codex loads skills from $CODEX_HOME/skills/ (defaults to ~/.codex/skills/). Installing the TestCollab QA skill there makes it available from every directory on your machine — one install, then you forget it.

curl -fsSL https://raw.githubusercontent.com/TCSoftInc/testcollab-cli/main/codex-skill/testcollab-qa/scripts/install.sh | bash

That drops the skill at ~/.codex/skills/testcollab-qa/ with SKILL.md. To verify:

ls ~/.codex/skills/testcollab-qa/

Restart Codex so it picks up the new skill, and make sure TESTCOLLAB_TOKEN is set in your shell profile so the skill can call the API.

Run from your own project

Switch to the project you actually want to test — not testcollab-cli:

cd ~/your-app
codex

Then prompt it:

Execute test plan 555 in project 16 against http://localhost:3000.
Login: testuser@example.com / password123.

Codex matches the prompt to the testcollab-qa skill via its description, fetches the plan with tc getTestPlan, drives the browser via playwright-cli, generates JUnit XML, and uploads results with tc report.

Non-interactive mode

For CI or scripted runs:

codex exec "Execute test plan 555 in project 16 against http://staging.example.com. \
  Login: testuser@example.com / password123."

Sandbox note: Codex's default sandbox blocks new listening sockets, which playwright-cli needs to launch a browser. Run with --full-auto (or the equivalent escalation flag for your Codex version) so the browser process can start. Tighten the sandbox back down for everything else.

How a QA agent drives a real browser

Both Hermes and Claude Code use accessibility trees — a text-based representation of what's on the page, with reference IDs for interactive elements. This makes them resilient to CSS class changes, re-renders, and minor UI tweaks.

Here's what a typical test case execution looks like:

Test step	What the agent does
"Navigate to /settings"	Navigates the browser to the full URL
"Click the Save button"	Takes a page snapshot, finds the Save button by text, clicks it by reference ID
"Enter 'admin@example.com' in the email field"	Snapshots the page, finds the email input, fills it with the value
"Verify the success message appears"	Snapshots the page and checks the accessibility tree for the expected text
"The page title should be 'Dashboard'"	Snapshots and verifies the heading element contains "Dashboard"

Each verification step checks the accessibility tree for expected content. If the expected result doesn't match, the test case is marked as failed with a description of what was expected vs. what was found.

Result mapping

Results are mapped back to TestCollab test cases using the [TC-ID] convention in JUnit XML:

<testsuites>
  <testsuite name="Sprint 12 Regression" tests="3" failures="1">
    <testcase classname="Login" name="[TC-42] Valid login redirects to dashboard" time="4.2"/>
    <testcase classname="Login" name="[TC-43] Invalid password shows error" time="3.1">
      <failure message="Error message not found">
        Expected: Error banner visible after wrong password.
        Actual: Page reloaded without error indication.
      </failure>
    </testcase>
    <testcase classname="Settings" name="[TC-44] User can update email" time="5.8"/>
  </testsuite>
</testsuites>

The [TC-42] prefix is what lets tc report match each result to the correct test case in TestCollab. Without it, results would create new test cases instead of updating existing ones.

What results look like in TestCollab

After tc report uploads the JUnit XML, results appear in the test plan's Test Cases tab — each test case shows its execution result with a color-coded status:

Test plan results in TestCollab showing Passed, Failed, Skipped, and Unexecuted test cases

The Result column shows Passed (green), Failed (red), Skipped (yellow), and Unexecuted (grey) at a glance. Click any test case to see the full execution details, including failure messages with expected vs. actual results.

See it in action

We tested both Hermes Agent and Claude Code against the TestCollab website itself — homepage, pricing, and blog — with a 3-case smoke test plan.

Hermes Agent results

Test Plan: TestCollab Website Smoke Test
Total: 3 | Passed: 3 | Failed: 0 | Skipped: 0

TC-201 [Homepage] - PASS
  - testcollab.com loaded with title "Test Management Tool for Humans & AI Agents"
  - "TestCollab" brand name present in page content and navigation
  - "START FREE TRIAL" and "Try For Free Now" CTAs visible

TC-202 [Pricing] - PASS
  - testcollab.com/pricing loaded with title "Plans & Pricing | TestCollab"
  - Three plan tiers present: Premium, Elite, Enterprise
  - Prices visible: $29/user/month (Premium), $39/user/month (Elite)

TC-203 [Blog] - PASS
  - testcollab.com/blog loaded with title "Blog | TestCollab"
  - 12 blog posts visible on first page, each with a title
  - Every article shows both a category label and publication date

Claude Code results

Test Plan: TestCollab Website Smoke Test (Claude Code)
Total: 3 | Passed: 3 | Failed: 0 | Skipped: 0

TC-201 [Homepage] - PASS
  - Page title: "Test Management Tool for Humans & AI Agents - Test Collab"
  - "TestCollab logo" in navigation, brand name in header
  - "Start Free Trial" and "Try For Free Now" CTAs found via accessibility tree

TC-202 [Pricing] - PASS
  - Page title: "Plans & Pricing | TestCollab"
  - Three plan headings: "Premium", "Elite", "Enterprise" identified
  - "$29" and "$39" price elements visible in plan cards

TC-203 [Blog] - PASS
  - Page title: "Blog | TestCollab"
  - Multiple article elements with titles, each with category and date
  - Categories: "Announcements", "Quality Assurance", "Best Practices"

Both agents passed all 3 test cases. The same test plan JSON, the same JUnit XML format, the same tc report upload — the only difference is which agent drove the browser.

<testsuite name="TestCollab Website Smoke Test" tests="3" failures="0">
  <testcase classname="Homepage" name="[TC-201] Homepage displays brand name and signup CTA"/>
  <testcase classname="Pricing" name="[TC-202] Pricing page shows plan tiers"/>
  <testcase classname="Blog" name="[TC-203] Blog page lists recent articles"/>
</testsuite>

Tips for better results

Keep test plans small. 10–30 test cases per plan works best. Larger plans risk context drift — the agent may lose track of what it's doing around case 40. Split large suites into multiple plans.

Write concrete expected results. The agent needs something specific to check. Compare:

Vague: "The page loads correctly"
Concrete: "URL is /dashboard and a table with at least 3 rows is visible"

Provide credentials in the prompt. Don't make the agent guess. Tell it the login URL, email, and password upfront.

Use staging, not production. The agent interacts with your app like a real user — it can submit forms, create records, and modify data. Point it at a staging or local environment.

Review failures manually. Agents sometimes misclassify slow page loads as failures, or miss subtle UI states. Use the results as a first pass, then review failures with human eyes before filing bugs.

Choosing a QA agent

Feature	Hermes Agent	Claude Code	Codex
Browser automation	Built-in (MCP)	Built-in (playwright-cli)	Via `playwright-cli` (needs `--full-auto`)
Skill install location	`~/.hermes/skills/software-development/testcollab-qa/`	`~/.claude/skills/testcollab-qa/`	`~/.codex/skills/testcollab-qa/`
Invocation	Natural-language prompt	Natural-language prompt	Natural-language prompt
Where you run it from	Your app's project dir	Your app's project dir	Your app's project dir
Best for	Dedicated QA agent runs	Developer-driven QA during coding	Repo-aware QA inside an OpenAI-powered workflow
Non-interactive/CI	`hermes -z "..."`	`claude -p "..."`	`codex exec "..."`
LLM flexibility	200+ models via OpenRouter	Claude models	OpenAI models
Cost	Bring your own API key	Claude Code subscription	Codex subscription

What's next

The skills for all three agents and their install scripts are open source in the TestCollab CLI repo under hermes-skill/, claude-code-skill/, and codex-skill/. Install each one to its agent's user-global skills directory, work from your own project, and the tc getTestPlan → agent → tc report loop is the same regardless of which agent you pick.

If you're already using the TestCollab MCP Server for AI-powered test case generation, this closes the loop: AI generates the test cases, humans curate them into plans, and an AI agent executes them.

The agentic QA workflow works with any agent that has browser automation — not just the three covered here. The tc getTestPlan + tc report pattern is the contract. See the Agentic QA Guide for the framework-agnostic approach.

#ai-in-software-testing#ai-software-testing#qa-agent#agentic-qa#ai-testing-tools#hermes-agent#claude-code#codex#browser-automation#cli#integrations

What "AI in software testing" actually means here

How the agentic QA workflow works

Common setup

How the install pattern works

Install the TestCollab CLI

Create a test plan in TestCollab

Option A: Hermes Agent

Prerequisites

Install Hermes Agent

Install the TestCollab QA skill (one-time, user-global)

Run from your own project

Non-interactive mode

Option B: Claude Code

Prerequisites

Install the TestCollab QA skill (one-time, user-global)

Run from your own project

Non-interactive mode

Option C: Codex

Prerequisites

Install the TestCollab QA skill (one-time, user-global)

Run from your own project

Non-interactive mode

How a QA agent drives a real browser

Result mapping

What results look like in TestCollab

See it in action

Hermes Agent results

Claude Code results

Tips for better results

Choosing a QA agent

What's next

Related Posts

Best AI Testing Tools Compared: Katalon, Testim, Applitools & 7 More (2026)

Playwright CLI: The Token-Efficient Alternative to Playwright MCP for AI Coding Agents

10 Best AI Test Case Generation Tools (2026)

Ready to Elevate Your QA Process?