AI in Software Testing: How QA Agents Run Your Test Plans

AI in Software Testing: How QA Agents Run Your Test Plans

You've got a test plan in TestCollab. 47 test cases, each with detailed steps and expected results, curated by your QA team. Now someone needs to actually execute them against the app — clicking through pages, filling forms, verifying that the right things show up in the right places.

What if that someone was an AI agent with a browser?

This is what AI in software testing actually looks like in 2026 — not theoretical, not "AI-powered" marketing copy. A QA agent reads your curated test plan, drives a real browser, and reports results back to your test management tool. Humans decide what to test; the agent does the clicking.

The TestCollab CLI makes this possible with any AI coding agent that can drive a browser. This tutorial covers three QA agents — Hermes Agent, Claude Code, and Codex — so you can pick the one that fits your stack.

What "AI in software testing" actually means here

AI software testing has fragmented into two distinct workflows that often get conflated:

  • AI generates test cases — an LLM reads your requirements or user stories and proposes test cases (covered in our AI test case generation guide)
  • AI executes test cases — a QA agent reads existing test cases and runs them against your app via a real browser
  • This post is about the second workflow. The model isn't deciding what to test — that's still a human's job. It's deciding how to click through the flow being tested, which is the part hand-written E2E suites do brittlely and humans do slowly. That's the part of QA where AI agents earn their keep.

    How the agentic QA workflow works

    The workflow has four steps, and the CLI handles the plumbing between TestCollab and the agent:

  • Humans curate the test plan in TestCollab — what to test, what the expected results are
  • tc getTestPlan exports the plan as structured JSON that agents can consume
  • The AI agent opens a browser, executes each test case step-by-step, and records pass/fail/skip
  • tc report uploads the JUnit XML results back to TestCollab
  • Humans stay in control of what to test. The agent handles executing it. The tc getTestPlan + tc report pattern is agent-agnostic — any agent with browser access can slot into step 3.

    Common setup

    Every agent needs the TestCollab CLI and a test plan. Set these up first, then pick your agent below.

    How the install pattern works

    All three agents below follow the same shape. You install a small TestCollab skill into the agent's user-global skills directory once. After that, you launch the agent from your own project's directory — the same place you'd cd into to run tests by hand — and the skill is auto-loaded from its description when you mention the work. You never cd into testcollab-cli to run QA.

    AgentSkill installs toYou launch from
    Hermes~/.hermes/skills/software-development/testcollab-qa/your app's project dir
    Claude Code~/.claude/skills/testcollab-qa/your app's project dir
    Codex~/.codex/skills/testcollab-qa/your app's project dir

    Install the TestCollab CLI

    npm install -g @testcollab/cli
    tc --version

    Set your TestCollab API token (get it from Account Settings → API Tokens in TestCollab):

    export TESTCOLLAB_TOKEN=your-token-here

    Create a test plan in TestCollab

    If you don't have one already, create a test plan in TestCollab with the test cases you want to execute. Each test case should have:

    • A clear title describing what's being tested
    • Steps with specific actions (e.g., "Navigate to /login", "Enter admin@example.com in the email field")
    • Expected results for each step (e.g., "Dashboard is displayed", "Success toast appears")
    The more concrete your steps and expected results are, the better the agent performs. "Page loads correctly" is ambiguous. "URL changes to /dashboard and a heading with text 'Welcome' is visible" gives the agent something specific to verify.

    Note your project ID and test plan ID — you'll need them when running the agent.


    Option A: Hermes Agent

    Hermes Agent is an open-source AI agent by Nous Research with built-in browser automation via MCP. It uses a skill system to learn workflows, and the TestCollab QA skill teaches it the full getTestPlan → browser → report cycle.

    Prerequisites

    • Python 3.10+
    • An LLM API key (OpenRouter, Anthropic, or any provider Hermes supports)

    Install Hermes Agent

    curl -fsSL https://raw.githubusercontent.com/NousResearch/hermes-agent/main/scripts/install.sh | bash
    source ~/.bashrc
    hermes setup

    Hermes supports 200+ models via OpenRouter, Anthropic, OpenAI, and other providers. Models with strong instruction-following (Claude, GPT-4, Gemini) work best for QA tasks.

    Install the TestCollab QA skill (one-time, user-global)

    Hermes loads skills from ~/.hermes/skills/, so installing the TestCollab QA skill there makes it available from every directory on your machine. One install, then you forget it.

    curl -fsSL https://raw.githubusercontent.com/TCSoftInc/testcollab-cli/main/hermes-skill/testcollab-qa/scripts/install.sh | bash

    That script copies SKILL.md into ~/.hermes/skills/software-development/testcollab-qa/. Hermes organizes skills by category — software-development is the category declared in this skill's manifest. To verify:

    ls ~/.hermes/skills/software-development/testcollab-qa/

    Add your TestCollab token to Hermes' global env file so the skill can call the API:

    echo "TESTCOLLAB_TOKEN=your-token-here" >> ~/.hermes/.env

    Run from your own project

    Switch to the project you actually want to test — the app's repo, not testcollab-cli:

    cd ~/your-app
    hermes

    Then prompt it:

    Execute test plan 555 in project 16 against http://localhost:3000.
    Login: testuser@example.com / password123.

    Hermes finds the TestCollab QA skill in your user-global skills dir, fetches the plan, drives the browser, generates JUnit XML, and calls tc report.

    Non-interactive mode

    For CI or scripted runs, pass the prompt inline:

    hermes -z "Execute test plan 555 in project 16 against http://staging.example.com. \
      Login: testuser@example.com / password123." --accept-hooks

    hermes chat -q "..." works the same way if you prefer the subcommand form.


    Option B: Claude Code

    Claude Code is Anthropic's CLI coding agent. It has built-in browser automation via playwright-cli — the same accessibility-tree approach that Hermes uses. No separate browser agent or skill installation needed.

    Prerequisites

    • Claude Code installed and authenticated
    • playwright-cli browser installed (Claude Code will prompt you if needed)

    Install the TestCollab QA skill (one-time, user-global)

    Claude Code loads skills from two locations:

    • ~/.claude/skills/<name>/SKILL.md — user-global, available in every project
    • <project>/.claude/skills/<name>/SKILL.md — project-local, available only in that one project
    For most teams, user-global is the right choice — install once and the skill works against any app you're testing. Same one-liner shape as the Hermes install:
    curl -fsSL https://raw.githubusercontent.com/TCSoftInc/testcollab-cli/main/claude-code-skill/testcollab-qa/scripts/install.sh | bash

    The script drops SKILL.md into ~/.claude/skills/testcollab-qa/, checks that Claude Code and @testcollab/cli are installed, and warns if TESTCOLLAB_TOKEN is unset. To verify:

    ls ~/.claude/skills/testcollab-qa/

    (If you'd rather scope the skill to one repo, drop the same SKILL.md into your-app/.claude/skills/testcollab-qa/SKILL.md instead — same skill, different visibility.)

    Run from your own project

    Switch to the app you want to test — not testcollab-cli:

    cd ~/your-app
    claude

    Then prompt it:

    Execute test plan 555 in project 16 against http://localhost:3000.
    Login: testuser@example.com / password123.

    Claude Code matches the prompt to the testcollab-qa skill via its description and:

  • Runs tc getTestPlan to fetch the plan as JSON
  • Opens a Chromium browser via playwright-cli
  • Navigates to each page, interacts with elements, and verifies expected results
  • Generates a JUnit XML result file with [TC-ID] prefixes
  • Uploads results via tc report
  • Prints a summary of pass/fail/skip results
  • Non-interactive mode

    For CI or scripted runs:

    claude -p "Execute test plan 555 in project 16 against http://staging.example.com. \
      Login: testuser@example.com / password123."

    Option C: Codex

    Codex is OpenAI's CLI coding agent. It has full shell access and a skill system, so once you install the TestCollab QA skill it can drive a browser via playwright-cli and run the same getTestPlan → browser → report cycle as Hermes and Claude Code.

    Prerequisites

    • Codex CLI installed and authenticated (npm install -g @openai/codex)
    • playwright-cli available on PATH so Codex can launch a real browser

    Install the TestCollab QA skill (one-time, user-global)

    Codex loads skills from $CODEX_HOME/skills/ (defaults to ~/.codex/skills/). Installing the TestCollab QA skill there makes it available from every directory on your machine — one install, then you forget it.

    curl -fsSL https://raw.githubusercontent.com/TCSoftInc/testcollab-cli/main/codex-skill/testcollab-qa/scripts/install.sh | bash

    That drops the skill at ~/.codex/skills/testcollab-qa/ with SKILL.md. To verify:

    ls ~/.codex/skills/testcollab-qa/

    Restart Codex so it picks up the new skill, and make sure TESTCOLLAB_TOKEN is set in your shell profile so the skill can call the API.

    Run from your own project

    Switch to the project you actually want to test — not testcollab-cli:

    cd ~/your-app
    codex

    Then prompt it:

    Execute test plan 555 in project 16 against http://localhost:3000.
    Login: testuser@example.com / password123.

    Codex matches the prompt to the testcollab-qa skill via its description, fetches the plan with tc getTestPlan, drives the browser via playwright-cli, generates JUnit XML, and uploads results with tc report.

    Non-interactive mode

    For CI or scripted runs:

    codex exec "Execute test plan 555 in project 16 against http://staging.example.com. \
      Login: testuser@example.com / password123."

    Sandbox note: Codex's default sandbox blocks new listening sockets, which playwright-cli needs to launch a browser. Run with --full-auto (or the equivalent escalation flag for your Codex version) so the browser process can start. Tighten the sandbox back down for everything else.


    How a QA agent drives a real browser

    Both Hermes and Claude Code use accessibility trees — a text-based representation of what's on the page, with reference IDs for interactive elements. This makes them resilient to CSS class changes, re-renders, and minor UI tweaks.

    Here's what a typical test case execution looks like:

    Test stepWhat the agent does
    "Navigate to /settings"Navigates the browser to the full URL
    "Click the Save button"Takes a page snapshot, finds the Save button by text, clicks it by reference ID
    "Enter 'admin@example.com' in the email field"Snapshots the page, finds the email input, fills it with the value
    "Verify the success message appears"Snapshots the page and checks the accessibility tree for the expected text
    "The page title should be 'Dashboard'"Snapshots and verifies the heading element contains "Dashboard"
    Each verification step checks the accessibility tree for expected content. If the expected result doesn't match, the test case is marked as failed with a description of what was expected vs. what was found.

    Result mapping

    Results are mapped back to TestCollab test cases using the [TC-ID] convention in JUnit XML:

    <testsuites>
      <testsuite name="Sprint 12 Regression" tests="3" failures="1">
        <testcase classname="Login" name="[TC-42] Valid login redirects to dashboard" time="4.2"/>
        <testcase classname="Login" name="[TC-43] Invalid password shows error" time="3.1">
          <failure message="Error message not found">
            Expected: Error banner visible after wrong password.
            Actual: Page reloaded without error indication.
          </failure>
        </testcase>
        <testcase classname="Settings" name="[TC-44] User can update email" time="5.8"/>
      </testsuite>
    </testsuites>

    The [TC-42] prefix is what lets tc report match each result to the correct test case in TestCollab. Without it, results would create new test cases instead of updating existing ones.

    What results look like in TestCollab

    After tc report uploads the JUnit XML, results appear in the test plan's Test Cases tab — each test case shows its execution result with a color-coded status:

    Test plan results in TestCollab showing Passed, Failed, Skipped, and Unexecuted test cases

    The Result column shows Passed (green), Failed (red), Skipped (yellow), and Unexecuted (grey) at a glance. Click any test case to see the full execution details, including failure messages with expected vs. actual results.

    See it in action

    We tested both Hermes Agent and Claude Code against the TestCollab website itself — homepage, pricing, and blog — with a 3-case smoke test plan.

    Hermes Agent results

    Test Plan: TestCollab Website Smoke Test
    Total: 3 | Passed: 3 | Failed: 0 | Skipped: 0
    
    TC-201 [Homepage] - PASS
      - testcollab.com loaded with title "Test Management Tool for Humans & AI Agents"
      - "TestCollab" brand name present in page content and navigation
      - "START FREE TRIAL" and "Try For Free Now" CTAs visible
    
    TC-202 [Pricing] - PASS
      - testcollab.com/pricing loaded with title "Plans & Pricing | TestCollab"
      - Three plan tiers present: Premium, Elite, Enterprise
      - Prices visible: $29/user/month (Premium), $39/user/month (Elite)
    
    TC-203 [Blog] - PASS
      - testcollab.com/blog loaded with title "Blog | TestCollab"
      - 12 blog posts visible on first page, each with a title
      - Every article shows both a category label and publication date

    Claude Code results

    Test Plan: TestCollab Website Smoke Test (Claude Code)
    Total: 3 | Passed: 3 | Failed: 0 | Skipped: 0
    
    TC-201 [Homepage] - PASS
      - Page title: "Test Management Tool for Humans & AI Agents - Test Collab"
      - "TestCollab logo" in navigation, brand name in header
      - "Start Free Trial" and "Try For Free Now" CTAs found via accessibility tree
    
    TC-202 [Pricing] - PASS
      - Page title: "Plans & Pricing | TestCollab"
      - Three plan headings: "Premium", "Elite", "Enterprise" identified
      - "$29" and "$39" price elements visible in plan cards
    
    TC-203 [Blog] - PASS
      - Page title: "Blog | TestCollab"
      - Multiple article elements with titles, each with category and date
      - Categories: "Announcements", "Quality Assurance", "Best Practices"

    Both agents passed all 3 test cases. The same test plan JSON, the same JUnit XML format, the same tc report upload — the only difference is which agent drove the browser.

    <testsuite name="TestCollab Website Smoke Test" tests="3" failures="0">
      <testcase classname="Homepage" name="[TC-201] Homepage displays brand name and signup CTA"/>
      <testcase classname="Pricing" name="[TC-202] Pricing page shows plan tiers"/>
      <testcase classname="Blog" name="[TC-203] Blog page lists recent articles"/>
    </testsuite>

    Tips for better results

    Keep test plans small. 10–30 test cases per plan works best. Larger plans risk context drift — the agent may lose track of what it's doing around case 40. Split large suites into multiple plans.

    Write concrete expected results. The agent needs something specific to check. Compare:

    • Vague: "The page loads correctly"
    • Concrete: "URL is /dashboard and a table with at least 3 rows is visible"
    Provide credentials in the prompt. Don't make the agent guess. Tell it the login URL, email, and password upfront.

    Use staging, not production. The agent interacts with your app like a real user — it can submit forms, create records, and modify data. Point it at a staging or local environment.

    Review failures manually. Agents sometimes misclassify slow page loads as failures, or miss subtle UI states. Use the results as a first pass, then review failures with human eyes before filing bugs.

    Choosing a QA agent

    FeatureHermes AgentClaude CodeCodex
    Browser automationBuilt-in (MCP)Built-in (playwright-cli)Via playwright-cli (needs --full-auto)
    Skill install location~/.hermes/skills/software-development/testcollab-qa/~/.claude/skills/testcollab-qa/~/.codex/skills/testcollab-qa/
    InvocationNatural-language promptNatural-language promptNatural-language prompt
    Where you run it fromYour app's project dirYour app's project dirYour app's project dir
    Best forDedicated QA agent runsDeveloper-driven QA during codingRepo-aware QA inside an OpenAI-powered workflow
    Non-interactive/CIhermes -z "..."claude -p "..."codex exec "..."
    LLM flexibility200+ models via OpenRouterClaude modelsOpenAI models
    CostBring your own API keyClaude Code subscriptionCodex subscription

    What's next

    The skills for all three agents and their install scripts are open source in the TestCollab CLI repo under hermes-skill/, claude-code-skill/, and codex-skill/. Install each one to its agent's user-global skills directory, work from your own project, and the tc getTestPlan → agent → tc report loop is the same regardless of which agent you pick.

    If you're already using the TestCollab MCP Server for AI-powered test case generation, this closes the loop: AI generates the test cases, humans curate them into plans, and an AI agent executes them.

    The agentic QA workflow works with any agent that has browser automation — not just the three covered here. The tc getTestPlan + tc report pattern is the contract. See the Agentic QA Guide for the framework-agnostic approach.