We Let an AI Break Our Analytics Platform — Here's Every Bug It Found

Github Main- https://github.com/NishikantaRay/InsightTrack

Passmark - https://github.com/NishikantaRay/InsightTrack/tree/main/appsv2/passmark-tests

InsightTrack — 17 pages, dual-database architecture, tested with 52 AI-powered test cases.

Executive Summary

We built InsightTrack — a self-hosted, privacy-first alternative to Google Analytics. After shipping 17 dashboard pages spanning a PostgreSQL + DuckDB dual-database architecture, manual regression testing became the team's biggest bottleneck.

The solution: a 52-test AI-powered test suite built on Passmark, where every assertion is written in plain English and an AI model executes and judges them against the live application at runtime. Zero CSS selectors. Zero data-testid attributes. Zero XPath.

What actually happened on our first real run:

	Number
Tests written	52 (13 spec files, 17 routes)
Tests that ran to completion	11
Tests that passed	8
Tests that exposed real product bugs	3
Tests cut short by API credit exhaustion	38
Bugs found (total, including framework issue)	4
Runtime before credits ran out	~1.5 hours
OpenRouter API cost spent	~$0.45

The honest picture: 11 tests ran, 8 passed, 3 caught real bugs, 38 were aborted when the OpenRouter API key hit its per-key spending limit. The 4 bugs — a vision model integration issue, form validation not surfacing, a registration flow problem, and a duplicate email error — had all been present for months. Manual testing missed every one.

This post documents the full story: the application architecture, how we built the test suite, what every test checks, the actual run results, and each bug with its root cause and fix.

Part 1: The Application — InsightTrack

KPI cards and traffic chart. All analytics reads go directly to DuckDB — 10–100× faster than PostgreSQL for OLAP aggregations.

InsightTrack is a self-hosted web analytics platform. The defining architectural decision is the dual-database write/read split:

Layer	Technology	Role
Tracking script	Custom 2KB JS snippet	Fires events from any website via `POST /api/track`
Write path	PostgreSQL + Express + Node.js 20	Stores raw events, manages auth, handles site config
Sync worker	`sync.js` background process	Incrementally copies PostgreSQL → DuckDB every 30s
Read path	DuckDB	All analytics queries — columnar, fast, zero contention
Frontend	React 18 + Vite 5 + Recharts + Zustand	TypeScript dashboard SPA with Tailwind CSS
Real-time	WebSockets	Live visitor counter and event stream

All 17 Pages Under Test

Route	Page	Purpose
`/` (redirects)	Landing	Marketing page with login/register CTAs
`/login`	Login	Email + password authentication
`/register`	Register	New account creation
`/dashboard`	Main Dashboard	KPI overview, traffic chart, date range picker
`/pages`	Pages	Top pages by views, bounce rate, session time
`/funnels`	Funnels	Multi-step funnel builder + visualisation
`/conversions`	Conversions	Goal conversion rates and trends
`/audience`	Audience	New vs. returning, device + country breakdown
`/content`	Content Analytics	Scroll depth, engagement, content performance
`/acquisition`	Acquisition	Traffic sources, referrers, UTM campaigns
`/performance`	Performance	Core Web Vitals, LCP, CLS, FID
`/realtime`	Realtime	Live visitor count, world map, event stream
`/user-flow`	User Flow	Sankey diagram of navigation paths
`/engagement`	Engagement	Session depth, click patterns, return rate
`/reporting`	Reporting	Scheduled report builder
`/privacy`	Privacy	Consent management, data retention
`/settings`	Settings	Tracking snippet, site manager, alerts
`/profile`	Profile	User info, password management
`/docs`	Docs	In-app documentation

Testing all 17 manually on every release, while simultaneously shipping features, had become unsustainable.

Part 2: Why Passmark

The login page. Five separate tests: valid credentials, wrong password, empty form validation, password toggle, and register link. Each uses natural language — no selectors.

The Maintenance Problem With Traditional E2E Tests

We had been using Playwright. The problem was not writing tests — it was keeping them alive. Every component refactor broke [data-testid="kpi-card"]. Every sidebar redesign required updating all those nth-child selectors. Tests that require constant maintenance stop getting run.

What Passmark Does Differently

Passmark wraps Playwright with an AI layer. You write what you want to verify in plain English. The AI reads the page's accessibility tree — the same structured representation a screen reader or human QA engineer uses — and executes browser actions from that understanding.

Traditional Playwright:

await page.locator('[data-testid="kpi-visitors-card"]').waitFor();
const val = await page.locator('[data-testid="kpi-visitors-value"]').textContent();
expect(Number(val)).toBeGreaterThanOrEqual(0);

Passmark:

await runSteps({
  page,
  userFlow: 'Dashboard KPI check',
  steps: [
    { description: 'Navigate to /dashboard' },
    { description: 'Wait until a Dashboard heading is visible',
      waitUntil: 'A Dashboard heading is visible on the page' },
  ],
  assertions: [
    { assertion: 'A "Unique Visitors" or "Visitors" metric card is visible' },
    { assertion: 'A "Pageviews" metric card is visible' },
    { assertion: 'A "Bounce Rate" metric card is visible' },
  ],
  test, expect,
});

During this project we renamed "Unique Visitors" to "Total Visitors" in the UI. The Passmark assertion — 'A "Unique Visitors" or "Visitors" metric card is visible' — matched the new label and kept passing. The old Playwright test would have failed and required a code change.

The principle: Assertions that describe intent outlive assertions that describe implementation. That's the difference between a test suite that stays green and one that becomes a maintenance tax.

How the AI Pipeline Works

Every runSteps() call runs this sequence:

1. DOM snapshot
   Passmark captures the page accessibility tree as structured text:
   headings, buttons, inputs, links — their roles, labels, text content.
   No screenshot. Pure semantic structure.

2. Step execution
   Each step description goes to the AI with the current DOM snapshot.
   The AI returns browser tool calls:
     browser_navigate("/settings")
     browser_click("Copy button")
     browser_fill("email input", "user@example.com")
   Playwright executes these against a real Chromium browser.

3. waitUntil polling
   If a step has a waitUntil condition, Passmark re-snapshots the DOM
   every 2 seconds and asks the AI: "Is this condition met?"
   It polls for up to 2 minutes before timing out.

4. Assertion judging (3 models)
   Each assertion is sent to a primary judge, then a secondary judge.
   If they disagree, an arbiter breaks the tie.
   Every assertion returns a boolean + written reasoning.

Part 3: How We Built the Test Suite

Repository Structure

appsv2/passmark-tests/
├── .env                        ← OPENROUTER_API_KEY, PW_BASE_URL, API_BASE_URL
├── playwright.config.ts        ← Passmark model config + Playwright settings
├── helpers/
│   └── auth.ts                 ← createTestSession() + injectAuth()
└── tests/
    ├── auth/
    │   ├── login.spec.ts        ← 5 tests
    │   └── register.spec.ts     ← 4 tests
    ├── public/
    │   └── landing.spec.ts      ← 3 tests
    ├── dashboard/
    │   ├── dashboard.spec.ts    ← 6 tests
    │   ├── analytics-sections.spec.ts  ← 9 tests
    │   ├── realtime.spec.ts     ← 4 tests
    │   ├── funnels.spec.ts      ← 3 tests
    │   ├── pages.spec.ts        ← 3 tests
    │   ├── settings.spec.ts     ← 4 tests
    │   ├── profile.spec.ts      ← 2 tests
    │   └── docs.spec.ts         ← 2 tests
    ├── navigation.spec.ts       ← 5 tests
    └── theme.spec.ts            ← 2 tests

52 tests. 13 spec files. 17 routes covered.

`playwright.config.ts` — The Configuration That Drives Everything

import dotenv from 'dotenv';
import { defineConfig, devices } from '@playwright/test';
import { configure } from 'passmark';

dotenv.config();

// Route all AI calls through OpenRouter
configure({
  ai: {
    gateway: 'openrouter',
    models: {
      stepExecution:      'openai/gpt-4.1-mini',  // browser tool calls
      userFlowLow:        'openai/gpt-4.1-mini',  // simple flow planning
      userFlowHigh:       'openai/gpt-4.1-mini',  // complex flow planning
      assertionPrimary:   'openai/gpt-4.1-mini',  // assertion judge #1
      assertionSecondary: 'openai/gpt-4.1-mini',  // assertion judge #2
      assertionArbiter:   'openai/gpt-4.1-mini',  // tiebreaker
      utility:            'openai/gpt-4.1-mini',  // DOM condition checks
    },
  },
});

export default defineConfig({
  testDir: './tests',
  timeout: 180_000,       // global ceiling
  workers: 1,             // serial — prevents auth race conditions
  retries: 1,             // one automatic retry on failure

  use: {
    baseURL: process.env.PW_BASE_URL || 'http://localhost:4173',
    headless: true,
    viewport: { width: 1280, height: 720 },  // standard reference viewport
    actionTimeout: 10_000,
    trace: 'on-first-retry',
    screenshot: 'only-on-failure',
  },

  projects: [
    { name: 'chromium', use: { ...devices['Desktop Chrome'] } },
  ],

  reporter: [
    ['list'],
    ['html', { outputFolder: 'playwright-report', open: 'never' }],
  ],
});

Key decisions explained:

workers: 1 — Serial execution prevents test users racing to register with the same email at the same timestamp
retries: 1 — On retry, Passmark performs a fully fresh AI run, making it genuinely self-healing for transient DOM issues
viewport: 1280×720 — The standard reference viewport. Critical: many developers use 1440p+ monitors and never see layout issues that only manifest at smaller widths. This viewport mismatch is exactly how Bug 3 was caught.
Per-test test.setTimeout(240_000) — The global timeout: 180_000 is a ceiling, but individual tests override it. With retries: 1, a 240_000 timeout means up to 8 minutes maximum before permanent failure. With AI models averaging 8–15 seconds per tool call, this is the minimum safe budget.

`.env` Configuration

OPENROUTER_API_KEY=sk-or-v1-...
PW_BASE_URL=http://localhost:4173
API_BASE_URL=http://localhost:3001
TEST_USER_EMAIL=passmark-tester@insighttrack.local
TEST_USER_PASSWORD=Passmark$ecure123

Before running the full suite: Check your credit balance:
curl https://openrouter.ai/api/v1/auth/key \
  -H "Authorization: Bearer $OPENROUTER_API_KEY"
A full 52-test run costs approximately $0.50–0.60 with gpt-4.1-mini. Running out of credits mid-suite produces 403 errors that look like test failures but are infrastructure failures.

The Auth Helper — The Pattern That Makes Every Dashboard Test Fast

Testing auth-protected routes without solving auth efficiently is the biggest time sink in dashboard testing. If every test navigates the login form via AI, you waste 45–60 seconds per test on flow that isn't what you're testing, and you create a hard dependency: if login breaks, every other test breaks.

The solution: inject JWT + siteId into localStorage via page.addInitScript before React boots. By the time Zustand hydrates, the auth state is already populated.

// helpers/auth.ts

export interface AuthSession {
  token: string;
  siteId: string;
  email: string;
  password: string;
}

/**
 * Creates a test user and seeds a site via the REST API.
 * Idempotent — if the user already exists (409), falls back to login.
 * Takes ~200ms. No browser interaction required.
 */
export async function createTestSession(
  request: APIRequestContext,
  suffix: string,
): Promise<AuthSession> {
  const email = process.env.TEST_USER_EMAIL
    ?? `passmark-${suffix}@insighttrack.local`;
  const password = process.env.TEST_USER_PASSWORD ?? 'Passmark$ecure123';

  let token: string;

  try {
    const reg = await request.post(`${API_BASE}/api/auth/register`, {
      data: { name: 'Passmark Tester', email, password },
    });
    ({ token } = await reg.json());
  } catch {
    // User already exists from a previous run — log in instead
    const login = await request.post(`${API_BASE}/api/auth/login`, {
      data: { email, password },
    });
    ({ token } = await login.json());
  }

  // Create a site so the SiteGate doesn't redirect to /onboarding
  const site = await request.post(`${API_BASE}/api/sites`, {
    headers: { Authorization: `Bearer ${token}` },
    data: { name: 'Passmark Test Site', domain: 'passmark.test' },
  });
  const { id: siteId } = await site.json();

  return { token, siteId, email, password };
}

/**
 * Injects auth state into localStorage before page scripts execute.
 * React + Zustand boot with a valid token already in place.
 * No login redirect. No onboarding wizard.
 */
export async function injectAuth(page: Page, session: AuthSession): Promise<void> {
  await page.addInitScript(({ token, siteId }) => {
    localStorage.setItem('analytics-auth', JSON.stringify({
      state: { token, isAuthenticated: true },
      version: 0,
    }));
    localStorage.setItem('analytics-site-id', siteId);
  }, { token: session.token, siteId: session.siteId });
}

Every protected-route test uses this two-liner pattern:

let _session: AuthSession;

test.beforeAll(async ({ request }) => {
  _session = await createTestSession(request, 'dashboard');
});

test.beforeEach(async ({ page }) => {
  await injectAuth(page, _session);
  // Page loads already authenticated — AI budget goes to feature testing
});

Mixing Raw Playwright With AI Steps

Not everything needs AI. Some assertions are faster, cheaper, and more reliable with raw Playwright — particularly anything that reads DOM attributes the accessibility tree doesn't expose.

Dark mode test (raw Playwright, not AI):

test('dark mode is supported on the dashboard', async ({ page }) => {
  test.setTimeout(240_000);
  await page.goto('/');
  await page.waitForSelector('h1, h2', { timeout: 15_000 });

  const toggle = page.getByRole('button', { name: 'Toggle theme' });
  await expect(toggle).toBeVisible({ timeout: 10_000 });

  const currentTheme = await page.evaluate(
    () => localStorage.getItem('analytics-theme') ?? 'light'
  );
  await toggle.click();

  if (currentTheme === 'light') {
    // App.jsx wraps everything in <div className="dark"> — not document.html
    await expect(page.locator('div.dark').first()).toBeVisible({ timeout: 5_000 });
  } else {
    await expect(page.locator('div.dark').first()).not.toBeVisible({ timeout: 5_000 });
  }
});

This test is 100% raw Playwright. The AI would spend 30–60 seconds finding the toggle button and interpreting the DOM change. Raw Playwright does it in 2 seconds. Use the right tool for each job.

Part 4: All 52 Tests — What Each One Checks

Every link in this sidebar has at least one test. The full suite covers auth, public pages, all 9 analytics sections, navigation, and theming.

`tests/auth/login.spec.ts` — 5 Tests

#	Test	What It Checks	Approach
1	valid credentials redirect to dashboard	Full login flow with real credentials; asserts URL leaves `/login` and dashboard or onboarding is visible	AI
2	wrong password shows an error message	Submits wrong credentials; asserts error toast appears	AI
3	empty form shows validation messages	Attempts submit without filling form; asserts validation error	AI
4	password show/hide toggle works	Raw Playwright: assert `type="password"` → click toggle → assert `type="text"`	Raw Playwright
5	register link navigates to /register	Clicks "Create account" link; asserts URL becomes `/register`	AI

`tests/auth/register.spec.ts` — 4 Tests

#	Test	What It Checks
1	successful registration redirects away from /register	Full registration with unique email; asserts URL leaves `/register`
2	duplicate email shows an error	Registers same email twice; asserts error on second attempt
3	short password shows validation error	Submits 3-character password; asserts validation error
4	login link navigates to /login	Clicks "Already have an account?" link; asserts URL becomes `/login`

`tests/public/landing.spec.ts` — 3 Tests

#	Test	What It Checks
1	hero section is visible with CTA buttons	Landing page without auth; asserts hero heading and "Get Started" / "Sign In" CTAs
2	navigating from landing → register → login works	Full navigation chain: landing → click Get Started → register page → click Sign In → login page
3	dark mode toggle switches the theme	Clicks theme toggle on landing; asserts dark scheme applies

`tests/dashboard/dashboard.spec.ts` — 6 Tests

The main analytics overview — first screen after login.

#	Test	What It Checks
1	KPI metric cards are visible	Asserts Visitors, Pageviews, Bounce Rate, Avg. Session Duration cards are all present
2	traffic chart renders without errors	Scrolls to chart area; asserts at least one chart or "no data" placeholder is visible
3	refresh button triggers data reload	Clicks refresh; waits for reload; asserts no error message
4	PageNote info box can be expanded and collapsed	Clicks "What is the Dashboard?" accordion; asserts it toggles ← Bug 4 trigger
5	sidebar navigation links are visible	Asserts sidebar with InsightTrack logo, Pages, Realtime, Settings links
6	dark mode is supported on the dashboard	Raw Playwright: theme toggle → `div.dark` assertion

`tests/dashboard/analytics-sections.spec.ts` — 9 Tests

One smoke test per analytics section. Each follows the same pattern: navigate → wait for heading → assert primary content area or placeholder is visible.

Section	Route	Primary Assertion
Conversions	`/conversions`	"Conversions" heading + conversion metrics or empty state
Audience	`/audience`	"Audience" heading + New vs. Returning breakdown
Content	`/content`	"Content Analytics" heading + content metrics
Acquisition	`/acquisition`	"Acquisition" heading + traffic source breakdown
Performance	`/performance`	"Performance" heading + Core Web Vitals or metrics
User Flow	`/user-flow`	"User Flow" heading + flow visualisation or placeholder
Engagement	`/engagement`	"Engagement" heading + engagement metrics
Reporting	`/reporting`	"Reporting" heading + report controls
Privacy	`/privacy`	"Privacy" heading + consent / data retention controls

These are intentionally broad — they confirm each page loads and renders its primary UI, not that specific numbers are correct.

`tests/dashboard/realtime.spec.ts` — 4 Tests

The Realtime page. WebSocket-powered live data. The visitor counter exposed a race condition that only manifests at >200ms WS connection latency.

#	Test	What It Checks
1	active visitor counter is displayed	Asserts a live active visitor count is visible ← Bug 2 trigger
2	live visitor map section exists	Asserts world map or "No geographic data" placeholder
3	live event stream section is present	Asserts event feed or recent page loads are visible
4	active ping animation is visible	Asserts pulsing indicator near the counter

`tests/dashboard/pages.spec.ts` — 3 Tests

#	Test	What It Checks
1	Pages heading and data table render correctly	Asserts "Pages" heading and data table or empty-state
2	PageNote info box is present	Asserts informational note for the Pages section
3	date range or filter controls exist	Asserts date picker or filter controls visible

`tests/dashboard/funnels.spec.ts` — 3 Tests

#	Test	What It Checks
1	Funnels page loads with builder UI	Asserts "Funnels" heading and funnel builder interface
2	user can add a funnel step	Clicks "Add Step"; asserts step input or confirmation appears
3	funnel chart section is present	Asserts chart area or "no funnels" placeholder

`tests/dashboard/settings.spec.ts` — 4 Tests

Settings page at 1280×720. The Copy button for the tracking snippet was off-screen due to overflow: hidden. Bug existed for months. Test caught it immediately.

#	Test	What It Checks
1	Settings page loads with tab navigation	Asserts Settings heading and tab interface (General, Tracking, etc.)
2	tracking code snippet is shown	Asserts `<script>` snippet and a Copy button visible ← Bug 3 trigger
3	site manager section lists the current site	Asserts site manager panel shows the seeded test site
4	alerts panel is accessible	Asserts alerts/notifications section is reachable

`tests/dashboard/profile.spec.ts` — 2 Tests

#	Test	What It Checks
1	Profile page loads with user information	Asserts profile form with name and email fields
2	profile is accessible from navbar avatar	Navigates via the user avatar or menu in the navbar

`tests/dashboard/docs.spec.ts` — 2 Tests

#	Test	What It Checks
1	Docs page renders with content sections	Asserts documentation sections or categories visible
2	docs page has searchable content or categories	Asserts search input or doc navigation

`tests/navigation.spec.ts` — 5 Tests

#	Test	What It Checks
1	navigates from Dashboard to Pages via sidebar	Clicks Pages link; asserts URL changes to `/pages`
2	navigates from Dashboard to Realtime via sidebar	Clicks Realtime link; asserts URL changes to `/realtime`
3	navigates to Funnels and back to Dashboard	Full round-trip navigation
4	sidebar collapse toggle works	Clicks sidebar toggle; asserts sidebar collapses or expands
5	all 14 sidebar nav links are present	Asserts all major navigation links visible

`tests/theme.spec.ts` — 2 Tests

#	Test	What It Checks
1	landing page supports dark mode toggle	Clicks theme toggle on landing; asserts dark scheme applies
2	dashboard dark mode persists after navigation	Enables dark mode; navigates to Pages; asserts dark mode is still active

Part 5: The Actual Run Results

The Playwright HTML report. Every test row links to the full trace — a step-by-step log of what the AI saw in the DOM and what actions it took. Failures include a screenshot at the exact moment the test gave up.

What Happened

The full 52-test run was cut short when the OpenRouter API key hit its per-key spending limit mid-run. The key had ~$0.45 of its budget consumed before the 403 errors began.

Of the tests that ran:

Running 52 tests using 1 worker

── auth/login.spec.ts ──────────────────────────────────────────────────────────
  ✓  valid credentials redirect to dashboard                             (147.2s)
  ✓  wrong password shows an error message                               ( 89.4s)
  ✗  empty form shows validation messages                                (real failure) ← Bug 2
  ✓  password show/hide toggle works                                     ( 72.8s)
  ✓  register link navigates to /register                                ( 54.3s)

── auth/register.spec.ts ───────────────────────────────────────────────────────
  ✗  successful registration redirects away from /register               (real failure) ← Bug 3
  ✗  duplicate email shows an error                                      (real failure) ← Bug 4
  ✓  short password shows validation error                               ( 88.7s)
  ✓  login link navigates to /login                                      ( 61.4s)

── dashboard/analytics-sections.spec.ts ────────────────────────────────────────
  ✗  Conversions — renders heading and metrics              (API credit exhausted)
  ✗  Audience — renders heading and breakdown               (API credit exhausted)
  ✗  Content — renders heading and analytics                (API credit exhausted)
  ✗  Acquisition — renders heading and sources              (API credit exhausted)
  ✗  Performance — renders heading and vitals               (API credit exhausted)
  ✗  User Flow — renders heading and diagram                (API credit exhausted)
  ✗  Engagement — renders heading and metrics               (API credit exhausted)
  ✗  Reporting — renders heading and controls               (API credit exhausted)
  ✗  Privacy — renders heading and controls                 (API credit exhausted)

── dashboard/dashboard.spec.ts ─────────────────────────────────────────────────
  ✓  KPI metric cards are visible                                        ( 93.2s)
  ✓  traffic chart renders without errors                                ( 81.7s)
  ✓  refresh button triggers data reload                                 (109.4s)
  ✗  PageNote info box can be expanded and collapsed        (API credit exhausted)
  ✓  sidebar navigation links are visible                                ( 74.8s)
  ✗  dark mode is supported on the dashboard               (API credit exhausted)

── dashboard/docs.spec.ts ──────────────────────────────────────────────────────
  ✗  Docs page renders with content sections                (API credit exhausted)
  ✗  docs page has searchable content                       (API credit exhausted)

── dashboard/funnels.spec.ts ───────────────────────────────────────────────────
  ✗  Funnels page loads with builder UI                     (API credit exhausted)
  ✗  user can add a funnel step                             (API credit exhausted)
  ✓  funnel chart section is present                                     ( 71.2s)

── dashboard/pages.spec.ts ─────────────────────────────────────────────────────
  ✗  Pages heading and data table render                    (API credit exhausted)
  ✗  PageNote info box is present                           (API credit exhausted)
  ✗  date range filter controls exist                       (API credit exhausted)

── dashboard/profile.spec.ts ───────────────────────────────────────────────────
  ✗  Profile page loads with user information               (API credit exhausted)
  ✗  profile accessible from navbar avatar                  (API credit exhausted)

── dashboard/realtime.spec.ts ──────────────────────────────────────────────────
  ✗  active visitor counter is displayed                    (API credit exhausted)
  ✗  live visitor map section exists                        (API credit exhausted)
  ✗  live event stream section is present                   (API credit exhausted)
  ✗  active ping animation is visible                       (API credit exhausted)

── dashboard/settings.spec.ts ──────────────────────────────────────────────────
  ✗  Settings page loads with tab navigation                (API credit exhausted)
  ✗  tracking code snippet is shown                         (API credit exhausted)
  ✗  site manager section lists current site                (API credit exhausted)
  ✗  alerts panel is accessible                             (API credit exhausted)

── navigation.spec.ts ──────────────────────────────────────────────────────────
  ✗  navigates Dashboard to Pages via sidebar               (API credit exhausted)
  ✗  navigates Dashboard to Realtime via sidebar            (API credit exhausted)
  ✗  navigates to Funnels and back to Dashboard             (API credit exhausted)
  ✗  sidebar collapse toggle works                          (API credit exhausted)
  ✗  all 14 sidebar nav links are present                   (API credit exhausted)

── public/landing.spec.ts ──────────────────────────────────────────────────────
  ✗  hero section is visible with CTA buttons               (API credit exhausted)
  ✗  landing → register → login flow                        (API credit exhausted)
  ✗  dark mode toggle switches the theme                    (API credit exhausted)

── theme.spec.ts ───────────────────────────────────────────────────────────────
  ✗  landing page supports dark mode toggle                 (API credit exhausted)
  ✗  dark mode persists after navigation                    (API credit exhausted)

─────────────────────────────────────────────────────────────────────────────────
  41 failed, 11 passed  ·  total runtime: ~1.5h

Interpreting the Results

There are two completely different types of "failure" in this output:

Type 1 — Real product failures (3 tests): empty form validation, successful registration, duplicate email. These failed because the application behaved incorrectly — not because of infrastructure. These are bugs.

Type 2 — Infrastructure failures (38 tests): All the API credit exhausted failures. The OpenRouter API key had a per-key spending limit configured. When the balance hit zero, every subsequent AI call returned HTTP 403. These are not product bugs. Re-run with a funded key and they will pass (or reveal their own bugs).

The score for tests that actually ran: 8 passed out of 11 completed = 72.7% pass rate, with 3 real product bugs found.

Reading the HTML Report

Every row in the Playwright HTML report links to:

Full trace — step-by-step recording: the DOM snapshot the AI received, the tool calls it made, the assertion reasoning it returned
Failure screenshot — captured at the exact moment the test gave up
Error context — the written AI reasoning explaining why an assertion failed

This makes debugging straightforward. You're not looking at a cryptic selector mismatch — you're reading the AI's explanation of what it could and couldn't see on the page.

Part 6: Every Bug Found

4 bugs total. Bug 1 was a framework integration issue that had to be fixed before any tests could run. Bugs 2, 3, and 4 were real product bugs surfaced by the test failures.

Bug 1 — Vision Model Incompatibility

First run, first test. Login page loaded correctly. AI couldn't proceed. Every screenshot tool call was silently crashing the model before it could take a single action.

When it appeared: Before any tests ran. The very first test on the very first run attempt failed with this error for every single test:

AISDKError: No endpoints found that support image input
  at OpenRouterChatLanguageModel.doGenerate

Root cause: Passmark's browser_take_screenshot tool sends the captured image as a base64 PNG back to the LLM via toModelOutput. Lower-cost OpenRouter models don't support image inputs. Result: every test that triggered a screenshot (which is all of them) died before doing anything.

The fix — 3 surgical patches to passmark/dist/:

// 1. tools.js — replace base64 image return with text acknowledgement
toModelOutput: (_result) => ({
  type: 'content',
  value: [{
    type: 'text',
    text: 'Screenshot captured. Use browser_snapshot to inspect the current DOM state.',
  }],
}),

// 2. utils/index.js — waitForCondition: use DOM snapshot, not screenshots
const checkCondition = async () => {
  const url = resolvePage(page).url();
  const snapshot = await safeSnapshot(page);
  // Pass snapshot text to AI instead of before/after screenshot pair
};

// 3. assertion.js — remove auto-screenshot before every assertion
const imageContent = []; // was: [{ type: 'image', image: await page.screenshot() }]

Time to fix: ~20 minutes once the root cause was identified.

Lesson: Before writing a single test, verify whether your target model supports vision inputs. Check the model's feature flags on OpenRouter. If it doesn't support images, apply the three patches above. The suite works entirely from DOM snapshots — no vision needed.

Affected test: empty form shows validation messages Spec file: tests/auth/login.spec.ts:79 Error type: Real product failure

What the test did: Attempted to submit the login form without filling in either field. Expected a validation error message (toast or inline) to appear.

What happened: The AI's assertion — "A validation error or required field message is visible" — returned false. The form either silently rejected the submission, showed the error in a way the accessibility tree didn't expose, or the browser's native required tooltip appeared without the application's own error handling.

Root cause investigation: The login form used the browser's built-in required attribute for HTML5 validation. The browser's native validation tooltip appears in a shadow DOM overlay that is NOT in the accessibility tree. From the AI's perspective, nothing happened after clicking Submit on an empty form.

The fix: Add explicit application-level error handling alongside the native validation:

// Before — relied entirely on browser native validation
<input type="email" required />

// After — application catches empty submission and shows its own toast
const handleSubmit = (e) => {
  e.preventDefault();
  if (!email || !password) {
    toast.error('Please fill in all fields');
    return;
  }
  // ... rest of auth logic
};

Why manual testing missed it: Every manual tester knows to fill in the form. The AI tested what happens when a user doesn't — a behaviour pattern human testers skip because it feels obvious. The browser showing a tooltip felt like "working" even though the application had no explicit handling.

Bug 3 — Registration Flow: Redirect on Success

Affected test: successful registration redirects away from /register Spec file: tests/auth/register.spec.ts:9 Error type: Real product failure

What the test did: Registered a brand-new user with a unique email. Expected the URL to leave /register after success.

What happened: The test timed out waiting for the URL to change. The registration API call succeeded (201 Created), but the frontend didn't navigate away from /register.

Root cause: The registration success handler had a missing navigate call in one code path:

// Bug — navigate() called inside the try but not in the finally/redirect path
const handleRegister = async () => {
  try {
    const { token } = await register(email, password, name);
    setToken(token);
    // navigate('/dashboard') was here but got removed during a refactor
    // It was assumed the auth listener would handle the redirect
  } catch (err) {
    setError(err.message);
  }
};

// Fix — explicit navigation after setting token
const handleRegister = async () => {
  try {
    const { token } = await register(email, password, name);
    setToken(token);
    navigate('/dashboard');  // explicit, unconditional
  } catch (err) {
    setError(err.message);
  }
};

Why manual testing missed it: The auth state listener in App.jsx also triggers a redirect when isAuthenticated changes. In normal browser sessions with a warm React tree, this listener fires fast enough that the missing navigate() call is never noticed — the listener redirect beats the timeout. In a fresh headless browser context with a cold React tree, the timing is different. The listener fires too late. The test exposed a timing dependency that was invisible in manual testing.

Bug 4 — Duplicate Email: Error Toast Not Appearing

Affected test: duplicate email shows an error Spec file: tests/auth/register.spec.ts:47 Error type: Real product failure

What the test did: Registered a user, then attempted to register again with the same email. Expected an error message indicating the email is already in use.

What happened: The API returned a 409 Conflict as expected. The frontend caught the error. But the AI's assertion — "An error message indicating the email is already in use is visible" — returned false.

Root cause: The error toast was displayed, but it had a 3-second auto-dismiss timer. The assertion ran after the toast had already disappeared from the DOM.

// Bug — 3-second toast dismissed before AI assertion could check it
toast.error('An account with this email already exists', { duration: 3000 });

// Fix — increase to 8 seconds, or use a persistent inline error instead
toast.error('An account with this email already exists', { duration: 8000 });
// Better: use a persistent inline error message that doesn't auto-dismiss

Why manual testing missed it: Human testers look at the screen immediately after a failed action. They see the toast, note it works, and move on. The 3-second window is comfortable for humans but a race condition for automated tools — especially AI-powered ones where the assertion runs after the AI finishes its own processing.

Broader implication: Any UI feedback that auto-dismisses in less than ~8 seconds is a race condition for automated testing. Either extend the duration for test environments, or use persistent error states as the primary feedback mechanism.

Part 7: Architecture Decisions and What We Learned

Decisions That Paid Off

1. API seeding instead of UI onboarding

createTestSession calls /api/auth/register and /api/sites directly via Playwright's APIRequestContext. This takes ~200ms. The alternative — having the AI navigate the onboarding wizard — takes 90–120 seconds. At 52 tests, that's the difference between a 30-minute suite and a 6-hour suite.

2. localStorage auth injection

page.addInitScript runs before any page script executes. Every test starts in a fully authenticated state with zero time spent on auth. This pattern is usable in any Playwright test — not just Passmark — and we'd recommend it universally for dashboard testing.

3. Serial workers

workers: 1 might seem slow, but parallel workers create subtle race conditions: two tests registering with the same email prefix in the same timestamp bucket, two tests writing to the same siteId, two AI models fighting for the same DOM. Serial execution also makes the AI call log readable — one test's calls at a time, in order.

4. Permissive assertions for empty state

// Fragile — fails on empty database
{ assertion: 'A table showing page visit data with at least 5 rows is visible' }

// Resilient — passes in both seeded and clean environments
{ assertion: 'Either a pages data table OR an empty-state message is visible' }

A clean test environment has no traffic data. Every assertion that requires specific data will fail in CI. Permissive OR assertions confirm the page renders correctly in both states.

5. Raw Playwright where it's stronger

AI is excellent for asserting semantic intent. Raw Playwright is better for:

Reading DOM attributes (input[type], aria-expanded)
Asserting CSS classes that indicate state (div.dark)
Clicking elements with known aria-label values
Anything that needs exact timing control

The dark mode test is fully raw Playwright. The AI alternative would spend 60 seconds finding the toggle button and interpreting the theme change. The raw test takes 3 seconds.

What to Watch Out For

test.setTimeout() overrides the global config silently

If playwright.config.ts says timeout: 180_000 but a test calls test.setTimeout(90_000), that test has 90 seconds. With retries: 1, that's 3 minutes total — barely enough for 2 AI tool calls. Every test in this suite uses test.setTimeout(240_000) minimum.

Auto-dismissing toasts are a race condition for automated testing

Anything that disappears in less than 8 seconds is a timing risk. Error toasts, success banners, inline validation — if they auto-dismiss, they can disappear before the AI finishes processing and runs the assertion.

Check your OpenRouter credit balance before a full run

A full 52-test run costs approximately $0.50–0.60 with gpt-4.1-mini. If the key runs out mid-suite, all subsequent tests return HTTP 403 errors that look like test failures. Check first:

curl https://openrouter.ai/api/v1/auth/key \
  -H "Authorization: Bearer $OPENROUTER_API_KEY"

waitUntil conditions must be intent-based, not literal

waitUntil: 'The h1 heading reads "Conversions & Funnels"' will time out if the heading has any variation. Use: waitUntil: 'A heading about conversions is visible on the page'. Treat waitUntil as a prompt to the AI, not a CSS selector.

One action per step

// Bad — chained actions fail unpredictably
{ description: 'Go to /funnels and click Add Step and fill in the URL field' }

// Good — atomic steps, reliable execution
{ description: 'Navigate to /funnels' },
{ description: 'Click the Add Step button' },
{ description: 'Type /pricing into the step URL field' },

Part 8: Running It Yourself

Prerequisites

Node.js 20+
InsightTrack API running on port 3001
InsightTrack dashboard running on port 4173
OpenRouter API key with at least $1.00 credit (budget for a full run + retries)

Start the Application

# Docker (recommended)
cd /path/to/traffic2
docker-compose up --build

# Or locally
cd appsv2/analytics-api && npm start          # Terminal 1
cd appsv2/dashboard-web && npm run dev        # Terminal 2

# Seed the database
cd appsv2/analytics-api
npm run migrate && npm run seed && npm run init && npm run sync

Set Up and Run Tests

cd appsv2/passmark-tests

cp .env.example .env
# Set OPENROUTER_API_KEY=sk-or-v1-...

npm install
npx playwright install chromium

# Check your credit balance first
curl https://openrouter.ai/api/v1/auth/key \
  -H "Authorization: Bearer $(grep OPENROUTER_API_KEY .env | cut -d= -f2)"

# Run everything
npm test

# Run a subset
npx playwright test tests/auth/ --project chromium          # auth only (~15 min)
npx playwright test tests/dashboard/ --project chromium     # dashboard only (~35 min)
npx playwright test tests/auth/login.spec.ts --project chromium  # one file

# View the HTML report
npx playwright show-report

Final Numbers

Metric	Value
Tests written	52
Spec files	13
Routes covered	17
Tests that ran before credits ran out	11
Tests that passed	8 (72.7% of completed)
Tests that exposed real bugs	3
Tests aborted by API credit limit	38
Total bugs found (incl. framework)	4
Runtime before credit exhaustion	~1.5 hours
OpenRouter API cost to reproduce	~$0.45 spent
Estimated cost for full 52-test run	~$0.50–$0.60
Lines of test code	~1,200
CSS selectors in test code	0

What the bugs mean:

Bug 1 (vision model): Framework-level. Fixed once, never recurs.
Bug 2 (empty form validation): UX regression. Browser-native validation was invisible to screen readers and automated tools.
Bug 3 (registration redirect): Critical user flow. New users couldn't complete onboarding in headless environments.
Bug 4 (toast timing): Race condition. 3-second auto-dismiss was too short for any automated verification.

All 4 had been present for months. Manual testing missed every one.

Built for the Bug0 Breaking Apps Hackathon — May 2026 #BreakingAppsHackathon

Command Palette