Skip to main content
Website environments use declarative JSON-based checks via the /api/verifier HTTP endpoint.

Access

MethodEndpoint
DockerPOST /api/verifier
Web UI”Run Verifier” button

Check Types

TypePurposeData Source
State CheckVerify database changesSQLite database
Log CheckVerify user interactionsSession logs
Rubric CheckLLM-based evaluationAgent response

State Checks

Verify that the database contains expected values after agent actions.

Components

FieldDescription
tableWhich database table to query
conditionsFilters to match records
assertionsExpected values to verify

Example Use Cases

  • “events table has entry with title ‘Meeting’” (Calendr)
  • “users table has email updated to ‘new@email.com’” (Cloudfile)
  • “cart_items table is empty after checkout” (Shopora)
  • “orders table has new record with status ‘completed’” (Shopora)
  • “messages table has email with subject ‘Re: Project Update’” (Pandora’s Inbox)

Best Practices

  • Target specific tables and fields
  • Use precise conditions to isolate records
  • Include both existence and value assertions
  • Consider order-independent matching for arrays

Log Checks

Verify that specific user interactions occurred during the task.

Components

FieldDescription
event_typeType of interaction (click, input, navigation)
element_idTarget element identifier
valueExpected value (for input events)

Example Use Cases

  • “User clicked the ‘Submit’ button”
  • “User navigated to /settings”
  • “User entered ‘John’ in name field”
  • “User selected ‘California’ from dropdown”
  • “User opened email thread with subject ‘Q4 Budget Allocation’” (Pandora’s Inbox)

What Gets Logged

Website environments automatically capture:
  • Button clicks with element IDs
  • Form inputs with values
  • Page navigations
  • Dropdown selections
  • Checkbox/radio changes

Rubric Checks

LLM-evaluated criteria for qualitative assessment of agent responses.

Components

FieldDescription
criteriaDescription of what to evaluate
rubricScoring guidelines

Behavior

  1. Agent completes task and produces response
  2. Rubric check returns “PENDING” status
  3. External LLM evaluates response against criteria
  4. Final pass/fail determined

Example Use Cases

  • “Response accurately summarizes the calendar events” (Calendr)
  • “Agent provided helpful and accurate information”
  • “Output follows the requested format”
  • “Response correctly identifies the date of the calendar invitation in the email” (Pandora’s Inbox)

Response Structure

{
  "passed": true,
  "checks": [
    { "type": "state", "passed": true, "message": "Event created" },
    { "type": "log", "passed": true, "message": "Submit clicked" }
  ],
  "message": "All checks passed"
}

Debugging Failed Checks

When checks fail, the response includes:
FieldDescription
all_resultsAll data that was searched
expectedWhat was expected
actualWhat was found
Use this to diagnose why expected data was not found.

Next Steps