Website Verifiers - Scale Gymnasium

Website environments use declarative JSON-based checks via the /api/verifier HTTP endpoint.

Access

Method	Endpoint
Docker	`POST /api/verifier`
Web UI	”Run Verifier” button

Check Types

Type	Purpose	Data Source
State Check	Verify database changes	SQLite database
Log Check	Verify user interactions	Session logs
Rubric Check	LLM-based evaluation	Agent response

State Checks

Verify that the database contains expected values after agent actions.

Components

Field	Description
`table`	Which database table to query
`conditions`	Filters to match records
`assertions`	Expected values to verify

Example Use Cases

“events table has entry with title ‘Meeting’” (Calendr)
“users table has email updated to ‘new@email.com’” (Cloudfile)
“cart_items table is empty after checkout” (Shopora)
“orders table has new record with status ‘completed’” (Shopora)
“messages table has email with subject ‘Re: Project Update’” (Pandora’s Inbox)

Best Practices

Target specific tables and fields
Use precise conditions to isolate records
Include both existence and value assertions
Consider order-independent matching for arrays

Log Checks

Verify that specific user interactions occurred during the task.

Components

Field	Description
`event_type`	Type of interaction (click, input, navigation)
`element_id`	Target element identifier
`value`	Expected value (for input events)

Example Use Cases

“User clicked the ‘Submit’ button”
“User navigated to /settings”
“User entered ‘John’ in name field”
“User selected ‘California’ from dropdown”
“User opened email thread with subject ‘Q4 Budget Allocation’” (Pandora’s Inbox)

What Gets Logged

Website environments automatically capture:

Button clicks with element IDs
Form inputs with values
Page navigations
Dropdown selections
Checkbox/radio changes

Rubric Checks

LLM-evaluated criteria for qualitative assessment of agent responses.

Components

Field	Description
`criteria`	Description of what to evaluate
`rubric`	Scoring guidelines

Behavior

Agent completes task and produces response
Rubric check returns “PENDING” status
External LLM evaluates response against criteria
Final pass/fail determined

Example Use Cases

“Response accurately summarizes the calendar events” (Calendr)
“Agent provided helpful and accurate information”
“Output follows the requested format”
“Response correctly identifies the date of the calendar invitation in the email” (Pandora’s Inbox)

Response Structure

{
  "passed": true,
  "checks": [
    { "type": "state", "passed": true, "message": "Event created" },
    { "type": "log", "passed": true, "message": "Submit clicked" }
  ],
  "message": "All checks passed"
}

Debugging Failed Checks

When checks fail, the response includes:

Field	Description
`all_results`	All data that was searched
`expected`	What was expected
`actual`	What was found

Use this to diagnose why expected data was not found.

Next Steps

Website API Reference

Full verifier endpoint documentation

Desktop Verifiers

File comparison for VM environments

​Access

​Check Types

​State Checks

​Components

​Example Use Cases

​Best Practices

​Log Checks

​Components

​Example Use Cases

​What Gets Logged

​Rubric Checks

​Components

​Behavior

​Example Use Cases

​Response Structure

​Debugging Failed Checks

​Next Steps

Website API Reference

Desktop Verifiers

Access

Check Types

State Checks

Components

Example Use Cases

Best Practices

Log Checks

Components

Example Use Cases

What Gets Logged

Rubric Checks

Components

Behavior

Example Use Cases

Response Structure

Debugging Failed Checks

Next Steps