Website Task Design - Scale Gymnasium

Website tasks evaluate an agent’s ability to navigate and interact with web applications like Calendr, Cloudfile, Shopora, and Pandora’s Inbox.

Task Components

The high-level, natural language instruction given to the agent. Example: “Schedule a meeting with John Doe for Thursday afternoon”

Break the task into atomic, verifiable actions that represent individual steps.

JSON-based checks sent to /verifier:

Type	Purpose	Example
State Check	Verify database changes	Assert `events` table has entry with title “Team Meeting”
Log Check	Verify UI interactions	Assert click on `save-button` is in logs
Rubric Check	LLM evaluation of response	Assert response is helpful and accurate

Match subproblems to verification

Each subproblem should map to one or more verifier checks. If you can’t verify it, consider simplifying.

Use realistic data packs

Ensure your initial data reflects realistic usage patterns—enough records to test navigation but not so many that tasks become tedious.

Test multiple valid paths

Many tasks have more than one correct approach. Design verifiers that check outcomes, not specific click sequences.

Include negative checks

Verify that the agent didn’t break unrelated data:

Configure state, log, and rubric checks

File-based tasks for VM environments

⌘I