Skip to main content
Website tasks evaluate an agent’s ability to navigate and interact with web applications like Calendr, Cloudfile, Shopora, and Pandora’s Inbox.

Task Components

1. Prompt

The high-level, natural language instruction given to the agent. Example: “Schedule a meeting with John Doe for Thursday afternoon”

2. Subproblems

Break the task into atomic, verifiable actions that represent individual steps.

3. Verifiers

JSON-based checks sent to /verifier:
TypePurposeExample
State CheckVerify database changesAssert events table has entry with title “Team Meeting”
Log CheckVerify UI interactionsAssert click on save-button is in logs
Rubric CheckLLM evaluation of responseAssert response is helpful and accurate

Best Practices

Each subproblem should map to one or more verifier checks. If you can’t verify it, consider simplifying.
Ensure your initial data reflects realistic usage patterns—enough records to test navigation but not so many that tasks become tedious.
Many tasks have more than one correct approach. Design verifiers that check outcomes, not specific click sequences.
Verify that the agent didn’t break unrelated data:
  • “Other events remain unchanged”
  • “User profile not modified”

Next Steps