Task Components
1. Prompt
The high-level, natural language instruction given to the agent. Example: “Schedule a meeting with John Doe for Thursday afternoon”2. Subproblems
Break the task into atomic, verifiable actions that represent individual steps.3. Verifiers
JSON-based checks sent to/verifier:
| Type | Purpose | Example |
|---|---|---|
| State Check | Verify database changes | Assert events table has entry with title “Team Meeting” |
| Log Check | Verify UI interactions | Assert click on save-button is in logs |
| Rubric Check | LLM evaluation of response | Assert response is helpful and accurate |
Best Practices
Match subproblems to verification
Match subproblems to verification
Each subproblem should map to one or more verifier checks. If you can’t verify it, consider simplifying.
Use realistic data packs
Use realistic data packs
Ensure your initial data reflects realistic usage patterns—enough records to test navigation but not so many that tasks become tedious.
Test multiple valid paths
Test multiple valid paths
Many tasks have more than one correct approach. Design verifiers that check outcomes, not specific click sequences.
Include negative checks
Include negative checks
Verify that the agent didn’t break unrelated data:
- “Other events remain unchanged”
- “User profile not modified”