Task Components
1. Prompt
A natural language description of the goal. Example: “Extract and analyze quarterly sales data from the business report. Open ‘Q2_2025_Business_Report.docx’ in Word and review the sales data. Then, create an analysis spreadsheet in Excel…“2. Initialization Config
Programmatic setup that ensures consistent starting state:| Function | Purpose |
|---|---|
download | Place initial files in the VM |
open | Launch applications with target files |
execute | Run commands or scripts |
3. Evaluators
Verifiers that determine completion success via:- File state comparison
- Application state checking
- Screenshot analysis
Initialization Config Structure
Initialization Functions
| Type | Description | Use Case |
|---|---|---|
download | Fetch file from URL to VM path | Place input documents |
open | Launch application with file | Start Word, Excel, browser |
execute | Run shell command | Create folders, configure settings |
launch | Start application without file | Open empty application |
Evaluator Configuration
Desktop evaluators compare the agent’s output against gold standard files:OS-Specific Considerations
| OS | Applications | Notes |
|---|---|---|
| Linux | LibreOffice, Firefox, VS Code | Use xdotool for automation |
| Windows | Microsoft Office, Chrome, VS Code | Use PowerShell for scripting |
| macOS | Microsoft Office, Safari, VS Code | Use osascript for AppleScript automation |
macOS Example (Save before evaluation)
Best Practices
Use deterministic initialization
Use deterministic initialization
Ensure every agent run starts from the exact same state. Download fresh files rather than relying on pre-existing VM state.
Design for outcome verification
Design for outcome verification
Focus on verifying the end result (file content, structure) rather than the exact steps taken. Multiple valid paths may exist.
Include save steps in postconfig
Include save steps in postconfig
Most applications don’t auto-save. Always include explicit save commands before running evaluators.
Test on target OS first
Test on target OS first
Application behavior varies by OS. Test your gold standard and evaluator on the same OS the agent will use.
Use appropriate evaluator functions
Use appropriate evaluator functions
Choose specific evaluators (pivot_table, compare_docx_lines) over generic ones (exact_match) for meaningful comparison.