Skip to main content
Desktop tasks evaluate an agent’s ability to interact with full operating system environments—manipulating files, using applications like Excel, Word, and browsers, and executing multi-step workflows.

Task Components

1. Prompt

A natural language description of the goal. Example: “Extract and analyze quarterly sales data from the business report. Open ‘Q2_2025_Business_Report.docx’ in Word and review the sales data. Then, create an analysis spreadsheet in Excel…“

2. Initialization Config

Programmatic setup that ensures consistent starting state:
FunctionPurpose
downloadPlace initial files in the VM
openLaunch applications with target files
executeRun commands or scripts

3. Evaluators

Verifiers that determine completion success via:
  • File state comparison
  • Application state checking
  • Screenshot analysis

Initialization Config Structure

{
  "init": [
    {
      "type": "download",
      "parameters": {
        "url": "https://example.s3.amazonaws.com/Q2_Report.docx",
        "path": "/Users/user/Desktop/Q2_Report.docx"
      }
    },
    {
      "type": "open",
      "parameters": {
        "path": "/Users/user/Desktop/Q2_Report.docx"
      }
    },
    {
      "type": "execute",
      "parameters": {
        "command": ["mkdir", "-p", "/Users/user/Desktop/Output"]
      }
    }
  ]
}

Initialization Functions

TypeDescriptionUse Case
downloadFetch file from URL to VM pathPlace input documents
openLaunch application with fileStart Word, Excel, browser
executeRun shell commandCreate folders, configure settings
launchStart application without fileOpen empty application

Evaluator Configuration

Desktop evaluators compare the agent’s output against gold standard files:
{
  "evaluator": {
    "func": "compare_table",
    "expected": {
      "type": "cloud_file",
      "path": "https://example.s3.amazonaws.com/gold.xlsx",
      "dest": "gold.xlsx"
    },
    "result": {
      "type": "vm_file",
      "path": "/Users/user/Desktop/output.xlsx",
      "dest": "output.xlsx"
    }
  }
}
See Desktop Verifiers for full evaluator documentation.

OS-Specific Considerations

OSApplicationsNotes
LinuxLibreOffice, Firefox, VS CodeUse xdotool for automation
WindowsMicrosoft Office, Chrome, VS CodeUse PowerShell for scripting
macOSMicrosoft Office, Safari, VS CodeUse osascript for AppleScript automation

macOS Example (Save before evaluation)

{
  "postconfig": [
    {
      "type": "execute",
      "parameters": {
        "command": ["osascript", "-e", "tell application \"Microsoft Excel\" to activate"]
      }
    },
    { "type": "sleep", "parameters": { "seconds": 1 } },
    {
      "type": "execute",
      "parameters": {
        "command": ["python", "-c", "import pyautogui; pyautogui.hotkey('command', 's')"]
      }
    }
  ]
}

Best Practices

Ensure every agent run starts from the exact same state. Download fresh files rather than relying on pre-existing VM state.
Focus on verifying the end result (file content, structure) rather than the exact steps taken. Multiple valid paths may exist.
Most applications don’t auto-save. Always include explicit save commands before running evaluators.
Application behavior varies by OS. Test your gold standard and evaluator on the same OS the agent will use.
Choose specific evaluators (pivot_table, compare_docx_lines) over generic ones (exact_match) for meaningful comparison.

Next Steps