Desktop Task Design

Desktop tasks evaluate an agent’s ability to interact with full operating system environments—manipulating files, using applications like Excel, Word, and browsers, and executing multi-step workflows.

Task Components

1. Prompt

A natural language description of the goal. Example: “Extract and analyze quarterly sales data from the business report. Open ‘Q2_2025_Business_Report.docx’ in Word and review the sales data. Then, create an analysis spreadsheet in Excel…“

2. Initialization Config

Programmatic setup that ensures consistent starting state:

Function	Purpose
`download`	Place initial files in the VM
`open`	Launch applications with target files
`execute`	Run commands or scripts

3. Evaluators

Verifiers that determine completion success via:

File state comparison
Application state checking
Screenshot analysis

Initialization Config Structure

{
  "init": [
    {
      "type": "download",
      "parameters": {
        "url": "https://example.s3.amazonaws.com/Q2_Report.docx",
        "path": "/Users/user/Desktop/Q2_Report.docx"
      }
    },
    {
      "type": "open",
      "parameters": {
        "path": "/Users/user/Desktop/Q2_Report.docx"
      }
    },
    {
      "type": "execute",
      "parameters": {
        "command": ["mkdir", "-p", "/Users/user/Desktop/Output"]
      }
    }
  ]
}

Initialization Functions

Type	Description	Use Case
`download`	Fetch file from URL to VM path	Place input documents
`open`	Launch application with file	Start Word, Excel, browser
`execute`	Run shell command	Create folders, configure settings
`launch`	Start application without file	Open empty application

Evaluator Configuration

Desktop evaluators compare the agent’s output against gold standard files:

{
  "evaluator": {
    "func": "compare_table",
    "expected": {
      "type": "cloud_file",
      "path": "https://example.s3.amazonaws.com/gold.xlsx",
      "dest": "gold.xlsx"
    },
    "result": {
      "type": "vm_file",
      "path": "/Users/user/Desktop/output.xlsx",
      "dest": "output.xlsx"
    }
  }
}

See Desktop Verifiers for full evaluator documentation.

OS-Specific Considerations

OS	Applications	Notes
Linux	LibreOffice, Firefox, VS Code	Use `xdotool` for automation
Windows	Microsoft Office, Chrome, VS Code	Use PowerShell for scripting
macOS	Microsoft Office, Safari, VS Code	Use `osascript` for AppleScript automation

macOS Example (Save before evaluation)

{
  "postconfig": [
    {
      "type": "execute",
      "parameters": {
        "command": ["osascript", "-e", "tell application \"Microsoft Excel\" to activate"]
      }
    },
    { "type": "sleep", "parameters": { "seconds": 1 } },
    {
      "type": "execute",
      "parameters": {
        "command": ["python", "-c", "import pyautogui; pyautogui.hotkey('command', 's')"]
      }
    }
  ]
}

Best Practices

Use deterministic initialization

Ensure every agent run starts from the exact same state. Download fresh files rather than relying on pre-existing VM state.

Design for outcome verification

Focus on verifying the end result (file content, structure) rather than the exact steps taken. Multiple valid paths may exist.

Include save steps in postconfig

Most applications don’t auto-save. Always include explicit save commands before running evaluators.

Test on target OS first

Application behavior varies by OS. Test your gold standard and evaluator on the same OS the agent will use.

Use appropriate evaluator functions

Choose specific evaluators (pivot_table, compare_docx_lines) over generic ones (exact_match) for meaningful comparison.

Overview

Getting Started

Deep Dives

Task Components

1. Prompt

2. Initialization Config

3. Evaluators

Initialization Config Structure

Initialization Functions

Evaluator Configuration

OS-Specific Considerations

macOS Example (Save before evaluation)

Best Practices

Next Steps

Desktop Verifiers

MCP Tasks

Overview

Getting Started

Deep Dives

​Task Components

​1. Prompt

​2. Initialization Config

​3. Evaluators

​Initialization Config Structure

​Initialization Functions

​Evaluator Configuration

​OS-Specific Considerations

​macOS Example (Save before evaluation)

​Best Practices

​Next Steps

Desktop Verifiers

MCP Tasks

Task Components

1. Prompt

2. Initialization Config

3. Evaluators

Initialization Config Structure

Initialization Functions

Evaluator Configuration

OS-Specific Considerations

macOS Example (Save before evaluation)

Best Practices

Next Steps