Desktop Verifiers

Desktop environments use evaluator functions that compare VM state against expected outcomes (gold standards).

Explore live evaluator examples at gym.scale.com.

Access

Method	Endpoint
API	`POST /run_evaluator`
Web UI	”Run Evaluator” button

Evaluator Structure

The evaluator is defined in the task_config.evaluator object:

{
  "evaluator": {
    "func": "compare_table",
    "expected": { ... },
    "result": { ... },
    "postconfig": [ ... ],
    "options": { ... }
  }
}

Core Fields

Field	Type	Description
`func`	string	Evaluation function to use
`expected`	object	Gold standard file configuration
`result`	object	Result file from the VM to compare
`postconfig`	array	Actions to run before evaluation
`options`	object	Comparison rules and settings

File Types

Cloud File (Gold Standard)

Files hosted at a URL, typically your expected output:

{
  "expected": {
    "type": "cloud_file",
    "path": "https://example.s3.amazonaws.com/gold.xlsx",
    "dest": "gold.xlsx"
  }
}

VM File (Result)

Files on the VM filesystem, the agent’s output:

{
  "result": {
    "type": "vm_file",
    "path": "/Users/user/Desktop/output.xlsx",
    "dest": "output.xlsx"
  }
}

Postconfig Actions

Actions executed before evaluation to prepare VM state (e.g., save files).

Execute

Run a shell command:

{
  "type": "execute",
  "parameters": {
    "command": ["osascript", "-e", "tell application \"Microsoft Excel\" to activate"]
  }
}

Sleep

Wait for specified seconds:

{
  "type": "sleep",
  "parameters": { "seconds": 1 }
}

Common Postconfig Pattern

Save file before evaluation (macOS Excel example):

{
  "postconfig": [
    {
      "type": "execute",
      "parameters": {
        "command": ["osascript", "-e", "tell application \"Microsoft Excel\" to activate"]
      }
    },
    { "type": "sleep", "parameters": { "seconds": 1 } },
    {
      "type": "execute",
      "parameters": {
        "command": ["python", "-c", "import pyautogui; pyautogui.hotkey('command', 's')"]
      }
    },
    { "type": "sleep", "parameters": { "seconds": 0.5 } }
  ]
}

Comparison Rules

Rules define how files are compared. Specified in options.rules:

{
  "options": {
    "rules": [
      { "type": "pivot_table", ... },
      { "type": "freeze", ... }
    ]
  }
}

Available Rule Types

Rule Type	Description	Use Case
`pivot_table`	Compare pivot table structure	Excel pivot tasks
`freeze`	Compare freeze pane settings	Spreadsheet formatting
`exact_match`	Byte-for-byte comparison	Exact output tasks
`structural`	Compare document structure	Document editing

Pivot Table Rule

{
  "type": "pivot_table",
  "sheet_idx0": "RNPivotData",
  "sheet_idx1": "ENPivotData",
  "pivot_props": ["col_fields", "row_fields", "data_fields"]
}

Freeze Rule

{
  "type": "freeze",
  "sheet_idx0": "RNPivotData",
  "sheet_idx1": "ENPivotData"
}

Evaluator Functions

The func field specifies which evaluator to use. Over 100 functions are available, organized by category:

Spreadsheets & Tables

Function	Description
`compare_table`	Excel/spreadsheet comparison with rules (pivot tables, freeze panes, etc.)
`compare_csv`	CSV file comparison

Documents (Word/DOCX)

Function	Description
`compare_docx_files`	Full document comparison
`compare_docx_lines`	Line-by-line comparison
`compare_line_spacing`	Spacing verification
`compare_font_names`	Font verification
`is_first_line_centered`	Alignment check

Presentations (PowerPoint/PPTX)

Function	Description
`compare_pptx_files`	Full presentation comparison
`check_slide_orientation_Portrait`	Orientation check
`check_image_stretch_and_center`	Image positioning

Images

Function	Description
`compare_images`	Image similarity comparison
`compare_image_list`	Multiple image comparison
`check_structure_sim`	Structural similarity

General

Function	Description
`exact_match`	Byte-for-byte comparison
`fuzzy_match`	Fuzzy string matching
`compare_text_file`	Text file comparison
`compare_pdfs`	PDF comparison
`check_json`	JSON validation
`check_file_exists`	File existence check
`infeasible`	Mark task as impossible

VS Code

Function	Description
`is_extension_installed`	Extension check
`check_json_settings`	Settings verification
`check_python_file_by_test_suite`	Run Python tests

Browser

Function	Description
`is_expected_bookmarks`	Bookmark verification
`is_expected_tabs`	Tab verification
`is_expected_installed_extensions`	Extension check

View all 100+ functions

Documents: compare_docx_tables, compare_highlighted_text, compare_insert_equation, compare_references, has_page_numbers_in_footers, check_highlighted_words, check_tabstops, contains_page_break, evaluate_alignment, evaluate_spacingSlides: check_strikethrough, check_transition, check_slide_numbers_color, check_presenter_console_disable, evaluate_presentation_fill_to_rgb_distanceImages: check_image_size, check_image_mirror, check_brightness_decrease_and_structure_sim, check_contrast_increase_and_structure_sim, check_saturation_increase_and_structure_simBrowser: compare_htmls, compare_archive, is_cookie_deleted, check_history_deleted, is_expected_url_pattern_match, compare_pdf_imagesGeneral: check_csv, check_list, check_include_exclude, run_sqlite3, file_contains, diff_text_file, literal_match, is_in_listVS Code: compare_config, compare_zip_files, check_json_keybindings, check_python_file_by_gold_fileMedia: compare_audios, compare_videos, is_vlc_playing, is_vlc_fullscreenOther: compare_epub, check_mp3_meta, check_pdf_pages, check_thunderbird_prefs

Response Structure

{
  "status": "success",
  "message": "Verifier passed",
  "vm_id": "vm-abc123",
  "task_id": "task-001",
  "evaluation_score": 1
}

Score Interpretation

Score	Meaning
`1.0`	Task completed successfully
`0.0`	Task not completed or failed checks

Failed Example

{
  "status": "success",
  "message": "Rule 1 (pivot_table): Pivot tables differ between RNPivotData and ENPivotData",
  "vm_id": "vm-abc123",
  "task_id": "task-001",
  "evaluation_score": 0
}

Complete Example

Excel pivot table verification:

{
  "evaluator": {
    "func": "compare_table",
    "expected": {
      "type": "cloud_file",
      "path": "https://temp-cua.s3.us-west-2.amazonaws.com/gold.xlsx",
      "dest": "historical_unemployment-gold.xlsx"
    },
    "result": {
      "type": "vm_file",
      "path": "/Users/user/Desktop/historical_unemployment.xlsx",
      "dest": "historical_unemployment.xlsx"
    },
    "postconfig": [
      {
        "type": "execute",
        "parameters": {
          "command": ["osascript", "-e", "tell application \"Microsoft Excel\" to activate"]
        }
      },
      { "type": "sleep", "parameters": { "seconds": 1 } },
      {
        "type": "execute",
        "parameters": {
          "command": ["python", "-c", "import pyautogui; pyautogui.hotkey('command', 's')"]
        }
      }
    ],
    "options": {
      "rules": [
        {
          "type": "pivot_table",
          "sheet_idx0": "RNPivotData",
          "sheet_idx1": "ENPivotData",
          "pivot_props": ["col_fields", "row_fields", "data_fields"]
        },
        {
          "type": "freeze",
          "sheet_idx0": "RNPivotData",
          "sheet_idx1": "ENPivotData"
        }
      ]
    }
  }
}

Best Practices

Always save files before evaluation

Many applications don’t auto-save. Use postconfig to trigger save before evaluation.

Add delays for UI stability

Use sleep actions between commands to allow the UI to settle (typically 0.5-1 second).

Use specific rules for complex comparisons

For spreadsheets, define exactly which aspects to compare rather than using exact match.

Test gold standards manually first

Verify your expected output is correct before using it for automated evaluation.

Overview

Getting Started

Deep Dives

Access

Evaluator Structure

Core Fields

File Types

Cloud File (Gold Standard)

VM File (Result)

Postconfig Actions

Execute

Sleep

Common Postconfig Pattern

Comparison Rules

Available Rule Types

Pivot Table Rule

Freeze Rule

Evaluator Functions

Spreadsheets & Tables

Documents (Word/DOCX)

Presentations (PowerPoint/PPTX)

Images

General

VS Code

Browser

Response Structure

Score Interpretation

Failed Example

Complete Example

Best Practices

Next Steps

Run Evaluator API

MCP Verifiers

Overview

Getting Started

Deep Dives

​Access

​Evaluator Structure

​Core Fields

​File Types

​Cloud File (Gold Standard)

​VM File (Result)

​Postconfig Actions

​Execute

​Sleep

​Common Postconfig Pattern

​Comparison Rules

​Available Rule Types

​Pivot Table Rule

​Freeze Rule

​Evaluator Functions

​Spreadsheets & Tables

​Documents (Word/DOCX)

​Presentations (PowerPoint/PPTX)

​Images

​General

​VS Code

​Browser

​Response Structure

​Score Interpretation

​Failed Example

​Complete Example

​Best Practices

​Next Steps

Run Evaluator API

MCP Verifiers

Access

Evaluator Structure

Core Fields

File Types

Cloud File (Gold Standard)

VM File (Result)

Postconfig Actions

Execute

Sleep

Common Postconfig Pattern

Comparison Rules

Available Rule Types

Pivot Table Rule

Freeze Rule

Evaluator Functions

Spreadsheets & Tables

Documents (Word/DOCX)

Presentations (PowerPoint/PPTX)

Images

General

VS Code

Browser

Response Structure

Score Interpretation

Failed Example

Complete Example

Best Practices

Next Steps