Skip to main content
Desktop environments use evaluator functions that compare VM state against expected outcomes (gold standards).
Explore live evaluator examples at gym.scale.com.

Access

MethodEndpoint
APIPOST /run_evaluator
Web UI”Run Evaluator” button

Evaluator Structure

The evaluator is defined in the task_config.evaluator object:
{
  "evaluator": {
    "func": "compare_table",
    "expected": { ... },
    "result": { ... },
    "postconfig": [ ... ],
    "options": { ... }
  }
}

Core Fields

FieldTypeDescription
funcstringEvaluation function to use
expectedobjectGold standard file configuration
resultobjectResult file from the VM to compare
postconfigarrayActions to run before evaluation
optionsobjectComparison rules and settings

File Types

Cloud File (Gold Standard)

Files hosted at a URL, typically your expected output:
{
  "expected": {
    "type": "cloud_file",
    "path": "https://example.s3.amazonaws.com/gold.xlsx",
    "dest": "gold.xlsx"
  }
}

VM File (Result)

Files on the VM filesystem, the agent’s output:
{
  "result": {
    "type": "vm_file",
    "path": "/Users/user/Desktop/output.xlsx",
    "dest": "output.xlsx"
  }
}

Postconfig Actions

Actions executed before evaluation to prepare VM state (e.g., save files).

Execute

Run a shell command:
{
  "type": "execute",
  "parameters": {
    "command": ["osascript", "-e", "tell application \"Microsoft Excel\" to activate"]
  }
}

Sleep

Wait for specified seconds:
{
  "type": "sleep",
  "parameters": { "seconds": 1 }
}

Common Postconfig Pattern

Save file before evaluation (macOS Excel example):
{
  "postconfig": [
    {
      "type": "execute",
      "parameters": {
        "command": ["osascript", "-e", "tell application \"Microsoft Excel\" to activate"]
      }
    },
    { "type": "sleep", "parameters": { "seconds": 1 } },
    {
      "type": "execute",
      "parameters": {
        "command": ["python", "-c", "import pyautogui; pyautogui.hotkey('command', 's')"]
      }
    },
    { "type": "sleep", "parameters": { "seconds": 0.5 } }
  ]
}

Comparison Rules

Rules define how files are compared. Specified in options.rules:
{
  "options": {
    "rules": [
      { "type": "pivot_table", ... },
      { "type": "freeze", ... }
    ]
  }
}

Available Rule Types

Rule TypeDescriptionUse Case
pivot_tableCompare pivot table structureExcel pivot tasks
freezeCompare freeze pane settingsSpreadsheet formatting
exact_matchByte-for-byte comparisonExact output tasks
structuralCompare document structureDocument editing

Pivot Table Rule

{
  "type": "pivot_table",
  "sheet_idx0": "RNPivotData",
  "sheet_idx1": "ENPivotData",
  "pivot_props": ["col_fields", "row_fields", "data_fields"]
}

Freeze Rule

{
  "type": "freeze",
  "sheet_idx0": "RNPivotData",
  "sheet_idx1": "ENPivotData"
}

Evaluator Functions

The func field specifies which evaluator to use. Over 100 functions are available, organized by category:

Spreadsheets & Tables

FunctionDescription
compare_tableExcel/spreadsheet comparison with rules (pivot tables, freeze panes, etc.)
compare_csvCSV file comparison

Documents (Word/DOCX)

FunctionDescription
compare_docx_filesFull document comparison
compare_docx_linesLine-by-line comparison
compare_line_spacingSpacing verification
compare_font_namesFont verification
is_first_line_centeredAlignment check

Presentations (PowerPoint/PPTX)

FunctionDescription
compare_pptx_filesFull presentation comparison
check_slide_orientation_PortraitOrientation check
check_image_stretch_and_centerImage positioning

Images

FunctionDescription
compare_imagesImage similarity comparison
compare_image_listMultiple image comparison
check_structure_simStructural similarity

General

FunctionDescription
exact_matchByte-for-byte comparison
fuzzy_matchFuzzy string matching
compare_text_fileText file comparison
compare_pdfsPDF comparison
check_jsonJSON validation
check_file_existsFile existence check
infeasibleMark task as impossible

VS Code

FunctionDescription
is_extension_installedExtension check
check_json_settingsSettings verification
check_python_file_by_test_suiteRun Python tests

Browser

FunctionDescription
is_expected_bookmarksBookmark verification
is_expected_tabsTab verification
is_expected_installed_extensionsExtension check
Documents: compare_docx_tables, compare_highlighted_text, compare_insert_equation, compare_references, has_page_numbers_in_footers, check_highlighted_words, check_tabstops, contains_page_break, evaluate_alignment, evaluate_spacingSlides: check_strikethrough, check_transition, check_slide_numbers_color, check_presenter_console_disable, evaluate_presentation_fill_to_rgb_distanceImages: check_image_size, check_image_mirror, check_brightness_decrease_and_structure_sim, check_contrast_increase_and_structure_sim, check_saturation_increase_and_structure_simBrowser: compare_htmls, compare_archive, is_cookie_deleted, check_history_deleted, is_expected_url_pattern_match, compare_pdf_imagesGeneral: check_csv, check_list, check_include_exclude, run_sqlite3, file_contains, diff_text_file, literal_match, is_in_listVS Code: compare_config, compare_zip_files, check_json_keybindings, check_python_file_by_gold_fileMedia: compare_audios, compare_videos, is_vlc_playing, is_vlc_fullscreenOther: compare_epub, check_mp3_meta, check_pdf_pages, check_thunderbird_prefs

Response Structure

{
  "status": "success",
  "message": "Verifier passed",
  "vm_id": "vm-abc123",
  "task_id": "task-001",
  "evaluation_score": 1
}

Score Interpretation

ScoreMeaning
1.0Task completed successfully
0.0Task not completed or failed checks

Failed Example

{
  "status": "success",
  "message": "Rule 1 (pivot_table): Pivot tables differ between RNPivotData and ENPivotData",
  "vm_id": "vm-abc123",
  "task_id": "task-001",
  "evaluation_score": 0
}

Complete Example

Excel pivot table verification:
{
  "evaluator": {
    "func": "compare_table",
    "expected": {
      "type": "cloud_file",
      "path": "https://temp-cua.s3.us-west-2.amazonaws.com/gold.xlsx",
      "dest": "historical_unemployment-gold.xlsx"
    },
    "result": {
      "type": "vm_file",
      "path": "/Users/user/Desktop/historical_unemployment.xlsx",
      "dest": "historical_unemployment.xlsx"
    },
    "postconfig": [
      {
        "type": "execute",
        "parameters": {
          "command": ["osascript", "-e", "tell application \"Microsoft Excel\" to activate"]
        }
      },
      { "type": "sleep", "parameters": { "seconds": 1 } },
      {
        "type": "execute",
        "parameters": {
          "command": ["python", "-c", "import pyautogui; pyautogui.hotkey('command', 's')"]
        }
      }
    ],
    "options": {
      "rules": [
        {
          "type": "pivot_table",
          "sheet_idx0": "RNPivotData",
          "sheet_idx1": "ENPivotData",
          "pivot_props": ["col_fields", "row_fields", "data_fields"]
        },
        {
          "type": "freeze",
          "sheet_idx0": "RNPivotData",
          "sheet_idx1": "ENPivotData"
        }
      ]
    }
  }
}

Best Practices

Many applications don’t auto-save. Use postconfig to trigger save before evaluation.
Use sleep actions between commands to allow the UI to settle (typically 0.5-1 second).
For spreadsheets, define exactly which aspects to compare rather than using exact match.
Verify your expected output is correct before using it for automated evaluation.

Next Steps