Run Evaluator - Scale Gymnasium

curl -X POST "http://CONTROL_PLANE_IP:PORT/run_evaluator" \
  -H "Content-Type: application/json" \
  -d '{
    "vm_id": "vm-abc123",
    "task_config": {
      "id": "task-001",
      "instruction": "Create a pivot table from the unemployment data",
      "evaluator": {
        "func": "compare_table",
        "expected": {
          "type": "cloud_file",
          "path": "https://temp-cua.s3.us-west-2.amazonaws.com/gold.xlsx",
          "dest": "historical_unemployment-gold.xlsx"
        },
        "result": {
          "type": "vm_file",
          "path": "/Users/user/Desktop/historical_unemployment.xlsx",
          "dest": "historical_unemployment.xlsx"
        },
        "postconfig": [
          {
            "type": "execute",
            "parameters": {
              "command": ["osascript", "-e", "tell application \"Microsoft Excel\" to activate"]
            }
          },
          {
            "type": "sleep",
            "parameters": { "seconds": 1 }
          },
          {
            "type": "execute",
            "parameters": {
              "command": ["python", "-c", "import pyautogui; pyautogui.hotkey('command', 's')"]
            }
          },
          {
            "type": "sleep",
            "parameters": { "seconds": 0.5 }
          }
        ],
        "options": {
          "rules": [
            {
              "type": "pivot_table",
              "sheet_idx0": "RNPivotData",
              "sheet_idx1": "ENPivotData",
              "pivot_props": ["col_fields", "row_fields", "data_fields"]
            },
            {
              "type": "freeze",
              "sheet_idx0": "RNPivotData",
              "sheet_idx1": "ENPivotData"
            }
          ]
        }
      }
    }
  }'

{
  "status": "success",
  "message": "Verifier passed",
  "vm_id": "vm-abc123",
  "task_id": "task-001",
  "evaluation_score": 1
}

POST

run_evaluator

curl -X POST "http://CONTROL_PLANE_IP:PORT/run_evaluator" \
  -H "Content-Type: application/json" \
  -d '{
    "vm_id": "vm-abc123",
    "task_config": {
      "id": "task-001",
      "instruction": "Create a pivot table from the unemployment data",
      "evaluator": {
        "func": "compare_table",
        "expected": {
          "type": "cloud_file",
          "path": "https://temp-cua.s3.us-west-2.amazonaws.com/gold.xlsx",
          "dest": "historical_unemployment-gold.xlsx"
        },
        "result": {
          "type": "vm_file",
          "path": "/Users/user/Desktop/historical_unemployment.xlsx",
          "dest": "historical_unemployment.xlsx"
        },
        "postconfig": [
          {
            "type": "execute",
            "parameters": {
              "command": ["osascript", "-e", "tell application \"Microsoft Excel\" to activate"]
            }
          },
          {
            "type": "sleep",
            "parameters": { "seconds": 1 }
          },
          {
            "type": "execute",
            "parameters": {
              "command": ["python", "-c", "import pyautogui; pyautogui.hotkey('command', 's')"]
            }
          },
          {
            "type": "sleep",
            "parameters": { "seconds": 0.5 }
          }
        ],
        "options": {
          "rules": [
            {
              "type": "pivot_table",
              "sheet_idx0": "RNPivotData",
              "sheet_idx1": "ENPivotData",
              "pivot_props": ["col_fields", "row_fields", "data_fields"]
            },
            {
              "type": "freeze",
              "sheet_idx0": "RNPivotData",
              "sheet_idx1": "ENPivotData"
            }
          ]
        }
      }
    }
  }'

{
  "status": "success",
  "message": "Verifier passed",
  "vm_id": "vm-abc123",
  "task_id": "task-001",
  "evaluation_score": 1
}

Runs the evaluator to check if a task was completed successfully. Compares the current VM state against the expected outcome defined in the task configuration and produces a score from 0.0 to 1.0.

Explore live examples of evaluator configurations at gym.scale.com.

Request Body

vm_id

string

required

VM identifier

task_config

object

required

Task configuration with evaluator definition

Evaluator Configuration

The evaluator object within task_config defines how to verify task completion.

Core Fields

evaluator.func

string

required

Evaluation function to use (e.g., compare_table, file_check, exact_match)

evaluator.expected

object

required

Expected/gold standard file configuration

evaluator.result

object

required

Result file from the VM to compare

evaluator.postconfig

array

Actions to run before evaluation (e.g., save file, activate app)

evaluator.options

object

Evaluation options and rules

Expected/Result File Types

Type	Description
`cloud_file`	File hosted at a URL (for gold standard)
`vm_file`	File on the VM filesystem (for result)

Postconfig Actions

Actions to prepare the VM state before evaluation:

Type	Description
`execute`	Run a shell command
`sleep`	Wait for specified seconds

Evaluation Rules

Rules define how files are compared:

Rule Type	Description
`pivot_table`	Compare pivot table structure
`freeze`	Compare freeze pane settings
`exact_match`	Byte-for-byte comparison
`structural`	Compare document structure

Response

status

string

Result status (success)

message

string

Evaluation result message (describes pass/fail reason)

vm_id

string

VM identifier

task_id

string

Task identifier

evaluation_score

number

Score from 0.0 to 1.0 (1.0 = fully completed)

Response structure may be simplified to {"score": number, "message": string} in future versions.

Example: Excel Pivot Table Verification

This example verifies that a pivot table was correctly created in Excel:

curl -X POST "http://CONTROL_PLANE_IP:PORT/run_evaluator" \
  -H "Content-Type: application/json" \
  -d '{
    "vm_id": "vm-abc123",
    "task_config": {
      "id": "task-001",
      "instruction": "Create a pivot table from the unemployment data",
      "evaluator": {
        "func": "compare_table",
        "expected": {
          "type": "cloud_file",
          "path": "https://temp-cua.s3.us-west-2.amazonaws.com/gold.xlsx",
          "dest": "historical_unemployment-gold.xlsx"
        },
        "result": {
          "type": "vm_file",
          "path": "/Users/user/Desktop/historical_unemployment.xlsx",
          "dest": "historical_unemployment.xlsx"
        },
        "postconfig": [
          {
            "type": "execute",
            "parameters": {
              "command": ["osascript", "-e", "tell application \"Microsoft Excel\" to activate"]
            }
          },
          {
            "type": "sleep",
            "parameters": { "seconds": 1 }
          },
          {
            "type": "execute",
            "parameters": {
              "command": ["python", "-c", "import pyautogui; pyautogui.hotkey('command', 's')"]
            }
          },
          {
            "type": "sleep",
            "parameters": { "seconds": 0.5 }
          }
        ],
        "options": {
          "rules": [
            {
              "type": "pivot_table",
              "sheet_idx0": "RNPivotData",
              "sheet_idx1": "ENPivotData",
              "pivot_props": ["col_fields", "row_fields", "data_fields"]
            },
            {
              "type": "freeze",
              "sheet_idx0": "RNPivotData",
              "sheet_idx1": "ENPivotData"
            }
          ]
        }
      }
    }
  }'

{
  "status": "success",
  "message": "Verifier passed",
  "vm_id": "vm-abc123",
  "task_id": "task-001",
  "evaluation_score": 1
}

Evaluator Functions

Over 100 functions available. Common ones:

Function	Use Case
`compare_table`	Excel/spreadsheet with rules
`compare_docx_files`	Word documents
`compare_pptx_files`	PowerPoint presentations
`compare_pdfs`	PDF files
`compare_images`	Image similarity
`compare_text_file`	Text files
`exact_match`	Byte-for-byte comparison
`fuzzy_match`	Fuzzy string matching
`check_json`	JSON validation
`check_file_exists`	File existence
`is_extension_installed`	VS Code extensions
`infeasible`	Task cannot be completed

See Desktop Verifiers for the complete list.

Best Practices

Use postconfig to save files

Many applications don’t auto-save. Use postconfig to trigger save before evaluation:

{
  "postconfig": [
    {
      "type": "execute",
      "parameters": {
        "command": ["python", "-c", "import pyautogui; pyautogui.hotkey('command', 's')"]
      }
    }
  ]
}

Add delays for UI stability

Use sleep actions between commands to allow the UI to settle:

{
  "type": "sleep",
  "parameters": { "seconds": 1 }
}

Use specific rules for complex comparisons

For spreadsheets, define which aspects to compare:

{
  "options": {
    "rules": [
      { "type": "pivot_table", "pivot_props": ["col_fields", "row_fields"] },
      { "type": "freeze" }
    ]
  }
}

Initialize Task Reset Environment

⌘I

​Request Body

​Evaluator Configuration

​Core Fields

​Expected/Result File Types

​Postconfig Actions

​Evaluation Rules

​Response

​Example: Excel Pivot Table Verification

​Evaluator Functions

​Best Practices

Request Body

Evaluator Configuration

Core Fields

Expected/Result File Types

Postconfig Actions

Evaluation Rules

Response

Example: Excel Pivot Table Verification

Evaluator Functions

Best Practices