Skip to main content
POST
/
run_evaluator
curl -X POST "http://CONTROL_PLANE_IP:PORT/run_evaluator" \
  -H "Content-Type: application/json" \
  -d '{
    "vm_id": "vm-abc123",
    "task_config": {
      "id": "task-001",
      "instruction": "Create a pivot table from the unemployment data",
      "evaluator": {
        "func": "compare_table",
        "expected": {
          "type": "cloud_file",
          "path": "https://temp-cua.s3.us-west-2.amazonaws.com/gold.xlsx",
          "dest": "historical_unemployment-gold.xlsx"
        },
        "result": {
          "type": "vm_file",
          "path": "/Users/user/Desktop/historical_unemployment.xlsx",
          "dest": "historical_unemployment.xlsx"
        },
        "postconfig": [
          {
            "type": "execute",
            "parameters": {
              "command": ["osascript", "-e", "tell application \"Microsoft Excel\" to activate"]
            }
          },
          {
            "type": "sleep",
            "parameters": { "seconds": 1 }
          },
          {
            "type": "execute",
            "parameters": {
              "command": ["python", "-c", "import pyautogui; pyautogui.hotkey('command', 's')"]
            }
          },
          {
            "type": "sleep",
            "parameters": { "seconds": 0.5 }
          }
        ],
        "options": {
          "rules": [
            {
              "type": "pivot_table",
              "sheet_idx0": "RNPivotData",
              "sheet_idx1": "ENPivotData",
              "pivot_props": ["col_fields", "row_fields", "data_fields"]
            },
            {
              "type": "freeze",
              "sheet_idx0": "RNPivotData",
              "sheet_idx1": "ENPivotData"
            }
          ]
        }
      }
    }
  }'
{
  "status": "success",
  "message": "Verifier passed",
  "vm_id": "vm-abc123",
  "task_id": "task-001",
  "evaluation_score": 1
}
Runs the evaluator to check if a task was completed successfully. Compares the current VM state against the expected outcome defined in the task configuration and produces a score from 0.0 to 1.0.
Explore live examples of evaluator configurations at gym.scale.com.

Request Body

vm_id
string
required
VM identifier
task_config
object
required
Task configuration with evaluator definition

Evaluator Configuration

The evaluator object within task_config defines how to verify task completion.

Core Fields

evaluator.func
string
required
Evaluation function to use (e.g., compare_table, file_check, exact_match)
evaluator.expected
object
required
Expected/gold standard file configuration
evaluator.result
object
required
Result file from the VM to compare
evaluator.postconfig
array
Actions to run before evaluation (e.g., save file, activate app)
evaluator.options
object
Evaluation options and rules

Expected/Result File Types

TypeDescription
cloud_fileFile hosted at a URL (for gold standard)
vm_fileFile on the VM filesystem (for result)

Postconfig Actions

Actions to prepare the VM state before evaluation:
TypeDescription
executeRun a shell command
sleepWait for specified seconds

Evaluation Rules

Rules define how files are compared:
Rule TypeDescription
pivot_tableCompare pivot table structure
freezeCompare freeze pane settings
exact_matchByte-for-byte comparison
structuralCompare document structure

Response

status
string
Result status (success)
message
string
Evaluation result message (describes pass/fail reason)
vm_id
string
VM identifier
task_id
string
Task identifier
evaluation_score
number
Score from 0.0 to 1.0 (1.0 = fully completed)
Response structure may be simplified to {"score": number, "message": string} in future versions.

Example: Excel Pivot Table Verification

This example verifies that a pivot table was correctly created in Excel:
curl -X POST "http://CONTROL_PLANE_IP:PORT/run_evaluator" \
  -H "Content-Type: application/json" \
  -d '{
    "vm_id": "vm-abc123",
    "task_config": {
      "id": "task-001",
      "instruction": "Create a pivot table from the unemployment data",
      "evaluator": {
        "func": "compare_table",
        "expected": {
          "type": "cloud_file",
          "path": "https://temp-cua.s3.us-west-2.amazonaws.com/gold.xlsx",
          "dest": "historical_unemployment-gold.xlsx"
        },
        "result": {
          "type": "vm_file",
          "path": "/Users/user/Desktop/historical_unemployment.xlsx",
          "dest": "historical_unemployment.xlsx"
        },
        "postconfig": [
          {
            "type": "execute",
            "parameters": {
              "command": ["osascript", "-e", "tell application \"Microsoft Excel\" to activate"]
            }
          },
          {
            "type": "sleep",
            "parameters": { "seconds": 1 }
          },
          {
            "type": "execute",
            "parameters": {
              "command": ["python", "-c", "import pyautogui; pyautogui.hotkey('command', 's')"]
            }
          },
          {
            "type": "sleep",
            "parameters": { "seconds": 0.5 }
          }
        ],
        "options": {
          "rules": [
            {
              "type": "pivot_table",
              "sheet_idx0": "RNPivotData",
              "sheet_idx1": "ENPivotData",
              "pivot_props": ["col_fields", "row_fields", "data_fields"]
            },
            {
              "type": "freeze",
              "sheet_idx0": "RNPivotData",
              "sheet_idx1": "ENPivotData"
            }
          ]
        }
      }
    }
  }'
{
  "status": "success",
  "message": "Verifier passed",
  "vm_id": "vm-abc123",
  "task_id": "task-001",
  "evaluation_score": 1
}

Evaluator Functions

Over 100 functions available. Common ones:
FunctionUse Case
compare_tableExcel/spreadsheet with rules
compare_docx_filesWord documents
compare_pptx_filesPowerPoint presentations
compare_pdfsPDF files
compare_imagesImage similarity
compare_text_fileText files
exact_matchByte-for-byte comparison
fuzzy_matchFuzzy string matching
check_jsonJSON validation
check_file_existsFile existence
is_extension_installedVS Code extensions
infeasibleTask cannot be completed
See Desktop Verifiers for the complete list.

Best Practices

Many applications don’t auto-save. Use postconfig to trigger save before evaluation:
{
  "postconfig": [
    {
      "type": "execute",
      "parameters": {
        "command": ["python", "-c", "import pyautogui; pyautogui.hotkey('command', 's')"]
      }
    }
  ]
}
Use sleep actions between commands to allow the UI to settle:
{
  "type": "sleep",
  "parameters": { "seconds": 1 }
}
For spreadsheets, define which aspects to compare:
{
  "options": {
    "rules": [
      { "type": "pivot_table", "pivot_props": ["col_fields", "row_fields"] },
      { "type": "freeze" }
    ]
  }
}