Explore live evaluator examples at gym.scale.com.
Access
| Method | Endpoint |
|---|---|
| API | POST /run_evaluator |
| Web UI | ”Run Evaluator” button |
Evaluator Structure
The evaluator is defined in thetask_config.evaluator object:
Core Fields
| Field | Type | Description |
|---|---|---|
func | string | Evaluation function to use |
expected | object | Gold standard file configuration |
result | object | Result file from the VM to compare |
postconfig | array | Actions to run before evaluation |
options | object | Comparison rules and settings |
File Types
Cloud File (Gold Standard)
Files hosted at a URL, typically your expected output:VM File (Result)
Files on the VM filesystem, the agent’s output:Postconfig Actions
Actions executed before evaluation to prepare VM state (e.g., save files).Execute
Run a shell command:Sleep
Wait for specified seconds:Common Postconfig Pattern
Save file before evaluation (macOS Excel example):Comparison Rules
Rules define how files are compared. Specified inoptions.rules:
Available Rule Types
| Rule Type | Description | Use Case |
|---|---|---|
pivot_table | Compare pivot table structure | Excel pivot tasks |
freeze | Compare freeze pane settings | Spreadsheet formatting |
exact_match | Byte-for-byte comparison | Exact output tasks |
structural | Compare document structure | Document editing |
Pivot Table Rule
Freeze Rule
Evaluator Functions
Thefunc field specifies which evaluator to use. Over 100 functions are available, organized by category:
Spreadsheets & Tables
| Function | Description |
|---|---|
compare_table | Excel/spreadsheet comparison with rules (pivot tables, freeze panes, etc.) |
compare_csv | CSV file comparison |
Documents (Word/DOCX)
| Function | Description |
|---|---|
compare_docx_files | Full document comparison |
compare_docx_lines | Line-by-line comparison |
compare_line_spacing | Spacing verification |
compare_font_names | Font verification |
is_first_line_centered | Alignment check |
Presentations (PowerPoint/PPTX)
| Function | Description |
|---|---|
compare_pptx_files | Full presentation comparison |
check_slide_orientation_Portrait | Orientation check |
check_image_stretch_and_center | Image positioning |
Images
| Function | Description |
|---|---|
compare_images | Image similarity comparison |
compare_image_list | Multiple image comparison |
check_structure_sim | Structural similarity |
General
| Function | Description |
|---|---|
exact_match | Byte-for-byte comparison |
fuzzy_match | Fuzzy string matching |
compare_text_file | Text file comparison |
compare_pdfs | PDF comparison |
check_json | JSON validation |
check_file_exists | File existence check |
infeasible | Mark task as impossible |
VS Code
| Function | Description |
|---|---|
is_extension_installed | Extension check |
check_json_settings | Settings verification |
check_python_file_by_test_suite | Run Python tests |
Browser
| Function | Description |
|---|---|
is_expected_bookmarks | Bookmark verification |
is_expected_tabs | Tab verification |
is_expected_installed_extensions | Extension check |
View all 100+ functions
View all 100+ functions
Documents:
compare_docx_tables, compare_highlighted_text, compare_insert_equation, compare_references, has_page_numbers_in_footers, check_highlighted_words, check_tabstops, contains_page_break, evaluate_alignment, evaluate_spacingSlides: check_strikethrough, check_transition, check_slide_numbers_color, check_presenter_console_disable, evaluate_presentation_fill_to_rgb_distanceImages: check_image_size, check_image_mirror, check_brightness_decrease_and_structure_sim, check_contrast_increase_and_structure_sim, check_saturation_increase_and_structure_simBrowser: compare_htmls, compare_archive, is_cookie_deleted, check_history_deleted, is_expected_url_pattern_match, compare_pdf_imagesGeneral: check_csv, check_list, check_include_exclude, run_sqlite3, file_contains, diff_text_file, literal_match, is_in_listVS Code: compare_config, compare_zip_files, check_json_keybindings, check_python_file_by_gold_fileMedia: compare_audios, compare_videos, is_vlc_playing, is_vlc_fullscreenOther: compare_epub, check_mp3_meta, check_pdf_pages, check_thunderbird_prefsResponse Structure
Score Interpretation
| Score | Meaning |
|---|---|
1.0 | Task completed successfully |
0.0 | Task not completed or failed checks |
Failed Example
Complete Example
Excel pivot table verification:Best Practices
Always save files before evaluation
Always save files before evaluation
Many applications don’t auto-save. Use
postconfig to trigger save before evaluation.Add delays for UI stability
Add delays for UI stability
Use
sleep actions between commands to allow the UI to settle (typically 0.5-1 second).Use specific rules for complex comparisons
Use specific rules for complex comparisons
For spreadsheets, define exactly which aspects to compare rather than using exact match.
Test gold standards manually first
Test gold standards manually first
Verify your expected output is correct before using it for automated evaluation.