Environment Comparison - Scale Gymnasium

Scale Gymnasium offers three distinct environment types, each designed for different agent capabilities. This page helps you choose the right environment for your evaluation needs.

At a Glance

Aspect	Website	Desktop	MCP
Agent Type	GUI agents	Computer-use agents (CUA)	Tool-use agents
Interaction	Web UI navigation	Screenshots + commands	Function calls
State	Database	VM filesystem + apps	Server databases
Verifier Access	API endpoint	Web UI only	Web UI only
Scaling	High (containers)	Medium (VMs)	High (containers)

Website Environments

Best for: GUI-based agents that navigate web interfaces

What’s Available

Application	Description
📅 Calendr	Calendar application
📁 Cloudfile	Cloud storage application
🛍️ Shopora	E-commerce platform
📧 Pandora’s Inbox	Email application

Key Characteristics

Feature	Description
Interaction Mode	Web UI navigation, form filling, clicking
State Management	SQLite database per session
Logging	Automatic capture of all UI interactions
Verification	State checks, log checks, rubric checks
Data Packs	Multiple pre-configured scenarios

Ideal For

Testing agents that interact with web applications
Evaluating form-filling and navigation capabilities
Multi-step workflow automation
E-commerce and productivity app testing

Explore Website Environments →

Desktop Environments

Best for: Computer-use agents that control full operating systems

What’s Available

Platform	Description
🪟 Windows	Windows desktop environment
🐧 Ubuntu	Linux desktop environment
🍎 macOS	Mac desktop environment

Key Characteristics

Feature	Description
Interaction Mode	Screenshots, mouse/keyboard, shell commands
State Management	Full VM filesystem and applications
Observation	Screenshot capture, accessibility tree
Verification	Gold standard comparison via Web UI
Initialization	Script-based task setup

Ideal For

Testing agents that use native desktop applications
Evaluating visual understanding and navigation
Cross-application workflows
Office productivity automation

Explore Desktop Environments →

MCP Environment

Best for: Tool-use agents that call functions and APIs

What’s Available

45+ MCP servers covering productivity, business, communication, e-commerce, and more
300+ tools for comprehensive capability testing

Server Categories

Category	Examples
Productivity	Calendar, Reminders, Notes
Business	CRM, QuickBooks, Invoicing
Communication	Email, Slack, SMS
E-commerce	Shopping, Orders, Inventory
Development	GitHub, Jira, Documentation

Key Characteristics

Feature	Description
Interaction Mode	Tool calls with structured parameters
State Management	SQLite database per server
Observation	Show Data endpoint for state inspection
Verification	LLM Judge evaluation via Web UI
Protocol	Model Context Protocol (MCP)

Ideal For

Testing agents that use function calling
Evaluating multi-tool workflows
API-based automation
Business process automation

Explore MCP Environment →

Decision Guide

By Agent Capability

If your agent…	Use…
Navigates web pages and fills forms	Website Environments
Uses screenshots and mouse/keyboard	Desktop Environments
Calls tools and functions	MCP Environment
Does multiple of the above	Test each environment type

By Task Type

If your task involves…	Use…
Web application workflows	Website Environments
Native desktop applications	Desktop Environments
API integrations	MCP Environment
Document editing	Desktop Environments
E-commerce flows	Website (Shopora) or MCP
Calendar/scheduling	Website (Calendr) or MCP

By Evaluation Needs

If you need…	Use…
High parallelism	Website or MCP (containers)
Visual testing	Desktop (screenshots)
Programmatic verification	Website (API verifier)
Real application testing	Desktop (full apps)

Combining Environments

Some evaluations may benefit from multiple environment types:

Scenario	Environments
Full-stack agent testing	All three
Comparing UI vs API approaches	Website + MCP
Cross-platform testing	Multiple Desktop platforms

Common Design Principles

All environments share these architectural properties:

Principle	Description
Session Isolation	Sessions cannot access each other’s data; `sessionId` scopes all operations
Clean Slate Reset	Full reset returns to known initial state
Reproducibility	Data packs ensure consistent starting conditions
Observability	Complete audit trail of interactions
Consistent APIs	Reset, query, verify, and health patterns across all types

Next Steps

Website Environments

Explore web applications

Desktop Environments

Explore VM platforms

MCP Environment

Explore MCP servers

Overview

Website Environments

Desktop Environments

MCP Environment

​At a Glance

​Website Environments

Best for: GUI-based agents that navigate web interfaces

​What’s Available

​Key Characteristics

​Ideal For

​Desktop Environments

Best for: Computer-use agents that control full operating systems

​What’s Available

​Key Characteristics

​Ideal For

​MCP Environment

Best for: Tool-use agents that call functions and APIs

​What’s Available

​Server Categories

​Key Characteristics

​Ideal For

​Decision Guide

​By Agent Capability

​By Task Type

​By Evaluation Needs

​Combining Environments

​Common Design Principles

​Next Steps

Website Environments

Desktop Environments

MCP Environment

At a Glance

Website Environments

What’s Available

Key Characteristics

Ideal For

Desktop Environments

What’s Available

Key Characteristics

Ideal For

MCP Environment

What’s Available

Server Categories

Key Characteristics

Ideal For

Decision Guide

By Agent Capability

By Task Type

By Evaluation Needs

Combining Environments

Common Design Principles

Next Steps