Skip to main content
Scale Gymnasium offers three distinct environment types, each designed for different agent capabilities. This page helps you choose the right environment for your evaluation needs.

At a Glance

AspectWebsiteDesktopMCP
Agent TypeGUI agentsComputer-use agents (CUA)Tool-use agents
InteractionWeb UI navigationScreenshots + commandsFunction calls
StateDatabaseVM filesystem + appsServer databases
Verifier AccessAPI endpointWeb UI onlyWeb UI only
ScalingHigh (containers)Medium (VMs)High (containers)

Website Environments

Best for: GUI-based agents that navigate web interfaces

What’s Available

ApplicationDescription
📅 CalendrCalendar application
📁 CloudfileCloud storage application
🛍️ ShoporaE-commerce platform
📧 Pandora’s InboxEmail application

Key Characteristics

FeatureDescription
Interaction ModeWeb UI navigation, form filling, clicking
State ManagementSQLite database per session
LoggingAutomatic capture of all UI interactions
VerificationState checks, log checks, rubric checks
Data PacksMultiple pre-configured scenarios

Ideal For

  • Testing agents that interact with web applications
  • Evaluating form-filling and navigation capabilities
  • Multi-step workflow automation
  • E-commerce and productivity app testing
Explore Website Environments →

Desktop Environments

Best for: Computer-use agents that control full operating systems

What’s Available

PlatformDescription
🪟 WindowsWindows desktop environment
🐧 UbuntuLinux desktop environment
🍎 macOSMac desktop environment

Key Characteristics

FeatureDescription
Interaction ModeScreenshots, mouse/keyboard, shell commands
State ManagementFull VM filesystem and applications
ObservationScreenshot capture, accessibility tree
VerificationGold standard comparison via Web UI
InitializationScript-based task setup

Ideal For

  • Testing agents that use native desktop applications
  • Evaluating visual understanding and navigation
  • Cross-application workflows
  • Office productivity automation
Explore Desktop Environments →

MCP Environment

Best for: Tool-use agents that call functions and APIs

What’s Available

  • 45+ MCP servers covering productivity, business, communication, e-commerce, and more
  • 300+ tools for comprehensive capability testing

Server Categories

CategoryExamples
ProductivityCalendar, Reminders, Notes
BusinessCRM, QuickBooks, Invoicing
CommunicationEmail, Slack, SMS
E-commerceShopping, Orders, Inventory
DevelopmentGitHub, Jira, Documentation

Key Characteristics

FeatureDescription
Interaction ModeTool calls with structured parameters
State ManagementSQLite database per server
ObservationShow Data endpoint for state inspection
VerificationLLM Judge evaluation via Web UI
ProtocolModel Context Protocol (MCP)

Ideal For

  • Testing agents that use function calling
  • Evaluating multi-tool workflows
  • API-based automation
  • Business process automation
Explore MCP Environment →

Decision Guide

By Agent Capability

If your agent…Use…
Navigates web pages and fills formsWebsite Environments
Uses screenshots and mouse/keyboardDesktop Environments
Calls tools and functionsMCP Environment
Does multiple of the aboveTest each environment type

By Task Type

If your task involves…Use…
Web application workflowsWebsite Environments
Native desktop applicationsDesktop Environments
API integrationsMCP Environment
Document editingDesktop Environments
E-commerce flowsWebsite (Shopora) or MCP
Calendar/schedulingWebsite (Calendr) or MCP

By Evaluation Needs

If you need…Use…
High parallelismWebsite or MCP (containers)
Visual testingDesktop (screenshots)
Programmatic verificationWebsite (API verifier)
Real application testingDesktop (full apps)

Combining Environments

Some evaluations may benefit from multiple environment types:
ScenarioEnvironments
Full-stack agent testingAll three
Comparing UI vs API approachesWebsite + MCP
Cross-platform testingMultiple Desktop platforms

Common Design Principles

All environments share these architectural properties:
PrincipleDescription
Session IsolationSessions cannot access each other’s data; sessionId scopes all operations
Clean Slate ResetFull reset returns to known initial state
ReproducibilityData packs ensure consistent starting conditions
ObservabilityComplete audit trail of interactions
Consistent APIsReset, query, verify, and health patterns across all types

Next Steps