At a Glance
| Aspect | Website | Desktop | MCP |
|---|---|---|---|
| Agent Type | GUI agents | Computer-use agents (CUA) | Tool-use agents |
| Interaction | Web UI navigation | Screenshots + commands | Function calls |
| State | Database | VM filesystem + apps | Server databases |
| Verifier Access | API endpoint | Web UI only | Web UI only |
| Scaling | High (containers) | Medium (VMs) | High (containers) |
Website Environments
Best for: GUI-based agents that navigate web interfaces
What’s Available
| Application | Description |
|---|---|
| 📅 Calendr | Calendar application |
| 📁 Cloudfile | Cloud storage application |
| 🛍️ Shopora | E-commerce platform |
| 📧 Pandora’s Inbox | Email application |
Key Characteristics
| Feature | Description |
|---|---|
| Interaction Mode | Web UI navigation, form filling, clicking |
| State Management | SQLite database per session |
| Logging | Automatic capture of all UI interactions |
| Verification | State checks, log checks, rubric checks |
| Data Packs | Multiple pre-configured scenarios |
Ideal For
- Testing agents that interact with web applications
- Evaluating form-filling and navigation capabilities
- Multi-step workflow automation
- E-commerce and productivity app testing
Desktop Environments
Best for: Computer-use agents that control full operating systems
What’s Available
| Platform | Description |
|---|---|
| 🪟 Windows | Windows desktop environment |
| 🐧 Ubuntu | Linux desktop environment |
| 🍎 macOS | Mac desktop environment |
Key Characteristics
| Feature | Description |
|---|---|
| Interaction Mode | Screenshots, mouse/keyboard, shell commands |
| State Management | Full VM filesystem and applications |
| Observation | Screenshot capture, accessibility tree |
| Verification | Gold standard comparison via Web UI |
| Initialization | Script-based task setup |
Ideal For
- Testing agents that use native desktop applications
- Evaluating visual understanding and navigation
- Cross-application workflows
- Office productivity automation
MCP Environment
Best for: Tool-use agents that call functions and APIs
What’s Available
- 45+ MCP servers covering productivity, business, communication, e-commerce, and more
- 300+ tools for comprehensive capability testing
Server Categories
| Category | Examples |
|---|---|
| Productivity | Calendar, Reminders, Notes |
| Business | CRM, QuickBooks, Invoicing |
| Communication | Email, Slack, SMS |
| E-commerce | Shopping, Orders, Inventory |
| Development | GitHub, Jira, Documentation |
Key Characteristics
| Feature | Description |
|---|---|
| Interaction Mode | Tool calls with structured parameters |
| State Management | SQLite database per server |
| Observation | Show Data endpoint for state inspection |
| Verification | LLM Judge evaluation via Web UI |
| Protocol | Model Context Protocol (MCP) |
Ideal For
- Testing agents that use function calling
- Evaluating multi-tool workflows
- API-based automation
- Business process automation
Decision Guide
By Agent Capability
| If your agent… | Use… |
|---|---|
| Navigates web pages and fills forms | Website Environments |
| Uses screenshots and mouse/keyboard | Desktop Environments |
| Calls tools and functions | MCP Environment |
| Does multiple of the above | Test each environment type |
By Task Type
| If your task involves… | Use… |
|---|---|
| Web application workflows | Website Environments |
| Native desktop applications | Desktop Environments |
| API integrations | MCP Environment |
| Document editing | Desktop Environments |
| E-commerce flows | Website (Shopora) or MCP |
| Calendar/scheduling | Website (Calendr) or MCP |
By Evaluation Needs
| If you need… | Use… |
|---|---|
| High parallelism | Website or MCP (containers) |
| Visual testing | Desktop (screenshots) |
| Programmatic verification | Website (API verifier) |
| Real application testing | Desktop (full apps) |
Combining Environments
Some evaluations may benefit from multiple environment types:| Scenario | Environments |
|---|---|
| Full-stack agent testing | All three |
| Comparing UI vs API approaches | Website + MCP |
| Cross-platform testing | Multiple Desktop platforms |
Common Design Principles
All environments share these architectural properties:| Principle | Description |
|---|---|
| Session Isolation | Sessions cannot access each other’s data; sessionId scopes all operations |
| Clean Slate Reset | Full reset returns to known initial state |
| Reproducibility | Data packs ensure consistent starting conditions |
| Observability | Complete audit trail of interactions |
| Consistent APIs | Reset, query, verify, and health patterns across all types |