Task Design Overview

A Task is a specific challenge for an agent to solve within an environment. Well-designed tasks are clear, measurable, and aligned with the capabilities you want to evaluate.

Explore live task examples at gym.scale.com.

Task Components

Every task is composed of core components:

Component	Purpose
Prompt	Natural language instructions for the agent
Initial State	Starting environment configuration (via data packs)
Available Tools	The tools/actions the agent can use
Verifier	The mechanism for measuring success

Task Types

Tasks are categorized by the primary action required:

Type	Description	Example
Information Retrieval	Agent gathers and reports information without modifying state	”What events are scheduled for tomorrow?”
State Modification	Agent performs actions that change the environment	”Schedule a meeting with John for Friday at 2pm”
Hybrid	Combination of retrieval and modification	”Find all overdue invoices and send reminder emails”

Choose Your Environment

Website Tasks

Prompts, subproblems, and JSON verifiers for web apps

Desktop Tasks

Initialization configs and file-based evaluators for VMs

MCP Tasks

Tool constraints, trajectories, and GTFA claims

Best Practices

Write Clear Prompts

Be specific about the desired outcome
Include all necessary context
Avoid ambiguous instructions
Use natural language a human would understand

Design Measurable Outcomes

Identify the exact state changes to verify
Include both positive and negative checks
Consider partial completion scenarios
Design for automated verification

Control Initial State

Use data packs for consistent starting conditions
Document any manual setup required
Ensure reproducibility across runs
Consider edge cases in initial data

Allow Multiple Paths

Avoid over-constraining the solution path
Allow multiple valid approaches when appropriate
Test tasks with human annotators first

Next Steps

Verifier Configuration

Set up verification checks

Data Packs

Configure initial environment state

Overview

Getting Started

Deep Dives

Task Components

Task Types

Choose Your Environment

Website Tasks

Desktop Tasks

MCP Tasks

Best Practices

Next Steps

Verifier Configuration

Data Packs

Overview

Getting Started

Deep Dives

​Task Components

​Task Types

​Choose Your Environment

Website Tasks

Desktop Tasks

MCP Tasks

​Best Practices

​Next Steps

Verifier Configuration

Data Packs

Task Components

Task Types

Choose Your Environment

Best Practices

Next Steps