Skip to main content
A Task is a specific challenge for an agent to solve within an environment. Well-designed tasks are clear, measurable, and aligned with the capabilities you want to evaluate.
Explore live task examples at gym.scale.com.

Task Components

Every task is composed of core components:
ComponentPurpose
PromptNatural language instructions for the agent
Initial StateStarting environment configuration (via data packs)
Available ToolsThe tools/actions the agent can use
VerifierThe mechanism for measuring success

Task Types

Tasks are categorized by the primary action required:
TypeDescriptionExample
Information RetrievalAgent gathers and reports information without modifying state”What events are scheduled for tomorrow?”
State ModificationAgent performs actions that change the environment”Schedule a meeting with John for Friday at 2pm”
HybridCombination of retrieval and modification”Find all overdue invoices and send reminder emails”

Choose Your Environment

Website Tasks

Prompts, subproblems, and JSON verifiers for web apps

Desktop Tasks

Initialization configs and file-based evaluators for VMs

MCP Tasks

Tool constraints, trajectories, and GTFA claims

Best Practices

  • Be specific about the desired outcome
  • Include all necessary context
  • Avoid ambiguous instructions
  • Use natural language a human would understand
  • Identify the exact state changes to verify
  • Include both positive and negative checks
  • Consider partial completion scenarios
  • Design for automated verification
  • Use data packs for consistent starting conditions
  • Document any manual setup required
  • Ensure reproducibility across runs
  • Consider edge cases in initial data
  • Avoid over-constraining the solution path
  • Allow multiple valid approaches when appropriate
  • Test tasks with human annotators first

Next Steps

Verifier Configuration

Set up verification checks

Data Packs

Configure initial environment state