Skip to main content
A Task is a specific challenge for an agent to solve within an environment. Well-designed tasks are clear, measurable, and aligned with the capabilities you want to evaluate.
Explore live task examples at gym.scale.com.

Task Components

Every task is composed of core components:
ComponentPurpose
PromptNatural language instructions for the agent
Initial StateStarting environment configuration (via data packs)
Available ToolsThe tools/actions the agent can use
VerifierThe mechanism for measuring success

Task Types

Tasks are categorized by the primary action required:
TypeDescriptionExample
Information RetrievalAgent gathers and reports information without modifying state”What events are scheduled for tomorrow?”
State ModificationAgent performs actions that change the environment”Schedule a meeting with John for Friday at 2pm”
HybridCombination of retrieval and modification”Find all overdue invoices and send reminder emails”

Choose Your Environment


Best Practices

  • Be specific about the desired outcome
  • Include all necessary context
  • Avoid ambiguous instructions
  • Use natural language a human would understand
  • Identify the exact state changes to verify
  • Include both positive and negative checks
  • Consider partial completion scenarios
  • Design for automated verification
  • Use data packs for consistent starting conditions
  • Document any manual setup required
  • Ensure reproducibility across runs
  • Consider edge cases in initial data
  • Avoid over-constraining the solution path
  • Allow multiple valid approaches when appropriate
  • Test tasks with human annotators first

Next Steps