Skip to main content
MCP tasks evaluate an agent’s ability to use tools effectively—calling the right functions with correct parameters to accomplish goals across calendar, email, CRM, filesystem, and other services.

Task Components

ComponentPurpose
ENABLED_TOOLSTools available to the model (constraint space)
PROMPTUser’s query
TRAJECTORYIntended sequence of tool calls (ground truth)
GTFA_CLAIMSRequired factual claims in final response

Component Details

ENABLED_TOOLS

Defines which tools the agent can access. Constraining tools increases task focus and reduces ambiguity.
{
  "ENABLED_TOOLS": [
    "calendar_get_events",
    "calendar_add_event",
    "email_send",
    "email_search"
  ]
}
Considerations:
  • Include only tools relevant to the task
  • Adding irrelevant tools tests discrimination ability

PROMPT

The natural language instruction given to the agent.
{
  "PROMPT": "Schedule a meeting with Sarah for next Tuesday at 2pm and send her a calendar invite via email"
}
Prompt Design Tips:
  • Be specific about expected outcomes
  • Include implicit constraints (e.g., “during business hours”)

TRAJECTORY

The expected sequence of tool calls (ground truth). Used for trajectory-based evaluation.
{
  "TRAJECTORY": [
    {
      "tool": "calendar_add_event",
      "args": {
        "title": "Meeting with Sarah",
        "datetime": "2025-01-21T14:00:00",
        "attendees": ["sarah@example.com"]
      }
    },
    {
      "tool": "email_send",
      "args": {
        "to": "sarah@example.com",
        "subject": "Meeting Invitation",
        "body": "..."
      }
    }
  ]
}
Note: Trajectories represent one valid path—agents may complete tasks via different valid sequences.

GTFA_CLAIMS

Ground Truth Factual Assertions that must be present in the agent’s final response.
{
  "GTFA_CLAIMS": [
    "Agent created a calendar event for Tuesday at 2:00 PM",
    "Event includes Sarah as an attendee",
    "Confirmation email was sent to sarah@example.com"
  ]
}

Example Task Structure

Complete Example

{
  "ENABLED_TOOLS": [
    "calendar_get_events",
    "calendar_add_event",
    "calendar_delete_event",
    "email_send",
    "contacts_search"
  ],
  "PROMPT": "Check my calendar for Friday. If I have a meeting with the marketing team, cancel it and send an apology email to all attendees.",
  "TRAJECTORY": [
    { "tool": "calendar_get_events", "args": { "date": "Friday" } },
    { "tool": "calendar_delete_event", "args": { "event_id": "mktg-123" } },
    { "tool": "email_send", "args": { "to": ["alice@co.com", "bob@co.com"], "subject": "Meeting Cancelled" } }
  ],
  "GTFA_CLAIMS": [
    "Agent checked Friday's calendar",
    "Agent identified the marketing team meeting",
    "Agent cancelled the meeting",
    "Agent sent apology email to all attendees (Alice and Bob)"
  ]
}

Best Practices

GTFA claims should be specific and unambiguous:❌ “Agent completed the task correctly”✅ “Agent sent email to john@example.com with subject containing ‘Q3 Report’”
Test agent behavior at the edges:
  • What if no results are found?
  • What if multiple matches exist?
  • What if required information is missing?
  • Too few tools: Task may be impossible
  • Too many tools: Agent may get confused
  • Include 1-2 “distractor” tools to test discrimination
Many tasks can be completed different ways. Ensure GTFA claims verify outcomes, not specific tool sequences.
Use data packs that reflect real usage patterns—realistic contact names, plausible email content, believable calendar schedules.

Next Steps