MCP Verifiers

MCP environments use client-side LLM Judge evaluation to verify agent responses against rubric claims.

Access

MCP verifiers are only accessible through the Scale Gymnasium Web UI at gym.scale.com, not via Docker container APIs.

Evaluation Approach

Component	Purpose
Rubric Claims	Expected actions/information the agent should produce
LLM Judge	Your model evaluates the agent’s response
Verdicts	Pass/fail determination per claim

GTFA Claims

Ground Truth Factual Assertions (GTFA) define what must be present in the agent’s response.

What to Include

Required information: Facts that must appear in the response
Expected actions: Tool calls the agent should have made
Factual accuracy: Correct values and calculations
Format requirements: Structure of the expected output

Example Claims

Task Type	Example Claim
Information retrieval	”Response includes the meeting time of 2:00 PM”
Calculation	”Response shows total of $1,234.56”
Tool use	”Agent called `send_email` with correct recipient”
Multi-step	”Agent created calendar event AND sent notification”

LLM Judge

The LLM Judge evaluates the agent’s response against each rubric claim.

How It Works

Agent completes task and produces response
Response is sent to LLM Judge with rubric claims
Judge evaluates each claim independently
Verdicts are aggregated into final score

Judge Prompt Structure

The judge receives:

The original task prompt
The agent’s response/trajectory
Each rubric claim to evaluate

Verdict Types

Verdict	Meaning
Pass	Claim is satisfied by the response
Fail	Claim is not satisfied
Partial	Claim is partially satisfied (where applicable)

Scoring

Calculation

Final score is typically:

1.0 if all claims pass
0.0 if any claim fails
Or proportional based on passed/total claims

Response Structure

{
  "score": 1.0,
  "claims": [
    { "claim": "Agent sent email to correct recipient", "verdict": "pass" },
    { "claim": "Email subject matches requirement", "verdict": "pass" }
  ],
  "message": "All claims verified"
}

Task Component Structure

MCP tasks include these verification components:

Component	Purpose
`ENABLED_TOOLS`	Tools available to the agent
`PROMPT`	Task instructions
`TRAJECTORY`	Expected sequence of tool calls (ground truth)
`GTFA_CLAIMS`	Required factual claims for verification

Best Practices

Be specific in claims

❌ “Response is correct”✅ “Response includes the customer’s order total of $156.78”

Include both positive and negative criteria

✅ “Response mentions the 3:00 PM meeting”
✅ “Response does NOT include cancelled events”

Consider partial completion

For complex tasks, define claims at different granularity levels to capture partial success.

Test claims with edge cases

Verify your claims work correctly with:

Correct responses (should pass)
Incorrect responses (should fail)
Partially correct responses (should behave as expected)

Example Task Structure

{
  "ENABLED_TOOLS": ["calendar_add_event", "email_send", "reminder_create"],
  "PROMPT": "Schedule a meeting with John for Friday at 2pm and send him an email confirmation",
  "TRAJECTORY": [
    { "tool": "calendar_add_event", "args": { "title": "Meeting with John", "time": "Friday 2pm" } },
    { "tool": "email_send", "args": { "to": "john@example.com", "subject": "Meeting Confirmation" } }
  ],
  "GTFA_CLAIMS": [
    "Agent created a calendar event titled 'Meeting with John'",
    "Event is scheduled for Friday at 2:00 PM",
    "Agent sent email to john@example.com",
    "Email subject contains 'Meeting' or 'Confirmation'"
  ]
}

Overview

Getting Started

Deep Dives

Access

Evaluation Approach

GTFA Claims

What to Include

Example Claims

LLM Judge

How It Works

Judge Prompt Structure

Verdict Types

Scoring

Calculation

Response Structure

Task Component Structure

Best Practices

Example Task Structure

Next Steps

Website Verifiers

Task Design

Overview

Getting Started

Deep Dives

​Access

​Evaluation Approach

​GTFA Claims

​What to Include

​Example Claims

​LLM Judge

​How It Works

​Judge Prompt Structure

​Verdict Types

​Scoring

​Calculation

​Response Structure

​Task Component Structure

​Best Practices

​Example Task Structure

​Next Steps

Website Verifiers

Task Design

Access

Evaluation Approach

GTFA Claims

What to Include

Example Claims

LLM Judge

How It Works

Judge Prompt Structure

Verdict Types

Scoring

Calculation

Response Structure

Task Component Structure

Best Practices

Example Task Structure

Next Steps