Skip to main content
MCP environments use client-side LLM Judge evaluation to verify agent responses against rubric claims.

Access

MCP verifiers are only accessible through the Scale Gymnasium Web UI at gym.scale.com, not via Docker container APIs.

Evaluation Approach

ComponentPurpose
Rubric ClaimsExpected actions/information the agent should produce
LLM JudgeYour model evaluates the agent’s response
VerdictsPass/fail determination per claim

GTFA Claims

Ground Truth Factual Assertions (GTFA) define what must be present in the agent’s response.

What to Include

  • Required information: Facts that must appear in the response
  • Expected actions: Tool calls the agent should have made
  • Factual accuracy: Correct values and calculations
  • Format requirements: Structure of the expected output

Example Claims

Task TypeExample Claim
Information retrieval”Response includes the meeting time of 2:00 PM”
Calculation”Response shows total of $1,234.56”
Tool use”Agent called send_email with correct recipient”
Multi-step”Agent created calendar event AND sent notification”

LLM Judge

The LLM Judge evaluates the agent’s response against each rubric claim.

How It Works

  1. Agent completes task and produces response
  2. Response is sent to LLM Judge with rubric claims
  3. Judge evaluates each claim independently
  4. Verdicts are aggregated into final score

Judge Prompt Structure

The judge receives:
  • The original task prompt
  • The agent’s response/trajectory
  • Each rubric claim to evaluate

Verdict Types

VerdictMeaning
PassClaim is satisfied by the response
FailClaim is not satisfied
PartialClaim is partially satisfied (where applicable)

Scoring

Calculation

Final score is typically:
  • 1.0 if all claims pass
  • 0.0 if any claim fails
  • Or proportional based on passed/total claims

Response Structure

{
  "score": 1.0,
  "claims": [
    { "claim": "Agent sent email to correct recipient", "verdict": "pass" },
    { "claim": "Email subject matches requirement", "verdict": "pass" }
  ],
  "message": "All claims verified"
}

Task Component Structure

MCP tasks include these verification components:
ComponentPurpose
ENABLED_TOOLSTools available to the agent
PROMPTTask instructions
TRAJECTORYExpected sequence of tool calls (ground truth)
GTFA_CLAIMSRequired factual claims for verification

Best Practices

❌ “Response is correct”✅ “Response includes the customer’s order total of $156.78”
  • ✅ “Response mentions the 3:00 PM meeting”
  • ✅ “Response does NOT include cancelled events”
For complex tasks, define claims at different granularity levels to capture partial success.
Verify your claims work correctly with:
  • Correct responses (should pass)
  • Incorrect responses (should fail)
  • Partially correct responses (should behave as expected)

Example Task Structure

{
  "ENABLED_TOOLS": ["calendar_add_event", "email_send", "reminder_create"],
  "PROMPT": "Schedule a meeting with John for Friday at 2pm and send him an email confirmation",
  "TRAJECTORY": [
    { "tool": "calendar_add_event", "args": { "title": "Meeting with John", "time": "Friday 2pm" } },
    { "tool": "email_send", "args": { "to": "john@example.com", "subject": "Meeting Confirmation" } }
  ],
  "GTFA_CLAIMS": [
    "Agent created a calendar event titled 'Meeting with John'",
    "Event is scheduled for Friday at 2:00 PM",
    "Agent sent email to john@example.com",
    "Email subject contains 'Meeting' or 'Confirmation'"
  ]
}

Next Steps