Access
MCP verifiers are only accessible through the Scale Gymnasium Web UI at gym.scale.com, not via Docker container APIs.
Evaluation Approach
| Component | Purpose |
|---|---|
| Rubric Claims | Expected actions/information the agent should produce |
| LLM Judge | Your model evaluates the agent’s response |
| Verdicts | Pass/fail determination per claim |
GTFA Claims
Ground Truth Factual Assertions (GTFA) define what must be present in the agent’s response.What to Include
- Required information: Facts that must appear in the response
- Expected actions: Tool calls the agent should have made
- Factual accuracy: Correct values and calculations
- Format requirements: Structure of the expected output
Example Claims
| Task Type | Example Claim |
|---|---|
| Information retrieval | ”Response includes the meeting time of 2:00 PM” |
| Calculation | ”Response shows total of $1,234.56” |
| Tool use | ”Agent called send_email with correct recipient” |
| Multi-step | ”Agent created calendar event AND sent notification” |
LLM Judge
The LLM Judge evaluates the agent’s response against each rubric claim.How It Works
- Agent completes task and produces response
- Response is sent to LLM Judge with rubric claims
- Judge evaluates each claim independently
- Verdicts are aggregated into final score
Judge Prompt Structure
The judge receives:- The original task prompt
- The agent’s response/trajectory
- Each rubric claim to evaluate
Verdict Types
| Verdict | Meaning |
|---|---|
| Pass | Claim is satisfied by the response |
| Fail | Claim is not satisfied |
| Partial | Claim is partially satisfied (where applicable) |
Scoring
Calculation
Final score is typically:1.0if all claims pass0.0if any claim fails- Or proportional based on passed/total claims
Response Structure
Task Component Structure
MCP tasks include these verification components:| Component | Purpose |
|---|---|
ENABLED_TOOLS | Tools available to the agent |
PROMPT | Task instructions |
TRAJECTORY | Expected sequence of tool calls (ground truth) |
GTFA_CLAIMS | Required factual claims for verification |
Best Practices
Be specific in claims
Be specific in claims
❌ “Response is correct”✅ “Response includes the customer’s order total of $156.78”
Include both positive and negative criteria
Include both positive and negative criteria
- ✅ “Response mentions the 3:00 PM meeting”
- ✅ “Response does NOT include cancelled events”
Consider partial completion
Consider partial completion
For complex tasks, define claims at different granularity levels to capture partial success.
Test claims with edge cases
Test claims with edge cases
Verify your claims work correctly with:
- Correct responses (should pass)
- Incorrect responses (should fail)
- Partially correct responses (should behave as expected)