MCP Agent Evaluation¶
This scenario evaluates tool-using agents that interact with MCP (Model Context Protocol) servers. You configure MCP tools on the answering model, set agent middleware parameters, and handle traces that include tool calls alongside natural language responses.
What you'll learn:
- Configure MCP tools for the answering model
- Set agent middleware parameters (limits, retries, summarization)
- Handle traces that contain tool calls
- Understand recursion limits and how to adjust them
- Compare adapter options for MCP evaluation
When to Use MCP¶
MCP evaluation is needed when the answering model should use external tools:
- Web search — fact-checking, current events, real-time data
- Database queries — SQL execution, knowledge base lookup
- API access — external service integration
- File operations — reading documents, code analysis
Configure MCP Tools¶
Attach MCP servers to the answering model via ModelConfig:
from karenina import Benchmark
benchmark = Benchmark.load(str(_tmp))
config = VerificationConfig(
answering_models=[
ModelConfig(
id="claude-with-search",
model_name="claude-sonnet-4-20250514",
model_provider="anthropic",
interface="claude_agent_sdk",
# MCP server configuration
mcp_urls_dict={"brave_search": "http://localhost:3001/sse"},
mcp_tool_filter=["brave_web_search", "brave_local_search"],
)
],
parsing_models=[
ModelConfig(id="haiku-parser", model_name="claude-haiku-4-5",
model_provider="anthropic", interface="langchain",
temperature=0.0)
],
evaluation_mode="template_only",
)
print(f"MCP servers: {config.answering_models[0].mcp_urls_dict}")
print(f"Tool filter: {config.answering_models[0].mcp_tool_filter}")
| Field | Description |
|---|---|
mcp_urls_dict |
Map of server name → SSE endpoint URL |
mcp_tool_filter |
Restrict which tools the model can use (empty = all) |
mcp_tool_description_overrides |
Override tool descriptions for better prompting |
Agent Middleware¶
AgentMiddlewareConfig controls agent execution behavior — limits, retries, and summarization:
from karenina.schemas.config import ModelConfig
from karenina.schemas.config.models import AgentLimitConfig, AgentMiddlewareConfig
model = ModelConfig(
id="claude-with-search",
model_name="claude-sonnet-4-20250514",
model_provider="anthropic",
interface="claude_agent_sdk",
mcp_urls_dict={"brave_search": "http://localhost:3001/sse"},
# Agent middleware settings
agent_middleware=AgentMiddlewareConfig(
limits=AgentLimitConfig(
model_call_limit=15,
tool_call_limit=30,
),
),
)
print(f"Model call limit: {model.agent_middleware.limits.model_call_limit}")
print(f"Tool call limit: {model.agent_middleware.limits.tool_call_limit}")
| Setting | Default | Description |
|---|---|---|
limits.model_call_limit |
25 |
Maximum LLM calls per agent invocation |
limits.tool_call_limit |
50 |
Maximum tool calls per agent invocation |
limits.exit_behavior |
"end" |
Behavior when limit reached: "end" or "continue" |
Run MCP-Enabled Verification¶
results = benchmark.run_verification(config)
print(f"Total results: {len(results)}")
Trace Handling¶
MCP agents produce multi-turn traces (human → AI → tool → AI → ...). By default, only the final AI message is passed to template evaluation, while rubric evaluation uses the full trace. You can configure both independently:
config_full_trace = VerificationConfig(
answering_models=[
ModelConfig(
id="claude-with-search",
model_name="claude-sonnet-4-20250514",
model_provider="anthropic",
interface="claude_agent_sdk",
mcp_urls_dict={"brave_search": "http://localhost:3001/sse"},
)
],
parsing_models=[
ModelConfig(id="haiku-parser", model_name="claude-haiku-4-5",
model_provider="anthropic", interface="langchain",
temperature=0.0)
],
# Pass full trace to evaluation (includes tool calls)
use_full_trace_for_template=True,
use_full_trace_for_rubric=True,
)
print(f"Full trace for template: {config_full_trace.use_full_trace_for_template}")
print(f"Full trace for rubric: {config_full_trace.use_full_trace_for_rubric}")
| Setting | Default | When to change |
|---|---|---|
use_full_trace_for_template |
False |
Set True if template needs to see tool call context |
use_full_trace_for_rubric |
True |
Set False if rubric only needs the final answer |
Inspect what was passed to evaluation:
for result in results[:2]:
print(f"Q: {result.metadata.question_text[:40]}")
print(f" Used full trace: {result.used_full_trace}")
if result.evaluation_input:
print(f" Evaluation input: {result.evaluation_input[:60]}...")
for result in results:
t = result.template
if t and t.recursion_limit_reached:
print(f"RECURSION LIMIT: {result.metadata.question_text[:50]}")
if t.agent_metrics:
print(f" Iterations: {t.agent_metrics['iterations']}")
print(f" Tool calls: {t.agent_metrics['tool_calls']}")
print(f" Tools used: {t.agent_metrics['tools_used']}")
Agent Metrics¶
All MCP results include agent execution metrics:
for result in results[:3]:
t = result.template
if t and t.agent_metrics:
m = t.agent_metrics
print(f"Q: {result.metadata.question_text[:40]}")
print(f" Iterations: {m['iterations']}, Tool calls: {m['tool_calls']}")
print(f" Tools: {m['tools_used']}")
if m.get('suspect_failed_tool_calls', 0) > 0:
print(f" Suspect failures: {m['suspect_failed_tool_calls']} ({m['suspect_failed_tools']})")
Adapter Comparison¶
Three adapters support MCP/agent workflows:
| Adapter | Interface | MCP Support | Best For |
|---|---|---|---|
| LangChain Agent | langchain |
Via LangChain tools | OpenAI/Google models with tools |
| Claude Agent SDK | claude_agent_sdk |
Native MCP | Claude models with MCP servers |
| Claude Tool | claude_tool |
Via tool_use | Claude models with structured tool calling |
Choose the adapter via the interface field on ModelConfig. Claude Agent SDK provides the most direct MCP integration.
CLI Equivalent¶
# MCP verification via CLI:
# karenina verify benchmark.jsonld --preset mcp-preset.json \
# --interface claude_agent_sdk
# The preset file should contain mcp_urls_dict and tool configuration.
# CLI does not support inline MCP configuration — use a preset.
print("CLI: karenina verify ... --preset mcp-preset.json --interface claude_agent_sdk")
Related Pages¶
- Basic Verification — Non-MCP verification walkthrough
- Deep Judgment — Add deep judgment to MCP results
- Adapters — Full adapter comparison
- VerificationConfig Reference — Trace handling fields