MCP Agent Evaluation¶

This scenario evaluates tool-using agents that interact with MCP (Model Context Protocol) servers. You configure MCP tools on the answering model, set agent middleware parameters, and handle traces that include tool calls alongside natural language responses.

What you'll learn:

Configure MCP tools for the answering model
Set agent middleware parameters (limits, retries, summarization)
Handle traces that contain tool calls
Understand recursion limits and how to adjust them
Compare adapter options for MCP evaluation

When to Use MCP¶

MCP evaluation is needed when the answering model should use external tools:

Web search — fact-checking, current events, real-time data
Database queries — SQL execution, knowledge base lookup
API access — external service integration
File operations — reading documents, code analysis

Configure MCP Tools¶

Attach MCP servers to the answering model via ModelConfig:

In [ ]:

Copied!





from karenina import Benchmark

benchmark = Benchmark.load(str(_tmp))

config = VerificationConfig(
    answering_models=[
        ModelConfig(
            id="claude-with-search",
            model_name="claude-sonnet-4-20250514",
            model_provider="anthropic",
            interface="claude_agent_sdk",
            # MCP server configuration
            mcp_urls_dict={"brave_search": "http://localhost:3001/sse"},
            mcp_tool_filter=["brave_web_search", "brave_local_search"],
        )
    ],
    parsing_models=[
        ModelConfig(id="haiku-parser", model_name="claude-haiku-4-5",
                    model_provider="anthropic", interface="langchain",
                    temperature=0.0)
    ],
    evaluation_mode="template_only",
)

print(f"MCP servers: {config.answering_models[0].mcp_urls_dict}")
print(f"Tool filter: {config.answering_models[0].mcp_tool_filter}")
from karenina import Benchmark

benchmark = Benchmark.load(str(_tmp))

config = VerificationConfig(
    answering_models=[
        ModelConfig(
            id="claude-with-search",
            model_name="claude-sonnet-4-20250514",
            model_provider="anthropic",
            interface="claude_agent_sdk",
            # MCP server configuration
            mcp_urls_dict={"brave_search": "http://localhost:3001/sse"},
            mcp_tool_filter=["brave_web_search", "brave_local_search"],
        )
    ],
    parsing_models=[
        ModelConfig(id="haiku-parser", model_name="claude-haiku-4-5",
                    model_provider="anthropic", interface="langchain",
                    temperature=0.0)
    ],
    evaluation_mode="template_only",
)

print(f"MCP servers: {config.answering_models[0].mcp_urls_dict}")
print(f"Tool filter: {config.answering_models[0].mcp_tool_filter}")

Field	Description
`mcp_urls_dict`	Map of server name → SSE endpoint URL
`mcp_tool_filter`	Restrict which tools the model can use (empty = all)
`mcp_tool_description_overrides`	Override tool descriptions for better prompting

Agent Middleware¶

AgentMiddlewareConfig controls agent execution behavior — limits, retries, and summarization:

In [ ]:

Copied!





from karenina.schemas.config import ModelConfig
from karenina.schemas.config.models import AgentLimitConfig, AgentMiddlewareConfig

model = ModelConfig(
    id="claude-with-search",
    model_name="claude-sonnet-4-20250514",
    model_provider="anthropic",
    interface="claude_agent_sdk",
    mcp_urls_dict={"brave_search": "http://localhost:3001/sse"},
    # Agent middleware settings
    agent_middleware=AgentMiddlewareConfig(
        limits=AgentLimitConfig(
            model_call_limit=15,
            tool_call_limit=30,
        ),
    ),
)

print(f"Model call limit: {model.agent_middleware.limits.model_call_limit}")
print(f"Tool call limit:  {model.agent_middleware.limits.tool_call_limit}")
from karenina.schemas.config import ModelConfig
from karenina.schemas.config.models import AgentLimitConfig, AgentMiddlewareConfig

model = ModelConfig(
    id="claude-with-search",
    model_name="claude-sonnet-4-20250514",
    model_provider="anthropic",
    interface="claude_agent_sdk",
    mcp_urls_dict={"brave_search": "http://localhost:3001/sse"},
    # Agent middleware settings
    agent_middleware=AgentMiddlewareConfig(
        limits=AgentLimitConfig(
            model_call_limit=15,
            tool_call_limit=30,
        ),
    ),
)

print(f"Model call limit: {model.agent_middleware.limits.model_call_limit}")
print(f"Tool call limit:  {model.agent_middleware.limits.tool_call_limit}")

Setting	Default	Description
`limits.model_call_limit`	`25`	Maximum LLM calls per agent invocation
`limits.tool_call_limit`	`50`	Maximum tool calls per agent invocation
`limits.exit_behavior`	`"end"`	Behavior when limit reached: `"end"` or `"continue"`

Run MCP-Enabled Verification¶

In [ ]:

Copied!

results = benchmark.run_verification(config)
print(f"Total results: {len(results)}")
results = benchmark.run_verification(config)
print(f"Total results: {len(results)}")

Trace Handling¶

MCP agents produce multi-turn traces (human → AI → tool → AI → ...). By default, only the final AI message is passed to template evaluation, while rubric evaluation uses the full trace. You can configure both independently:

In [ ]:

Copied!





config_full_trace = VerificationConfig(
    answering_models=[
        ModelConfig(
            id="claude-with-search",
            model_name="claude-sonnet-4-20250514",
            model_provider="anthropic",
            interface="claude_agent_sdk",
            mcp_urls_dict={"brave_search": "http://localhost:3001/sse"},
        )
    ],
    parsing_models=[
        ModelConfig(id="haiku-parser", model_name="claude-haiku-4-5",
                    model_provider="anthropic", interface="langchain",
                    temperature=0.0)
    ],
    # Pass full trace to evaluation (includes tool calls)
    use_full_trace_for_template=True,
    use_full_trace_for_rubric=True,
)

print(f"Full trace for template: {config_full_trace.use_full_trace_for_template}")
print(f"Full trace for rubric:   {config_full_trace.use_full_trace_for_rubric}")
config_full_trace = VerificationConfig(
    answering_models=[
        ModelConfig(
            id="claude-with-search",
            model_name="claude-sonnet-4-20250514",
            model_provider="anthropic",
            interface="claude_agent_sdk",
            mcp_urls_dict={"brave_search": "http://localhost:3001/sse"},
        )
    ],
    parsing_models=[
        ModelConfig(id="haiku-parser", model_name="claude-haiku-4-5",
                    model_provider="anthropic", interface="langchain",
                    temperature=0.0)
    ],
    # Pass full trace to evaluation (includes tool calls)
    use_full_trace_for_template=True,
    use_full_trace_for_rubric=True,
)

print(f"Full trace for template: {config_full_trace.use_full_trace_for_template}")
print(f"Full trace for rubric:   {config_full_trace.use_full_trace_for_rubric}")

Setting	Default	When to change
`use_full_trace_for_template`	`False`	Set `True` if template needs to see tool call context
`use_full_trace_for_rubric`	`True`	Set `False` if rubric only needs the final answer

Inspect what was passed to evaluation:

In [ ]:

Copied!





for result in results[:2]:
    print(f"Q: {result.metadata.question_text[:40]}")
    print(f"  Used full trace: {result.used_full_trace}")
    if result.evaluation_input:
        print(f"  Evaluation input: {result.evaluation_input[:60]}...")
for result in results[:2]:
    print(f"Q: {result.metadata.question_text[:40]}")
    print(f"  Used full trace: {result.used_full_trace}")
    if result.evaluation_input:
        print(f"  Evaluation input: {result.evaluation_input[:60]}...")

Recursion Limits¶

When an agent hits the recursion limit (agent_max_turns), it is auto-failed:

In [ ]:

Copied!





for result in results:
    t = result.template
    if t and t.recursion_limit_reached:
        print(f"RECURSION LIMIT: {result.metadata.question_text[:50]}")
        if t.agent_metrics:
            print(f"  Iterations: {t.agent_metrics['iterations']}")
            print(f"  Tool calls: {t.agent_metrics['tool_calls']}")
            print(f"  Tools used: {t.agent_metrics['tools_used']}")
for result in results:
    t = result.template
    if t and t.recursion_limit_reached:
        print(f"RECURSION LIMIT: {result.metadata.question_text[:50]}")
        if t.agent_metrics:
            print(f"  Iterations: {t.agent_metrics['iterations']}")
            print(f"  Tool calls: {t.agent_metrics['tool_calls']}")
            print(f"  Tools used: {t.agent_metrics['tools_used']}")

Agent Metrics¶

All MCP results include agent execution metrics:

In [ ]:

Copied!





for result in results[:3]:
    t = result.template
    if t and t.agent_metrics:
        m = t.agent_metrics
        print(f"Q: {result.metadata.question_text[:40]}")
        print(f"  Iterations: {m['iterations']}, Tool calls: {m['tool_calls']}")
        print(f"  Tools: {m['tools_used']}")
        if m.get('suspect_failed_tool_calls', 0) > 0:
            print(f"  Suspect failures: {m['suspect_failed_tool_calls']} ({m['suspect_failed_tools']})")
for result in results[:3]:
    t = result.template
    if t and t.agent_metrics:
        m = t.agent_metrics
        print(f"Q: {result.metadata.question_text[:40]}")
        print(f"  Iterations: {m['iterations']}, Tool calls: {m['tool_calls']}")
        print(f"  Tools: {m['tools_used']}")
        if m.get('suspect_failed_tool_calls', 0) > 0:
            print(f"  Suspect failures: {m['suspect_failed_tool_calls']} ({m['suspect_failed_tools']})")

Adapter Comparison¶

Three adapters support MCP/agent workflows:

Adapter	Interface	MCP Support	Best For
LangChain Agent	`langchain`	Via LangChain tools	OpenAI/Google models with tools
Claude Agent SDK	`claude_agent_sdk`	Native MCP	Claude models with MCP servers
Claude Tool	`claude_tool`	Via tool_use	Claude models with structured tool calling

Choose the adapter via the interface field on ModelConfig. Claude Agent SDK provides the most direct MCP integration.

CLI Equivalent¶

In [ ]:

Copied!





# MCP verification via CLI:
# karenina verify benchmark.jsonld --preset mcp-preset.json \
#   --interface claude_agent_sdk

# The preset file should contain mcp_urls_dict and tool configuration.
# CLI does not support inline MCP configuration — use a preset.

print("CLI: karenina verify ... --preset mcp-preset.json --interface claude_agent_sdk")
# MCP verification via CLI:
# karenina verify benchmark.jsonld --preset mcp-preset.json \
#   --interface claude_agent_sdk

# The preset file should contain mcp_urls_dict and tool configuration.
# CLI does not support inline MCP configuration — use a preset.

print("CLI: karenina verify ... --preset mcp-preset.json --interface claude_agent_sdk")

Basic Verification — Non-MCP verification walkthrough
Deep Judgment — Add deep judgment to MCP results
Adapters — Full adapter comparison
VerificationConfig Reference — Trace handling fields

MCP Agent Evaluation¶

When to Use MCP¶

Configure MCP Tools¶

Agent Middleware¶

Run MCP-Enabled Verification¶

Trace Handling¶

Recursion Limits¶

Agent Metrics¶

Adapter Comparison¶

CLI Equivalent¶

Related Pages¶