MCP Integration¶

MCP (Model Context Protocol) transforms the answering model from a single-shot text generator into a multi-turn agent with tool access. Instead of relying solely on training data, the model can call external tools (web search, database queries, API endpoints, file operations) to gather information before producing its final response.

MCP integration is purely an answering model concern. It changes how the response is generated, not how the response is evaluated. The verification pipeline, answer templates, and rubrics work identically regardless of whether the response came from a simple LLM call or a multi-turn agent session.

1. Why MCP Exists in Karenina¶

Standard verification sends a question to an LLM, receives a single response, and evaluates it. This works when the question can be answered from training data alone. Some evaluation scenarios, however, require the model to access external information:

Current information: drug approval statuses, regulatory changes, recent publications
Structured databases: genomics databases, clinical trial registries, knowledge graphs
Custom tools: domain-specific calculators, internal APIs, file readers
Tool-use evaluation: benchmarks that test the model's ability to choose and use tools effectively

MCP provides a standardized protocol for giving the model tool access. You configure MCP servers and Karenina handles connection, tool discovery, the agent loop, and trace capture automatically.

2. When to Use MCP¶

Scenario	Recommendation	Why
Questions answerable from training data	Simple LLM (no MCP)	Faster, cheaper, more reproducible
Questions requiring current or real-time information	MCP with appropriate servers	Training data has a cutoff date
Questions needing data from specific databases	MCP with database tools	Model cannot access databases without tools
Benchmark tests tool-use ability itself	MCP required	The tool usage is the capability being evaluated
Reproducibility is the top priority	Simple LLM or manual interface	MCP results vary with external state
Cost and latency must be minimized	Simple LLM	MCP adds multiple LLM calls per question

!!! tip "Litmus test"

If the question includes phrases like "current status of," "latest data on," or "look up in [database]," MCP is likely appropriate. If the question asks about established knowledge ("What is the mechanism of action of aspirin?"), simple LLM is sufficient and more reproducible.

MCP verification costs more and takes longer than simple LLM verification. Each question triggers a multi-turn agent loop with multiple LLM calls and tool invocations. Use MCP only when the evaluation genuinely requires external information or when tool use is itself the capability being tested.

3. Architecture¶

MCP integration sits between the question and the verification pipeline, replacing the single LLM call with an agent loop:

                          ┌─────────────────┐
                          │   MCP Servers    │
                          │  (external tools)│
                          └────────┬─────────┘
                                   │ tool calls & results
                                   ▼
┌──────────┐    question    ┌─────────────┐    response + trace    ┌────────────┐
│ Karenina │ ──────────────►│  AgentPort  │ ─────────────────────► │ Evaluation │
│          │                │  (adapter)  │                        │ Pipeline   │
└──────────┘                └─────────────┘                        └────────────┘
                                   │
                              multi-turn
                              agent loop

Three components work together:

Component	Role	Configured via
MCP Servers	External processes that expose tools via the MCP protocol (HTTP or stdio transport)	`ModelConfig.mcp_urls_dict`
AgentPort adapter	Connects to servers, discovers tools, runs the multi-turn agent loop, captures the trace	`ModelConfig.interface` (selects the adapter)
Agent middleware	Retry logic, execution limits, conversation summarization, prompt caching	`ModelConfig.agent_middleware`

The agent loop iterates: the model receives the question plus available tools, generates a response, optionally invokes tools, receives tool results, and continues until it produces a final answer or hits a configured limit.

3.1 The Abstraction Boundary¶

MCP changes the generation side of the pipeline. Everything downstream remains the same:

Pipeline concern	With MCP	Without MCP
Response generation	Multi-turn agent loop with tool calls	Single LLM call
Trace capture	Full conversation (messages + tool calls + results)	Single response
Template parsing	Identical: Judge LLM parses response into schema	Same
Template verification	Identical: `verify()` checks parsed values	Same
Rubric evaluation	Identical: traits assess observable response properties	Same

The only pipeline-level difference is that two VerificationConfig fields control whether the full agent trace or only the final response is sent to the parsing and evaluation models (see Section 7).

4. Configuration¶

MCP is configured on ModelConfig, not on VerificationConfig. Each answering model can have its own MCP server configuration:

In [2]:

Copied!





from karenina.schemas.config.models import ModelConfig

model_config = ModelConfig(
    id="agent-claude",
    model_name="claude-sonnet-4-5",
    model_provider="anthropic",
    interface="langchain",
    mcp_urls_dict={
        "biocontext": "https://mcp.biocontext.ai/mcp/",
        "web_search": "https://search-server.example.com/mcp/",
    },
)

print(f"Agent mode: {'enabled' if model_config.mcp_urls_dict else 'disabled'}")
print(f"MCP servers: {list(model_config.mcp_urls_dict.keys())}")
print(f"Tool filter: {model_config.mcp_tool_filter}")
print(f"Agent timeout: {model_config.agent_timeout}")
from karenina.schemas.config.models import ModelConfig

model_config = ModelConfig(
    id="agent-claude",
    model_name="claude-sonnet-4-5",
    model_provider="anthropic",
    interface="langchain",
    mcp_urls_dict={
        "biocontext": "https://mcp.biocontext.ai/mcp/",
        "web_search": "https://search-server.example.com/mcp/",
    },
)

print(f"Agent mode: {'enabled' if model_config.mcp_urls_dict else 'disabled'}")
print(f"MCP servers: {list(model_config.mcp_urls_dict.keys())}")
print(f"Tool filter: {model_config.mcp_tool_filter}")
print(f"Agent timeout: {model_config.agent_timeout}")

Agent mode: enabled
MCP servers: ['biocontext', 'web_search']
Tool filter: None
Agent timeout: None

Setting mcp_urls_dict is the trigger that switches from simple LLM invocation to agent-based execution. All other MCP-related fields are optional refinements.

4.1 MCP Fields on ModelConfig¶

Field	Type	Default	Purpose
`mcp_urls_dict`	`dict[str, str] \\| None`	`None`	Map of server names to URLs. Setting this enables MCP agent mode.
`mcp_tool_filter`	`list[str] \\| None`	`None`	Restrict which discovered tools the agent can use (by tool name). `None` means all tools are available.
`mcp_tool_description_overrides`	`dict[str, str] \\| None`	`None`	Replace tool descriptions sent to the LLM. Useful for improving tool selection accuracy.
`max_context_tokens`	`int \\| None`	`None`	Token threshold for triggering conversation summarization. When omitted, the adapter auto-detects from the model's context window.
`agent_middleware`	`AgentMiddlewareConfig \\| None`	`None`	Controls retry behavior, execution limits, summarization, and prompt caching. Only used when `mcp_urls_dict` is set.
`agent_timeout`	`int \\| None`	`None`	Timeout in seconds for agent execution. Overrides the default (180s). Set higher for complex questions with many tool calls.

5. Agent Middleware¶

AgentMiddlewareConfig controls agent execution behavior and only applies when mcp_urls_dict is set. It groups five sub-configurations, each with sensible defaults:

In [3]:

Copied!





from karenina.schemas.config.models import (
    AgentLimitConfig,
    AgentMiddlewareConfig,
    SummarizationConfig,
)

middleware = AgentMiddlewareConfig(
    limits=AgentLimitConfig(
        model_call_limit=40,    # default: 25
        tool_call_limit=80,     # default: 50
    ),
    summarization=SummarizationConfig(
        trigger_fraction=0.7,   # default: 0.8
        keep_messages=30,       # default: 20
    ),
)

model_config = ModelConfig(
    id="agent-claude",
    model_name="claude-sonnet-4-5",
    model_provider="anthropic",
    interface="langchain",
    mcp_urls_dict={"biocontext": "https://mcp.biocontext.ai/mcp/"},
    agent_middleware=middleware,
)

print(f"Model call limit: {middleware.limits.model_call_limit}")
print(f"Tool call limit: {middleware.limits.tool_call_limit}")
print(f"Exit behavior: {middleware.limits.exit_behavior}")
print(f"Summarization: trigger at {middleware.summarization.trigger_fraction:.0%}, keep {middleware.summarization.keep_messages} messages")
print(f"Model retry: {middleware.model_retry.max_retries} retries, on_failure='{middleware.model_retry.on_failure}'")
print(f"Tool retry: {middleware.tool_retry.max_retries} retries, on_failure='{middleware.tool_retry.on_failure}'")
print(f"Prompt caching: enabled={middleware.prompt_caching.enabled}, ttl='{middleware.prompt_caching.ttl}'")
from karenina.schemas.config.models import (
    AgentLimitConfig,
    AgentMiddlewareConfig,
    SummarizationConfig,
)

middleware = AgentMiddlewareConfig(
    limits=AgentLimitConfig(
        model_call_limit=40,    # default: 25
        tool_call_limit=80,     # default: 50
    ),
    summarization=SummarizationConfig(
        trigger_fraction=0.7,   # default: 0.8
        keep_messages=30,       # default: 20
    ),
)

model_config = ModelConfig(
    id="agent-claude",
    model_name="claude-sonnet-4-5",
    model_provider="anthropic",
    interface="langchain",
    mcp_urls_dict={"biocontext": "https://mcp.biocontext.ai/mcp/"},
    agent_middleware=middleware,
)

print(f"Model call limit: {middleware.limits.model_call_limit}")
print(f"Tool call limit: {middleware.limits.tool_call_limit}")
print(f"Exit behavior: {middleware.limits.exit_behavior}")
print(f"Summarization: trigger at {middleware.summarization.trigger_fraction:.0%}, keep {middleware.summarization.keep_messages} messages")
print(f"Model retry: {middleware.model_retry.max_retries} retries, on_failure='{middleware.model_retry.on_failure}'")
print(f"Tool retry: {middleware.tool_retry.max_retries} retries, on_failure='{middleware.tool_retry.on_failure}'")
print(f"Prompt caching: enabled={middleware.prompt_caching.enabled}, ttl='{middleware.prompt_caching.ttl}'")

Model call limit: 40
Tool call limit: 80
Exit behavior: end
Summarization: trigger at 70%, keep 30 messages
Model retry: 2 retries, on_failure='continue'
Tool retry: 3 retries, on_failure='return_message'
Prompt caching: enabled=True, ttl='5m'

5.1 Sub-Configurations¶

Component	Config Class	Key Fields	Defaults
Execution limits	`AgentLimitConfig`	`model_call_limit`, `tool_call_limit`, `exit_behavior`	25 model calls, 50 tool calls, `"end"`
Model retry	`ModelRetryConfig`	`max_retries`, `backoff_factor`, `initial_delay`, `on_failure`	2 retries, 2.0x backoff, 2.0s initial delay, `"continue"`
Tool retry	`ToolRetryConfig`	`max_retries`, `backoff_factor`, `initial_delay`, `on_failure`	3 retries, 2.0x backoff, 1.0s initial delay, `"return_message"`
Summarization	`SummarizationConfig`	`enabled`, `trigger_fraction`, `keep_messages`	`True`, 0.8, 20 messages
Prompt caching	`PromptCachingConfig`	`enabled`, `ttl`, `unsupported_model_behavior`	`True`, `"5m"`, `"warn"`

Execution limits prevent runaway agents. When model_call_limit or tool_call_limit is reached, the exit_behavior field determines whether the agent returns a partial response ("end") or blocks further calls but continues ("continue"). The pipeline's RecursionLimitAutoFail stage (Stage 3) auto-fails verification if a limit was hit.

Summarization condenses conversation history when the token count approaches the context window, preventing failures on long multi-turn interactions. Only the langchain adapter exposes the full SummarizationConfig; the claude_agent_sdk and claude_tool adapters use built-in summarization.

Prompt caching reduces cost and latency for Anthropic models by caching static content (system prompts, tool definitions) on Anthropic's servers. Only applies to Anthropic models with the langchain interface. The ttl field accepts "5m" or "1h".

6. Adapter-Specific MCP Behavior¶

Each adapter implements MCP differently. The adapter is selected by the interface field on ModelConfig.

Feature	`langchain`	`claude_agent_sdk`	`claude_tool`
Transport	HTTP/SSE	HTTP/SSE + stdio	HTTP/SSE
MCP library	`langchain-mcp-adapters`	Claude Agent SDK	`mcp` Python SDK
Tool filtering	Yes	Yes	Yes
Description overrides	Yes	Yes	Yes
Configurable middleware	Full (`AgentMiddlewareConfig`)	Limits only	Limits only
Prompt caching	Middleware-based (Anthropic only)	No	Native (`cache_control`)
Summarization	Configurable	Built-in (SDK-managed)	Built-in
Stdio servers	No	Yes	No

6.1 Choosing an Adapter for MCP¶

Scenario	Adapter	Why
General-purpose MCP with configurable middleware	`langchain`	Full control over retry, summarization, caching behavior
Need local stdio-based MCP servers	`claude_agent_sdk`	Only adapter supporting stdio transport
Direct Anthropic API without framework overhead	`claude_tool`	Uses Anthropic SDK directly with native prompt caching
Non-Anthropic models (OpenAI, Google) with MCP	`langchain`	Multi-provider support via LangChain
Pre-recorded traces, no live tools	`manual`	Does not support MCP; raises `ValueError` if `mcp_urls_dict` is set

The openrouter and openai_endpoint interfaces delegate to the langchain adapter internally, so they inherit the same MCP behavior.

For implementation details on how each adapter connects, discovers tools, runs the agent loop, and captures traces, see the MCP Integration Deep Dive.

7. Trace Handling¶

After agent execution, the full conversation trace (assistant responses, tool calls, tool results) is captured in two formats:

trace_messages: structured list of Message objects with typed content blocks (text, tool use, tool result, thinking)
raw_trace: legacy string serialization for backward compatibility and debugging

Two VerificationConfig fields control what the downstream evaluation models receive:

In [4]:

Copied!





from karenina.schemas.verification.config import VerificationConfig

# Minimal config to demonstrate trace handling fields
judge = ModelConfig(id="judge", model_name="claude-haiku-4-5", model_provider="anthropic", interface="langchain")
config = VerificationConfig(
    answering_models=[model_config],  # MCP-enabled model from Section 5
    parsing_models=[judge],
    use_full_trace_for_template=False,  # default: only final AI message for parsing
    use_full_trace_for_rubric=True,     # default: full trace for rubric evaluation
)

print(f"Template sees full trace: {config.use_full_trace_for_template}")
print(f"Rubric sees full trace: {config.use_full_trace_for_rubric}")
from karenina.schemas.verification.config import VerificationConfig

# Minimal config to demonstrate trace handling fields
judge = ModelConfig(id="judge", model_name="claude-haiku-4-5", model_provider="anthropic", interface="langchain")
config = VerificationConfig(
    answering_models=[model_config],  # MCP-enabled model from Section 5
    parsing_models=[judge],
    use_full_trace_for_template=False,  # default: only final AI message for parsing
    use_full_trace_for_rubric=True,     # default: full trace for rubric evaluation
)

print(f"Template sees full trace: {config.use_full_trace_for_template}")
print(f"Rubric sees full trace: {config.use_full_trace_for_rubric}")

Template sees full trace: False
Rubric sees full trace: True

Field	Type	Default	Effect
`use_full_trace_for_template`	`bool`	`False`	`True`: template parsing sees the complete trace. `False`: only the final AI message is passed to the Judge LLM.
`use_full_trace_for_rubric`	`bool`	`True`	`True`: rubric evaluation sees the complete trace. `False`: only the final AI message is evaluated.

!!! note

The full trace is **always captured and stored** regardless of these settings. These flags only control what input the parsing and evaluation models receive. If set to `False` and the trace does not end with an AI message, the corresponding verification stage will fail.

The defaults reflect typical evaluation patterns: template parsing usually needs only the final answer (the model's conclusion after tool use), while rubric evaluation benefits from the full trace (to assess qualities like tool selection, reasoning process, or citation of retrieved information).

8. MCP Execution Lifecycle¶

When verification runs with MCP enabled, the following lifecycle executes per question:

1. Connect       Adapter connects to MCP servers listed in mcp_urls_dict
       │
2. Discover      Available tools are fetched from each server
       │
3. Filter        mcp_tool_filter restricts the tool set;
       │         description overrides are applied
       │
4. Agent loop    LLM receives question + tools → generates response →
       │         calls tools → receives results → continues until
       │         final answer or limit reached
       │
5. Middleware     Retry handles transient failures; summarization
       │         condenses long conversations; limits cap execution
       │
6. Capture       All messages captured as trace_messages and raw_trace
       │
7. Return        AgentResult with final_response, traces, usage metadata,
                 and limit_reached flag

The AgentResult feeds into the standard verification pipeline at the GenerateAnswer stage (Stage 2). If limit_reached is True, the RecursionLimitAutoFail stage (Stage 3) marks verification as auto-failed while preserving the captured trace for inspection.

9. Next Steps¶

MCP-enabled verification workflow: step-by-step configuration and execution
MCP Integration Deep Dive: adapter internals, connection lifecycle, trace capture, message conversion
Adapters: adapter comparison and port/adapter architecture
Evaluation modes: how MCP interacts with template and rubric evaluation
Manual interface: alternative for reproducible testing without live tools