Manual Interface¶
The manual interface is an adapter that replays pre-recorded LLM responses through the verification pipeline instead of calling a live model. The pipeline evaluates these traces identically to live responses: parsing, template verification, and rubric evaluation all run the same way. The only stage that changes behavior is answer generation, which reads from a local trace store instead of making an API call.
The most important idea is: the manual interface decouples answer generation from evaluation. You can generate answers once, then re-evaluate them many times under different templates, rubrics, parsing models, or configurations, without repeating the expensive generation step.
Manual interface vs TaskEval
Both let you evaluate pre-recorded text, but they address different situations. The manual interface operates inside the Benchmark pipeline: it replaces the answering model with a trace lookup, so all 13 pipeline stages still run. TaskEval operates outside the Benchmark pipeline: you feed free text directly into the evaluation engine without needing a benchmark, questions, or checkpoints. See Section 6 for detailed guidance on choosing between them.
1. What the Manual Interface Is¶
The manual interface is one of several adapter backends registered in the adapter system. When ModelConfig.interface is set to "manual", the adapter factory returns a ManualAgentAdapter that reads from a thread-safe trace store (ManualTraceManager) instead of calling an LLM provider. The pipeline stage GenerateAnswer uses this adapter through the same AgentPort protocol it uses for every other backend; it does not contain manual-specific branching.
1.1. The Abstraction Boundary¶
What the manual interface handles:
- Looking up a pre-recorded trace by question hash (MD5 of the question text)
- Returning that trace in the same
AgentResultformat as live adapters - Extracting agent metrics (tool calls, iterations, failures) from message-list traces
- Converting port
Messagelists and LangChain message lists to harmonized string traces
What the manual interface does not handle:
- Parsing the trace into a template schema (a separate parsing model is always required)
- Rubric evaluation (uses the parsing model, not the answering model)
- Prompt construction, stage sequencing, or result storage (handled by other pipeline components)
1.2. Three Adapters, One Functional¶
The manual adapter registers three port implementations, but only one does real work:
| Adapter | Port | Behavior | Purpose |
|---|---|---|---|
ManualAgentAdapter |
AgentPort |
Looks up trace from ManualTraceManager, returns AgentResult |
Answer generation |
ManualLLMAdapter |
LLMPort |
Raises ManualInterfaceError if invoked |
Safety net: ensures no accidental LLM calls |
ManualParserAdapter |
ParserPort |
Raises ManualInterfaceError if invoked |
Safety net: parsing uses the separate parsing model |
The LLM and parser adapters exist because the factory always returns an adapter for each port (never None). They act as guardrails: if pipeline code accidentally tries to call the answering model's LLM or parser port, it gets an immediate error rather than a silent failure.
2. Why It Exists¶
Answer generation is the slowest and most expensive step in the verification pipeline. When you are iterating on templates, adjusting rubric traits, tuning parsing models, or comparing how different judges handle the same responses, regenerating answers every time wastes time and money.
The manual interface solves this by making the trace store a first-class adapter backend. Because it plugs into the same port/adapter architecture as live backends, every pipeline stage after GenerateAnswer runs identically. Results from manual runs are directly comparable to live runs.
| Scenario | How the manual interface helps |
|---|---|
| Template iteration | Fix a verify() method or add a field, then re-evaluate the same answers instantly |
| Rubric refinement | Adjust trait descriptions or add new traits without regenerating responses |
| Parsing model comparison | Run the same traces through different judge models to compare extraction quality |
| Cost control | Generate answers once with an expensive model, then iterate on evaluation cheaply |
| External outputs | Evaluate responses from systems outside Karenina (other frameworks, human-written answers, production logs) |
| Controlled testing | Test templates and rubrics with known answers before running full benchmarks |
3. How It Works¶
3.1. Pipeline Flow¶
Normal flow: Question ──► Answering LLM ──► Trace ──► Parsing ──► Verify ──► Rubric
Manual flow: Question ──► Trace Store ──► Trace ──► Parsing ──► Verify ──► Rubric
(ManualTraceManager)
During GenerateAnswer, the pipeline computes the MD5 hash of the question text and passes it to the ManualAgentAdapter via AgentConfig.question_hash. The adapter looks up the corresponding trace in the global ManualTraceManager singleton and returns it as an AgentResult with zero token usage and actual_model="manual".
Every subsequent stage (parsing, template verification, embedding check, rubric evaluation, finalization) runs without modification. The parsing model (a separate, live LLM) receives the pre-recorded trace and extracts structured data from it exactly as it would from a live response.
3.2. Trace Lookup by Question Hash¶
Traces are indexed by question hash: the MD5 hash of the question text, which matches the deterministic ID used throughout Karenina's question system. The pipeline computes this hash automatically during verification; you do not need to compute it manually unless you are building a trace file by hand.
question_hash = hashlib.md5(b"What is 2+2?").hexdigest()
print(question_hash)
If no trace is found for a question hash, the pipeline raises ManualTraceNotFoundError with the hash and the count of loaded traces, making mismatches easy to diagnose.
3.3. Session Management¶
The ManualTraceManager is a global singleton with session-based cleanup. Traces are automatically cleared after one hour of inactivity (configurable via session_timeout_seconds). Individual traces also expire after the session timeout. This prevents memory leaks in long-running server processes. For short-lived scripts, the timeout is effectively irrelevant.
4. Trace Formats¶
ManualTraces.register_trace() accepts three input formats. All are stored internally as plain strings.
| Format | Input Type | When to Use |
|---|---|---|
| String | str |
Simple text answers with no tool call history |
| Port message list | list[Message] |
Traces captured using Karenina's native message format (karenina.ports.messages.Message) |
| LangChain message list | list[AIMessage \| ToolMessage \| ...] |
Traces captured from LangChain agent runs (requires langchain-core) |
For message lists, ManualTraces automatically:
- Detects the format (port or LangChain)
- Converts LangChain messages to port
Messageobjects if needed - Extracts agent metrics (iterations, tool call counts, failure counts)
- Harmonizes the message list to a string trace, filtering out system messages and the initial user question
- Stores the trace string and metrics separately in the
ManualTraceManager
The extracted agent metrics are preserved in the AgentResult and flow through to the verification result, so manual runs of agent traces still report tool usage statistics.
4.1. JSON File Format¶
The CLI and load_manual_traces_from_file() expect a JSON file mapping question hashes to trace strings:
{
"936dbc8755f623c951d96ea2b03e13bc": "The answer is 4.",
"8f2e2b1e4d5c6a7b8c9d0e1f2a3b4c5d": "The answer is 6."
}
Validation rules:
- The file must contain a JSON object (not an array or scalar)
- Keys must be valid 32-character hexadecimal MD5 hashes
- Values must be non-empty strings
from karenina.benchmark import Benchmark
# Load benchmark
benchmark = Benchmark.load("checkpoint.jsonld")
# Create ManualTraces linked to the benchmark
manual_traces = ManualTraces(benchmark)
# Register by question text (map_to_id=True converts text to hash)
manual_traces.register_trace(
"What is 2+2?",
"The answer is 4. I computed this by adding 2 and 2.",
map_to_id=True,
)
# Or register by MD5 hash directly
manual_traces.register_trace(
"936dbc8755f623c951d96ea2b03e13bc",
"The answer is 4.",
)
# Or batch register
manual_traces.register_traces({
"What is 2+2?": "The answer is 4.",
"What is 3+3?": "The answer is 6.",
}, map_to_id=True)
print(f"Registered traces for {len(benchmark.get_question_ids())} questions")
Then configure and run verification:
# Answering model: manual (reads from trace store)
manual_config = ModelConfig(
interface="manual",
manual_traces=manual_traces,
)
# Parsing model: a live LLM (required for template extraction)
judge_config = ModelConfig(
id="claude-haiku",
model_provider="anthropic",
model_name="claude-haiku-4-5",
interface="langchain",
)
config = VerificationConfig(
answering_models=[manual_config],
parsing_models=[judge_config],
)
results = benchmark.run_verification(config)
print(f"Verified {len(results)} questions")
Automatic defaults for manual ModelConfig
When interface="manual", ModelConfig automatically sets id and model_name to "manual" if you leave them unset. You do not need to specify model_provider.
5.2. CLI¶
karenina verify checkpoint.jsonld \
--interface manual \
--manual-traces traces/my_traces.json \
--parsing-model claude-haiku-4-5 \
--parsing-provider anthropic
The CLI automatically sets the parsing interface to "langchain" when --interface manual is specified. The --manual-traces flag is required with --interface manual.
5.3. Registration Behaviors¶
| Registration method | map_to_id |
Identifier type | Notes |
|---|---|---|---|
register_trace(hash, trace) |
False (default) |
MD5 hash | Hash format validated (32 hex chars) |
register_trace(text, trace, map_to_id=True) |
True |
Question text | Text must match a question in the benchmark exactly (case-sensitive, whitespace-sensitive) |
register_traces(dict, map_to_id=True) |
True |
Question text | Batch version; calls register_trace() for each entry |
When using map_to_id=True, ManualTraces builds a lazy index of question text to hash on first use, so subsequent lookups are O(1).
6. Manual Interface vs TaskEval¶
Both mechanisms evaluate pre-recorded text, but they serve different workflows.
| Dimension | Manual Interface | TaskEval |
|---|---|---|
| Operates within | Benchmark pipeline (all 13 stages run) | Standalone evaluation engine (no benchmark required) |
| Requires | Benchmark with questions, checkpoint, ManualTraces |
Just a template and/or rubric |
| Input | Trace per question, keyed by question hash | Free text logged via task.log() |
| Use case | Re-evaluate benchmark answers under new configs | Evaluate arbitrary text (production logs, human writing, one-off outputs) |
| Results stored in | Benchmark results (database, JSON-LD checkpoint) | TaskEval result object |
Use the manual interface when you already have a benchmark and want to re-run evaluation on captured traces (template iteration, parsing model comparison, rubric refinement).
Use TaskEval when you have text to evaluate but no benchmark context: one-off evaluations, production output spot-checks, or evaluating text from systems that do not produce Karenina checkpoints.
7. Constraints¶
| Constraint | Reason |
|---|---|
| No MCP support | Configuring mcp_urls_dict on a manual ModelConfig raises ValueError. Pre-recorded traces cannot invoke live tools. |
| No live LLM calls from the answering model | ManualLLMAdapter and ManualParserAdapter raise ManualInterfaceError if invoked. Parsing and rubric evaluation use the separate parsing model. |
| Traces excluded from serialization | manual_traces is Field(exclude=True) on ModelConfig. When loading a preset with a manual config, traces must be re-registered. |
| Exact text matching | When using map_to_id=True, question text must match the benchmark exactly (case-sensitive, including whitespace). |
| No tools or structured output on the answering model | The manual adapter spec has supports_mcp=False and supports_tools=False. |
8. Next Steps¶
- Manual Interface Workflow: Step-by-step walkthrough with executable examples
- Adapters: How the manual interface fits into the port/adapter architecture
- TaskEval: Evaluating free text without a benchmark
- Running Verification: Overview of all verification methods