Manual Interface¶
This scenario uses the manual interface for offline evaluation — you supply pre-recorded LLM responses instead of generating them live. This is useful for evaluating cached responses, comparing parsing models on the same answers, or iterating on templates without re-running expensive LLM calls.
What you'll learn:
- When to use manual (offline) verification
- Prepare and register pre-recorded traces
- Configure the manual interface with a parsing model
- Common patterns: template iteration, parsing model comparison
- CLI workflow for manual verification
When to Use¶
| Scenario | Why Manual? |
|---|---|
| Template iteration | Refine templates against the same responses without re-generating |
| Parsing model comparison | Try different judge models on identical inputs |
| Cost control | Evaluate expensive model outputs without re-running them |
| Reproducibility | Guarantee identical inputs across evaluation runs |
| Cached responses | Evaluate responses saved from a previous run or external system |
Workflow Diagram¶
Load benchmark Pre-record responses
│ │
▼ ▼
Configure manual interface Register traces (dict or file)
│ │
└───────────┬───────────────────┘
│
▼
Run verification (parsing only — no answer generation)
│
▼
Inspect results
Prepare Traces¶
Pre-recorded traces map question IDs to response strings:
As a Python Dictionary¶
In [ ]:
Copied!
from karenina import Benchmark
benchmark = Benchmark.load(str(_tmp))
question_ids = benchmark.get_question_ids()
# Map each question ID to a pre-recorded response
question_ids = benchmark.get_question_ids()
traces = {}
for qid in question_ids:
q = benchmark.get_question(qid)
traces[q["question"]] = f"Response for: {q['question'][:30]}..."
# Override with realistic responses
all_q = [benchmark.get_question(qid) for qid in question_ids]
traces[all_q[0]["question"]] = "Venetoclax is a selective BCL2 inhibitor. It targets the BCL2 protein."
traces[all_q[1]["question"]] = "Humans have 23 pairs of chromosomes, for a total of 46."
traces[all_q[2]["question"]] = "The primary neurotransmitter is norepinephrine (noradrenaline)."
traces[all_q[3]["question"]] = "Insulin is produced by the beta cells in the pancreas."
traces[all_q[4]["question"]] = "The half-life of caffeine is approximately 5 hours."
print(f"Prepared traces for {len(traces)} questions")
from karenina import Benchmark
benchmark = Benchmark.load(str(_tmp))
question_ids = benchmark.get_question_ids()
# Map each question ID to a pre-recorded response
question_ids = benchmark.get_question_ids()
traces = {}
for qid in question_ids:
q = benchmark.get_question(qid)
traces[q["question"]] = f"Response for: {q['question'][:30]}..."
# Override with realistic responses
all_q = [benchmark.get_question(qid) for qid in question_ids]
traces[all_q[0]["question"]] = "Venetoclax is a selective BCL2 inhibitor. It targets the BCL2 protein."
traces[all_q[1]["question"]] = "Humans have 23 pairs of chromosomes, for a total of 46."
traces[all_q[2]["question"]] = "The primary neurotransmitter is norepinephrine (noradrenaline)."
traces[all_q[3]["question"]] = "Insulin is produced by the beta cells in the pancreas."
traces[all_q[4]["question"]] = "The half-life of caffeine is approximately 5 hours."
print(f"Prepared traces for {len(traces)} questions")
From a JSON File¶
In [ ]:
Copied!
import json
# Load from a JSON file (question_id → response string)
with open(str(_traces_file)) as f:
traces_from_file = json.load(f)
print(f"Loaded {len(traces_from_file)} traces from file")
import json
# Load from a JSON file (question_id → response string)
with open(str(_traces_file)) as f:
traces_from_file = json.load(f)
print(f"Loaded {len(traces_from_file)} traces from file")
In [ ]:
Copied!
from karenina.adapters.manual.traces import ManualTraces
# Register all traces at once (ManualTraces requires the benchmark)
# map_to_id=True converts question text keys to internal question hashes
manual_traces = ManualTraces(benchmark)
manual_traces.register_traces(traces, map_to_id=True)
print(f"Registered {len(traces)} traces")
from karenina.adapters.manual.traces import ManualTraces
# Register all traces at once (ManualTraces requires the benchmark)
# map_to_id=True converts question text keys to internal question hashes
manual_traces = ManualTraces(benchmark)
manual_traces.register_traces(traces, map_to_id=True)
print(f"Registered {len(traces)} traces")
Register Individual Traces¶
In [ ]:
Copied!
manual_traces_individual = ManualTraces(benchmark)
for question_text, response in traces.items():
manual_traces_individual.register_trace(question_text, response, map_to_id=True)
print(f"Registered {len(traces)} traces individually")
manual_traces_individual = ManualTraces(benchmark)
for question_text, response in traces.items():
manual_traces_individual.register_trace(question_text, response, map_to_id=True)
print(f"Registered {len(traces)} traces individually")
Load from File and Register¶
In [ ]:
Copied!
import json
manual_traces_from_file = ManualTraces(benchmark)
with open(str(_traces_file)) as f:
file_traces = json.load(f)
manual_traces_from_file.register_traces(file_traces, map_to_id=True)
print(f"Loaded and registered {len(file_traces)} traces from file")
import json
manual_traces_from_file = ManualTraces(benchmark)
with open(str(_traces_file)) as f:
file_traces = json.load(f)
manual_traces_from_file.register_traces(file_traces, map_to_id=True)
print(f"Loaded and registered {len(file_traces)} traces from file")
Configure and Run¶
Set interface="manual" on the answering model. The parsing model still uses a live LLM:
In [ ]:
Copied!
config = VerificationConfig(
answering_models=[
ModelConfig(
id="manual",
model_name="manual",
model_provider="manual",
interface="manual",
manual_traces=manual_traces,
)
],
parsing_models=[
ModelConfig(
id="haiku-parser",
model_name="claude-haiku-4-5",
model_provider="anthropic",
interface="langchain",
temperature=0.0,
)
],
evaluation_mode="template_only",
)
results = benchmark.run_verification(config)
print(f"Results: {len(results)}")
config = VerificationConfig(
answering_models=[
ModelConfig(
id="manual",
model_name="manual",
model_provider="manual",
interface="manual",
manual_traces=manual_traces,
)
],
parsing_models=[
ModelConfig(
id="haiku-parser",
model_name="claude-haiku-4-5",
model_provider="anthropic",
interface="langchain",
temperature=0.0,
)
],
evaluation_mode="template_only",
)
results = benchmark.run_verification(config)
print(f"Results: {len(results)}")
With the manual interface, no answer generation LLM calls are made — only parsing calls.
Inspect Results¶
In [ ]:
Copied!
for result in results:
meta = result.metadata
t = result.template
status = "PASS" if (t and t.verify_result) else "FAIL"
print(f"[{status}] {meta.question_text[:50]}")
print(f" Model: {meta.answering.interface}/{meta.answering.model_name}")
for result in results:
meta = result.metadata
t = result.template
status = "PASS" if (t and t.verify_result) else "FAIL"
print(f"[{status}] {meta.question_text[:50]}")
print(f" Model: {meta.answering.interface}/{meta.answering.model_name}")
In [ ]:
Copied!
# Step 1: Run with initial template → inspect failures
# Step 2: Update the template code (e.g., adjust verify() logic)
# Step 3: Re-run with same traces — only parsing is repeated
#
# benchmark.update_template(question_id, new_template_code)
# results = benchmark.run_verification(config)
print("Iterate: update template → re-run → compare results")
# Step 1: Run with initial template → inspect failures
# Step 2: Update the template code (e.g., adjust verify() logic)
# Step 3: Re-run with same traces — only parsing is repeated
#
# benchmark.update_template(question_id, new_template_code)
# results = benchmark.run_verification(config)
print("Iterate: update template → re-run → compare results")
Parsing Model Comparison¶
Evaluate the same responses with different parsing models:
In [ ]:
Copied!
parsing_models = [
ModelConfig(id="haiku-parser", model_name="claude-haiku-4-5",
model_provider="anthropic", interface="langchain", temperature=0.0),
ModelConfig(id="sonnet-parser", model_name="claude-sonnet-4-5",
model_provider="anthropic", interface="langchain", temperature=0.0),
]
for parser in parsing_models:
parser_config = VerificationConfig(
answering_models=[
ModelConfig(id="manual", model_name="manual",
model_provider="manual", interface="manual",
manual_traces=manual_traces)
],
parsing_models=[parser],
evaluation_mode="template_only",
)
parser_results = benchmark.run_verification(parser_config)
passed = sum(1 for r in parser_results if r.template and r.template.verify_result)
print(f"Parser {parser.id}: {passed}/{len(parser_results)} passed")
parsing_models = [
ModelConfig(id="haiku-parser", model_name="claude-haiku-4-5",
model_provider="anthropic", interface="langchain", temperature=0.0),
ModelConfig(id="sonnet-parser", model_name="claude-sonnet-4-5",
model_provider="anthropic", interface="langchain", temperature=0.0),
]
for parser in parsing_models:
parser_config = VerificationConfig(
answering_models=[
ModelConfig(id="manual", model_name="manual",
model_provider="manual", interface="manual",
manual_traces=manual_traces)
],
parsing_models=[parser],
evaluation_mode="template_only",
)
parser_results = benchmark.run_verification(parser_config)
passed = sum(1 for r in parser_results if r.template and r.template.verify_result)
print(f"Parser {parser.id}: {passed}/{len(parser_results)} passed")
Late Trace Population¶
Register traces incrementally, running verification as traces become available:
In [ ]:
Copied!
# Start with a subset
partial_traces = ManualTraces(benchmark)
subset_keys = list(traces.keys())[:3]
for question_text in subset_keys:
partial_traces.register_trace(question_text, traces[question_text], map_to_id=True)
print(f"Phase 1: {len(subset_keys)} traces registered")
# Add more traces later
remaining_keys = list(traces.keys())[3:]
for question_text in remaining_keys:
partial_traces.register_trace(question_text, traces[question_text], map_to_id=True)
print(f"Phase 2: {len(subset_keys) + len(remaining_keys)} traces registered")
# Start with a subset
partial_traces = ManualTraces(benchmark)
subset_keys = list(traces.keys())[:3]
for question_text in subset_keys:
partial_traces.register_trace(question_text, traces[question_text], map_to_id=True)
print(f"Phase 1: {len(subset_keys)} traces registered")
# Add more traces later
remaining_keys = list(traces.keys())[3:]
for question_text in remaining_keys:
partial_traces.register_trace(question_text, traces[question_text], map_to_id=True)
print(f"Phase 2: {len(subset_keys) + len(remaining_keys)} traces registered")
CLI Workflow¶
In [ ]:
Copied!
# Manual verification via CLI:
# karenina verify benchmark.jsonld --preset base.json \
# --interface manual --manual-traces traces.json
# Compare parsing models on the same traces:
# karenina verify benchmark.jsonld --interface manual \
# --manual-traces traces.json --parsing-model claude-haiku-4-5
# karenina verify benchmark.jsonld --interface manual \
# --manual-traces traces.json --parsing-model claude-sonnet-4-5
print("CLI: karenina verify ... --interface manual --manual-traces traces.json")
# Manual verification via CLI:
# karenina verify benchmark.jsonld --preset base.json \
# --interface manual --manual-traces traces.json
# Compare parsing models on the same traces:
# karenina verify benchmark.jsonld --interface manual \
# --manual-traces traces.json --parsing-model claude-haiku-4-5
# karenina verify benchmark.jsonld --interface manual \
# --manual-traces traces.json --parsing-model claude-sonnet-4-5
print("CLI: karenina verify ... --interface manual --manual-traces traces.json")
Troubleshooting¶
| Issue | Cause | Fix |
|---|---|---|
KeyError: question_id |
Trace not registered for a question | Register traces for all questions, or use question_ids to verify a subset |
Empty raw_llm_response |
Trace is an empty string | Check the trace content — empty strings are valid but will likely fail parsing |
| Parsing errors on manual traces | Response format doesn't match template expectations | Review the template's expected fields and adjust the response or template |
Related Pages¶
- Basic Verification — Live verification walkthrough
- Full Evaluation — Add rubrics to manual evaluation
- Adapters — Manual adapter details
- CLI Reference: verify —
--interfaceand--manual-tracesoptions