Manual Interface¶

This scenario uses the manual interface for offline evaluation — you supply pre-recorded LLM responses instead of generating them live. This is useful for evaluating cached responses, comparing parsing models on the same answers, or iterating on templates without re-running expensive LLM calls.

What you'll learn:

When to use manual (offline) verification
Prepare and register pre-recorded traces
Configure the manual interface with a parsing model
Common patterns: template iteration, parsing model comparison
CLI workflow for manual verification

When to Use¶

Scenario	Why Manual?
Template iteration	Refine templates against the same responses without re-generating
Parsing model comparison	Try different judge models on identical inputs
Cost control	Evaluate expensive model outputs without re-running them
Reproducibility	Guarantee identical inputs across evaluation runs
Cached responses	Evaluate responses saved from a previous run or external system

Workflow Diagram¶

Load benchmark                Pre-record responses
    │                               │
    ▼                               ▼
Configure manual interface    Register traces (dict or file)
    │                               │
    └───────────┬───────────────────┘
                │
                ▼
    Run verification (parsing only — no answer generation)
                │
                ▼
        Inspect results

Prepare Traces¶

Pre-recorded traces map question IDs to response strings:

As a Python Dictionary¶

In [ ]:

Copied!





from karenina import Benchmark

benchmark = Benchmark.load(str(_tmp))
question_ids = benchmark.get_question_ids()

# Map each question ID to a pre-recorded response
question_ids = benchmark.get_question_ids()
traces = {}
for qid in question_ids:
    q = benchmark.get_question(qid)
    traces[q["question"]] = f"Response for: {q['question'][:30]}..."

# Override with realistic responses
all_q = [benchmark.get_question(qid) for qid in question_ids]
traces[all_q[0]["question"]] = "Venetoclax is a selective BCL2 inhibitor. It targets the BCL2 protein."
traces[all_q[1]["question"]] = "Humans have 23 pairs of chromosomes, for a total of 46."
traces[all_q[2]["question"]] = "The primary neurotransmitter is norepinephrine (noradrenaline)."
traces[all_q[3]["question"]] = "Insulin is produced by the beta cells in the pancreas."
traces[all_q[4]["question"]] = "The half-life of caffeine is approximately 5 hours."

print(f"Prepared traces for {len(traces)} questions")
from karenina import Benchmark

benchmark = Benchmark.load(str(_tmp))
question_ids = benchmark.get_question_ids()

# Map each question ID to a pre-recorded response
question_ids = benchmark.get_question_ids()
traces = {}
for qid in question_ids:
    q = benchmark.get_question(qid)
    traces[q["question"]] = f"Response for: {q['question'][:30]}..."

# Override with realistic responses
all_q = [benchmark.get_question(qid) for qid in question_ids]
traces[all_q[0]["question"]] = "Venetoclax is a selective BCL2 inhibitor. It targets the BCL2 protein."
traces[all_q[1]["question"]] = "Humans have 23 pairs of chromosomes, for a total of 46."
traces[all_q[2]["question"]] = "The primary neurotransmitter is norepinephrine (noradrenaline)."
traces[all_q[3]["question"]] = "Insulin is produced by the beta cells in the pancreas."
traces[all_q[4]["question"]] = "The half-life of caffeine is approximately 5 hours."

print(f"Prepared traces for {len(traces)} questions")

From a JSON File¶

In [ ]:

Copied!

import json

# Load from a JSON file (question_id → response string)
with open(str(_traces_file)) as f:
    traces_from_file = json.load(f)

print(f"Loaded {len(traces_from_file)} traces from file")
import json

# Load from a JSON file (question_id → response string)
with open(str(_traces_file)) as f:
    traces_from_file = json.load(f)

print(f"Loaded {len(traces_from_file)} traces from file")

Register Traces¶

Use ManualTraces to register pre-recorded responses with the benchmark:

In [ ]:

Copied!





from karenina.adapters.manual.traces import ManualTraces

# Register all traces at once (ManualTraces requires the benchmark)
# map_to_id=True converts question text keys to internal question hashes
manual_traces = ManualTraces(benchmark)
manual_traces.register_traces(traces, map_to_id=True)

print(f"Registered {len(traces)} traces")
from karenina.adapters.manual.traces import ManualTraces

# Register all traces at once (ManualTraces requires the benchmark)
# map_to_id=True converts question text keys to internal question hashes
manual_traces = ManualTraces(benchmark)
manual_traces.register_traces(traces, map_to_id=True)

print(f"Registered {len(traces)} traces")

Register Individual Traces¶

In [ ]:

Copied!

manual_traces_individual = ManualTraces(benchmark)
for question_text, response in traces.items():
    manual_traces_individual.register_trace(question_text, response, map_to_id=True)

print(f"Registered {len(traces)} traces individually")
manual_traces_individual = ManualTraces(benchmark)
for question_text, response in traces.items():
    manual_traces_individual.register_trace(question_text, response, map_to_id=True)

print(f"Registered {len(traces)} traces individually")

Load from File and Register¶

In [ ]:

Copied!





import json

manual_traces_from_file = ManualTraces(benchmark)
with open(str(_traces_file)) as f:
    file_traces = json.load(f)
manual_traces_from_file.register_traces(file_traces, map_to_id=True)

print(f"Loaded and registered {len(file_traces)} traces from file")
import json

manual_traces_from_file = ManualTraces(benchmark)
with open(str(_traces_file)) as f:
    file_traces = json.load(f)
manual_traces_from_file.register_traces(file_traces, map_to_id=True)

print(f"Loaded and registered {len(file_traces)} traces from file")

Configure and Run¶

Set interface="manual" on the answering model. The parsing model still uses a live LLM:

In [ ]:

Copied!





config = VerificationConfig(
    answering_models=[
        ModelConfig(
            id="manual",
            model_name="manual",
            model_provider="manual",
            interface="manual",
            manual_traces=manual_traces,
        )
    ],
    parsing_models=[
        ModelConfig(
            id="haiku-parser",
            model_name="claude-haiku-4-5",
            model_provider="anthropic",
            interface="langchain",
            temperature=0.0,
        )
    ],
    evaluation_mode="template_only",
)

results = benchmark.run_verification(config)
print(f"Results: {len(results)}")

config = VerificationConfig(
    answering_models=[
        ModelConfig(
            id="manual",
            model_name="manual",
            model_provider="manual",
            interface="manual",
            manual_traces=manual_traces,
        )
    ],
    parsing_models=[
        ModelConfig(
            id="haiku-parser",
            model_name="claude-haiku-4-5",
            model_provider="anthropic",
            interface="langchain",
            temperature=0.0,
        )
    ],
    evaluation_mode="template_only",
)

results = benchmark.run_verification(config)
print(f"Results: {len(results)}")

With the manual interface, no answer generation LLM calls are made — only parsing calls.

Inspect Results¶

In [ ]:

Copied!





for result in results:
    meta = result.metadata
    t = result.template
    status = "PASS" if (t and t.verify_result) else "FAIL"
    print(f"[{status}] {meta.question_text[:50]}")
    print(f"         Model: {meta.answering.interface}/{meta.answering.model_name}")
for result in results:
    meta = result.metadata
    t = result.template
    status = "PASS" if (t and t.verify_result) else "FAIL"
    print(f"[{status}] {meta.question_text[:50]}")
    print(f"         Model: {meta.answering.interface}/{meta.answering.model_name}")

Common Patterns¶

Template Iteration¶

Refine a template and re-evaluate without re-generating responses:

In [ ]:

Copied!





# Step 1: Run with initial template → inspect failures
# Step 2: Update the template code (e.g., adjust verify() logic)
# Step 3: Re-run with same traces — only parsing is repeated
#
# benchmark.update_template(question_id, new_template_code)
# results = benchmark.run_verification(config)
print("Iterate: update template → re-run → compare results")
# Step 1: Run with initial template → inspect failures
# Step 2: Update the template code (e.g., adjust verify() logic)
# Step 3: Re-run with same traces — only parsing is repeated
#
# benchmark.update_template(question_id, new_template_code)
# results = benchmark.run_verification(config)
print("Iterate: update template → re-run → compare results")

Parsing Model Comparison¶

Evaluate the same responses with different parsing models:

In [ ]:

Copied!





parsing_models = [
    ModelConfig(id="haiku-parser", model_name="claude-haiku-4-5",
                model_provider="anthropic", interface="langchain", temperature=0.0),
    ModelConfig(id="sonnet-parser", model_name="claude-sonnet-4-5",
                model_provider="anthropic", interface="langchain", temperature=0.0),
]

for parser in parsing_models:
    parser_config = VerificationConfig(
        answering_models=[
            ModelConfig(id="manual", model_name="manual",
                        model_provider="manual", interface="manual",
                        manual_traces=manual_traces)
        ],
        parsing_models=[parser],
        evaluation_mode="template_only",
    )
    parser_results = benchmark.run_verification(parser_config)
    passed = sum(1 for r in parser_results if r.template and r.template.verify_result)
    print(f"Parser {parser.id}: {passed}/{len(parser_results)} passed")
parsing_models = [
    ModelConfig(id="haiku-parser", model_name="claude-haiku-4-5",
                model_provider="anthropic", interface="langchain", temperature=0.0),
    ModelConfig(id="sonnet-parser", model_name="claude-sonnet-4-5",
                model_provider="anthropic", interface="langchain", temperature=0.0),
]

for parser in parsing_models:
    parser_config = VerificationConfig(
        answering_models=[
            ModelConfig(id="manual", model_name="manual",
                        model_provider="manual", interface="manual",
                        manual_traces=manual_traces)
        ],
        parsing_models=[parser],
        evaluation_mode="template_only",
    )
    parser_results = benchmark.run_verification(parser_config)
    passed = sum(1 for r in parser_results if r.template and r.template.verify_result)
    print(f"Parser {parser.id}: {passed}/{len(parser_results)} passed")

Late Trace Population¶

Register traces incrementally, running verification as traces become available:

In [ ]:

Copied!





# Start with a subset
partial_traces = ManualTraces(benchmark)
subset_keys = list(traces.keys())[:3]
for question_text in subset_keys:
    partial_traces.register_trace(question_text, traces[question_text], map_to_id=True)

print(f"Phase 1: {len(subset_keys)} traces registered")

# Add more traces later
remaining_keys = list(traces.keys())[3:]
for question_text in remaining_keys:
    partial_traces.register_trace(question_text, traces[question_text], map_to_id=True)

print(f"Phase 2: {len(subset_keys) + len(remaining_keys)} traces registered")
# Start with a subset
partial_traces = ManualTraces(benchmark)
subset_keys = list(traces.keys())[:3]
for question_text in subset_keys:
    partial_traces.register_trace(question_text, traces[question_text], map_to_id=True)

print(f"Phase 1: {len(subset_keys)} traces registered")

# Add more traces later
remaining_keys = list(traces.keys())[3:]
for question_text in remaining_keys:
    partial_traces.register_trace(question_text, traces[question_text], map_to_id=True)

print(f"Phase 2: {len(subset_keys) + len(remaining_keys)} traces registered")

CLI Workflow¶

In [ ]:

Copied!





# Manual verification via CLI:
# karenina verify benchmark.jsonld --preset base.json \
#   --interface manual --manual-traces traces.json

# Compare parsing models on the same traces:
# karenina verify benchmark.jsonld --interface manual \
#   --manual-traces traces.json --parsing-model claude-haiku-4-5
# karenina verify benchmark.jsonld --interface manual \
#   --manual-traces traces.json --parsing-model claude-sonnet-4-5

print("CLI: karenina verify ... --interface manual --manual-traces traces.json")
# Manual verification via CLI:
# karenina verify benchmark.jsonld --preset base.json \
#   --interface manual --manual-traces traces.json

# Compare parsing models on the same traces:
# karenina verify benchmark.jsonld --interface manual \
#   --manual-traces traces.json --parsing-model claude-haiku-4-5
# karenina verify benchmark.jsonld --interface manual \
#   --manual-traces traces.json --parsing-model claude-sonnet-4-5

print("CLI: karenina verify ... --interface manual --manual-traces traces.json")

Troubleshooting¶

Issue	Cause	Fix
`KeyError: question_id`	Trace not registered for a question	Register traces for all questions, or use `question_ids` to verify a subset
Empty `raw_llm_response`	Trace is an empty string	Check the trace content — empty strings are valid but will likely fail parsing
Parsing errors on manual traces	Response format doesn't match template expectations	Review the template's expected fields and adjust the response or template

Basic Verification — Live verification walkthrough
Full Evaluation — Add rubrics to manual evaluation
Adapters — Manual adapter details
CLI Reference: verify — --interface and --manual-traces options

Manual Interface¶

When to Use¶

Workflow Diagram¶

Prepare Traces¶

As a Python Dictionary¶

From a JSON File¶

Register Traces¶

Register Individual Traces¶

Load from File and Register¶

Configure and Run¶

Inspect Results¶

Common Patterns¶

Template Iteration¶

Parsing Model Comparison¶

Late Trace Population¶

CLI Workflow¶

Troubleshooting¶

Related Pages¶