Multi-Model Comparison¶
This scenario compares multiple answering models on the same benchmark. You configure several models, leverage answer caching, use replicates for statistical robustness, and analyze results with grouping and DataFrames.
What you'll learn:
- Configure multiple answering models in one run
- Understand answer caching and cost savings
- Group and compare results by model
- Use replicates for variance measurement
- Analyze results with DataFrames
Configure Multiple Models¶
Specify multiple models in the answering_models list. Each question is verified once per answering model:
from karenina import Benchmark
benchmark = Benchmark.load(str(_tmp))
config = VerificationConfig(
answering_models=[
ModelConfig(id="claude-haiku", model_name="claude-haiku-4-5",
model_provider="anthropic", interface="langchain"),
ModelConfig(id="claude-sonnet", model_name="claude-sonnet-4-5",
model_provider="anthropic", interface="langchain"),
],
parsing_models=[
ModelConfig(id="haiku-parser", model_name="claude-haiku-4-5",
model_provider="anthropic", interface="langchain",
temperature=0.0)
],
evaluation_mode="template_only",
)
print(f"Answering models: {len(config.answering_models)}")
print(f"Total verifications: {len(config.answering_models)} models x {benchmark.question_count} questions = {len(config.answering_models) * benchmark.question_count}")
Answer Caching¶
When multiple parsing models evaluate the same answering model's response, karenina caches the answer generation. Each answering model generates each response only once, regardless of how many parsing models evaluate it.
2 answering models x 5 questions = 10 answer generations
1 parsing model x 10 answers = 10 parse calls
Total LLM calls: 20
Without caching (if parsing models varied):
2 answering x 5 questions x 2 parsers = 20 answers (10 cached)
This makes multi-model evaluation cost-efficient — the expensive answering step is never duplicated.
Run and Compare¶
results = benchmark.run_verification(config)
print(f"Total results: {len(results)}")
Group by Model¶
by_model = results.group_by_model()
for model_key, model_results in by_model.items():
passed = sum(1 for r in model_results if r.template and r.template.verify_result)
total = len(model_results)
print(f"{model_key}: {passed}/{total} passed ({100*passed/total:.0f}%)")
Group by Question¶
See which questions are hardest across models:
by_question = results.group_by_question()
for qid, q_results in list(by_question.items())[:3]:
q_text = q_results[0].metadata.question_text[:40]
passed = sum(1 for r in q_results if r.template and r.template.verify_result)
print(f"{q_text}... — {passed}/{len(q_results)} models passed")
Filter by Model¶
haiku_only = results.filter(answering_models=["claude-haiku-4-5"])
print(f"Claude Haiku results: {len(haiku_only)}")
template_results = results.get_template_results()
df = template_results.to_dataframe()
print(f"DataFrame shape: {df.shape}")
print(df[["question_id", "answering_model", "verify_result"]].head(10))
See DataFrame Analysis for advanced pivot tables and visualizations.
Replicates¶
Use replicate_count to run each model-question pair multiple times, measuring variance:
config_with_replicates = VerificationConfig(
answering_models=[
ModelConfig(id="claude-sonnet", model_name="claude-sonnet-4-5",
model_provider="anthropic", interface="langchain"),
],
parsing_models=[
ModelConfig(id="haiku-parser", model_name="claude-haiku-4-5",
model_provider="anthropic", interface="langchain",
temperature=0.0)
],
evaluation_mode="template_only",
replicate_count=3,
)
print(f"Replicates: {config_with_replicates.replicate_count}")
print(f"Total verifications: {config_with_replicates.replicate_count} x {benchmark.question_count} = {config_with_replicates.replicate_count * benchmark.question_count}")
Analyze Replicate Variance¶
replicate_results = benchmark.run_verification(config_with_replicates, _replicate=True)
by_question = replicate_results.group_by_question()
for qid, q_results in list(by_question.items())[:3]:
q_text = q_results[0].metadata.question_text[:40]
passes = sum(1 for r in q_results if r.template and r.template.verify_result)
print(f"{q_text}... — {passes}/{len(q_results)} replicates passed")
# Compare models using separate runs with the same preset:
# karenina verify benchmark.jsonld --preset base.json --answering-model claude-haiku-4-5 --output results_haiku.json
# karenina verify benchmark.jsonld --preset base.json --answering-model claude-sonnet-4-5 --output results_sonnet.json
# With replicates:
# karenina verify benchmark.jsonld --preset base.json --replicate-count 3
print("CLI: sequential runs with --answering-model for each model")
Related Pages¶
- Basic Verification — Single-model verification walkthrough
- Full Evaluation — Add rubric evaluation to multi-model runs
- DataFrame Analysis — Advanced analysis with pandas
- VerificationConfig Reference — All configuration fields