Multi-Model Comparison¶

This scenario compares multiple answering models on the same benchmark. You configure several models, leverage answer caching, use replicates for statistical robustness, and analyze results with grouping and DataFrames.

What you'll learn:

Configure multiple answering models in one run
Understand answer caching and cost savings
Group and compare results by model
Use replicates for variance measurement
Analyze results with DataFrames

Configure Multiple Models¶

Specify multiple models in the answering_models list. Each question is verified once per answering model:

In [ ]:

Copied!





from karenina import Benchmark

benchmark = Benchmark.load(str(_tmp))

config = VerificationConfig(
    answering_models=[
        ModelConfig(id="claude-haiku", model_name="claude-haiku-4-5",
                    model_provider="anthropic", interface="langchain"),
        ModelConfig(id="claude-sonnet", model_name="claude-sonnet-4-5",
                    model_provider="anthropic", interface="langchain"),
    ],
    parsing_models=[
        ModelConfig(id="haiku-parser", model_name="claude-haiku-4-5",
                    model_provider="anthropic", interface="langchain",
                    temperature=0.0)
    ],
    evaluation_mode="template_only",
)

print(f"Answering models: {len(config.answering_models)}")
print(f"Total verifications: {len(config.answering_models)} models x {benchmark.question_count} questions = {len(config.answering_models) * benchmark.question_count}")
from karenina import Benchmark

benchmark = Benchmark.load(str(_tmp))

config = VerificationConfig(
    answering_models=[
        ModelConfig(id="claude-haiku", model_name="claude-haiku-4-5",
                    model_provider="anthropic", interface="langchain"),
        ModelConfig(id="claude-sonnet", model_name="claude-sonnet-4-5",
                    model_provider="anthropic", interface="langchain"),
    ],
    parsing_models=[
        ModelConfig(id="haiku-parser", model_name="claude-haiku-4-5",
                    model_provider="anthropic", interface="langchain",
                    temperature=0.0)
    ],
    evaluation_mode="template_only",
)

print(f"Answering models: {len(config.answering_models)}")
print(f"Total verifications: {len(config.answering_models)} models x {benchmark.question_count} questions = {len(config.answering_models) * benchmark.question_count}")

Answer Caching¶

When multiple parsing models evaluate the same answering model's response, karenina caches the answer generation. Each answering model generates each response only once, regardless of how many parsing models evaluate it.

2 answering models x 5 questions = 10 answer generations
1 parsing model x 10 answers = 10 parse calls
Total LLM calls: 20

Without caching (if parsing models varied):
2 answering x 5 questions x 2 parsers = 20 answers (10 cached)

This makes multi-model evaluation cost-efficient — the expensive answering step is never duplicated.

Run and Compare¶

In [ ]:

Copied!

results = benchmark.run_verification(config)
print(f"Total results: {len(results)}")
results = benchmark.run_verification(config)
print(f"Total results: {len(results)}")

Group by Model¶

In [ ]:

Copied!





by_model = results.group_by_model()
for model_key, model_results in by_model.items():
    passed = sum(1 for r in model_results if r.template and r.template.verify_result)
    total = len(model_results)
    print(f"{model_key}: {passed}/{total} passed ({100*passed/total:.0f}%)")
by_model = results.group_by_model()
for model_key, model_results in by_model.items():
    passed = sum(1 for r in model_results if r.template and r.template.verify_result)
    total = len(model_results)
    print(f"{model_key}: {passed}/{total} passed ({100*passed/total:.0f}%)")

Group by Question¶

See which questions are hardest across models:

In [ ]:

Copied!





by_question = results.group_by_question()
for qid, q_results in list(by_question.items())[:3]:
    q_text = q_results[0].metadata.question_text[:40]
    passed = sum(1 for r in q_results if r.template and r.template.verify_result)
    print(f"{q_text}... — {passed}/{len(q_results)} models passed")
by_question = results.group_by_question()
for qid, q_results in list(by_question.items())[:3]:
    q_text = q_results[0].metadata.question_text[:40]
    passed = sum(1 for r in q_results if r.template and r.template.verify_result)
    print(f"{q_text}... — {passed}/{len(q_results)} models passed")

Filter by Model¶

In [ ]:

Copied!

haiku_only = results.filter(answering_models=["claude-haiku-4-5"])
print(f"Claude Haiku results: {len(haiku_only)}")
haiku_only = results.filter(answering_models=["claude-haiku-4-5"])
print(f"Claude Haiku results: {len(haiku_only)}")

DataFrame Analysis¶

Convert results to a DataFrame for richer analysis:

In [ ]:

Copied!





template_results = results.get_template_results()
df = template_results.to_dataframe()
print(f"DataFrame shape: {df.shape}")
print(df[["question_id", "answering_model", "verify_result"]].head(10))
template_results = results.get_template_results()
df = template_results.to_dataframe()
print(f"DataFrame shape: {df.shape}")
print(df[["question_id", "answering_model", "verify_result"]].head(10))

See DataFrame Analysis for advanced pivot tables and visualizations.

Replicates¶

Use replicate_count to run each model-question pair multiple times, measuring variance:

In [ ]:

Copied!





config_with_replicates = VerificationConfig(
    answering_models=[
        ModelConfig(id="claude-sonnet", model_name="claude-sonnet-4-5",
                    model_provider="anthropic", interface="langchain"),
    ],
    parsing_models=[
        ModelConfig(id="haiku-parser", model_name="claude-haiku-4-5",
                    model_provider="anthropic", interface="langchain",
                    temperature=0.0)
    ],
    evaluation_mode="template_only",
    replicate_count=3,
)

print(f"Replicates: {config_with_replicates.replicate_count}")
print(f"Total verifications: {config_with_replicates.replicate_count} x {benchmark.question_count} = {config_with_replicates.replicate_count * benchmark.question_count}")
config_with_replicates = VerificationConfig(
    answering_models=[
        ModelConfig(id="claude-sonnet", model_name="claude-sonnet-4-5",
                    model_provider="anthropic", interface="langchain"),
    ],
    parsing_models=[
        ModelConfig(id="haiku-parser", model_name="claude-haiku-4-5",
                    model_provider="anthropic", interface="langchain",
                    temperature=0.0)
    ],
    evaluation_mode="template_only",
    replicate_count=3,
)

print(f"Replicates: {config_with_replicates.replicate_count}")
print(f"Total verifications: {config_with_replicates.replicate_count} x {benchmark.question_count} = {config_with_replicates.replicate_count * benchmark.question_count}")

Analyze Replicate Variance¶

In [ ]:

Copied!





replicate_results = benchmark.run_verification(config_with_replicates, _replicate=True)

by_question = replicate_results.group_by_question()
for qid, q_results in list(by_question.items())[:3]:
    q_text = q_results[0].metadata.question_text[:40]
    passes = sum(1 for r in q_results if r.template and r.template.verify_result)
    print(f"{q_text}... — {passes}/{len(q_results)} replicates passed")
replicate_results = benchmark.run_verification(config_with_replicates, _replicate=True)

by_question = replicate_results.group_by_question()
for qid, q_results in list(by_question.items())[:3]:
    q_text = q_results[0].metadata.question_text[:40]
    passes = sum(1 for r in q_results if r.template and r.template.verify_result)
    print(f"{q_text}... — {passes}/{len(q_results)} replicates passed")

CLI Equivalent¶

Run multi-model comparison from the command line with sequential calls:

In [ ]:

Copied!





# Compare models using separate runs with the same preset:
# karenina verify benchmark.jsonld --preset base.json --answering-model claude-haiku-4-5 --output results_haiku.json
# karenina verify benchmark.jsonld --preset base.json --answering-model claude-sonnet-4-5 --output results_sonnet.json

# With replicates:
# karenina verify benchmark.jsonld --preset base.json --replicate-count 3

print("CLI: sequential runs with --answering-model for each model")
# Compare models using separate runs with the same preset:
# karenina verify benchmark.jsonld --preset base.json --answering-model claude-haiku-4-5 --output results_haiku.json
# karenina verify benchmark.jsonld --preset base.json --answering-model claude-sonnet-4-5 --output results_sonnet.json

# With replicates:
# karenina verify benchmark.jsonld --preset base.json --replicate-count 3

print("CLI: sequential runs with --answering-model for each model")

Basic Verification — Single-model verification walkthrough
Full Evaluation — Add rubric evaluation to multi-model runs
DataFrame Analysis — Advanced analysis with pandas
VerificationConfig Reference — All configuration fields

Multi-Model Comparison¶

Configure Multiple Models¶

Answer Caching¶

Run and Compare¶

Group by Model¶

Group by Question¶

Filter by Model¶

DataFrame Analysis¶

Replicates¶

Analyze Replicate Variance¶

CLI Equivalent¶

Related Pages¶