Analyzing Results¶

This section covers how to work with verification results — from understanding the result structure to building DataFrames for analysis, exporting data, and iterating on your benchmark based on findings.

Workflow Overview¶

Run verification (returns VerificationResultSet)
    │
    ▼
Explore result structure (metadata, template, rubric, deep judgment)
    │
    ▼
Filter and group results (by question, model, replicate)
    │
    ▼
Build DataFrames for analysis (template fields, rubric traits, judgment excerpts)
    │
    ▼
Export results (JSON, CSV, file) ─── and/or ─── Iterate (fix templates, improve rubrics, re-run)

Each step has a dedicated page with detailed instructions and examples.

Workflow Steps¶

1. Understand the Result Structure¶

Every call to run_verification() returns a VerificationResultSet — a collection of VerificationResult objects, one per question × model × replicate combination:

results = benchmark.run_verification(config)

# Iterate over individual results
for result in results:
    meta = result.metadata
    print(f"Q: {meta.question_id} | Model: {meta.answering.model_name}")

Each VerificationResult contains four sections:

Section	Contains	Present When
metadata	Question ID, model info, timing, execution status	Always
template	Parsed answers, verify result, regex results, embedding similarity	Template evaluation ran
rubric	Trait scores by type (LLM, regex, callable, metric)	Rubric evaluation ran
deep_judgment	Excerpts, reasoning, hallucination risk	Deep judgment enabled

Understand the full result structure →

2. Filter and Group Results¶

VerificationResultSet provides built-in methods for slicing results:

# Filter to specific questions or models
subset = results.filter(
    question_ids=["q1", "q2"],
    completed_only=True
)

# Group by question, model, or replicate
by_question = results.group_by_question()
by_model = results.group_by_model()
by_replicate = results.group_by_replicate()

You can also get a summary of all results:

summary = results.get_summary()
print(f"Total: {summary['num_results']}, Completed: {summary['num_completed']}")

3. Build DataFrames for Analysis¶

Convert results into pandas DataFrames for detailed analysis using three specialized builders:

Builder	Rows	Best For
TemplateDataFrameBuilder	One row per parsed field	Field-by-field comparison, pass/fail analysis
RubricDataFrameBuilder	One row per rubric trait	Trait score distributions, quality analysis
JudgmentDataFrameBuilder	One row per excerpt	Deep judgment inspection, hallucination review

# Template results as a DataFrame
template_results = results.get_template_results()
df = template_results.to_dataframe()

# Rubric results as a DataFrame
rubric_results = results.get_rubrics_results()
df_rubric = rubric_results.to_dataframe()

Each builder also supports filtering and aggregation with standard pandas operations:

# Pass rate by model
df.groupby("answering_model_name")["field_match"].mean()

# Trait scores by question
df_rubric.groupby("question_id")["trait_score"].mean()

Analyze results with DataFrames →

4. Persist to Database¶

Save benchmarks and verification results to a database for long-term storage, querying, and cross-run comparison:

from karenina.storage import DBConfig
from karenina.storage.operations import save_benchmark, save_verification_results

db_config = DBConfig(storage_url="sqlite:///results.db")
save_benchmark(benchmark, db_config)
save_verification_results(results_dict, db_config, run_id="run-001", benchmark_name="my-benchmark")

Persist results to database →

5. Export Results¶

Save results for sharing, external analysis, or archival:

# Export as JSON string
json_str = benchmark.export_verification_results(format="json")

# Export to file (format inferred from extension)
benchmark.export_verification_results_to_file("results.json")
benchmark.export_verification_results_to_file("results.csv")

# Export DataFrames directly
df.to_csv("template_analysis.csv", index=False)

Export results →

6. Iterate on Your Benchmark¶

Use analysis findings to improve your benchmark:

Failing templates — Identify questions where verify_result is False and refine template logic or field descriptions
Low rubric scores — Find traits with consistently low scores and adjust descriptions or thresholds
Re-run verification — After making changes, re-run to measure improvement

Iterate and improve →

Result Access Patterns¶

The VerificationResultSet provides specialized accessors for different analysis needs:

Accessor	Returns	Use Case
`get_template_results()`	`TemplateResults`	Template field comparisons, regex matches, token usage
`get_rubrics_results()`	`RubricResults`	Rubric trait scores by type
`get_judgment_results()`	`JudgmentResults`	Deep judgment excerpts and reasoning
`get_rubric_judgments_results()`	`RubricJudgmentResults`	Per-trait deep judgment details
`filter(...)`	`VerificationResultSet`	Subset by question, model, or completion status
`group_by_question()`	`dict[str, VerificationResultSet]`	Per-question analysis
`group_by_model()`	`dict[str, VerificationResultSet]`	Cross-model comparison
`group_by_replicate()`	`dict[int, VerificationResultSet]`	Replicate consistency
`get_summary()`	`dict`	Aggregate statistics (counts, pass rates, timing)

What You Get by Evaluation Mode¶

The data available in results depends on which evaluation mode you used:

Evaluation Mode	Template Results	Rubric Results	Deep Judgment
template_only	Parsed fields, verify result, regex, embedding	—	If enabled
template_and_rubric	Parsed fields, verify result, regex, embedding	Trait scores for all trait types	If enabled
rubric_only	—	Trait scores for all trait types	If enabled (rubric only)

Next Steps¶

Running Verification — If you haven't run verification yet
Creating Benchmarks — If you need to build or modify a benchmark