DataFrame Analysis¶
Karenina provides a DataFrame-first approach for analyzing verification results. By converting results to pandas DataFrames, you can use the full power of pandas for filtering, grouping, aggregation, and visualization.
Overview¶
After running verification, you receive a VerificationResultSet containing all
results. The result set provides three specialized accessors that convert results
to pandas DataFrames:
| Accessor | Returns | Rows Represent |
|---|---|---|
get_template_results() |
TemplateResults |
One row per parsed field comparison |
get_rubrics_results() |
RubricResults |
One row per rubric trait evaluated |
get_judgment_results() |
JudgmentResults |
One row per (attribute x excerpt) pair |
Each accessor returns a wrapper object with a .to_dataframe() method plus
filtering, grouping, and aggregation helpers.
Getting Started¶
The basic workflow is: extract a result type, convert to DataFrame, analyze with pandas.
# Extract template results and convert to DataFrame
template_results = results.get_template_results()
df = template_results.to_dataframe()
print(f"DataFrame shape: {df.shape}")
print(f"Columns: {list(df.columns[:8])}...")
DataFrame shape: (6, 34) Columns: ['completed_without_errors', 'error', 'recursion_limit_reached', 'question_id', 'template_id', 'question_text', 'keywords', 'replicate']...
Template DataFrames¶
TemplateResults provides three DataFrame methods:
| Method | Exploded By | Use Case |
|---|---|---|
to_dataframe() |
Parsed fields | Field-level pass/fail analysis |
to_regex_dataframe() |
Regex patterns | Format compliance analysis |
to_usage_dataframe() |
Token usage stages | Cost analysis |
Field-Level Analysis¶
The main template DataFrame creates one row per parsed field, enabling field-level comparison between ground truth and LLM-extracted values.
template_results = results.get_template_results()
df = template_results.to_dataframe()
# Key columns for field-level analysis
print("Field comparison columns:")
print(
df[["question_id", "answering_model", "field_name", "gt_value", "llm_value", "field_match"]].to_string(index=False)
)
Field comparison columns:
question_id answering_model field_name gt_value llm_value field_match
q1 langchain:gpt-4o capital Paris Paris True
q2 langchain:gpt-4o result 42 42 True
q3 langchain:gpt-4o element Oxygen Nitrogen False
q1 claude_agent_sdk:claude-sonnet-4-20250514 capital Paris Paris True
q2 claude_agent_sdk:claude-sonnet-4-20250514 result 42 42 True
q3 claude_agent_sdk:claude-sonnet-4-20250514 element Oxygen Oxygen True
Pass Rate by Model¶
A common analysis pattern: calculate template verification pass rates grouped by answering model.
# Use the built-in aggregation helper
pass_rates = template_results.aggregate_pass_rate(by="answering_model")
print("Pass rate by model:")
for model, rate in pass_rates.items():
print(f" {model}: {rate:.0%}")
Pass rate by model: claude_agent_sdk:claude-sonnet-4-20250514: 100% langchain:gpt-4o: 67%
Pass Rate by Question¶
pass_rates_by_q = template_results.aggregate_pass_rate(by="question_id")
print("Pass rate by question:")
for qid, rate in pass_rates_by_q.items():
print(f" {qid}: {rate:.0%}")
Pass rate by question: q1: 100% q2: 100% q3: 50%
Filtering Results¶
TemplateResults supports filtering before DataFrame conversion:
# Filter to only failed results
failed = template_results.filter(failed_only=True)
df_failed = failed.to_dataframe()
print(f"Failed results: {len(failed)} (fields in DataFrame: {len(df_failed)})")
# Filter by model (use the full display string: "interface:model_name")
gpt_results = template_results.filter(answering_models=["langchain:gpt-4o"])
print(f"GPT-4o results: {len(gpt_results)}")
Failed results: 1 (fields in DataFrame: 1) GPT-4o results: 3
Summary Statistics¶
summary = template_results.get_template_summary()
print(f"Total results: {summary['num_results']}")
print(f"Passed: {summary['num_passed']}, Failed: {summary['num_failed']}")
print(f"Pass rate: {summary['pass_rate']:.0%}")
print(f"Unique questions: {summary['num_questions']}")
Total results: 6 Passed: 5, Failed: 1 Pass rate: 83% Unique questions: 3
Rubric DataFrames¶
RubricResults converts rubric evaluation scores to DataFrames, with one row
per trait evaluated. It supports filtering by trait type.
Trait Type Filtering¶
The to_dataframe() method accepts a trait_type parameter:
| Value | Includes |
|---|---|
"all" |
All trait types (default) |
"llm" |
All LLM traits (score + binary + literal) |
"llm_score" |
LLM score traits only (1-5 scale) |
"llm_binary" |
LLM binary traits only (True/False) |
"llm_literal" |
LLM literal traits only (categorical) |
"regex" |
Regex traits |
"callable" |
Callable traits |
"metric" |
Metric traits (exploded by metric name) |
rubric_results = results.get_rubrics_results()
df_all = rubric_results.to_dataframe()
print("All rubric traits:")
print(df_all[["question_id", "answering_model", "trait_name", "trait_score", "trait_type"]].to_string(index=False))
All rubric traits:
question_id answering_model trait_name trait_score trait_type
q1 langchain:gpt-4o clarity 4 llm_score
q1 langchain:gpt-4o conciseness True llm_binary
q1 langchain:gpt-4o no_hedging True regex
q2 langchain:gpt-4o clarity 5 llm_score
q2 langchain:gpt-4o conciseness True llm_binary
q3 langchain:gpt-4o clarity 3 llm_score
q3 langchain:gpt-4o conciseness False llm_binary
q1 claude_agent_sdk:claude-sonnet-4-20250514 clarity 5 llm_score
q1 claude_agent_sdk:claude-sonnet-4-20250514 conciseness True llm_binary
q1 claude_agent_sdk:claude-sonnet-4-20250514 no_hedging True regex
q2 claude_agent_sdk:claude-sonnet-4-20250514 clarity 5 llm_score
q2 claude_agent_sdk:claude-sonnet-4-20250514 conciseness True llm_binary
q3 claude_agent_sdk:claude-sonnet-4-20250514 clarity 4 llm_score
q3 claude_agent_sdk:claude-sonnet-4-20250514 conciseness True llm_binary
Filtering by Trait Type¶
# Get only LLM score traits (numeric 1-5 scale)
df_scores = rubric_results.to_dataframe(trait_type="llm_score")
print(f"\nLLM score traits: {len(df_scores)} rows")
if len(df_scores) > 0:
print(df_scores[["question_id", "answering_model", "trait_name", "trait_score"]].to_string(index=False))
LLM score traits: 6 rows
question_id answering_model trait_name trait_score
q1 langchain:gpt-4o clarity 4
q2 langchain:gpt-4o clarity 5
q3 langchain:gpt-4o clarity 3
q1 claude_agent_sdk:claude-sonnet-4-20250514 clarity 5
q2 claude_agent_sdk:claude-sonnet-4-20250514 clarity 5
q3 claude_agent_sdk:claude-sonnet-4-20250514 clarity 4
Aggregating Trait Scores¶
# Average LLM trait scores by model
avg_by_model = rubric_results.aggregate_llm_traits(strategy="mean", by="answering_model")
print("Average LLM trait scores by model:")
for model, traits in avg_by_model.items():
print(f" {model}:")
for trait, score in traits.items():
print(f" {trait}: {score:.1f}")
Average LLM trait scores by model:
claude_agent_sdk:claude-sonnet-4-20250514:
clarity: 4.7
conciseness: 1.0
langchain:gpt-4o:
clarity: 4.0
conciseness: 0.7
Trait Summary¶
trait_summary = rubric_results.get_trait_summary()
print(f"Results with rubric data: {trait_summary['num_results']}")
print(f"LLM traits: {trait_summary['llm_traits']}")
print(f"Regex traits: {trait_summary['regex_traits']}")
print(f"Callable traits: {trait_summary['callable_traits']}")
Results with rubric data: 6 LLM traits: ['clarity', 'conciseness'] Regex traits: ['no_hedging'] Callable traits: []
Deep Judgment DataFrames¶
JudgmentResults handles deep judgment data, creating one row per
(attribute x excerpt) pair. This is the most granular DataFrame — use it
when deep judgment is enabled in your verification configuration.
# Access judgment results (empty if deep judgment was not enabled)
judgment_results = results.get_judgment_results()
print(f"Results with deep judgment: {len(judgment_results.get_results_with_judgment())}")
Results with deep judgment: 0
When deep judgment is enabled, the DataFrame provides columns for excerpt text, confidence scores, similarity scores, hallucination risk, and reasoning traces.
Including Deep Judgment in Rubric DataFrames¶
You can also include deep judgment columns in rubric DataFrames:
# Include trait reasoning and excerpts in rubric DataFrame
rubric_with_dj = results.get_rubrics_results(include_deep_judgment=True)
df = rubric_with_dj.to_dataframe()
# When deep judgment is enabled, additional columns appear:
# trait_reasoning, trait_excerpts, trait_hallucination_risk
print(f"Rubric DataFrame columns: {len(df.columns)}")
Rubric DataFrame columns: 21
# Template pass rates by model
template_df = results.get_template_results().to_dataframe()
model_pass = template_df.drop_duplicates(subset=["result_index"]).groupby("answering_model")["verify_result"].mean()
print("Template pass rate by model:")
print(model_pass.to_string())
Template pass rate by model: answering_model claude_agent_sdk:claude-sonnet-4-20250514 1.000000 langchain:gpt-4o 0.666667
Question Difficulty¶
Identify which questions are hardest by looking at pass rates across all models:
question_pass = (
template_df.drop_duplicates(subset=["result_index"])
.groupby("question_id")["verify_result"]
.agg(["mean", "count"])
.rename(columns={"mean": "pass_rate", "count": "num_runs"})
.sort_values("pass_rate")
)
print("\nQuestion difficulty (sorted by pass rate):")
print(question_pass.to_string())
Question difficulty (sorted by pass rate):
pass_rate num_runs
question_id
q3 0.5 2
q1 1.0 2
q2 1.0 2
Exporting to CSV¶
# Export template results to CSV
import os
import tempfile
with tempfile.NamedTemporaryFile(suffix=".csv", delete=False, mode="w") as f:
template_df.to_csv(f.name, index=False)
print(f"Exported {len(template_df)} rows to CSV")
os.unlink(f.name)
Exported 6 rows to CSV
Result Access Methods Summary¶
All three result types share a consistent interface:
| Method | TemplateResults | RubricResults | JudgmentResults |
|---|---|---|---|
to_dataframe() |
field-level | trait-level | attribute x excerpt |
filter() |
by model, question, pass/fail | by model, question | by model, question, search |
group_by_question() |
dict of TemplateResults | dict of RubricResults | dict of JudgmentResults |
group_by_model() |
dict of TemplateResults | dict of RubricResults | dict of JudgmentResults |
get_*_summary() |
template stats | trait inventory | judgment stats |
The VerificationResultSet itself provides higher-level operations:
filter()— filter by question IDs, models, completion status, etc.group_by_question()/group_by_model()/group_by_replicate()— group resultsget_summary()— comprehensive statistics including pass rates, token usage, and tool usage
Next Steps¶
- VerificationResult Structure — understand the complete result hierarchy
- Exporting Results — save results to JSON, CSV, or files
- Iterating on Benchmarks — use analysis to improve templates and rubrics
- Running Verification — how to generate results