DataFrame Analysis¶

Karenina provides a DataFrame-first approach for analyzing verification results. By converting results to pandas DataFrames, you can use the full power of pandas for filtering, grouping, aggregation, and visualization.

Overview¶

After running verification, you receive a VerificationResultSet containing all results. The result set provides three specialized accessors that convert results to pandas DataFrames:

Accessor	Returns	Rows Represent
`get_template_results()`	`TemplateResults`	One row per parsed field comparison
`get_rubrics_results()`	`RubricResults`	One row per rubric trait evaluated
`get_judgment_results()`	`JudgmentResults`	One row per (attribute x excerpt) pair

Each accessor returns a wrapper object with a .to_dataframe() method plus filtering, grouping, and aggregation helpers.

Getting Started¶

The basic workflow is: extract a result type, convert to DataFrame, analyze with pandas.

In [2]:

Copied!





# Extract template results and convert to DataFrame
template_results = results.get_template_results()
df = template_results.to_dataframe()

print(f"DataFrame shape: {df.shape}")
print(f"Columns: {list(df.columns[:8])}...")
# Extract template results and convert to DataFrame
template_results = results.get_template_results()
df = template_results.to_dataframe()

print(f"DataFrame shape: {df.shape}")
print(f"Columns: {list(df.columns[:8])}...")

DataFrame shape: (6, 34)
Columns: ['completed_without_errors', 'error', 'recursion_limit_reached', 'question_id', 'template_id', 'question_text', 'keywords', 'replicate']...

Template DataFrames¶

TemplateResults provides three DataFrame methods:

Method	Exploded By	Use Case
`to_dataframe()`	Parsed fields	Field-level pass/fail analysis
`to_regex_dataframe()`	Regex patterns	Format compliance analysis
`to_usage_dataframe()`	Token usage stages	Cost analysis

Field-Level Analysis¶

The main template DataFrame creates one row per parsed field, enabling field-level comparison between ground truth and LLM-extracted values.

In [3]:

Copied!





template_results = results.get_template_results()
df = template_results.to_dataframe()

# Key columns for field-level analysis
print("Field comparison columns:")
print(
    df[["question_id", "answering_model", "field_name", "gt_value", "llm_value", "field_match"]].to_string(index=False)
)
template_results = results.get_template_results()
df = template_results.to_dataframe()

# Key columns for field-level analysis
print("Field comparison columns:")
print(
    df[["question_id", "answering_model", "field_name", "gt_value", "llm_value", "field_match"]].to_string(index=False)
)

Field comparison columns:
question_id                           answering_model field_name gt_value llm_value  field_match
         q1                          langchain:gpt-4o    capital    Paris     Paris         True
         q2                          langchain:gpt-4o     result       42        42         True
         q3                          langchain:gpt-4o    element   Oxygen  Nitrogen        False
         q1 claude_agent_sdk:claude-sonnet-4-20250514    capital    Paris     Paris         True
         q2 claude_agent_sdk:claude-sonnet-4-20250514     result       42        42         True
         q3 claude_agent_sdk:claude-sonnet-4-20250514    element   Oxygen    Oxygen         True

Pass Rate by Model¶

A common analysis pattern: calculate template verification pass rates grouped by answering model.

In [4]:

Copied!





# Use the built-in aggregation helper
pass_rates = template_results.aggregate_pass_rate(by="answering_model")
print("Pass rate by model:")
for model, rate in pass_rates.items():
    print(f"  {model}: {rate:.0%}")
# Use the built-in aggregation helper
pass_rates = template_results.aggregate_pass_rate(by="answering_model")
print("Pass rate by model:")
for model, rate in pass_rates.items():
    print(f"  {model}: {rate:.0%}")

Pass rate by model:
  claude_agent_sdk:claude-sonnet-4-20250514: 100%
  langchain:gpt-4o: 67%

Pass Rate by Question¶

In [5]:

Copied!





pass_rates_by_q = template_results.aggregate_pass_rate(by="question_id")
print("Pass rate by question:")
for qid, rate in pass_rates_by_q.items():
    print(f"  {qid}: {rate:.0%}")
pass_rates_by_q = template_results.aggregate_pass_rate(by="question_id")
print("Pass rate by question:")
for qid, rate in pass_rates_by_q.items():
    print(f"  {qid}: {rate:.0%}")

Pass rate by question:
  q1: 100%
  q2: 100%
  q3: 50%

Filtering Results¶

TemplateResults supports filtering before DataFrame conversion:

In [6]:

Copied!





# Filter to only failed results
failed = template_results.filter(failed_only=True)
df_failed = failed.to_dataframe()
print(f"Failed results: {len(failed)} (fields in DataFrame: {len(df_failed)})")

# Filter by model (use the full display string: "interface:model_name")
gpt_results = template_results.filter(answering_models=["langchain:gpt-4o"])
print(f"GPT-4o results: {len(gpt_results)}")
# Filter to only failed results
failed = template_results.filter(failed_only=True)
df_failed = failed.to_dataframe()
print(f"Failed results: {len(failed)} (fields in DataFrame: {len(df_failed)})")

# Filter by model (use the full display string: "interface:model_name")
gpt_results = template_results.filter(answering_models=["langchain:gpt-4o"])
print(f"GPT-4o results: {len(gpt_results)}")

Failed results: 1 (fields in DataFrame: 1)
GPT-4o results: 3

Summary Statistics¶

In [7]:

Copied!





summary = template_results.get_template_summary()
print(f"Total results: {summary['num_results']}")
print(f"Passed: {summary['num_passed']}, Failed: {summary['num_failed']}")
print(f"Pass rate: {summary['pass_rate']:.0%}")
print(f"Unique questions: {summary['num_questions']}")
summary = template_results.get_template_summary()
print(f"Total results: {summary['num_results']}")
print(f"Passed: {summary['num_passed']}, Failed: {summary['num_failed']}")
print(f"Pass rate: {summary['pass_rate']:.0%}")
print(f"Unique questions: {summary['num_questions']}")

Total results: 6
Passed: 5, Failed: 1
Pass rate: 83%
Unique questions: 3

Rubric DataFrames¶

RubricResults converts rubric evaluation scores to DataFrames, with one row per trait evaluated. It supports filtering by trait type.

Trait Type Filtering¶

The to_dataframe() method accepts a trait_type parameter:

Value	Includes
`"all"`	All trait types (default)
`"llm"`	All LLM traits (score + binary + literal)
`"llm_score"`	LLM score traits only (1-5 scale)
`"llm_binary"`	LLM binary traits only (True/False)
`"llm_literal"`	LLM literal traits only (categorical)
`"regex"`	Regex traits
`"callable"`	Callable traits
`"metric"`	Metric traits (exploded by metric name)

In [8]:

Copied!

rubric_results = results.get_rubrics_results()
df_all = rubric_results.to_dataframe()

print("All rubric traits:")
print(df_all[["question_id", "answering_model", "trait_name", "trait_score", "trait_type"]].to_string(index=False))
rubric_results = results.get_rubrics_results()
df_all = rubric_results.to_dataframe()

print("All rubric traits:")
print(df_all[["question_id", "answering_model", "trait_name", "trait_score", "trait_type"]].to_string(index=False))

All rubric traits:
question_id                           answering_model  trait_name trait_score trait_type
         q1                          langchain:gpt-4o     clarity           4  llm_score
         q1                          langchain:gpt-4o conciseness        True llm_binary
         q1                          langchain:gpt-4o  no_hedging        True      regex
         q2                          langchain:gpt-4o     clarity           5  llm_score
         q2                          langchain:gpt-4o conciseness        True llm_binary
         q3                          langchain:gpt-4o     clarity           3  llm_score
         q3                          langchain:gpt-4o conciseness       False llm_binary
         q1 claude_agent_sdk:claude-sonnet-4-20250514     clarity           5  llm_score
         q1 claude_agent_sdk:claude-sonnet-4-20250514 conciseness        True llm_binary
         q1 claude_agent_sdk:claude-sonnet-4-20250514  no_hedging        True      regex
         q2 claude_agent_sdk:claude-sonnet-4-20250514     clarity           5  llm_score
         q2 claude_agent_sdk:claude-sonnet-4-20250514 conciseness        True llm_binary
         q3 claude_agent_sdk:claude-sonnet-4-20250514     clarity           4  llm_score
         q3 claude_agent_sdk:claude-sonnet-4-20250514 conciseness        True llm_binary

Filtering by Trait Type¶

In [9]:

Copied!





# Get only LLM score traits (numeric 1-5 scale)
df_scores = rubric_results.to_dataframe(trait_type="llm_score")
print(f"\nLLM score traits: {len(df_scores)} rows")
if len(df_scores) > 0:
    print(df_scores[["question_id", "answering_model", "trait_name", "trait_score"]].to_string(index=False))
# Get only LLM score traits (numeric 1-5 scale)
df_scores = rubric_results.to_dataframe(trait_type="llm_score")
print(f"\nLLM score traits: {len(df_scores)} rows")
if len(df_scores) > 0:
    print(df_scores[["question_id", "answering_model", "trait_name", "trait_score"]].to_string(index=False))

LLM score traits: 6 rows
question_id                           answering_model trait_name  trait_score
         q1                          langchain:gpt-4o    clarity            4
         q2                          langchain:gpt-4o    clarity            5
         q3                          langchain:gpt-4o    clarity            3
         q1 claude_agent_sdk:claude-sonnet-4-20250514    clarity            5
         q2 claude_agent_sdk:claude-sonnet-4-20250514    clarity            5
         q3 claude_agent_sdk:claude-sonnet-4-20250514    clarity            4

Aggregating Trait Scores¶

In [10]:

Copied!





# Average LLM trait scores by model
avg_by_model = rubric_results.aggregate_llm_traits(strategy="mean", by="answering_model")
print("Average LLM trait scores by model:")
for model, traits in avg_by_model.items():
    print(f"  {model}:")
    for trait, score in traits.items():
        print(f"    {trait}: {score:.1f}")
# Average LLM trait scores by model
avg_by_model = rubric_results.aggregate_llm_traits(strategy="mean", by="answering_model")
print("Average LLM trait scores by model:")
for model, traits in avg_by_model.items():
    print(f"  {model}:")
    for trait, score in traits.items():
        print(f"    {trait}: {score:.1f}")

Average LLM trait scores by model:
  claude_agent_sdk:claude-sonnet-4-20250514:
    clarity: 4.7
    conciseness: 1.0
  langchain:gpt-4o:
    clarity: 4.0
    conciseness: 0.7

Trait Summary¶

In [11]:

Copied!





trait_summary = rubric_results.get_trait_summary()
print(f"Results with rubric data: {trait_summary['num_results']}")
print(f"LLM traits: {trait_summary['llm_traits']}")
print(f"Regex traits: {trait_summary['regex_traits']}")
print(f"Callable traits: {trait_summary['callable_traits']}")
trait_summary = rubric_results.get_trait_summary()
print(f"Results with rubric data: {trait_summary['num_results']}")
print(f"LLM traits: {trait_summary['llm_traits']}")
print(f"Regex traits: {trait_summary['regex_traits']}")
print(f"Callable traits: {trait_summary['callable_traits']}")

Results with rubric data: 6
LLM traits: ['clarity', 'conciseness']
Regex traits: ['no_hedging']
Callable traits: []

Deep Judgment DataFrames¶

JudgmentResults handles deep judgment data, creating one row per (attribute x excerpt) pair. This is the most granular DataFrame — use it when deep judgment is enabled in your verification configuration.

In [12]:

Copied!

# Access judgment results (empty if deep judgment was not enabled)
judgment_results = results.get_judgment_results()
print(f"Results with deep judgment: {len(judgment_results.get_results_with_judgment())}")
# Access judgment results (empty if deep judgment was not enabled)
judgment_results = results.get_judgment_results()
print(f"Results with deep judgment: {len(judgment_results.get_results_with_judgment())}")

Results with deep judgment: 0

When deep judgment is enabled, the DataFrame provides columns for excerpt text, confidence scores, similarity scores, hallucination risk, and reasoning traces.

Including Deep Judgment in Rubric DataFrames¶

You can also include deep judgment columns in rubric DataFrames:

In [13]:

Copied!





# Include trait reasoning and excerpts in rubric DataFrame
rubric_with_dj = results.get_rubrics_results(include_deep_judgment=True)
df = rubric_with_dj.to_dataframe()
# When deep judgment is enabled, additional columns appear:
# trait_reasoning, trait_excerpts, trait_hallucination_risk
print(f"Rubric DataFrame columns: {len(df.columns)}")
# Include trait reasoning and excerpts in rubric DataFrame
rubric_with_dj = results.get_rubrics_results(include_deep_judgment=True)
df = rubric_with_dj.to_dataframe()
# When deep judgment is enabled, additional columns appear:
# trait_reasoning, trait_excerpts, trait_hallucination_risk
print(f"Rubric DataFrame columns: {len(df.columns)}")

Rubric DataFrame columns: 21

Common Analysis Patterns¶

Model Comparison¶

Compare template pass rates and rubric scores across models using pandas:

In [14]:

Copied!





# Template pass rates by model
template_df = results.get_template_results().to_dataframe()
model_pass = template_df.drop_duplicates(subset=["result_index"]).groupby("answering_model")["verify_result"].mean()
print("Template pass rate by model:")
print(model_pass.to_string())
# Template pass rates by model
template_df = results.get_template_results().to_dataframe()
model_pass = template_df.drop_duplicates(subset=["result_index"]).groupby("answering_model")["verify_result"].mean()
print("Template pass rate by model:")
print(model_pass.to_string())

Template pass rate by model:
answering_model
claude_agent_sdk:claude-sonnet-4-20250514    1.000000
langchain:gpt-4o                             0.666667

Question Difficulty¶

Identify which questions are hardest by looking at pass rates across all models:

In [15]:

Copied!





question_pass = (
    template_df.drop_duplicates(subset=["result_index"])
    .groupby("question_id")["verify_result"]
    .agg(["mean", "count"])
    .rename(columns={"mean": "pass_rate", "count": "num_runs"})
    .sort_values("pass_rate")
)
print("\nQuestion difficulty (sorted by pass rate):")
print(question_pass.to_string())
question_pass = (
    template_df.drop_duplicates(subset=["result_index"])
    .groupby("question_id")["verify_result"]
    .agg(["mean", "count"])
    .rename(columns={"mean": "pass_rate", "count": "num_runs"})
    .sort_values("pass_rate")
)
print("\nQuestion difficulty (sorted by pass rate):")
print(question_pass.to_string())

Question difficulty (sorted by pass rate):
             pass_rate  num_runs
question_id                     
q3                 0.5         2
q1                 1.0         2
q2                 1.0         2

Exporting to CSV¶

In [16]:

Copied!





# Export template results to CSV
import os
import tempfile

with tempfile.NamedTemporaryFile(suffix=".csv", delete=False, mode="w") as f:
    template_df.to_csv(f.name, index=False)
    print(f"Exported {len(template_df)} rows to CSV")
    os.unlink(f.name)
# Export template results to CSV
import os
import tempfile

with tempfile.NamedTemporaryFile(suffix=".csv", delete=False, mode="w") as f:
    template_df.to_csv(f.name, index=False)
    print(f"Exported {len(template_df)} rows to CSV")
    os.unlink(f.name)

Exported 6 rows to CSV

Result Access Methods Summary¶

All three result types share a consistent interface:

Method	TemplateResults	RubricResults	JudgmentResults
`to_dataframe()`	field-level	trait-level	attribute x excerpt
`filter()`	by model, question, pass/fail	by model, question	by model, question, search
`group_by_question()`	dict of TemplateResults	dict of RubricResults	dict of JudgmentResults
`group_by_model()`	dict of TemplateResults	dict of RubricResults	dict of JudgmentResults
`get_*_summary()`	template stats	trait inventory	judgment stats

The VerificationResultSet itself provides higher-level operations:

filter() — filter by question IDs, models, completion status, etc.
group_by_question() / group_by_model() / group_by_replicate() — group results
get_summary() — comprehensive statistics including pass rates, token usage, and tool usage

Next Steps¶

VerificationResult Structure — understand the complete result hierarchy
Exporting Results — save results to JSON, CSV, or files
Iterating on Benchmarks — use analysis to improve templates and rubrics
Running Verification — how to generate results