Results and Scoring¶

Every question that passes through the verification pipeline produces a VerificationResult: a nested Pydantic model that captures everything that happened during evaluation, from the raw response to the final pass/fail verdict. This page explains the result data model, how scoring works, how to access and aggregate results, and how to export them for analysis.

1. What Results Capture¶

The most important idea is: a result is a complete evidence record, not just a score. It preserves every intermediate artifact (the raw response, the parsed fields, each optional check's outcome, every rubric trait score) so that downstream analysis can always trace a verdict back to its inputs. Nothing is discarded.

A single VerificationResult corresponds to one question evaluated by one answering model and parsed by one judge model, in one replicate. If you evaluate 10 questions with 2 answering models and 3 replicates, you get 60 results.

1.1. Result Structure at a Glance¶

VerificationResult uses nested composition: five optional sub-objects, each grouping a coherent slice of the evidence. This is the only access path; flat property accessors do not exist.

VerificationResult
├── metadata              ← Always present: identification, timing, model info
├── template              ← Present when template evaluation ran
├── rubric                ← Present when rubric evaluation ran
├── deep_judgment         ← Present when deep judgment ran (templates)
├── deep_judgment_rubric  ← Present when deep judgment ran (rubrics)
│
│  (Root-level fields for MCP agent trace filtering)
├── evaluation_input      ← The text passed to evaluation stages
├── used_full_trace       ← Whether the full agent trace was used
└── trace_extraction_error ← Error if final AI message extraction failed

Access fields through their sub-objects:

In [2]:

Copied!





# Correct: nested access
print(result.metadata.question_id)
print(result.template.verify_result)
print(result.rubric.llm_trait_scores)
# Correct: nested access
print(result.metadata.question_id)
print(result.template.verify_result)
print(result.rubric.llm_trait_scores)

q1
True
{'safety': True, 'clarity': 4}

In [3]:

Copied!





# Wrong: flat access (removed, will raise AttributeError)
try:
    result.question_id
except AttributeError as e:
    print(f"AttributeError: {e}")
# Wrong: flat access (removed, will raise AttributeError)
try:
    result.question_id
except AttributeError as e:
    print(f"AttributeError: {e}")

AttributeError: 'VerificationResult' object has no attribute 'question_id'

2. Metadata: Identity and Execution Context¶

Every result carries a VerificationResultMetadata sub-object regardless of evaluation mode. It identifies what was evaluated, by which models, and when.

Field	Type	Description
`question_id`	`str`	MD5 hash of the question text (32-char hex)
`question_text`	`str`	Full question text
`raw_answer`	`str \\| None`	Human-readable ground truth from the checkpoint
`template_id`	`str`	MD5 hash of the template code, or `"no_template"`
`answering`	`ModelIdentity`	Answering model (interface, model_name, tools)
`parsing`	`ModelIdentity`	Parsing/judge model (interface, model_name)
`answering_system_prompt`	`str \\| None`	System prompt used for the answering model
`parsing_system_prompt`	`str \\| None`	System prompt used for the parsing model
`execution_time`	`float`	Pipeline execution time in seconds
`timestamp`	`str`	ISO timestamp of when the result was produced
`result_id`	`str`	Deterministic 16-character SHA256 hash (see below)
`run_name`	`str \\| None`	Organizing label for verification runs
`replicate`	`int \\| None`	Replicate number (1, 2, 3, ...) for repeated runs
`keywords`	`list[str] \\| None`	Keywords associated with the question
`completed_without_errors`	`bool`	Whether the pipeline ran without errors
`error`	`str \\| None`	Error message if something went wrong
`few_shot_enabled`	`bool`	Whether few-shot prompting was active (default `False`)
`few_shot_example_count`	`int`	Number of few-shot examples used (default `0`)
`evaluation_mode`	`str \\| None`	Evaluation mode used (e.g., `"template_only"`, `"template_and_rubric"`)

In [4]:

Copied!





meta = result.metadata
print(f"Question:  {meta.question_id}")
print(f"Model:     {meta.answering.display_string}")
print(f"Judge:     {meta.parsing.display_string}")
print(f"Time:      {meta.execution_time}s")
print(f"Result ID: {meta.result_id}")
print(f"Replicate: {meta.replicate}")
print(f"Success:   {meta.completed_without_errors}")
meta = result.metadata
print(f"Question:  {meta.question_id}")
print(f"Model:     {meta.answering.display_string}")
print(f"Judge:     {meta.parsing.display_string}")
print(f"Time:      {meta.execution_time}s")
print(f"Result ID: {meta.result_id}")
print(f"Replicate: {meta.replicate}")
print(f"Success:   {meta.completed_without_errors}")

Question:  q1
Model:     langchain:claude-sonnet-4-5-20250514
Judge:     langchain:claude-haiku-4-5-20251001
Time:      1.2s
Result ID: e6972c44b765e7c3
Replicate: 1
Success:   True

2.1. ModelIdentity¶

Models are identified by a composite ModelIdentity object, not a plain string. This distinguishes the same model used with different interfaces or MCP tool sets:

Field	Description
`interface`	The adapter interface (e.g., `"langchain"`, `"claude_sdk"`)
`model_name`	The model name (e.g., `"claude-sonnet-4-6"`)
`tools`	Sorted list of MCP server names (answering models only; always `[]` for parsing)

In [5]:

Copied!





identity = result.metadata.answering
print(f"Interface:      {identity.interface}")
print(f"Model name:     {identity.model_name}")
print(f"Tools:          {identity.tools}")
print(f"Display string: {identity.display_string}")
print(f"Canonical key:  {identity.canonical_key}")
identity = result.metadata.answering
print(f"Interface:      {identity.interface}")
print(f"Model name:     {identity.model_name}")
print(f"Tools:          {identity.tools}")
print(f"Display string: {identity.display_string}")
print(f"Canonical key:  {identity.canonical_key}")

Interface:      langchain
Model name:     claude-sonnet-4-5-20250514
Tools:          []
Display string: langchain:claude-sonnet-4-5-20250514
Canonical key:  langchain:claude-sonnet-4-5-20250514:

2.2. Deterministic Result IDs¶

Each result gets a result_id: a 16-character SHA256 hash computed from (question_id, answering, parsing, timestamp, replicate). The same inputs always produce the same ID, enabling deduplication across runs. The ID is computed by VerificationResultMetadata.compute_result_id().

In [6]:

Copied!





# Same inputs always produce the same ID
id1 = VerificationResultMetadata.compute_result_id(
    question_id="q1", answering=answering_a, parsing=parsing,
    timestamp="2025-06-15T10:00:00Z", replicate=1,
)
id2 = VerificationResultMetadata.compute_result_id(
    question_id="q1", answering=answering_a, parsing=parsing,
    timestamp="2025-06-15T10:00:00Z", replicate=1,
)
print(f"ID 1:  {id1}")
print(f"ID 2:  {id2}")
print(f"Match: {id1 == id2}")
print(f"Length: {len(id1)} characters")
# Same inputs always produce the same ID
id1 = VerificationResultMetadata.compute_result_id(
    question_id="q1", answering=answering_a, parsing=parsing,
    timestamp="2025-06-15T10:00:00Z", replicate=1,
)
id2 = VerificationResultMetadata.compute_result_id(
    question_id="q1", answering=answering_a, parsing=parsing,
    timestamp="2025-06-15T10:00:00Z", replicate=1,
)
print(f"ID 1:  {id1}")
print(f"ID 2:  {id2}")
print(f"Match: {id1 == id2}")
print(f"Length: {len(id1)} characters")

ID 1:  e6972c44b765e7c3
ID 2:  e6972c44b765e7c3
Match: True
Length: 16 characters

3. Template Results: The Correctness Record¶

The template sub-object (VerificationResultTemplate) is present whenever template evaluation ran (template_only or template_and_rubric evaluation modes). It records the full chain from raw response to pass/fail verdict.

3.1. The Primary Correctness Signal: `verify_result`¶

verify_result is a bool | None that captures whether the template's verify() method returned True. This is the core correctness output. When template evaluation did not run (e.g., rubric_only mode), this field is None.

Several pipeline stages can override this value before finalization:

Stage	Override Behavior
Abstention check	Sets `verify_result` to `False` if the model refused to answer
Sufficiency check	Sets `verify_result` to `False` if the response lacks sufficient information
Embedding check	Can override `verify_result` based on semantic similarity threshold

The corresponding *_override_applied boolean fields record whether an override occurred, so you can always distinguish "failed on its own merits" from "overridden by a guard stage."

In [7]:

Copied!





tmpl = result.template
print(f"verify_result:                {tmpl.verify_result}")
print(f"Parsed LLM response:         {tmpl.parsed_llm_response}")
print(f"Parsed ground truth:         {tmpl.parsed_gt_response}")
print(f"Abstention check performed:  {tmpl.abstention_check_performed}")
print(f"Embedding check performed:   {tmpl.embedding_check_performed}")
print(f"Embedding similarity score:  {tmpl.embedding_similarity_score}")
tmpl = result.template
print(f"verify_result:                {tmpl.verify_result}")
print(f"Parsed LLM response:         {tmpl.parsed_llm_response}")
print(f"Parsed ground truth:         {tmpl.parsed_gt_response}")
print(f"Abstention check performed:  {tmpl.abstention_check_performed}")
print(f"Embedding check performed:   {tmpl.embedding_check_performed}")
print(f"Embedding similarity score:  {tmpl.embedding_similarity_score}")

verify_result:                True
Parsed LLM response:         {'target': 'BCL2'}
Parsed ground truth:         {'target': 'BCL2'}
Abstention check performed:  False
Embedding check performed:   True
Embedding similarity score:  0.92

3.2. Response and Parsing Artifacts¶

Field	Type	Description
`raw_llm_response`	`str`	The answering model's full text response
`trace_messages`	`list[dict]`	Structured message trace (for MCP agent runs)
`parsed_llm_response`	`dict \\| None`	Fields extracted by the Judge LLM
`parsed_gt_response`	`dict \\| None`	Ground truth parsed into the same template fields
`verify_granular_result`	`Any \\| None`	Per-field verification detail (if `verify_granular()` is implemented)
`field_verification_error`	`str \\| None`	Error message if `verify()` raised an exception (non-fatal)
`field_results`	`dict[str, bool] \\| None`	Per-field primitive verification results (from `_compute_field_results()`)
`composition_strategy`	`str \\| None`	Composition strategy used: `"all_of"`, `"any_of"`, or `"at_least_n(N)"`

3.3. Optional Check Results¶

Each optional check records three pieces of state: whether it was attempted, what it found, and whether it overrode the verdict.

Check	Key Fields
Abstention	`abstention_check_performed`, `abstention_detected`, `abstention_override_applied`, `abstention_reasoning`
Sufficiency	`sufficiency_check_performed`, `sufficiency_detected`, `sufficiency_override_applied`, `sufficiency_reasoning`
Embedding	`embedding_check_performed`, `embedding_similarity_score` (0.0 to 1.0), `embedding_override_applied`, `embedding_model_used`
Regex	`regex_validations_performed`, `regex_validation_results` (per-pattern dict), `regex_overall_success`, `regex_extraction_results`

3.4. Execution Metadata¶

Field	Type	Description
`recursion_limit_reached`	`bool`	Whether an MCP agent hit its recursion limit
`answering_mcp_servers`	`list[str] \\| None`	MCP servers attached to the answering model
`usage_metadata`	`dict \\| None`	Token usage breakdown by stage (`answer_generation`, `parsing`, `rubric_evaluation`, `abstention_check`, `total`)
`agent_metrics`	`dict \\| None`	MCP agent metrics: `iterations`, `tool_calls`, `tools_used`, `suspect_failed_tool_calls`, `suspect_failed_tools`

4. Rubric Results: The Quality Record¶

The rubric sub-object (VerificationResultRubric) is present whenever rubric evaluation ran (template_and_rubric or rubric_only modes). Trait scores are split by type into separate dictionaries, all keyed by trait name.

Field	Type	Description
`llm_trait_scores`	`dict[str, int \\| bool] \\| None`	LLM-evaluated traits (boolean or 1-5 scale)
`llm_trait_labels`	`dict[str, str] \\| None`	Class labels for literal-kind LLM traits (index-to-name mapping)
`regex_trait_scores`	`dict[str, bool] \\| None`	Regex trait pass/fail results
`callable_trait_scores`	`dict[str, bool \\| int] \\| None`	Callable trait results
`metric_trait_scores`	`dict[str, dict[str, float]] \\| None`	Metric trait metrics (precision, recall, F1, etc.)
`metric_trait_confusion_lists`	`dict[str, dict[str, list[str]]] \\| None`	Per-metric confusion lists (tp, tn, fp, fn containing excerpts)
`rubric_evaluation_strategy`	`str \\| None`	`"batch"` or `"sequential"`

4.1. Accessing Trait Scores¶

In [8]:

Copied!





# Individual trait types
print("LLM trait (boolean):", result.rubric.llm_trait_scores["safety"])
print("LLM trait (score):  ", result.rubric.llm_trait_scores["clarity"])
print("Regex trait:        ", result.rubric.regex_trait_scores["has_citations"])
print("Callable trait:     ", result.rubric.callable_trait_scores["under_150w"])
# Individual trait types
print("LLM trait (boolean):", result.rubric.llm_trait_scores["safety"])
print("LLM trait (score):  ", result.rubric.llm_trait_scores["clarity"])
print("Regex trait:        ", result.rubric.regex_trait_scores["has_citations"])
print("Callable trait:     ", result.rubric.callable_trait_scores["under_150w"])

LLM trait (boolean): True
LLM trait (score):   4
Regex trait:         True
Callable trait:      True

In [9]:

Copied!

# Literal-kind LLM traits: score is the class index, label is the class name
print("Literal label:", result.rubric.llm_trait_labels["response_type"])
# Literal-kind LLM traits: score is the class index, label is the class name
print("Literal label:", result.rubric.llm_trait_labels["response_type"])

Literal label: Factual

In [10]:

Copied!

# Metric traits: nested dict of float metrics
print("Metric scores:", result.rubric.metric_trait_scores["drug_coverage"])
# Metric traits: nested dict of float metrics
print("Metric scores:", result.rubric.metric_trait_scores["drug_coverage"])

Metric scores: {'tp': 3.0, 'fn': 1.0, 'fp': 0.0, 'precision': 1.0, 'recall': 0.75, 'f1': 0.857}

In [11]:

Copied!

# Confusion lists for metric traits: which items were found/missed
print("Confusion lists:", result.rubric.metric_trait_confusion_lists["drug_coverage"])
# Confusion lists for metric traits: which items were found/missed
print("Confusion lists:", result.rubric.metric_trait_confusion_lists["drug_coverage"])

Confusion lists: {'tp': ['aspirin', 'ibuprofen', 'acetaminophen'], 'fn': ['naproxen'], 'fp': [], 'tn': []}

In [12]:

Copied!

# Flat access across all types
all_scores = result.rubric.get_all_trait_scores()
print("All trait scores:", all_scores)
# Flat access across all types
all_scores = result.rubric.get_all_trait_scores()
print("All trait scores:", all_scores)

All trait scores: {'safety': True, 'clarity': 4, 'has_citations': True, 'under_150w': True, 'drug_coverage': {'tp': 3.0, 'fn': 1.0, 'fp': 0.0, 'precision': 1.0, 'recall': 0.75, 'f1': 0.857}}

In [13]:

Copied!

# Look up a trait by name (returns value and type)
print("Trait lookup:", result.rubric.get_trait_by_name("safety"))
print("Trait lookup:", result.rubric.get_trait_by_name("has_citations"))
# Look up a trait by name (returns value and type)
print("Trait lookup:", result.rubric.get_trait_by_name("safety"))
print("Trait lookup:", result.rubric.get_trait_by_name("has_citations"))

Trait lookup: (True, 'llm')
Trait lookup: (True, 'regex')

5. Deep Judgment Results (Optional)¶

When deep judgment is enabled, additional evidence-based results are captured. Deep judgment adds excerpt extraction, per-attribute reasoning, and optional hallucination risk assessment on top of standard evaluation.

5.1. Template Deep Judgment¶

The deep_judgment sub-object (VerificationResultDeepJudgment) records per-attribute evidence:

Field	Type	Description
`extracted_excerpts`	`dict[str, list[dict]]`	Per-attribute verbatim passages with confidence (`low`/`medium`/`high`), similarity score, and optional search results
`attribute_reasoning`	`dict[str, str]`	LLM reasoning for each attribute (present even when no excerpts were found)
`hallucination_risk_assessment`	`dict[str, str]`	Risk level per attribute (`none`/`low`/`medium`/`high`); only populated when search is enabled
`deep_judgment_stages_completed`	`list[str]`	Which stages ran: `"excerpts"`, `"reasoning"`, `"parameters"`
`attributes_without_excerpts`	`list[str]`	Attributes with no corroborating excerpts
`deep_judgment_model_calls`	`int`	Number of LLM invocations

5.2. Rubric Deep Judgment¶

The deep_judgment_rubric sub-object (VerificationResultDeepJudgmentRubric) records per-trait evidence for rubric traits with deep judgment enabled:

Field	Type	Description
`extracted_rubric_excerpts`	`dict[str, list[dict]]`	Per-trait excerpts (only for traits with `deep_judgment_excerpt_enabled=True`)
`rubric_trait_reasoning`	`dict[str, str]`	Per-trait reasoning (all deep-judgment-enabled traits)
`deep_judgment_rubric_scores`	`dict[str, int \\| bool]`	Scores from deep-judgment evaluation
`standard_rubric_scores`	`dict[str, int \\| bool]`	Scores for non-deep-judgment traits (for comparison)
`traits_without_valid_excerpts`	`list[str]`	Traits that exhausted retries without valid excerpts
`trait_metadata`	`dict[str, dict]`	Per-trait tracking (stages completed, model calls, retry counts)

6. How Results Vary by Evaluation Mode¶

The evaluation mode determines which sub-objects are populated:

Sub-object	`template_only`	`template_and_rubric`	`rubric_only`
`metadata`	Always	Always	Always
`template`	Present	Present	`None`
`template.verify_result`	`bool`	`bool`	N/A
`rubric`	`None`	Present	Present
`deep_judgment`	Optional	Optional	`None`
`deep_judgment_rubric`	`None`	Optional	Optional

In rubric_only mode, no template parsing occurs. The rubric trait scores evaluated against the raw response are the primary output. In template_only mode (the default), the rubric sub-object is None.

7. Working with Result Collections¶

Benchmark.run_verification() returns a VerificationResultSet: the top-level container that holds all individual results and provides specialized views, filtering, grouping, and DataFrame conversion.

7.1. Specialized Views¶

The result set provides four accessor methods, each returning a purpose-built wrapper with its own analysis API:

Accessor Method	Returns	Purpose
`get_template_results()`	`TemplateResults`	Pass/fail rates, embedding scores, regex results, abstention detection, parsed responses
`get_rubrics_results()`	`RubricResults`	Trait scores by type, aggregation, confusion matrices
`get_judgment_results()`	`JudgmentResults`	Extracted excerpts, reasoning traces, hallucination risk
`get_rubric_judgments_results()`	`RubricJudgmentResults`	Excerpt-level explosion (one row per trait per excerpt)

In [14]:

Copied!





# Template analysis
template_results = result_set.get_template_results()
print(f"TemplateResults with {len(template_results)} results")
print(f"Summary: {template_results.get_template_summary()}")
# Template analysis
template_results = result_set.get_template_results()
print(f"TemplateResults with {len(template_results)} results")
print(f"Summary: {template_results.get_template_summary()}")

TemplateResults with 4 results
Summary: {'num_results': 4, 'num_passed': 3, 'num_failed': 1, 'pass_rate': 0.75, 'num_with_embedding': 1, 'num_with_regex': 0, 'num_with_abstention': 0, 'num_questions': 2}

In [15]:

Copied!





# Rubric analysis
rubric_results = result_set.get_rubrics_results()
print(f"RubricResults with {len(rubric_results)} results")
print(f"Summary: {rubric_results.get_trait_summary()}")
# Rubric analysis
rubric_results = result_set.get_rubrics_results()
print(f"RubricResults with {len(rubric_results)} results")
print(f"Summary: {rubric_results.get_trait_summary()}")

RubricResults with 4 results
Summary: {'num_results': 4, 'llm_traits': ['clarity', 'safety'], 'regex_traits': ['has_citations'], 'callable_traits': ['under_150w'], 'metric_traits': ['drug_coverage'], 'num_questions': 2}

7.2. Filtering and Grouping¶

Both VerificationResultSet and the specialized views support filtering and grouping. Filtering returns a new instance of the same type with a subset of results.

In [16]:

Copied!





# Filter at the result set level
filtered = result_set.filter(
    question_ids=["q1"],
    completed_only=True,
)
print(f"Filtered to {len(filtered)} results (question q1 only)")

# Group by different dimensions
by_question = result_set.group_by_question()
for qid, qresults in by_question.items():
    print(f"  Question {qid}: {len(qresults)} results")
# Filter at the result set level
filtered = result_set.filter(
    question_ids=["q1"],
    completed_only=True,
)
print(f"Filtered to {len(filtered)} results (question q1 only)")

# Group by different dimensions
by_question = result_set.group_by_question()
for qid, qresults in by_question.items():
    print(f"  Question {qid}: {len(qresults)} results")

Filtered to 2 results (question q1 only)
  Question q1: 2 results
  Question q2: 2 results

In [17]:

Copied!





# Group by model
by_model = result_set.group_by_model()
for model_key, model_results in by_model.items():
    print(f"  Model {model_key}: {len(model_results)} results")
# Group by model
by_model = result_set.group_by_model()
for model_key, model_results in by_model.items():
    print(f"  Model {model_key}: {len(model_results)} results")

  Model langchain:claude-sonnet-4-5-20250514: 2 results
  Model langchain:gpt-4.1-mini-2025-04-14: 2 results

In [18]:

Copied!





# Specialized views also support filtering
passed = template_results.filter(passed_only=True)
failed = template_results.filter(failed_only=True)
print(f"Passed: {len(passed)}, Failed: {len(failed)}")
# Specialized views also support filtering
passed = template_results.filter(passed_only=True)
failed = template_results.filter(failed_only=True)
print(f"Passed: {len(passed)}, Failed: {len(failed)}")

Passed: 3, Failed: 1

7.3. Iteration¶

All containers support standard Python iteration:

In [19]:

Copied!





for r in result_set:
    print(f"  {r.metadata.question_id}: verify={r.template.verify_result}, "
          f"model={r.metadata.answering.model_name}")

print(f"\nTotal results: {len(result_set)}")
print(f"First result question: {result_set[0].metadata.question_text}")
for r in result_set:
    print(f"  {r.metadata.question_id}: verify={r.template.verify_result}, "
          f"model={r.metadata.answering.model_name}")

print(f"\nTotal results: {len(result_set)}")
print(f"First result question: {result_set[0].metadata.question_text}")

  q1: verify=True, model=claude-sonnet-4-5-20250514
  q1: verify=False, model=gpt-4.1-mini-2025-04-14
  q2: verify=True, model=claude-sonnet-4-5-20250514
  q2: verify=True, model=gpt-4.1-mini-2025-04-14

Total results: 4
First result question: What is the putative target of venetoclax?

8. DataFrame Export¶

Every specialized view converts to pandas DataFrames for tabular analysis. The DataFrame structures are designed around a specific "explosion" axis: each row represents the finest-grained unit for that view type.

8.1. Template DataFrames¶

TemplateResults provides three DataFrame exports:

Method	Row Granularity	Key Columns
`to_dataframe()`	One row per parsed field per result	`field_name`, `gt_value`, `llm_value`, `field_match`, `verify_result`
`to_regex_dataframe()`	One row per regex pattern per result	`pattern_name`, `pattern_regex`, `matched`, `extracted_value`
`to_usage_dataframe()`	One row per usage stage per result	`usage_stage`, `input_tokens`, `output_tokens`, `total_tokens`, `model_used`

In [20]:

Copied!





template_results = result_set.get_template_results()

# Field-level comparison: ground truth vs LLM extraction
df = template_results.to_dataframe()
print(f"Template DataFrame: {len(df)} rows, {len(df.columns)} columns")
print(f"Columns: {list(df.columns)}")
print()
print(df[["question_id", "field_name", "gt_value", "llm_value", "field_match", "verify_result"]].to_string(index=False))
template_results = result_set.get_template_results()

# Field-level comparison: ground truth vs LLM extraction
df = template_results.to_dataframe()
print(f"Template DataFrame: {len(df)} rows, {len(df.columns)} columns")
print(f"Columns: {list(df.columns)}")
print()
print(df[["question_id", "field_name", "gt_value", "llm_value", "field_match", "verify_result"]].to_string(index=False))

Template DataFrame: 4 rows, 34 columns
Columns: ['completed_without_errors', 'error', 'recursion_limit_reached', 'question_id', 'template_id', 'question_text', 'keywords', 'replicate', 'answering_mcp_servers', 'answering_model', 'parsing_model', 'answering_system_prompt', 'parsing_system_prompt', 'raw_llm_response', 'field_name', 'gt_value', 'llm_value', 'field_match', 'field_type', 'verify_result', 'embedding_check_performed', 'embedding_similarity_score', 'embedding_model_used', 'embedding_override_applied', 'abstention_check_performed', 'abstention_detected', 'abstention_reasoning', 'abstention_override_applied', 'regex_validations_performed', 'regex_overall_success', 'execution_time', 'timestamp', 'run_name', 'result_index']

question_id field_name gt_value llm_value  field_match  verify_result
         q1     target     BCL2      BCL2         True           True
         q1     target     BCL2      BCL2         True          False
         q2     target     BCL2      BCL2         True           True
         q2     target     BCL2      BCL2         True           True

8.2. Rubric DataFrames¶

RubricResults.to_dataframe() produces one row per trait (or per metric for metric traits). Filter by trait type using the trait_type parameter:

`trait_type`	Includes
`"all"` (default)	All trait types combined
`"llm"`	All LLM traits (score, binary, and literal)
`"llm_score"`	LLM traits with 1-5 scale
`"llm_binary"`	LLM traits with boolean scores
`"llm_literal"`	LLM traits with categorical classification
`"regex"`	Regex traits (boolean)
`"callable"`	Callable traits (boolean or integer)
`"metric"`	Metric traits (exploded by metric name)

Key columns: trait_name, trait_score, trait_label (for literal kinds), trait_type, metric_name (for metrics), confusion_tp/fp/fn/tn (for metrics).

In [21]:

Copied!





rubric_results = result_set.get_rubrics_results()

# All traits
df = rubric_results.to_dataframe()
print(f"Rubric DataFrame: {len(df)} rows")
print()
print(df[["question_id", "trait_name", "trait_score", "trait_type"]].to_string(index=False))
rubric_results = result_set.get_rubrics_results()

# All traits
df = rubric_results.to_dataframe()
print(f"Rubric DataFrame: {len(df)} rows")
print()
print(df[["question_id", "trait_name", "trait_score", "trait_type"]].to_string(index=False))

Rubric DataFrame: 19 rows

question_id    trait_name trait_score trait_type
         q1        safety        True llm_binary
         q1       clarity           4  llm_score
         q1 has_citations        True      regex
         q1    under_150w        True   callable
         q1 drug_coverage         3.0     metric
         q1 drug_coverage         1.0     metric
         q1 drug_coverage         0.0     metric
         q1 drug_coverage         1.0     metric
         q1 drug_coverage        0.75     metric
         q1 drug_coverage       0.857     metric
         q1        safety        True llm_binary
         q1       clarity           3  llm_score
         q1 has_citations       False      regex
         q2        safety        True llm_binary
         q2       clarity           5  llm_score
         q2 has_citations        True      regex
         q2        safety        True llm_binary
         q2       clarity           4  llm_score
         q2 has_citations        True      regex

In [22]:

Copied!





# Just LLM traits
df_llm = rubric_results.to_dataframe(trait_type="llm")
print(f"LLM traits only: {len(df_llm)} rows")
print(df_llm[["question_id", "answering_model", "trait_name", "trait_score"]].to_string(index=False))
# Just LLM traits
df_llm = rubric_results.to_dataframe(trait_type="llm")
print(f"LLM traits only: {len(df_llm)} rows")
print(df_llm[["question_id", "answering_model", "trait_name", "trait_score"]].to_string(index=False))

LLM traits only: 8 rows
question_id                      answering_model trait_name trait_score
         q1 langchain:claude-sonnet-4-5-20250514     safety        True
         q1 langchain:claude-sonnet-4-5-20250514    clarity           4
         q1    langchain:gpt-4.1-mini-2025-04-14     safety        True
         q1    langchain:gpt-4.1-mini-2025-04-14    clarity           3
         q2 langchain:claude-sonnet-4-5-20250514     safety        True
         q2 langchain:claude-sonnet-4-5-20250514    clarity           5
         q2    langchain:gpt-4.1-mini-2025-04-14     safety        True
         q2    langchain:gpt-4.1-mini-2025-04-14    clarity           4

9. Aggregation¶

Both TemplateResults and RubricResults provide built-in aggregation methods. Aggregation groups results by a column (e.g., question_id, answering_model, replicate) and applies a strategy.

9.1. Built-in Aggregation Strategies¶

Strategy	Behavior	Best For
`"mean"`	Arithmetic mean	Numeric scores, similarity scores
`"median"`	Median value	Numeric scores with outliers
`"mode"`	Most common value	Categorical values
`"majority_vote"`	`True` if >50% are `True` (configurable threshold)	Boolean traits, pass/fail
`"first"`	First non-null value	Metadata fields
`"count"`	Count occurrences of each value	Distribution analysis

9.2. Template Aggregation¶

In [23]:

Copied!

template_results = result_set.get_template_results()

# Pass rate by question
pass_rates = template_results.aggregate_pass_rate(by="question_id")
print("Pass rates by question:", pass_rates)
template_results = result_set.get_template_results()

# Pass rate by question
pass_rates = template_results.aggregate_pass_rate(by="question_id")
print("Pass rates by question:", pass_rates)

Pass rates by question: {'q1': 0.5, 'q2': 1.0}

9.3. Rubric Aggregation¶

In [24]:

Copied!





rubric_results = result_set.get_rubrics_results()

# Average LLM trait scores by question
avg = rubric_results.aggregate_llm_traits(strategy="mean", by="question_id")
print("Average LLM trait scores by question:")
for qid, scores in avg.items():
    print(f"  {qid}: {scores}")
rubric_results = result_set.get_rubrics_results()

# Average LLM trait scores by question
avg = rubric_results.aggregate_llm_traits(strategy="mean", by="question_id")
print("Average LLM trait scores by question:")
for qid, scores in avg.items():
    print(f"  {qid}: {scores}")

Average LLM trait scores by question:
  q1: {'clarity': 3.5, 'safety': 1.0}
  q2: {'clarity': 4.5, 'safety': 1.0}

In [25]:

Copied!





# Majority vote on regex traits by model
regex_agg = rubric_results.aggregate_regex_traits(strategy="majority_vote", by="answering_model")
print("Regex trait majority vote by model:")
for model, scores in regex_agg.items():
    print(f"  {model}: {scores}")
# Majority vote on regex traits by model
regex_agg = rubric_results.aggregate_regex_traits(strategy="majority_vote", by="answering_model")
print("Regex trait majority vote by model:")
for model, scores in regex_agg.items():
    print(f"  {model}: {scores}")

Regex trait majority vote by model:
  langchain:claude-sonnet-4-5-20250514: {'has_citations': True}
  langchain:gpt-4.1-mini-2025-04-14: {'has_citations': False}

9.4. Custom Aggregators¶

Register custom aggregation strategies by implementing the ResultAggregator protocol:

In [26]:

Copied!





class WeightedMeanAggregator:
    """Custom aggregator that computes a weighted mean."""
    def aggregate(self, series, **kwargs):
        # Simple mean as fallback (weights would come from kwargs)
        return series.mean()

rubric_results.register_aggregator("weighted_mean", WeightedMeanAggregator())
weighted = rubric_results.aggregate_llm_traits(strategy="weighted_mean", by="question_id")
print("Available aggregators:", rubric_results.list_aggregators())
print("Weighted mean result:", weighted)
class WeightedMeanAggregator:
    """Custom aggregator that computes a weighted mean."""
    def aggregate(self, series, **kwargs):
        # Simple mean as fallback (weights would come from kwargs)
        return series.mean()

rubric_results.register_aggregator("weighted_mean", WeightedMeanAggregator())
weighted = rubric_results.aggregate_llm_traits(strategy="weighted_mean", by="question_id")
print("Available aggregators:", rubric_results.list_aggregators())
print("Weighted mean result:", weighted)

Available aggregators: ['mean', 'median', 'mode', 'majority_vote', 'first', 'count', 'weighted_mean']
Weighted mean result: {'q1': {'clarity': 3.5, 'safety': 1.0}, 'q2': {'clarity': 4.5, 'safety': 1.0}}

10. In-Memory Storage and Export¶

ResultsManager stores verification results in memory during a session. Results are organized by run name and can be exported to JSON or CSV. The export format is auto-detected from the file extension, or can be specified explicitly.

In [27]:

Copied!





# ResultsManager API (shown here without a live Benchmark for reference):
#
# from pathlib import Path
#
# # Results are stored automatically after run_verification
# results = benchmark.results.get_verification_results(run_name="my_run")
#
# # Export to file (format auto-detected from extension)
# benchmark.results.export_results_to_file(Path("results.json"))
# benchmark.results.export_results_to_file(Path("results.csv"))
#
# # Get summary statistics for a run
# summary = benchmark.results.get_verification_summary(run_name="my_run")
# # {"total_results": 60, "successful_count": 58, "success_rate": 96.67, ...}

print("ResultsManager public methods:")
print([m for m in dir(ResultsManager) if not m.startswith("_")])
# ResultsManager API (shown here without a live Benchmark for reference):
#
# from pathlib import Path
#
# # Results are stored automatically after run_verification
# results = benchmark.results.get_verification_results(run_name="my_run")
#
# # Export to file (format auto-detected from extension)
# benchmark.results.export_results_to_file(Path("results.json"))
# benchmark.results.export_results_to_file(Path("results.csv"))
#
# # Get summary statistics for a run
# summary = benchmark.results.get_verification_summary(run_name="my_run")
# # {"total_results": 60, "successful_count": 58, "success_rate": 96.67, ...}

print("ResultsManager public methods:")
print([m for m in dir(ResultsManager) if not m.startswith("_")])

ResultsManager public methods:
['clear_verification_results', 'export_results_to_file', 'export_verification_results', 'get_all_run_names', 'get_latest_results', 'get_results_by_question', 'get_results_by_run', 'get_results_statistics_by_run', 'get_verification_history', 'get_verification_results', 'get_verification_summary', 'has_results', 'load_results_from_file', 'store_verification_results']

Results are not checkpointed

ResultsManager stores results in memory only. They are not saved to the benchmark checkpoint file. To persist results across sessions, use export_results_to_file() or save the VerificationResultSet directly.

11. How Results Are Built: The FinalizeResult Stage¶

The FinalizeResult stage (stage 13) always runs as the last step in the pipeline. It constructs the VerificationResult from the accumulated VerificationContext:

Collects all artifacts written by previous stages
Extracts parsed ground truth and LLM responses from the parsed answer object
Determines which verification types were performed (template, rubric)
Aggregates token usage metadata across all stages
Computes the deterministic result_id
Assembles the nested sub-objects (metadata, template, rubric, deep_judgment, deep_judgment_rubric)
Handles partial failure: whatever artifacts are available get populated; missing data remains None

This stage handles both success and error cases. If the pipeline errors at stage 5, the finalize stage still runs and captures whatever was collected up to that point, with completed_without_errors=False and the error message in metadata.error.

12. Next Steps¶

Verification Pipeline: The 13 stages that produce results
Evaluation Modes: How modes affect which result sub-objects are populated
Rubrics: Defining the traits that populate rubric results
Answer Templates: Writing the verify() logic that produces verify_result