Results and Scoring¶
Every question that passes through the verification pipeline produces a VerificationResult: a nested Pydantic model that captures everything that happened during evaluation, from the raw response to the final pass/fail verdict. This page explains the result data model, how scoring works, how to access and aggregate results, and how to export them for analysis.
1. What Results Capture¶
The most important idea is: a result is a complete evidence record, not just a score. It preserves every intermediate artifact (the raw response, the parsed fields, each optional check's outcome, every rubric trait score) so that downstream analysis can always trace a verdict back to its inputs. Nothing is discarded.
A single VerificationResult corresponds to one question evaluated by one answering model and parsed by one judge model, in one replicate. If you evaluate 10 questions with 2 answering models and 3 replicates, you get 60 results.
1.1. Result Structure at a Glance¶
VerificationResult uses nested composition: five optional sub-objects, each grouping a coherent slice of the evidence. This is the only access path; flat property accessors do not exist.
VerificationResult
├── metadata ← Always present: identification, timing, model info
├── template ← Present when template evaluation ran
├── rubric ← Present when rubric evaluation ran
├── deep_judgment ← Present when deep judgment ran (templates)
├── deep_judgment_rubric ← Present when deep judgment ran (rubrics)
│
│ (Root-level fields for MCP agent trace filtering)
├── evaluation_input ← The text passed to evaluation stages
├── used_full_trace ← Whether the full agent trace was used
└── trace_extraction_error ← Error if final AI message extraction failed
Access fields through their sub-objects:
# Correct: nested access
print(result.metadata.question_id)
print(result.template.verify_result)
print(result.rubric.llm_trait_scores)
q1
True
{'safety': True, 'clarity': 4}
# Wrong: flat access (removed, will raise AttributeError)
try:
result.question_id
except AttributeError as e:
print(f"AttributeError: {e}")
AttributeError: 'VerificationResult' object has no attribute 'question_id'
2. Metadata: Identity and Execution Context¶
Every result carries a VerificationResultMetadata sub-object regardless of evaluation mode. It identifies what was evaluated, by which models, and when.
| Field | Type | Description |
|---|---|---|
question_id |
str |
MD5 hash of the question text (32-char hex) |
question_text |
str |
Full question text |
raw_answer |
str \| None |
Human-readable ground truth from the checkpoint |
template_id |
str |
MD5 hash of the template code, or "no_template" |
answering |
ModelIdentity |
Answering model (interface, model_name, tools) |
parsing |
ModelIdentity |
Parsing/judge model (interface, model_name) |
answering_system_prompt |
str \| None |
System prompt used for the answering model |
parsing_system_prompt |
str \| None |
System prompt used for the parsing model |
execution_time |
float |
Pipeline execution time in seconds |
timestamp |
str |
ISO timestamp of when the result was produced |
result_id |
str |
Deterministic 16-character SHA256 hash (see below) |
run_name |
str \| None |
Organizing label for verification runs |
replicate |
int \| None |
Replicate number (1, 2, 3, ...) for repeated runs |
keywords |
list[str] \| None |
Keywords associated with the question |
completed_without_errors |
bool |
Whether the pipeline ran without errors |
error |
str \| None |
Error message if something went wrong |
few_shot_enabled |
bool |
Whether few-shot prompting was active (default False) |
few_shot_example_count |
int |
Number of few-shot examples used (default 0) |
evaluation_mode |
str \| None |
Evaluation mode used (e.g., "template_only", "template_and_rubric") |
meta = result.metadata
print(f"Question: {meta.question_id}")
print(f"Model: {meta.answering.display_string}")
print(f"Judge: {meta.parsing.display_string}")
print(f"Time: {meta.execution_time}s")
print(f"Result ID: {meta.result_id}")
print(f"Replicate: {meta.replicate}")
print(f"Success: {meta.completed_without_errors}")
Question: q1 Model: langchain:claude-sonnet-4-5-20250514 Judge: langchain:claude-haiku-4-5-20251001 Time: 1.2s Result ID: e6972c44b765e7c3 Replicate: 1 Success: True
2.1. ModelIdentity¶
Models are identified by a composite ModelIdentity object, not a plain string. This distinguishes the same model used with different interfaces or MCP tool sets:
| Field | Description |
|---|---|
interface |
The adapter interface (e.g., "langchain", "claude_sdk") |
model_name |
The model name (e.g., "claude-sonnet-4-6") |
tools |
Sorted list of MCP server names (answering models only; always [] for parsing) |
identity = result.metadata.answering
print(f"Interface: {identity.interface}")
print(f"Model name: {identity.model_name}")
print(f"Tools: {identity.tools}")
print(f"Display string: {identity.display_string}")
print(f"Canonical key: {identity.canonical_key}")
Interface: langchain Model name: claude-sonnet-4-5-20250514 Tools: [] Display string: langchain:claude-sonnet-4-5-20250514 Canonical key: langchain:claude-sonnet-4-5-20250514:
2.2. Deterministic Result IDs¶
Each result gets a result_id: a 16-character SHA256 hash computed from (question_id, answering, parsing, timestamp, replicate). The same inputs always produce the same ID, enabling deduplication across runs. The ID is computed by VerificationResultMetadata.compute_result_id().
# Same inputs always produce the same ID
id1 = VerificationResultMetadata.compute_result_id(
question_id="q1", answering=answering_a, parsing=parsing,
timestamp="2025-06-15T10:00:00Z", replicate=1,
)
id2 = VerificationResultMetadata.compute_result_id(
question_id="q1", answering=answering_a, parsing=parsing,
timestamp="2025-06-15T10:00:00Z", replicate=1,
)
print(f"ID 1: {id1}")
print(f"ID 2: {id2}")
print(f"Match: {id1 == id2}")
print(f"Length: {len(id1)} characters")
ID 1: e6972c44b765e7c3 ID 2: e6972c44b765e7c3 Match: True Length: 16 characters
3. Template Results: The Correctness Record¶
The template sub-object (VerificationResultTemplate) is present whenever template evaluation ran (template_only or template_and_rubric evaluation modes). It records the full chain from raw response to pass/fail verdict.
3.1. The Primary Correctness Signal: verify_result¶
verify_result is a bool | None that captures whether the template's verify() method returned True. This is the core correctness output. When template evaluation did not run (e.g., rubric_only mode), this field is None.
Several pipeline stages can override this value before finalization:
| Stage | Override Behavior |
|---|---|
| Abstention check | Sets verify_result to False if the model refused to answer |
| Sufficiency check | Sets verify_result to False if the response lacks sufficient information |
| Embedding check | Can override verify_result based on semantic similarity threshold |
The corresponding *_override_applied boolean fields record whether an override occurred, so you can always distinguish "failed on its own merits" from "overridden by a guard stage."
tmpl = result.template
print(f"verify_result: {tmpl.verify_result}")
print(f"Parsed LLM response: {tmpl.parsed_llm_response}")
print(f"Parsed ground truth: {tmpl.parsed_gt_response}")
print(f"Abstention check performed: {tmpl.abstention_check_performed}")
print(f"Embedding check performed: {tmpl.embedding_check_performed}")
print(f"Embedding similarity score: {tmpl.embedding_similarity_score}")
verify_result: True
Parsed LLM response: {'target': 'BCL2'}
Parsed ground truth: {'target': 'BCL2'}
Abstention check performed: False
Embedding check performed: True
Embedding similarity score: 0.92
3.2. Response and Parsing Artifacts¶
| Field | Type | Description |
|---|---|---|
raw_llm_response |
str |
The answering model's full text response |
trace_messages |
list[dict] |
Structured message trace (for MCP agent runs) |
parsed_llm_response |
dict \| None |
Fields extracted by the Judge LLM |
parsed_gt_response |
dict \| None |
Ground truth parsed into the same template fields |
verify_granular_result |
Any \| None |
Per-field verification detail (if verify_granular() is implemented) |
field_verification_error |
str \| None |
Error message if verify() raised an exception (non-fatal) |
field_results |
dict[str, bool] \| None |
Per-field primitive verification results (from _compute_field_results()) |
composition_strategy |
str \| None |
Composition strategy used: "all_of", "any_of", or "at_least_n(N)" |
3.3. Optional Check Results¶
Each optional check records three pieces of state: whether it was attempted, what it found, and whether it overrode the verdict.
| Check | Key Fields |
|---|---|
| Abstention | abstention_check_performed, abstention_detected, abstention_override_applied, abstention_reasoning |
| Sufficiency | sufficiency_check_performed, sufficiency_detected, sufficiency_override_applied, sufficiency_reasoning |
| Embedding | embedding_check_performed, embedding_similarity_score (0.0 to 1.0), embedding_override_applied, embedding_model_used |
| Regex | regex_validations_performed, regex_validation_results (per-pattern dict), regex_overall_success, regex_extraction_results |
3.4. Execution Metadata¶
| Field | Type | Description |
|---|---|---|
recursion_limit_reached |
bool |
Whether an MCP agent hit its recursion limit |
answering_mcp_servers |
list[str] \| None |
MCP servers attached to the answering model |
usage_metadata |
dict \| None |
Token usage breakdown by stage (answer_generation, parsing, rubric_evaluation, abstention_check, total) |
agent_metrics |
dict \| None |
MCP agent metrics: iterations, tool_calls, tools_used, suspect_failed_tool_calls, suspect_failed_tools |
4. Rubric Results: The Quality Record¶
The rubric sub-object (VerificationResultRubric) is present whenever rubric evaluation ran (template_and_rubric or rubric_only modes). Trait scores are split by type into separate dictionaries, all keyed by trait name.
| Field | Type | Description |
|---|---|---|
llm_trait_scores |
dict[str, int \| bool] \| None |
LLM-evaluated traits (boolean or 1-5 scale) |
llm_trait_labels |
dict[str, str] \| None |
Class labels for literal-kind LLM traits (index-to-name mapping) |
regex_trait_scores |
dict[str, bool] \| None |
Regex trait pass/fail results |
callable_trait_scores |
dict[str, bool \| int] \| None |
Callable trait results |
metric_trait_scores |
dict[str, dict[str, float]] \| None |
Metric trait metrics (precision, recall, F1, etc.) |
metric_trait_confusion_lists |
dict[str, dict[str, list[str]]] \| None |
Per-metric confusion lists (tp, tn, fp, fn containing excerpts) |
rubric_evaluation_strategy |
str \| None |
"batch" or "sequential" |
4.1. Accessing Trait Scores¶
# Individual trait types
print("LLM trait (boolean):", result.rubric.llm_trait_scores["safety"])
print("LLM trait (score): ", result.rubric.llm_trait_scores["clarity"])
print("Regex trait: ", result.rubric.regex_trait_scores["has_citations"])
print("Callable trait: ", result.rubric.callable_trait_scores["under_150w"])
LLM trait (boolean): True LLM trait (score): 4 Regex trait: True Callable trait: True
# Literal-kind LLM traits: score is the class index, label is the class name
print("Literal label:", result.rubric.llm_trait_labels["response_type"])
Literal label: Factual
# Metric traits: nested dict of float metrics
print("Metric scores:", result.rubric.metric_trait_scores["drug_coverage"])
Metric scores: {'tp': 3.0, 'fn': 1.0, 'fp': 0.0, 'precision': 1.0, 'recall': 0.75, 'f1': 0.857}
# Confusion lists for metric traits: which items were found/missed
print("Confusion lists:", result.rubric.metric_trait_confusion_lists["drug_coverage"])
Confusion lists: {'tp': ['aspirin', 'ibuprofen', 'acetaminophen'], 'fn': ['naproxen'], 'fp': [], 'tn': []}
# Flat access across all types
all_scores = result.rubric.get_all_trait_scores()
print("All trait scores:", all_scores)
All trait scores: {'safety': True, 'clarity': 4, 'has_citations': True, 'under_150w': True, 'drug_coverage': {'tp': 3.0, 'fn': 1.0, 'fp': 0.0, 'precision': 1.0, 'recall': 0.75, 'f1': 0.857}}
# Look up a trait by name (returns value and type)
print("Trait lookup:", result.rubric.get_trait_by_name("safety"))
print("Trait lookup:", result.rubric.get_trait_by_name("has_citations"))
Trait lookup: (True, 'llm') Trait lookup: (True, 'regex')
5. Deep Judgment Results (Optional)¶
When deep judgment is enabled, additional evidence-based results are captured. Deep judgment adds excerpt extraction, per-attribute reasoning, and optional hallucination risk assessment on top of standard evaluation.
5.1. Template Deep Judgment¶
The deep_judgment sub-object (VerificationResultDeepJudgment) records per-attribute evidence:
| Field | Type | Description |
|---|---|---|
extracted_excerpts |
dict[str, list[dict]] |
Per-attribute verbatim passages with confidence (low/medium/high), similarity score, and optional search results |
attribute_reasoning |
dict[str, str] |
LLM reasoning for each attribute (present even when no excerpts were found) |
hallucination_risk_assessment |
dict[str, str] |
Risk level per attribute (none/low/medium/high); only populated when search is enabled |
deep_judgment_stages_completed |
list[str] |
Which stages ran: "excerpts", "reasoning", "parameters" |
attributes_without_excerpts |
list[str] |
Attributes with no corroborating excerpts |
deep_judgment_model_calls |
int |
Number of LLM invocations |
5.2. Rubric Deep Judgment¶
The deep_judgment_rubric sub-object (VerificationResultDeepJudgmentRubric) records per-trait evidence for rubric traits with deep judgment enabled:
| Field | Type | Description |
|---|---|---|
extracted_rubric_excerpts |
dict[str, list[dict]] |
Per-trait excerpts (only for traits with deep_judgment_excerpt_enabled=True) |
rubric_trait_reasoning |
dict[str, str] |
Per-trait reasoning (all deep-judgment-enabled traits) |
deep_judgment_rubric_scores |
dict[str, int \| bool] |
Scores from deep-judgment evaluation |
standard_rubric_scores |
dict[str, int \| bool] |
Scores for non-deep-judgment traits (for comparison) |
traits_without_valid_excerpts |
list[str] |
Traits that exhausted retries without valid excerpts |
trait_metadata |
dict[str, dict] |
Per-trait tracking (stages completed, model calls, retry counts) |
6. How Results Vary by Evaluation Mode¶
The evaluation mode determines which sub-objects are populated:
| Sub-object | template_only |
template_and_rubric |
rubric_only |
|---|---|---|---|
metadata |
Always | Always | Always |
template |
Present | Present | None |
template.verify_result |
bool |
bool |
N/A |
rubric |
None |
Present | Present |
deep_judgment |
Optional | Optional | None |
deep_judgment_rubric |
None |
Optional | Optional |
In rubric_only mode, no template parsing occurs. The rubric trait scores evaluated against the raw response are the primary output. In template_only mode (the default), the rubric sub-object is None.
7. Working with Result Collections¶
Benchmark.run_verification() returns a VerificationResultSet: the top-level container that holds all individual results and provides specialized views, filtering, grouping, and DataFrame conversion.
7.1. Specialized Views¶
The result set provides four accessor methods, each returning a purpose-built wrapper with its own analysis API:
| Accessor Method | Returns | Purpose |
|---|---|---|
get_template_results() |
TemplateResults |
Pass/fail rates, embedding scores, regex results, abstention detection, parsed responses |
get_rubrics_results() |
RubricResults |
Trait scores by type, aggregation, confusion matrices |
get_judgment_results() |
JudgmentResults |
Extracted excerpts, reasoning traces, hallucination risk |
get_rubric_judgments_results() |
RubricJudgmentResults |
Excerpt-level explosion (one row per trait per excerpt) |
# Template analysis
template_results = result_set.get_template_results()
print(f"TemplateResults with {len(template_results)} results")
print(f"Summary: {template_results.get_template_summary()}")
TemplateResults with 4 results
Summary: {'num_results': 4, 'num_passed': 3, 'num_failed': 1, 'pass_rate': 0.75, 'num_with_embedding': 1, 'num_with_regex': 0, 'num_with_abstention': 0, 'num_questions': 2}
# Rubric analysis
rubric_results = result_set.get_rubrics_results()
print(f"RubricResults with {len(rubric_results)} results")
print(f"Summary: {rubric_results.get_trait_summary()}")
RubricResults with 4 results
Summary: {'num_results': 4, 'llm_traits': ['clarity', 'safety'], 'regex_traits': ['has_citations'], 'callable_traits': ['under_150w'], 'metric_traits': ['drug_coverage'], 'num_questions': 2}
7.2. Filtering and Grouping¶
Both VerificationResultSet and the specialized views support filtering and grouping. Filtering returns a new instance of the same type with a subset of results.
# Filter at the result set level
filtered = result_set.filter(
question_ids=["q1"],
completed_only=True,
)
print(f"Filtered to {len(filtered)} results (question q1 only)")
# Group by different dimensions
by_question = result_set.group_by_question()
for qid, qresults in by_question.items():
print(f" Question {qid}: {len(qresults)} results")
Filtered to 2 results (question q1 only) Question q1: 2 results Question q2: 2 results
# Group by model
by_model = result_set.group_by_model()
for model_key, model_results in by_model.items():
print(f" Model {model_key}: {len(model_results)} results")
Model langchain:claude-sonnet-4-5-20250514: 2 results Model langchain:gpt-4.1-mini-2025-04-14: 2 results
# Specialized views also support filtering
passed = template_results.filter(passed_only=True)
failed = template_results.filter(failed_only=True)
print(f"Passed: {len(passed)}, Failed: {len(failed)}")
Passed: 3, Failed: 1
7.3. Iteration¶
All containers support standard Python iteration:
for r in result_set:
print(f" {r.metadata.question_id}: verify={r.template.verify_result}, "
f"model={r.metadata.answering.model_name}")
print(f"\nTotal results: {len(result_set)}")
print(f"First result question: {result_set[0].metadata.question_text}")
q1: verify=True, model=claude-sonnet-4-5-20250514 q1: verify=False, model=gpt-4.1-mini-2025-04-14 q2: verify=True, model=claude-sonnet-4-5-20250514 q2: verify=True, model=gpt-4.1-mini-2025-04-14 Total results: 4 First result question: What is the putative target of venetoclax?
8. DataFrame Export¶
Every specialized view converts to pandas DataFrames for tabular analysis. The DataFrame structures are designed around a specific "explosion" axis: each row represents the finest-grained unit for that view type.
8.1. Template DataFrames¶
TemplateResults provides three DataFrame exports:
| Method | Row Granularity | Key Columns |
|---|---|---|
to_dataframe() |
One row per parsed field per result | field_name, gt_value, llm_value, field_match, verify_result |
to_regex_dataframe() |
One row per regex pattern per result | pattern_name, pattern_regex, matched, extracted_value |
to_usage_dataframe() |
One row per usage stage per result | usage_stage, input_tokens, output_tokens, total_tokens, model_used |
template_results = result_set.get_template_results()
# Field-level comparison: ground truth vs LLM extraction
df = template_results.to_dataframe()
print(f"Template DataFrame: {len(df)} rows, {len(df.columns)} columns")
print(f"Columns: {list(df.columns)}")
print()
print(df[["question_id", "field_name", "gt_value", "llm_value", "field_match", "verify_result"]].to_string(index=False))
Template DataFrame: 4 rows, 34 columns
Columns: ['completed_without_errors', 'error', 'recursion_limit_reached', 'question_id', 'template_id', 'question_text', 'keywords', 'replicate', 'answering_mcp_servers', 'answering_model', 'parsing_model', 'answering_system_prompt', 'parsing_system_prompt', 'raw_llm_response', 'field_name', 'gt_value', 'llm_value', 'field_match', 'field_type', 'verify_result', 'embedding_check_performed', 'embedding_similarity_score', 'embedding_model_used', 'embedding_override_applied', 'abstention_check_performed', 'abstention_detected', 'abstention_reasoning', 'abstention_override_applied', 'regex_validations_performed', 'regex_overall_success', 'execution_time', 'timestamp', 'run_name', 'result_index']
question_id field_name gt_value llm_value field_match verify_result
q1 target BCL2 BCL2 True True
q1 target BCL2 BCL2 True False
q2 target BCL2 BCL2 True True
q2 target BCL2 BCL2 True True
8.2. Rubric DataFrames¶
RubricResults.to_dataframe() produces one row per trait (or per metric for metric traits). Filter by trait type using the trait_type parameter:
trait_type |
Includes |
|---|---|
"all" (default) |
All trait types combined |
"llm" |
All LLM traits (score, binary, and literal) |
"llm_score" |
LLM traits with 1-5 scale |
"llm_binary" |
LLM traits with boolean scores |
"llm_literal" |
LLM traits with categorical classification |
"regex" |
Regex traits (boolean) |
"callable" |
Callable traits (boolean or integer) |
"metric" |
Metric traits (exploded by metric name) |
Key columns: trait_name, trait_score, trait_label (for literal kinds), trait_type, metric_name (for metrics), confusion_tp/fp/fn/tn (for metrics).
rubric_results = result_set.get_rubrics_results()
# All traits
df = rubric_results.to_dataframe()
print(f"Rubric DataFrame: {len(df)} rows")
print()
print(df[["question_id", "trait_name", "trait_score", "trait_type"]].to_string(index=False))
Rubric DataFrame: 19 rows
question_id trait_name trait_score trait_type
q1 safety True llm_binary
q1 clarity 4 llm_score
q1 has_citations True regex
q1 under_150w True callable
q1 drug_coverage 3.0 metric
q1 drug_coverage 1.0 metric
q1 drug_coverage 0.0 metric
q1 drug_coverage 1.0 metric
q1 drug_coverage 0.75 metric
q1 drug_coverage 0.857 metric
q1 safety True llm_binary
q1 clarity 3 llm_score
q1 has_citations False regex
q2 safety True llm_binary
q2 clarity 5 llm_score
q2 has_citations True regex
q2 safety True llm_binary
q2 clarity 4 llm_score
q2 has_citations True regex
# Just LLM traits
df_llm = rubric_results.to_dataframe(trait_type="llm")
print(f"LLM traits only: {len(df_llm)} rows")
print(df_llm[["question_id", "answering_model", "trait_name", "trait_score"]].to_string(index=False))
LLM traits only: 8 rows
question_id answering_model trait_name trait_score
q1 langchain:claude-sonnet-4-5-20250514 safety True
q1 langchain:claude-sonnet-4-5-20250514 clarity 4
q1 langchain:gpt-4.1-mini-2025-04-14 safety True
q1 langchain:gpt-4.1-mini-2025-04-14 clarity 3
q2 langchain:claude-sonnet-4-5-20250514 safety True
q2 langchain:claude-sonnet-4-5-20250514 clarity 5
q2 langchain:gpt-4.1-mini-2025-04-14 safety True
q2 langchain:gpt-4.1-mini-2025-04-14 clarity 4
9. Aggregation¶
Both TemplateResults and RubricResults provide built-in aggregation methods. Aggregation groups results by a column (e.g., question_id, answering_model, replicate) and applies a strategy.
9.1. Built-in Aggregation Strategies¶
| Strategy | Behavior | Best For |
|---|---|---|
"mean" |
Arithmetic mean | Numeric scores, similarity scores |
"median" |
Median value | Numeric scores with outliers |
"mode" |
Most common value | Categorical values |
"majority_vote" |
True if >50% are True (configurable threshold) |
Boolean traits, pass/fail |
"first" |
First non-null value | Metadata fields |
"count" |
Count occurrences of each value | Distribution analysis |
9.2. Template Aggregation¶
template_results = result_set.get_template_results()
# Pass rate by question
pass_rates = template_results.aggregate_pass_rate(by="question_id")
print("Pass rates by question:", pass_rates)
Pass rates by question: {'q1': 0.5, 'q2': 1.0}
9.3. Rubric Aggregation¶
rubric_results = result_set.get_rubrics_results()
# Average LLM trait scores by question
avg = rubric_results.aggregate_llm_traits(strategy="mean", by="question_id")
print("Average LLM trait scores by question:")
for qid, scores in avg.items():
print(f" {qid}: {scores}")
Average LLM trait scores by question:
q1: {'clarity': 3.5, 'safety': 1.0}
q2: {'clarity': 4.5, 'safety': 1.0}
# Majority vote on regex traits by model
regex_agg = rubric_results.aggregate_regex_traits(strategy="majority_vote", by="answering_model")
print("Regex trait majority vote by model:")
for model, scores in regex_agg.items():
print(f" {model}: {scores}")
Regex trait majority vote by model:
langchain:claude-sonnet-4-5-20250514: {'has_citations': True}
langchain:gpt-4.1-mini-2025-04-14: {'has_citations': False}
9.4. Custom Aggregators¶
Register custom aggregation strategies by implementing the ResultAggregator protocol:
class WeightedMeanAggregator:
"""Custom aggregator that computes a weighted mean."""
def aggregate(self, series, **kwargs):
# Simple mean as fallback (weights would come from kwargs)
return series.mean()
rubric_results.register_aggregator("weighted_mean", WeightedMeanAggregator())
weighted = rubric_results.aggregate_llm_traits(strategy="weighted_mean", by="question_id")
print("Available aggregators:", rubric_results.list_aggregators())
print("Weighted mean result:", weighted)
Available aggregators: ['mean', 'median', 'mode', 'majority_vote', 'first', 'count', 'weighted_mean']
Weighted mean result: {'q1': {'clarity': 3.5, 'safety': 1.0}, 'q2': {'clarity': 4.5, 'safety': 1.0}}
10. In-Memory Storage and Export¶
ResultsManager stores verification results in memory during a session. Results are organized by run name and can be exported to JSON or CSV. The export format is auto-detected from the file extension, or can be specified explicitly.
# ResultsManager API (shown here without a live Benchmark for reference):
#
# from pathlib import Path
#
# # Results are stored automatically after run_verification
# results = benchmark.results.get_verification_results(run_name="my_run")
#
# # Export to file (format auto-detected from extension)
# benchmark.results.export_results_to_file(Path("results.json"))
# benchmark.results.export_results_to_file(Path("results.csv"))
#
# # Get summary statistics for a run
# summary = benchmark.results.get_verification_summary(run_name="my_run")
# # {"total_results": 60, "successful_count": 58, "success_rate": 96.67, ...}
print("ResultsManager public methods:")
print([m for m in dir(ResultsManager) if not m.startswith("_")])
ResultsManager public methods: ['clear_verification_results', 'export_results_to_file', 'export_verification_results', 'get_all_run_names', 'get_latest_results', 'get_results_by_question', 'get_results_by_run', 'get_results_statistics_by_run', 'get_verification_history', 'get_verification_results', 'get_verification_summary', 'has_results', 'load_results_from_file', 'store_verification_results']
Results are not checkpointed
ResultsManager stores results in memory only. They are not saved to the benchmark checkpoint file. To persist results across sessions, use export_results_to_file() or save the VerificationResultSet directly.
11. How Results Are Built: The FinalizeResult Stage¶
The FinalizeResult stage (stage 13) always runs as the last step in the pipeline. It constructs the VerificationResult from the accumulated VerificationContext:
- Collects all artifacts written by previous stages
- Extracts parsed ground truth and LLM responses from the parsed answer object
- Determines which verification types were performed (template, rubric)
- Aggregates token usage metadata across all stages
- Computes the deterministic
result_id - Assembles the nested sub-objects (
metadata,template,rubric,deep_judgment,deep_judgment_rubric) - Handles partial failure: whatever artifacts are available get populated; missing data remains
None
This stage handles both success and error cases. If the pipeline errors at stage 5, the finalize stage still runs and captures whatever was collected up to that point, with completed_without_errors=False and the error message in metadata.error.
12. Next Steps¶
- Verification Pipeline: The 13 stages that produce results
- Evaluation Modes: How modes affect which result sub-objects are populated
- Rubrics: Defining the traits that populate rubric results
- Answer Templates: Writing the
verify()logic that producesverify_result