VerificationResult Structure
Every call to run_verification() returns a VerificationResultSet containing VerificationResult objects, one per question verified. This page documents the complete structure of both VerificationResultSet and VerificationResult so you know exactly what data is available for analysis.
VerificationResultSet Fields
| Field |
Type |
Description |
results |
list[VerificationResult] |
All individual verification results from the run |
scenario_results |
list[ScenarioExecutionResult] \| None |
Present only for scenario runs. Each entry is a ScenarioExecutionResult containing the full execution trace (path taken, turn history, outcome criteria results, final state) for one scenario. None for non-scenario verification. |
errors |
list[tuple[str, BaseException]] \| None |
Errors from failed scenario executions, as (description, exception) tuples. None when no scenario errors occurred. |
result_set = benchmark.run_verification(config)
# Standard question results
for vr in result_set.results:
print(vr.metadata.question_id, vr.template.verify_result)
# Scenario-specific data (only present for scenario runs)
if result_set.scenario_results:
for sr in result_set.scenario_results:
print(f"Scenario {sr.scenario_id}: {sr.status}, path={sr.path}")
print(f" Outcomes: {sr.outcome_results}")
if result_set.errors:
for desc, exc in result_set.errors:
print(f"Failed: {desc}: {exc}")
VerificationResult Overview
A VerificationResult has five top-level sections:
VerificationResult
├── metadata # Always present
│ ├── question_id, template_id, result_id
│ ├── answering (ModelIdentity), parsing (ModelIdentity)
│ ├── execution_time, timestamp
│ └── completed_without_errors, error
├── template # Present when template evaluation ran
│ ├── raw_llm_response, trace_messages
│ ├── parsed_llm_response, parsed_gt_response
│ ├── verify_result, verify_granular_result
│ ├── embedding_*, regex_*, abstention_*, sufficiency_*
│ └── usage_metadata, agent_metrics
├── rubric # Present when rubric evaluation ran
│ ├── llm_trait_scores, llm_trait_labels
│ ├── regex_trait_scores, callable_trait_scores
│ └── metric_trait_scores, metric_trait_confusion_lists
├── deep_judgment # Present when deep judgment enabled (templates)
│ ├── extracted_excerpts, attribute_reasoning
│ └── hallucination_risk_assessment
└── deep_judgment_rubric # Present when deep judgment enabled (rubrics)
├── extracted_rubric_excerpts, rubric_trait_reasoning
├── deep_judgment_rubric_scores, standard_rubric_scores
└── trait_metadata, traits_without_valid_excerpts
Plus three shared trace-filtering fields at the root level:
| Field |
Type |
Description |
evaluation_input |
str \| None |
The input text passed to evaluation (full trace or final AI message) |
used_full_trace |
bool |
Whether the full trace was used (True) or only the final AI message (False) |
trace_extraction_error |
str \| None |
Error message if final AI message extraction failed |
Accessing Fields
Fields are accessed through their nested section objects, not directly on the result:
# Correct: access through section
result.metadata.question_id
result.template.verify_result
result.rubric.llm_trait_scores
# Optional sections need a None check
if result.template:
print(result.template.verify_result)
if result.rubric:
print(result.rubric.get_all_trait_scores())
The metadata section is always present on every result. It identifies the question, models, and execution context.
Identification Fields
| Field |
Type |
Description |
question_id |
str |
Question identifier (URN format) |
template_id |
str |
MD5 hash of the template code, or "no_template" if none |
result_id |
str |
Deterministic 16-char SHA256 hash computed from question, models, timestamp, and replicate |
question_text |
str |
Full text of the question |
raw_answer |
str \| None |
Ground truth answer from the checkpoint (if provided) |
keywords |
list[str] \| None |
Keywords associated with the question |
run_name |
str \| None |
Optional name for this verification run |
replicate |
int \| None |
Replicate number (1, 2, 3, ...) for repeated runs |
| Field |
Type |
Description |
answering |
ModelIdentity |
Identity of the answering model |
parsing |
ModelIdentity |
Identity of the parsing (judge) model |
answering_system_prompt |
str \| None |
System prompt used for the answering model |
parsing_system_prompt |
str \| None |
System prompt used for the parsing model |
A ModelIdentity contains:
| Field |
Type |
Description |
interface |
str |
Adapter interface (e.g., "langchain", "claude_agent_sdk") |
model_name |
str |
Model name (e.g., "claude-haiku-4-5", "claude-sonnet-4-5") |
tools |
list[str] |
MCP server names (only for answering models; empty for parsing models) |
Convenience properties:
metadata.answering_model — Returns answering.display_string (e.g., "langchain:claude-haiku-4-5")
metadata.parsing_model — Returns parsing.display_string
Execution Fields
| Field |
Type |
Description |
completed_without_errors |
bool |
Whether verification completed successfully |
error |
str \| None |
Error message if verification failed |
execution_time |
float |
Execution time in seconds |
timestamp |
str |
ISO timestamp of when verification was run |
Scenario Linking Fields
These fields are populated only for results produced by scenario execution. For standalone (non-scenario) questions, all four are None.
| Field |
Type |
Description |
scenario_id |
str \| None |
Name of the scenario that produced this result |
scenario_node |
str \| None |
Node ID within the scenario graph for this turn |
scenario_turn |
int \| None |
Zero-based turn index within the scenario execution |
scenario_path |
list[str] \| None |
Ordered list of node IDs visited up to and including this turn |
# Scenario results carry linking metadata
for vr in result_set.results:
if vr.metadata.scenario_id:
print(f"Scenario: {vr.metadata.scenario_id}, "
f"node: {vr.metadata.scenario_node}, "
f"turn: {vr.metadata.scenario_turn}")
Template Results
The template section is present when template evaluation was performed (evaluation mode template_only or template_and_rubric). Access via result.template.
Answer Generation
| Field |
Type |
Description |
raw_llm_response |
str |
Raw text response from the answering model |
trace_messages |
list[dict] |
Full message trace (for multi-turn/agent interactions) |
Parsed Responses
| Field |
Type |
Description |
parsed_llm_response |
dict \| None |
Fields extracted by the judge LLM (excludes id and correct) |
parsed_gt_response |
dict \| None |
Ground truth values from the template's correct field |
Example:
if result.template and result.template.parsed_llm_response:
print("Extracted:", result.template.parsed_llm_response)
# {"tissue": "pancreas", "gene": "KRAS"}
print("Expected:", result.template.parsed_gt_response)
# {"tissue": "pancreas", "gene": "KRAS"}
Verification Outcomes
| Field |
Type |
Description |
template_verification_performed |
bool |
Whether verify() was executed |
verify_result |
bool \| None |
Template verification result (True/False, or None if skipped) |
verify_granular_result |
Any \| None |
Granular verification result from verify_granular() (e.g., 0.67 for partial credit) |
Embedding Check
| Field |
Type |
Description |
embedding_check_performed |
bool |
Whether embedding check was attempted |
embedding_similarity_score |
float \| None |
Similarity score between 0.0 and 1.0 |
embedding_override_applied |
bool |
Whether the embedding check overrode the verify result |
embedding_model_used |
str \| None |
Name of the embedding model used |
Regex Validations
| Field |
Type |
Description |
regex_validations_performed |
bool |
Whether regex validation was attempted |
regex_validation_results |
dict[str, bool] \| None |
Per-pattern pass/fail results |
regex_validation_details |
dict[str, dict] \| None |
Detailed match information per pattern |
regex_overall_success |
bool \| None |
Overall regex validation result |
regex_extraction_results |
dict[str, Any] \| None |
What the regex patterns actually extracted |
Abstention Detection
| Field |
Type |
Description |
abstention_check_performed |
bool |
Whether abstention check was attempted |
abstention_detected |
bool \| None |
Whether the model refused or abstained from answering |
abstention_override_applied |
bool |
Whether abstention check overrode the result |
abstention_reasoning |
str \| None |
LLM's reasoning for the abstention determination |
Sufficiency Detection
| Field |
Type |
Description |
sufficiency_check_performed |
bool |
Whether sufficiency check was attempted |
sufficiency_detected |
bool \| None |
Whether the response has sufficient information (True = sufficient) |
sufficiency_override_applied |
bool |
Whether sufficiency check overrode the result |
sufficiency_reasoning |
str \| None |
LLM's reasoning for the sufficiency determination |
MCP and Agent Metrics
| Field |
Type |
Description |
recursion_limit_reached |
bool |
Whether the agent hit its recursion limit |
answering_mcp_servers |
list[str] \| None |
Names of MCP servers attached to the answering model |
agent_metrics |
dict \| None |
MCP agent execution metrics (see structure below) |
Agent metrics structure (only present when an agent was used):
{
"iterations": 3, # Number of agent think-act cycles
"tool_calls": 5, # Total tool invocations
"tools_used": ["mcp__brave_search", "mcp__read_resource"],
"suspect_failed_tool_calls": 2, # Tool calls with error-like output
"suspect_failed_tools": ["mcp__brave_search"]
}
Token Usage
| Field |
Type |
Description |
usage_metadata |
dict \| None |
Token usage breakdown by verification stage |
Usage metadata structure:
{
"answer_generation": {
"input_tokens": 150, "output_tokens": 200, "total_tokens": 350,
"model": "claude-haiku-4-5"
},
"parsing": {"input_tokens": 200, "output_tokens": 80, "total_tokens": 280},
"rubric_evaluation": {...},
"abstention_check": {...},
"total": {"input_tokens": 600, "output_tokens": 360, "total_tokens": 960}
}
Rubric Results
The rubric section is present when rubric evaluation was performed (evaluation mode template_and_rubric or rubric_only). Access via result.rubric.
Evaluation Status
| Field |
Type |
Description |
rubric_evaluation_performed |
bool |
Whether rubric evaluation was executed |
rubric_evaluation_strategy |
str \| None |
Strategy used: "batch" or "sequential" |
Trait Scores by Type
Scores are split by trait type for type-safe access:
| Field |
Type |
Description |
llm_trait_scores |
dict[str, int \| bool] \| None |
LLM-evaluated traits — boolean (True/False) for boolean kind, integer score for score kind, class index for literal kind |
llm_trait_labels |
dict[str, str] \| None |
Human-readable class names for literal kind traits (e.g., {"tone": "Professional"}) |
regex_trait_scores |
dict[str, bool] \| None |
Regex-based traits (boolean pass/fail) |
callable_trait_scores |
dict[str, bool \| int] \| None |
Callable-based traits (boolean or integer score) |
metric_trait_scores |
dict[str, dict[str, float]] \| None |
Metric traits with nested metrics (e.g., {"extraction": {"precision": 1.0, "recall": 0.8, "f1": 0.89}}) |
agentic_trait_scores |
dict[str, int \| bool] \| None |
Agentic rubric trait scores, keyed by trait name. Same value types as LLM traits (boolean, integer score, or class index for literal kind). |
agentic_trait_investigation_traces |
dict[str, str] \| None |
Raw investigation traces from agentic trait agents, keyed by trait name. Each trace is the full text of the agent's investigation session. |
Metric Trait Details
| Field |
Type |
Description |
metric_trait_confusion_lists |
dict[str, dict[str, list[str]]] \| None |
Confusion matrix lists per metric trait |
Confusion lists structure:
{
"feature_extraction": {
"tp": ["feature_A", "feature_B"], # True positives
"tn": ["irrelevant_1"], # True negatives
"fp": ["hallucinated_1"], # False positives
"fn": ["missed_feature"] # False negatives
}
}
Convenience Methods
The VerificationResultRubric provides helper methods for working with trait scores:
| Method |
Returns |
Description |
get_all_trait_scores() |
dict |
All trait scores across all types in a flat dictionary |
get_trait_by_name(name) |
tuple \| None |
Look up a trait by name — returns (value, trait_type) or None |
get_llm_trait_labels() |
dict[str, str] |
Class labels for literal kind LLM traits |
Example:
if result.rubric:
# Get all scores at once
all_scores = result.rubric.get_all_trait_scores()
# {"clarity": 4, "has_citations": True, "extraction": {"precision": 1.0, ...}}
# Look up a specific trait
match = result.rubric.get_trait_by_name("clarity")
if match:
value, trait_type = match # (4, "llm")
# Get literal trait labels
labels = result.rubric.get_llm_trait_labels()
# {"response_type": "Factual", "tone": "Professional"}
Deep Judgment (Templates)
The deep_judgment section is present when deep judgment was enabled for template evaluation. It provides multi-stage parsing with excerpts and reasoning for each template attribute. Access via result.deep_judgment.
Status Fields
| Field |
Type |
Description |
deep_judgment_enabled |
bool |
Whether deep judgment was configured |
deep_judgment_performed |
bool |
Whether deep judgment was successfully executed |
deep_judgment_stages_completed |
list[str] \| None |
Stages completed: ["excerpts", "reasoning", "parameters"] |
deep_judgment_model_calls |
int |
Number of LLM invocations for deep judgment |
deep_judgment_excerpt_retry_count |
int |
Number of retries for excerpt validation |
attributes_without_excerpts |
list[str] \| None |
Attributes with no corroborating excerpts found |
Excerpts and Reasoning
| Field |
Type |
Description |
extracted_excerpts |
dict[str, list[dict]] \| None |
Extracted excerpts per attribute |
attribute_reasoning |
dict[str, str] \| None |
Reasoning traces per attribute |
Excerpt structure:
{
"tissue": [
{
"text": "KRAS is most essential in the pancreas",
"confidence": "high", # "low", "medium", or "high"
"similarity_score": 0.92,
# Only when search enabled:
"search_results": "External validation text...",
"hallucination_risk": "none", # "none", "low", "medium", or "high"
"hallucination_justification": "Strong external evidence supports this claim"
}
]
}
An empty list [] for an attribute indicates no excerpts were found (e.g., the model refused to answer or no corroborating evidence exists). Reasoning can still exist for attributes without excerpts, explaining why none were found.
Search-Enhanced Fields
| Field |
Type |
Description |
deep_judgment_search_enabled |
bool |
Whether search enhancement was enabled |
hallucination_risk_assessment |
dict[str, str] \| None |
Per-attribute hallucination risk ("none", "low", "medium", "high") |
Deep Judgment (Rubrics)
The deep_judgment_rubric section is present when deep judgment was enabled for rubric trait evaluation. It provides per-trait excerpts, reasoning, and scores. Access via result.deep_judgment_rubric.
Status and Scores
| Field |
Type |
Description |
deep_judgment_rubric_performed |
bool |
Whether deep judgment rubric evaluation was executed |
deep_judgment_rubric_scores |
dict[str, int \| bool] \| None |
Scores for traits evaluated with deep judgment |
standard_rubric_scores |
dict[str, int \| bool] \| None |
Scores for traits evaluated without deep judgment (in the same rubric) |
Per-Trait Excerpts and Reasoning
| Field |
Type |
Description |
extracted_rubric_excerpts |
dict[str, list[dict]] \| None |
Extracted excerpts per trait (same structure as template excerpts) |
rubric_trait_reasoning |
dict[str, str] \| None |
Reasoning text per trait explaining how the score was determined |
| Field |
Type |
Description |
trait_metadata |
dict[str, dict] \| None |
Detailed tracking per trait |
Trait metadata structure:
{
"clarity": {
"stages_completed": ["excerpt_extraction", "reasoning_generation", "score_extraction"],
"model_calls": 3,
"had_excerpts": True,
"excerpt_retry_count": 1,
"excerpt_validation_failed": False
}
}
Auto-Fail and Search Fields
| Field |
Type |
Description |
traits_without_valid_excerpts |
list[str] \| None |
Trait names that failed to extract valid excerpts after all retries (triggers auto-fail) |
rubric_hallucination_risk_assessment |
dict[str, dict] \| None |
Per-trait hallucination risk assessment |
Hallucination risk structure:
{
"clarity": {
"overall_risk": "low",
"per_excerpt_risks": ["none", "low"]
}
}
Aggregate Statistics
| Field |
Type |
Description |
total_deep_judgment_model_calls |
int |
Total LLM calls across all deep judgment traits |
total_traits_evaluated |
int |
Number of traits evaluated with deep judgment |
total_excerpt_retries |
int |
Total retry attempts across all traits |
Field Count Summary
For verification against source code, here are the field counts per section:
| Section |
Fields |
Convenience Methods |
| Root level |
3 (evaluation_input, used_full_trace, trace_extraction_error) |
— |
metadata |
16 fields + 2 properties |
answering_model, parsing_model |
template |
28 fields |
— |
rubric |
8 fields |
get_all_trait_scores(), get_trait_by_name(), get_llm_trait_labels() |
deep_judgment |
10 fields |
— |
deep_judgment_rubric |
11 fields |
— |
Next Steps