Iterating on Your Benchmark¶
Verification results often reveal that some templates or rubric traits need refinement. This page covers the workflow for identifying failures, making targeted improvements, and re-running verification to measure progress.
The Iteration Cycle¶
Run verification
│
▼
Identify failures (failing templates, low rubric scores)
│
▼
Diagnose root causes (inspect parsed responses, field mismatches)
│
▼
Make targeted fixes (update templates, adjust rubrics)
│
▼
Re-run verification (on affected questions only)
│
▼
Measure improvement (compare pass rates, trait scores)
Each step uses specific APIs covered below.
Identifying Failing Templates¶
Filter to Failures¶
The TemplateResults accessor provides direct filtering for failed verifications:
results = benchmark.run_verification(config)
# Get template results and filter to failures
template_results = results.get_template_results()
failed = template_results.filter(failed_only=True)
print(f"Failed: {len(failed.results)} / {len(template_results.results)}")
Find Failing Questions¶
Use a DataFrame to identify which questions are failing and why:
df = template_results.to_dataframe()
# Questions where verify_result is False
failing_rows = df[df["verify_result"] == False]
failing_question_ids = failing_rows["question_id"].unique().tolist()
print(f"Questions failing: {len(failing_question_ids)}")
Pass Rate by Question¶
Aggregate pass rates to find the weakest questions:
pass_rates = template_results.aggregate_pass_rate(by="question_id")
for q_id, rate in sorted(pass_rates.items(), key=lambda x: x[1]):
if rate < 1.0:
print(f" {q_id}: {rate:.0%}")
Get a Quick Summary¶
summary = template_results.get_template_summary()
print(f"Overall pass rate: {summary['pass_rate']:.0%}")
print(f"Passed: {summary['num_passed']} / {summary['num_results']}")
Diagnosing Template Failures¶
Once you know which questions are failing, inspect the parsed responses to understand why.
Inspect Field-Level Mismatches¶
The template DataFrame includes field-by-field comparison columns:
df = failed.to_dataframe()
# Each row represents one field comparison
for _, row in df.iterrows():
if not row["field_match"]:
print(f"Question: {row['question_id']}")
print(f" Field: {row['field_name']}")
print(f" Expected (GT): {row['gt_value']}")
print(f" Parsed (LLM): {row['llm_value']}")
print()
This reveals common issues:
- Case differences: "BCL2" vs "bcl2" — add .lower() normalization in verify()
- Format differences: "42.0" vs "42" — use numeric comparison with tolerance
- Extra text: "The answer is Paris" vs "Paris" — improve field description to ask for the value only
- Missing fields: None vs expected value — the Judge LLM couldn't extract the field
Inspect Raw Responses¶
For deeper diagnosis, look at the full verification result:
for result in results:
if result.template and result.template.verify_result is False:
meta = result.metadata
print(f"Question: {meta.question_text[:80]}")
print(f"Model: {meta.answering.model_name}")
# What the Judge LLM parsed
if result.template.parsed_llm_response:
print(f"Parsed LLM response: {result.template.parsed_llm_response}")
if result.template.parsed_gt_response:
print(f"Parsed GT response: {result.template.parsed_gt_response}")
print()
Fixing Templates¶
Update a Template¶
Once you've diagnosed the issue, update the template code in-memory — no need to save and reload:
new_template = '''
from pydantic import Field
from karenina.schemas.entities import BaseAnswer
class Answer(BaseAnswer):
gene_symbol: str = Field(
description="The official HGNC gene symbol mentioned in the response"
)
def ground_truth(self):
self.correct = {"gene_symbol": "BCL2"}
def verify(self) -> bool:
return self.gene_symbol.strip().upper() == self.correct["gene_symbol"].upper()
'''
benchmark.update_template(question_id, new_template)
update_template() replaces the template for that question immediately. The change is in the Benchmark object's state — you can re-run verification without calling save() first.
Validate After Editing¶
After modifying templates, validate that the code is syntactically correct:
is_valid, errors = benchmark.validate_templates()
if not is_valid:
for err in errors:
print(f" {err['question_id']}: {err['error']}")
Improving Rubric Traits¶
Identify Low-Scoring Traits¶
Use rubric DataFrames to find traits with consistently low scores:
rubric_results = results.get_rubrics_results()
df_rubric = rubric_results.to_dataframe()
# Average score per trait
trait_scores = df_rubric.groupby("trait_name")["trait_score"].mean()
for name, score in trait_scores.sort_values().items():
print(f" {name}: {score:.2f}")
Update Global Rubric Traits¶
Remove a poorly-performing trait and replace it with an improved version:
from karenina.schemas import LLMRubricTrait
# Remove the old trait
benchmark.remove_global_rubric_trait("clarity")
# Add an improved version with a better description
benchmark.add_global_rubric_trait(
LLMRubricTrait(
name="clarity",
kind="score",
description="Rate how clearly the response communicates its answer. "
"A clear response states the answer directly without unnecessary "
"preamble, hedging, or tangential information. Score 1 for "
"unclear, 5 for excellent clarity.",
min_score=1,
max_score=5,
higher_is_better=True,
)
)
Update Question-Specific Traits¶
For traits that only apply to certain questions:
from karenina.schemas import RegexRubricTrait
# Replace a question-specific rubric entirely
from karenina.schemas import Rubric
benchmark.set_question_rubric(
question_id,
Rubric(regex_traits=[
RegexRubricTrait(
name="citation_format",
pattern=r"\[\d+\]",
description="Response includes numbered citations",
higher_is_better=True,
)
])
)
Re-Running Verification¶
Re-Run Only Failing Questions¶
The key to efficient iteration — re-run verification only on questions that failed:
# Collect failing question IDs from earlier analysis
failing_question_ids = [...]
# Re-run only those questions
results_v2 = benchmark.run_verification(
config,
question_ids=failing_question_ids,
)
This avoids re-running questions that already pass, saving time and API costs.
Compare Before and After¶
# Before
summary_v1 = results.get_template_results().get_template_summary()
# After (on the subset that was re-run)
summary_v2 = results_v2.get_template_results().get_template_summary()
print(f"Before: {summary_v1['pass_rate']:.0%} ({summary_v1['num_passed']}/{summary_v1['num_results']})")
print(f"After: {summary_v2['pass_rate']:.0%} ({summary_v2['num_passed']}/{summary_v2['num_results']})")
Tag Runs for Tracking¶
Use run_name to label iteration runs so you can distinguish them later:
results_v1 = benchmark.run_verification(config, run_name="v1-initial")
# ... make fixes ...
results_v2 = benchmark.run_verification(
config,
question_ids=failing_question_ids,
run_name="v2-fixed-templates",
)
Run names appear in result.metadata.run_name and in exported results.
Common Iteration Patterns¶
Pattern 1: Fix-and-Verify Loop¶
For systematic template improvement:
for q_id in failing_question_ids:
# Inspect the current template
current = benchmark.get_template(q_id)
print(f"\n--- {q_id} ---")
print(current)
# Write and apply a fix
fixed_template = """...""" # Your improved template
benchmark.update_template(q_id, fixed_template)
# Verify just this question
single_result = benchmark.run_verification(config, question_ids=[q_id])
summary = single_result.get_template_results().get_template_summary()
print(f"Result: {'PASS' if summary['pass_rate'] == 1.0 else 'FAIL'}")
Pattern 2: Multi-Model Failure Analysis¶
Identify questions that fail on specific models:
template_results = results.get_template_results()
pass_rates = template_results.aggregate_pass_rate(by="question_id")
by_model = template_results.group_by_model()
for model_name, model_results in by_model.items():
model_pass = model_results.aggregate_pass_rate(by="question_id")
for q_id, rate in model_pass.items():
if rate < 1.0 and pass_rates.get(q_id, 0) < 1.0:
print(f" {q_id} fails on {model_name} (rate: {rate:.0%})")
Pattern 3: Save After Iterating¶
Once your templates and rubrics are refined, save the benchmark to preserve changes:
# Save updated benchmark (templates + rubrics persisted)
benchmark.save("benchmark_v2.jsonld")
# Or save to database
benchmark.save_to_db("sqlite:///dbs/benchmarks.db")
Tips for Effective Iteration¶
-
Start with templates — Template failures are the most actionable. Fix
verify()logic before tuning rubric trait descriptions. -
Use
question_idsfor targeted re-runs — Avoid re-running the entire benchmark when only a few questions need attention. -
Improve field descriptions first — Many parsing failures come from ambiguous field descriptions. A clearer
descriptionin the template often resolves extraction issues without changingverify(). -
Normalize in
verify()— Add.strip().lower()or numeric tolerance to handle format differences the Judge LLM introduces. -
Check abstention and sufficiency — If many questions fail, check whether the answering model is abstaining or giving insufficient responses. These show up in
result.template.abstention_detectedandresult.template.sufficiency_detected. -
Use
run_nameto track iterations — This makes it easy to compare results across refinement cycles.
Next Steps¶
- VerificationResult Structure — Understand all available result fields
- DataFrame Analysis — Detailed DataFrame analysis patterns
- Exporting Results — Save results for sharing or archival
- Factual QA Benchmark — Template patterns for complex verify logic
- Full Evaluation Benchmark — Creating and modifying rubric traits