Full Evaluation¶
This scenario runs verification with both template and rubric evaluation — the most comprehensive single-model workflow. You enable rubrics alongside templates, configure quality checks (abstention, sufficiency, embedding), customize prompts with PromptConfig, and use presets for repeatable configurations.
What you'll learn:
- Configure template+rubric evaluation mode
- Enable abstention, sufficiency, and embedding checks
- Customize judge prompts with
PromptConfig - Use presets for repeatable configurations
- Inspect combined template and rubric results
Load Benchmark¶
from karenina import Benchmark
benchmark = Benchmark.load(str(_tmp))
print(f"{benchmark.name}: {benchmark.question_count} questions")
For details on loading and inspecting benchmarks, see Basic Verification.
Configure Template+Rubric Mode¶
Enable rubric evaluation alongside templates, plus optional quality checks:
config = VerificationConfig(
answering_models=[
ModelConfig(id="haiku", model_name="claude-haiku-4-5",
model_provider="anthropic", interface="langchain")
],
parsing_models=[
ModelConfig(id="haiku-parser", model_name="claude-haiku-4-5",
model_provider="anthropic", interface="langchain",
temperature=0.0)
],
# Evaluation mode
evaluation_mode="template_and_rubric",
# Quality checks
abstention_enabled=True,
sufficiency_enabled=True,
# Embedding similarity
embedding_check_enabled=True,
embedding_check_threshold=0.85,
)
print(f"Mode: {config.evaluation_mode}")
print(f"Abstention: {config.abstention_enabled}")
print(f"Sufficiency:{config.sufficiency_enabled}")
print(f"Embedding: {config.embedding_check_enabled}")
When abstention is detected, parsing and rubric stages are skipped — the result is auto-failed. When sufficiency is insufficient, the same skipping applies. See Evaluation Modes for stage-skipping rules.
Customize with PromptConfig¶
PromptConfig lets you inject custom instructions into the judge prompts:
from karenina.schemas.verification import PromptConfig
prompt_config = PromptConfig(
parsing="Focus on extracting exact values. If the response contains hedging language, extract the most definitive statement.",
rubric_evaluation="Evaluate rubric traits strictly — partial compliance should score lower.",
)
config_with_prompts = VerificationConfig(
answering_models=[
ModelConfig(id="haiku", model_name="claude-haiku-4-5",
model_provider="anthropic", interface="langchain")
],
parsing_models=[
ModelConfig(id="haiku-parser", model_name="claude-haiku-4-5",
model_provider="anthropic", interface="langchain",
temperature=0.0)
],
evaluation_mode="template_and_rubric",
prompt_config=prompt_config,
)
print(f"Parsing prompt: {config_with_prompts.prompt_config.parsing[:50]}...")
print(f"Rubric prompt: {config_with_prompts.prompt_config.rubric_evaluation[:50]}...")
PromptConfig fields correspond to pipeline stages. See PromptConfig Reference for all injection points.
Use Presets¶
Presets save a full VerificationConfig as JSON for repeatable runs:
Load from Preset File¶
# config = VerificationConfig.from_preset(Path("presets/production.json"))
print("Load with: VerificationConfig.from_preset(Path('presets/production.json'))")
Override Preset Values¶
# Apply overrides on top of a preset base
config_override = VerificationConfig.from_overrides(
answering_id="haiku", answering_model="claude-haiku-4-5",
parsing_id="haiku-parser", parsing_model="claude-haiku-4-5",
evaluation_mode="template_and_rubric",
abstention=True,
)
print(f"Mode: {config_override.evaluation_mode}")
CLI with Preset + Overrides¶
# Combine preset with command-line overrides:
# karenina verify checkpoint.jsonld --preset production.json \
# --evaluation-mode template_and_rubric --abstention --sufficiency
print("CLI: karenina verify ... --preset production.json --abstention --sufficiency")
results = benchmark.run_verification(config)
print(f"Total results: {len(results)}")
for result in results:
meta = result.metadata
t = result.template
if t and t.abstention_detected:
print(f"[ABSTAINED] {meta.question_text[:50]}")
elif t and t.sufficiency_detected is False:
print(f"[INSUFFICIENT] {meta.question_text[:50]}")
elif t and t.template_verification_performed:
status = "PASS" if t.verify_result else "FAIL"
print(f"[{status}] {meta.question_text[:50]}")
else:
print(f"[SKIPPED] {meta.question_text[:50]}")
Rubric Scores¶
for result in results:
if result.rubric and result.rubric.rubric_evaluation_performed:
scores = result.rubric.get_all_trait_scores()
print(f"Q: {result.metadata.question_text[:40]} Traits: {scores}")
Embedding Scores¶
for result in results:
t = result.template
if t and t.embedding_check_performed:
print(f"Q: {result.metadata.question_text[:40]} "
f"Similarity: {t.embedding_similarity_score:.2f} "
f"Override: {t.embedding_override_applied}")
Dynamic Rubric¶
A DynamicRubric allows conditional rubric evaluation: traits are only scored when their concept is detected in the response. Attach a DynamicRubric to individual questions so that traits irrelevant to a particular response are skipped rather than evaluated against unrelated content.
from karenina.schemas.entities.rubric import DynamicRubric, LLMRubricTrait
dynamic = DynamicRubric(
llm_traits=[
LLMRubricTrait(
name="interaction_safety",
summary="drug interaction warnings",
description="Answer True if the response includes drug interaction warnings.",
kind="boolean",
higher_is_better=True,
),
LLMRubricTrait(
name="dosing_clarity",
summary="dosing instructions",
description="Rate dosing clarity from 1 to 5.",
kind="score",
higher_is_better=True,
),
],
)
# Attach per-question
benchmark.add_question(
question="What is the recommended treatment for condition X?",
raw_answer="Drug A, 500mg twice daily",
dynamic_rubric=dynamic,
)
# Run with rubric mode enabled
config = VerificationConfig(
answering_models=[
ModelConfig(id="haiku", model_name="claude-haiku-4-5",
model_provider="anthropic", interface="langchain")
],
parsing_models=[
ModelConfig(id="haiku-parser", model_name="claude-haiku-4-5",
model_provider="anthropic", interface="langchain")
],
evaluation_mode="template_and_rubric",
)
results = benchmark.run_verification(config)
# Inspect dynamic rubric metadata
for result in results:
if result.rubric:
print(f"Promoted: {result.rubric.dynamic_rubric_promoted_traits}")
print(f"Skipped: {result.rubric.dynamic_rubric_skipped_traits}")
The presence check runs automatically at the start of Stage 11 (RubricEvaluation). Skipped traits do not incur evaluation cost. Static rubric traits (attached via benchmark.set_global_rubric()) are always evaluated; only dynamic rubric traits are gated by presence.
Related Pages¶
- Basic Verification — Simpler template-only path
- Deep Judgment — Add excerpt-based reasoning to template and rubric evaluation
- VerificationConfig Reference — All configuration fields
- PromptConfig Reference — Prompt injection points
- Preset Schema Reference — Preset JSON format