1. What It Is¶
| Mode | Template | Rubric | Default | Best for |
|---|---|---|---|---|
template_only |
Yes | No | Yes | Closed-answer questions where correctness is the main signal |
template_and_rubric |
Yes | Yes | No | Runs where you need correctness and quality side by side |
rubric_only |
No | Yes | No | Open-ended outputs, trace audits, or quality-only evaluation |
Templates answer "did the model get it right?" by parsing into a schema and running deterministic verify() code. Rubrics answer "was the response good?" by evaluating the raw response or trace with LLM, regex, callable, and metric traits. Evaluation mode decides whether a run activates one path or both.
Two litmus tests:
- If you would struggle to write a defensible
verify()method, preferrubric_only. - If a rubric trait would just restate a known factual answer, keep the core decision in a template instead.
2. What Each Side Sees¶
The two evaluation systems operate on different inputs.
| System | What it sees | What stays hidden or handled elsewhere |
|---|---|---|
| Template flow | Question text, the evaluation input passed to parsing, and the Answer schema |
Ground truth stored on self.correct; rubric traits |
| Rubric flow | Question text, the rubric evaluation input, and the attached rubric traits | Template verify() logic; structured parsed fields unless your trait recreates them itself |
For trace-based runs, the defaults are intentionally asymmetric:
- Template parsing uses only the final AI message by default:
use_full_trace_for_template=False - Rubric evaluation uses the full trace by default:
use_full_trace_for_rubric=True
Templates judge the final answer; rubrics often need to inspect the broader behavior that produced it.
Primary outputs by mode¶
| Mode | Primary outputs | Determinism profile |
|---|---|---|
template_only |
result.template.verify_result, parsed fields, optional embedding/deep-judgment metadata |
verify() is deterministic after parsing; parsing may still use an LLM |
template_and_rubric |
Template outputs plus result.rubric.*_trait_scores |
Mixed: deterministic template verification plus whatever trait types you use |
rubric_only |
Rubric trait scores | Depends on trait types: regex/callable can be deterministic; LLM and metric traits are not |
3. How It Works¶
At the orchestration layer, evaluation mode controls which stages StageOrchestrator.from_config(...) adds to the pipeline.
| Stage | template_only |
template_and_rubric |
rubric_only |
|---|---|---|---|
ValidateTemplate |
Yes | Yes | No |
GenerateAnswer |
Yes | Yes | Yes |
RecursionLimitAutoFail |
Yes | Yes | Yes |
TraceValidationAutoFail |
Yes | Yes | Yes |
AbstentionCheck |
Optional | Optional | Optional |
SufficiencyCheck |
Optional | Optional | No |
ParseTemplate |
Yes | Yes | No |
VerifyTemplate |
Yes | Yes | No |
EmbeddingCheck |
Present; only runs when enabled and needed | Present; only runs when enabled and needed | No |
DeepJudgmentAutoFail |
Optional | Optional | No |
RubricEvaluation |
No | Yes | Yes |
AgenticRubricEvaluation |
No | If agentic traits | If agentic traits |
DeepJudgmentRubricAutoFail |
No | Optional | Optional |
FinalizeResult |
Yes | Yes | Yes |
Four execution details matter in practice:
rubric_onlytruly skips template stages at the orchestrator level.template_and_rubricevaluates rubrics on the raw response, not on parsed template fields.- Rubric stages are only added when a non-empty
Rubricis actually attached. - DynamicRubric traits participate in
template_and_rubricandrubric_onlymodes. Intemplate_onlymode, dynamic rubric traits are ignored because rubric evaluation does not run. The presence check that gates dynamic traits occurs at the start of Stage 11 (RubricEvaluation), so it only executes in modes that include rubric stages.
That means template_and_rubric is not "template mode with nicer reporting." It is a combined run with two distinct evaluation paths:
raw response
├─ template path: parse -> verify() -> optional embedding/deep judgment
└─ rubric path: score raw text/trace with attached traits
4. Patterns / Worked Examples¶
Canonical pattern: one factual question, one optional quality layer¶
Suppose your benchmark asks:
Which protein does venetoclax target?
You have:
- a template that verifies the extracted target is
BCL2 - a rubric with traits like
concisenessandhas_citation
The same benchmark can be run in three ways:
| Mode | What Karenina checks | What you read from the result |
|---|---|---|
template_only |
Did the parsed target match BCL2? |
result.template.verify_result |
template_and_rubric |
Did the parsed target match BCL2? And was the response concise / cited? |
result.template.verify_result and result.rubric.*_trait_scores |
rubric_only |
Only the quality traits on the raw output | result.rubric.*_trait_scores |
Configuration example¶
answering = [
ModelConfig(
id="answerer",
model_name="claude-haiku-4-5",
model_provider="anthropic",
)
]
parsing = [
ModelConfig(
id="judge",
model_name="claude-haiku-4-5",
model_provider="anthropic",
)
]
template_only_config = VerificationConfig(
answering_models=answering,
parsing_models=parsing,
)
template_and_rubric_config = VerificationConfig(
answering_models=answering,
parsing_models=parsing,
evaluation_mode="template_and_rubric",
)
rubric_only_config = VerificationConfig(
answering_models=answering,
parsing_models=parsing,
evaluation_mode="rubric_only",
)
5. Detailed Reference¶
Configuration rules¶
evaluation_mode alone controls which evaluation paths are active. Setting it to template_and_rubric or rubric_only automatically enables rubric evaluation; no separate flag is needed.
Result-shape guidance¶
When rubric evaluation runs, rubric outputs are stored under result.rubric.
When template stages do not run, Karenina still returns a template section with the raw response and related metadata, but template-specific correctness fields behave differently:
- in ordinary
rubric_onlyruns,result.template.verify_resultis typically unset (None) - guard stages such as recursion-limit, trace-validation, or abstention checks can still force
result.template.verify_resulttoFalse
So in rubric_only, treat rubric scores as the primary output and verify_result as a guard/failure signal rather than a correctness verdict.
Low-level runner behavior¶
At the low-level run_single_model_verification(...) API, passing a non-empty rubric while leaving evaluation_mode="template_only" causes the runner to upgrade internally to template_and_rubric. VerificationConfig does not do that silently; it validates consistency up front.
TaskEval behavior¶
TaskEval infers the mode from what you attach:
- template only ->
template_only - rubric only ->
rubric_only - both ->
template_and_rubric
In rubric_only, TaskEval can evaluate logs even when no explicit template is attached; internally it creates a minimal synthetic template so the shared pipeline can still run.
6. Mode Tracking in Results¶
The evaluation_mode used for each verification is stored in result.metadata.evaluation_mode. This lets you distinguish results produced under different modes when analyzing a benchmark that mixes evaluation strategies across runs.
7. Next Steps¶
- Templates vs Rubrics: the conceptual distinction between correctness and quality
- Answer Templates: how parsing and
verify()work - Rubrics: trait types and selection guidance
- Verification Pipeline: the full stage-by-stage execution model
- Results and Scoring: how to read
VerificationResult - VerificationConfig Reference: complete configuration fields