Core Concepts¶
This section is about understanding — what Karenina's building blocks are, why they exist, and how they relate to each other. If you're looking for step-by-step task guides, see Workflows. If you need exact field names or CLI flags, see Reference.
Concepts at a Glance¶
Concepts are ordered to follow the evaluation pipeline — from what you're evaluating, through how evaluation works, to what comes out the other end.
| Concept | What It Is | Page |
|---|---|---|
| Questions & Benchmarks | The central objects: questions bundled with templates, rubrics, and metadata | Questions & Benchmarks |
| Checkpoints | JSON-LD files that store benchmarks (questions, templates, rubrics, results) | Checkpoints |
| TaskEval | Evaluate any free text output using karenina's templates and rubrics (open-loop mode) | TaskEval |
| Scenarios | Multi-turn scenario benchmarks: evaluate LLM behavior across branching conversation graphs | Scenarios |
| Answer Templates | Pydantic models that define how a Judge LLM parses and verifies responses | Answer Templates |
| Rubrics | Trait-based evaluation of response quality (LLM, regex, callable, metric) | Rubrics |
| Templates vs Rubrics | The two evaluation units: correctness (templates) vs quality (rubrics) | Templates vs Rubrics |
| Evaluation Modes | Three modes controlling which evaluation units run (template_only, template_and_rubric, rubric_only) |
Evaluation Modes |
| Verification Pipeline | The 13-stage engine that executes evaluation end to end | Verification Pipeline |
| Prompt Assembly | How prompts are constructed for pipeline LLM calls (tri-section pattern) | Prompt Assembly |
| Results & Scoring | What verification produces: pass/fail, scores, traits, and metrics | Results & Scoring |
| Adapters | LLM backend interfaces (LangChain, Claude SDK, Claude Tool, Manual, and more) | Adapters |
| MCP | Tool-augmented evaluation via Model Context Protocol servers | MCP Overview |
| Manual Interface | Evaluation using pre-recorded LLM traces instead of live API calls | Manual Interface |
| ADeLe | 18-dimension question classification system (Zhou et al., 2025) | ADeLe |
| Few-Shot | Example injection for improved LLM response accuracy | Few-Shot |
How Concepts Fit Together¶
Karenina supports three entry points into a shared evaluation engine:
Benchmark Mode (closed-loop) Scenario Mode (multi-turn) TaskEval Mode (open-loop)
───────────────────────── ────────────────────────── ────────────────────────
Questions & Benchmarks Scenario Graph Logged Outputs
├── Questions ← ask ├── Nodes ← questions ├── log() ← plain text
├── Answer Templates ← correct ├── Edges ← conditions ├── log_trace() ← Message traces
└── Rubric Traits ← quality └── Outcomes ← criteria ├── add_template()
│ │ └── add_rubric()
▼ │ │
Checkpoint (.jsonld) Checkpoint (.jsonld) │
│ │ │
└──────────────────────┬───────┘──────────────────────────────────┘
▼
Evaluation Mode ← which evaluation units to run
│
▼
Adapter ← which LLM backend to use
├── LangChain, Claude SDK, Claude Tool, ...
└── optionally with MCP tools
│
▼
Verification Pipeline ← 13-stage execution engine
├── Prompt Assembly ← constructs all LLM prompts
└── Stage by stage ← generate*, parse, verify, evaluate
│ (*skipped in TaskEval)
▼
Results & Scoring ← pass/fail, scores, traits, metrics
Shared concepts (both modes):
- Answer templates define the structured schema a Judge LLM fills in, then
verify()checks correctness - Rubric traits evaluate quality dimensions of the raw response (safety, clarity, format compliance, etc.)
- The evaluation mode determines whether templates, rubrics, or both are used
- An adapter connects to the LLM backend that parses responses
- The verification pipeline orchestrates 13 stages from generation through scoring
- Prompt assembly constructs all LLM prompts using a tri-section pattern
- Results capture everything that happened: pass/fail, scores, excerpts, and metadata
Benchmark-specific: A benchmark bundles questions with templates and rubrics. Checkpoints persist benchmarks as portable JSON-LD files. MCP servers can provide tools to the answering model.
Scenario-specific: A scenario is a directed graph where nodes carry questions and edges carry routing conditions. After each turn the pipeline selects the next node based on verification results. Outcome criteria assert over the full conversation result.
TaskEval-specific: TaskEval records pre-existing outputs via log() and log_trace(), attaches evaluation criteria, and feeds them into the pipeline as cached_answer_data (skipping answer generation).
Concept Details¶
Questions & Benchmarks¶
A benchmark is the central object in Karenina: a self-contained evaluation unit bundling questions, templates, rubrics, and metadata. Questions are the atomic unit — each has text, an expected answer, and optionally an attached template and question-specific rubric traits.
Read more about questions and benchmarks →
Checkpoints¶
A checkpoint is a JSON-LD file that stores everything needed to define and reproduce a benchmark: questions, answer templates, rubric traits and metadata. Checkpoints use Schema.org types for interoperability.
TaskEval¶
TaskEval evaluates any free text output using karenina's two evaluation primitives: templates for correctness and rubrics for quality. Instead of defining questions and generating answers (the Benchmark workflow), you supply existing text or structured traces and attach evaluation criteria. This is useful whenever you have outputs that need structured evaluation, whether from agent workflows or external systems.
Scenarios¶
Multi-turn scenarios evaluate conversation dynamics: sycophancy resistance, error correction, progressive disclosure. A scenario is a directed graph where nodes carry questions and edges carry conditions. After execution, outcome criteria assert over the full conversation result. See Scenarios.
Answer Templates¶
Answer templates are Pydantic models that tell a Judge LLM how to parse a candidate response into structured fields. Each template implements a verify() method that compares parsed values against ground truth. This is the core mechanism for evaluating factual correctness.
Read more about answer templates →
Rubrics¶
Rubrics evaluate qualitative traits of the raw response, independent of whether the answer is factually correct. Karenina provides four trait types: LLM traits, regex traits, callable traits, and metric traits. Rubrics can be applied globally (all questions) or per-question.
Templates vs Rubrics¶
Karenina's evaluation rests on two complementary building blocks: answer templates verify factual correctness by having a Judge LLM parse responses into structured schemas, while rubrics assess response quality through trait evaluators that examine the raw text. Understanding when to use each, and when to use both together, is the foundation for effective benchmark design.
Read more about templates vs rubrics →
Evaluation Modes¶
Karenina supports three evaluation modes that control which units run during verification:
| Mode | Templates | Rubrics |
|---|---|---|
template_only (default) |
Yes | No |
template_and_rubric |
Yes | Yes |
rubric_only |
No | Yes |
Read more about evaluation modes →
Verification Pipeline¶
The verification pipeline is a 13-stage execution engine. Stages are grouped into setup, generation, guards, template processing, rubric evaluation, and finalization. The evaluation mode controls which stages are active.
Read more about the verification pipeline →
Prompt Assembly¶
The PromptAssembler constructs all LLM prompts using a tri-section pattern: task instructions (from the pipeline stage), adapter instructions (backend-specific adjustments), and user instructions (your custom overrides via PromptConfig).
Read more about prompt assembly →
Results & Scoring¶
The pipeline produces a VerificationResult per question containing template results (pass/fail, parsed fields), rubric results (per-trait scores), and metadata (timing, model info, errors). Result collections support aggregation and DataFrame export.
Read more about results and scoring →
Adapters¶
Adapters are LLM backend interfaces that handle the actual communication with language models. Karenina uses a hexagonal architecture where adapters implement port protocols:
| Interface | Description |
|---|---|
langchain |
Default adapter — supports all LLMs via LangChain |
openrouter |
OpenRouter API (routes through LangChain) |
openai_endpoint |
OpenAI-compatible endpoints (routes through LangChain) |
claude_agent_sdk |
Native Anthropic Agent SDK |
claude_tool |
Direct Anthropic SDK with tool use |
manual |
Pre-recorded traces (no live API calls) |
MCP Overview¶
The Model Context Protocol (MCP) enables tool-augmented evaluation, where the answering model can use external tools (databases, APIs, code execution) during verification. This is essential for evaluating agentic capabilities.
Manual Interface¶
The manual interface allows you to evaluate pre-recorded LLM traces instead of making live API calls. This is useful for reproducibility, cost reduction, and evaluating responses from models not directly supported by karenina adapters.
Read more about the manual interface →
ADeLe¶
ADeLe (Annotated Demand Levels; Zhou et al., 2025) is an 18-dimension question classification system that characterizes questions along axes like reasoning depth, domain specificity, and answer format. Classifications are stored in checkpoint metadata and can guide template design and evaluation strategy.
Few-Shot¶
Few-shot examples teach the answering model how to respond by prepending question-answer pairs to the prompt. They affect only the answering stage; the Judge LLM and rubric evaluators never see them. Modes include all, k-shot, custom, and none.