Core Concepts¶

This section is about understanding — what Karenina's building blocks are, why they exist, and how they relate to each other. If you're looking for step-by-step task guides, see Workflows. If you need exact field names or CLI flags, see Reference.

Concepts at a Glance¶

Concepts are ordered to follow the evaluation pipeline — from what you're evaluating, through how evaluation works, to what comes out the other end.

Concept	What It Is	Page
Questions & Benchmarks	The central objects: questions bundled with templates, rubrics, and metadata	Questions & Benchmarks
Checkpoints	JSON-LD files that store benchmarks (questions, templates, rubrics, results)	Checkpoints
TaskEval	Evaluate any free text output using karenina's templates and rubrics (open-loop mode)	TaskEval
Scenarios	Multi-turn scenario benchmarks: evaluate LLM behavior across branching conversation graphs	Scenarios
Answer Templates	Pydantic models that define how a Judge LLM parses and verifies responses	Answer Templates
Rubrics	Trait-based evaluation of response quality (LLM, regex, callable, metric)	Rubrics
Templates vs Rubrics	The two evaluation units: correctness (templates) vs quality (rubrics)	Templates vs Rubrics
Evaluation Modes	Three modes controlling which evaluation units run (`template_only`, `template_and_rubric`, `rubric_only`)	Evaluation Modes
Verification Pipeline	The 13-stage engine that executes evaluation end to end	Verification Pipeline
Prompt Assembly	How prompts are constructed for pipeline LLM calls (tri-section pattern)	Prompt Assembly
Results & Scoring	What verification produces: pass/fail, scores, traits, and metrics	Results & Scoring
Adapters	LLM backend interfaces (LangChain, Claude SDK, Claude Tool, Manual, and more)	Adapters
MCP	Tool-augmented evaluation via Model Context Protocol servers	MCP Overview
Manual Interface	Evaluation using pre-recorded LLM traces instead of live API calls	Manual Interface
ADeLe	18-dimension question classification system (Zhou et al., 2025)	ADeLe
Few-Shot	Example injection for improved LLM response accuracy	Few-Shot

How Concepts Fit Together¶

Karenina supports three entry points into a shared evaluation engine:

Benchmark Mode (closed-loop)   Scenario Mode (multi-turn)        TaskEval Mode (open-loop)
─────────────────────────      ──────────────────────────        ────────────────────────
Questions & Benchmarks         Scenario Graph                    Logged Outputs
 ├── Questions        ← ask     ├── Nodes      ← questions        ├── log()        ← plain text
 ├── Answer Templates ← correct ├── Edges      ← conditions       ├── log_trace()  ← Message traces
 └── Rubric Traits    ← quality └── Outcomes   ← criteria         ├── add_template()
         │                              │                          └── add_rubric()
         ▼                              │                                  │
 Checkpoint (.jsonld)          Checkpoint (.jsonld)                        │
         │                              │                                  │
         └──────────────────────┬───────┘──────────────────────────────────┘
                                ▼
                        Evaluation Mode     ← which evaluation units to run
                                │
                                ▼
                        Adapter             ← which LLM backend to use
                         ├── LangChain, Claude SDK, Claude Tool, ...
                         └── optionally with MCP tools
                                │
                                ▼
                        Verification Pipeline   ← 13-stage execution engine
                         ├── Prompt Assembly     ← constructs all LLM prompts
                         └── Stage by stage      ← generate*, parse, verify, evaluate
                                │                  (*skipped in TaskEval)
                                ▼
                        Results & Scoring   ← pass/fail, scores, traits, metrics

Shared concepts (both modes):

Answer templates define the structured schema a Judge LLM fills in, then verify() checks correctness
Rubric traits evaluate quality dimensions of the raw response (safety, clarity, format compliance, etc.)
The evaluation mode determines whether templates, rubrics, or both are used
An adapter connects to the LLM backend that parses responses
The verification pipeline orchestrates 13 stages from generation through scoring
Prompt assembly constructs all LLM prompts using a tri-section pattern
Results capture everything that happened: pass/fail, scores, excerpts, and metadata

Benchmark-specific: A benchmark bundles questions with templates and rubrics. Checkpoints persist benchmarks as portable JSON-LD files. MCP servers can provide tools to the answering model.

Scenario-specific: A scenario is a directed graph where nodes carry questions and edges carry routing conditions. After each turn the pipeline selects the next node based on verification results. Outcome criteria assert over the full conversation result.

TaskEval-specific: TaskEval records pre-existing outputs via log() and log_trace(), attaches evaluation criteria, and feeds them into the pipeline as cached_answer_data (skipping answer generation).

Concept Details¶

Questions & Benchmarks¶

A benchmark is the central object in Karenina: a self-contained evaluation unit bundling questions, templates, rubrics, and metadata. Questions are the atomic unit — each has text, an expected answer, and optionally an attached template and question-specific rubric traits.

Checkpoints¶

A checkpoint is a JSON-LD file that stores everything needed to define and reproduce a benchmark: questions, answer templates, rubric traits and metadata. Checkpoints use Schema.org types for interoperability.

TaskEval¶

TaskEval evaluates any free text output using karenina's two evaluation primitives: templates for correctness and rubrics for quality. Instead of defining questions and generating answers (the Benchmark workflow), you supply existing text or structured traces and attach evaluation criteria. This is useful whenever you have outputs that need structured evaluation, whether from agent workflows or external systems.

Scenarios¶

Multi-turn scenarios evaluate conversation dynamics: sycophancy resistance, error correction, progressive disclosure. A scenario is a directed graph where nodes carry questions and edges carry conditions. After execution, outcome criteria assert over the full conversation result. See Scenarios.

Answer Templates¶

Answer templates are Pydantic models that tell a Judge LLM how to parse a candidate response into structured fields. Each template implements a verify() method that compares parsed values against ground truth. This is the core mechanism for evaluating factual correctness.

Rubrics¶

Rubrics evaluate qualitative traits of the raw response, independent of whether the answer is factually correct. Karenina provides four trait types: LLM traits, regex traits, callable traits, and metric traits. Rubrics can be applied globally (all questions) or per-question.

Templates vs Rubrics¶

Karenina's evaluation rests on two complementary building blocks: answer templates verify factual correctness by having a Judge LLM parse responses into structured schemas, while rubrics assess response quality through trait evaluators that examine the raw text. Understanding when to use each, and when to use both together, is the foundation for effective benchmark design.

Read more about templates vs rubrics →

Evaluation Modes¶

Karenina supports three evaluation modes that control which units run during verification:

Mode	Templates	Rubrics
`template_only` (default)	Yes	No
`template_and_rubric`	Yes	Yes
`rubric_only`	No	Yes

Verification Pipeline¶

The verification pipeline is a 13-stage execution engine. Stages are grouped into setup, generation, guards, template processing, rubric evaluation, and finalization. The evaluation mode controls which stages are active.

Prompt Assembly¶

The PromptAssembler constructs all LLM prompts using a tri-section pattern: task instructions (from the pipeline stage), adapter instructions (backend-specific adjustments), and user instructions (your custom overrides via PromptConfig).

Results & Scoring¶

The pipeline produces a VerificationResult per question containing template results (pass/fail, parsed fields), rubric results (per-trait scores), and metadata (timing, model info, errors). Result collections support aggregation and DataFrame export.

Adapters¶

Adapters are LLM backend interfaces that handle the actual communication with language models. Karenina uses a hexagonal architecture where adapters implement port protocols:

Interface	Description
`langchain`	Default adapter — supports all LLMs via LangChain
`openrouter`	OpenRouter API (routes through LangChain)
`openai_endpoint`	OpenAI-compatible endpoints (routes through LangChain)
`claude_agent_sdk`	Native Anthropic Agent SDK
`claude_tool`	Direct Anthropic SDK with tool use
`manual`	Pre-recorded traces (no live API calls)

MCP Overview¶

The Model Context Protocol (MCP) enables tool-augmented evaluation, where the answering model can use external tools (databases, APIs, code execution) during verification. This is essential for evaluating agentic capabilities.

Manual Interface¶

The manual interface allows you to evaluate pre-recorded LLM traces instead of making live API calls. This is useful for reproducibility, cost reduction, and evaluating responses from models not directly supported by karenina adapters.

Read more about the manual interface →

ADeLe¶

ADeLe (Annotated Demand Levels; Zhou et al., 2025) is an 18-dimension question classification system that characterizes questions along axes like reasoning depth, domain specificity, and answer format. Classifications are stored in checkpoint metadata and can guide template design and evaluation strategy.

Few-Shot¶

Few-shot examples teach the answering model how to respond by prepending question-answer pairs to the prompt. They affect only the answering stage; the Judge LLM and rubric evaluators never see them. Modes include all, k-shot, custom, and none.