Reading Paths¶
Choose the path that matches your goal:
New User — Learn Karenina from the ground up:
Installation → Quick Start → Core Concepts → Creating Benchmarks → Running Verification → Analyzing Results
TaskEval User: Evaluate existing outputs (agent traces, external text):
Installation → TaskEval → TaskEval Workflow → Answer Templates → Rubrics → Analyzing Results
Power User — Dive into advanced features:
CLI User — Use Karenina from the command line:
Contributor — Extend Karenina with custom adapters or pipeline stages:
Getting Started¶
| Section | What You'll Learn |
|---|---|
| Installation | Requirements, install commands, optional dependencies, troubleshooting |
| Quick Start: Benchmark | Hands-on walkthrough from zero to a working benchmark |
| Quick Start: TaskEval | Evaluate pre-recorded outputs (agent traces, external text) |
| Workspace Init | Set up a project directory with karenina init |
Core Concepts¶
| Section | What You'll Learn |
|---|---|
| Overview | How all concepts fit together, ordered by pipeline flow |
| Questions & Benchmarks | The central objects: questions bundled with templates, rubrics, and metadata |
| Checkpoints | The JSON-LD benchmark format: questions, templates, rubrics, and metadata |
| Answer Templates | Pydantic models that define how a Judge LLM evaluates correctness |
| Rubrics | Quality assessment with four trait types: LLM, regex, callable, metric |
| Templates vs Rubrics | When to use which evaluation unit, and when to use both together |
| Evaluation Modes | Template-only, template-and-rubric, and rubric-only evaluation |
| Verification Pipeline | The 13-stage engine that executes evaluation end to end |
| Prompt Assembly | How prompts are constructed for pipeline LLM calls |
| Results & Scoring | What verification produces: pass/fail, scores, traits, and metrics |
| Adapters | LLM backend interfaces: LangChain, Claude SDK, OpenRouter, and more |
Workflows¶
| Section | What You'll Learn |
|---|---|
| Configuration | Configuration hierarchy: CLI args, presets, environment variables, defaults |
| Evaluating with TaskEval | Evaluate pre-recorded agent traces against templates and rubrics |
| Creating Benchmarks | Author questions, write templates, define rubrics, and save checkpoints |
| Running Verification | Configure and execute evaluation via Python API or CLI |
| Analyzing Results | Inspect results, build DataFrames, export data, and iterate |
Reference¶
| Section | What You'll Learn |
|---|---|
| CLI Reference | Complete documentation for all CLI commands |
| Configuration Reference | Exhaustive tables for all configuration options |
Advanced¶
| Section | What You'll Learn |
|---|---|
| Pipeline Internals | The 13-stage verification pipeline, deep judgment, and prompt assembly |
| Adapter Architecture | Ports and adapters pattern, custom adapter creation, MCP deep dive |
Contributing¶
| Section | What You'll Learn |
|---|---|
| Contributing Guide | How to create adapters, extend the pipeline, and contribute |