Reading Paths¶

Choose the path that matches your goal:

New User — Learn Karenina from the ground up:

Installation → Quick Start → Core Concepts → Creating Benchmarks → Running Verification → Analyzing Results

TaskEval User: Evaluate existing outputs (agent traces, external text):

Installation → TaskEval → TaskEval Workflow → Answer Templates → Rubrics → Analyzing Results

Power User — Dive into advanced features:

Core Concepts → Pipeline Internals → Adapter Architecture

CLI User — Use Karenina from the command line:

Installation → Configuration → CLI Reference

Contributor — Extend Karenina with custom adapters or pipeline stages:

Adapter Architecture → Contributing

Getting Started¶

Section	What You'll Learn
Installation	Requirements, install commands, optional dependencies, troubleshooting
Quick Start: Benchmark	Hands-on walkthrough from zero to a working benchmark
Quick Start: TaskEval	Evaluate pre-recorded outputs (agent traces, external text)
Workspace Init	Set up a project directory with `karenina init`

Section	What You'll Learn
Overview	How all concepts fit together, ordered by pipeline flow
Questions & Benchmarks	The central objects: questions bundled with templates, rubrics, and metadata
Checkpoints	The JSON-LD benchmark format: questions, templates, rubrics, and metadata
Answer Templates	Pydantic models that define how a Judge LLM evaluates correctness
Rubrics	Quality assessment with four trait types: LLM, regex, callable, metric
Templates vs Rubrics	When to use which evaluation unit, and when to use both together
Evaluation Modes	Template-only, template-and-rubric, and rubric-only evaluation
Verification Pipeline	The 13-stage engine that executes evaluation end to end
Prompt Assembly	How prompts are constructed for pipeline LLM calls
Results & Scoring	What verification produces: pass/fail, scores, traits, and metrics
Adapters	LLM backend interfaces: LangChain, Claude SDK, OpenRouter, and more

Section	What You'll Learn
Configuration	Configuration hierarchy: CLI args, presets, environment variables, defaults
Evaluating with TaskEval	Evaluate pre-recorded agent traces against templates and rubrics
Creating Benchmarks	Author questions, write templates, define rubrics, and save checkpoints
Running Verification	Configure and execute evaluation via Python API or CLI
Analyzing Results	Inspect results, build DataFrames, export data, and iterate

Section	What You'll Learn
CLI Reference	Complete documentation for all CLI commands
Configuration Reference	Exhaustive tables for all configuration options

Section	What You'll Learn
Pipeline Internals	The 13-stage verification pipeline, deep judgment, and prompt assembly
Adapter Architecture	Ports and adapters pattern, custom adapter creation, MCP deep dive

Section	What You'll Learn
Contributing Guide	How to create adapters, extend the pipeline, and contribute