Skip to content

Reading Paths

Choose the path that matches your goal:

New User — Learn Karenina from the ground up:

InstallationQuick StartCore ConceptsCreating BenchmarksRunning VerificationAnalyzing Results

TaskEval User: Evaluate existing outputs (agent traces, external text):

InstallationTaskEvalTaskEval WorkflowAnswer TemplatesRubricsAnalyzing Results

Power User — Dive into advanced features:

Core ConceptsPipeline InternalsAdapter Architecture

CLI User — Use Karenina from the command line:

InstallationConfigurationCLI Reference

Contributor — Extend Karenina with custom adapters or pipeline stages:

Adapter ArchitectureContributing


Getting Started

Section What You'll Learn
Installation Requirements, install commands, optional dependencies, troubleshooting
Quick Start: Benchmark Hands-on walkthrough from zero to a working benchmark
Quick Start: TaskEval Evaluate pre-recorded outputs (agent traces, external text)
Workspace Init Set up a project directory with karenina init

Core Concepts

Section What You'll Learn
Overview How all concepts fit together, ordered by pipeline flow
Questions & Benchmarks The central objects: questions bundled with templates, rubrics, and metadata
Checkpoints The JSON-LD benchmark format: questions, templates, rubrics, and metadata
Answer Templates Pydantic models that define how a Judge LLM evaluates correctness
Rubrics Quality assessment with four trait types: LLM, regex, callable, metric
Templates vs Rubrics When to use which evaluation unit, and when to use both together
Evaluation Modes Template-only, template-and-rubric, and rubric-only evaluation
Verification Pipeline The 13-stage engine that executes evaluation end to end
Prompt Assembly How prompts are constructed for pipeline LLM calls
Results & Scoring What verification produces: pass/fail, scores, traits, and metrics
Adapters LLM backend interfaces: LangChain, Claude SDK, OpenRouter, and more

Workflows

Section What You'll Learn
Configuration Configuration hierarchy: CLI args, presets, environment variables, defaults
Evaluating with TaskEval Evaluate pre-recorded agent traces against templates and rubrics
Creating Benchmarks Author questions, write templates, define rubrics, and save checkpoints
Running Verification Configure and execute evaluation via Python API or CLI
Analyzing Results Inspect results, build DataFrames, export data, and iterate

Reference

Section What You'll Learn
CLI Reference Complete documentation for all CLI commands
Configuration Reference Exhaustive tables for all configuration options

Advanced

Section What You'll Learn
Pipeline Internals The 13-stage verification pipeline, deep judgment, and prompt assembly
Adapter Architecture Ports and adapters pattern, custom adapter creation, MCP deep dive

Contributing

Section What You'll Learn
Contributing Guide How to create adapters, extend the pipeline, and contribute