Creating Benchmarks¶

This section walks through building benchmarks end-to-end — from creating an empty checkpoint to saving a fully populated benchmark with questions, templates, rubrics, and few-shot examples.

Each tutorial is a complete, self-contained scenario. Choose the one that matches your evaluation needs, or work through them in order to learn all the tools.

Choose Your Scenario¶

Scenario	Evaluation Strategy	What You'll Learn
Benchmark Operations	All	Full Benchmark API: creating, populating, templates, rubrics, readiness, filtering, collection protocols
Factual QA Benchmark	Template-only	Hand-written templates: boolean check, string normalization, numeric tolerance, regex in `verify()`, partial credit
Full Evaluation Benchmark	Template + rubric	Custom templates combined with all 6 rubric trait types (LLM boolean, LLM score, LLM literal, regex, callable, metric)
Quality Assessment	Rubric-only	No templates: quality evaluation for tasks with no single correct answer (safety, empathy, clarity)
Choosing Rubric Traits	Template + rubric	Need-driven trait selection: 7 evaluation needs mapped to the right trait type, decision flowchart
Scaled Authoring	Power user	Bulk ingestion, `generate_all_templates()`, `AnswerBuilder`, ADeLe classification, few-shot examples
AI-Assisted Authoring	Template generation	`generate_answer_template()`, `AnswerBuilder`, two-phase generation, batch `generate_all_templates()`
ADeLe Classification	Question profiling	18-dimension classification, `QuestionClassifier`, batch scoring, ADeLe rubrics, complexity filtering

Evaluation Strategies¶

Karenina supports three evaluation strategies. Every benchmark uses one of these:

┌─────────────────────────────────────────────────────────┐
│                    Template-only                         │
│  Questions + Templates → Correctness (pass/fail)        │
│  "Is the extracted information correct?"                 │
└─────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────┐
│               Template + Rubric                          │
│  Questions + Templates + Rubrics → Correctness + Quality │
│  "Is it correct AND well-written?"                       │
└─────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────┐
│                    Rubric-only                            │
│  Questions + Rubrics → Quality assessment                │
│  "Is it safe, clear, and appropriate?"                   │
└─────────────────────────────────────────────────────────┘

Strategy	Required	Optional	Best For
Template-only	Questions, templates	ADeLe, few-shot	Factual questions with definitive answers
Template + rubric	Questions, templates, rubrics	ADeLe, few-shot	Comprehensive evaluation (correctness + quality)
Rubric-only	Questions, rubrics	ADeLe, few-shot	Subjective tasks, communication, safety

See Evaluation Modes for details on how these strategies map to pipeline behavior.

Common Workflow¶

All four scenarios follow this general pattern:

Create benchmark
    │
    ▼
Add questions (with or without templates)
    │
    ▼
Define evaluation criteria (templates, rubrics, or both)
    │
    ▼
Save checkpoint
    │
    ▼
Reload and verify round-trip

Key APIs¶

Operation	Method	Covered In
Create benchmark	`Benchmark.create(name, description, version)`	All scenarios
Add question	`benchmark.add_question(question, raw_answer, answer_template=...)`	All scenarios
Add template to existing question	`benchmark.add_answer_template(question_id, code_string)`	Factual QA
Generate templates automatically	`benchmark.generate_all_templates(model, model_provider)`	Scaled Authoring
Build templates programmatically	`AnswerBuilder().add_attribute(...).compile()`	Scaled Authoring
Add global rubric trait	`benchmark.add_global_rubric_trait(trait)`	Full Evaluation, Quality Assessment
Add per-question rubric trait	`benchmark.add_question_rubric_trait(question_id, trait)`	Full Evaluation, Quality Assessment
Check readiness	`benchmark.check_readiness()`	Benchmark Operations
Filter questions	`benchmark.filter_questions(finished=True, has_template=True)`	Benchmark Operations
Save checkpoint	`benchmark.save("path.jsonld")`	All scenarios
Load checkpoint	`Benchmark.load("path.jsonld")`	All scenarios

Core Concepts¶

These concept pages provide the foundational knowledge that the scenarios build on:

Answer Templates — What templates are, field types, verify() semantics
Rubrics — Trait types (LLM, regex, callable, metric), global vs per-question
Checkpoints — JSON-LD format, save/load behavior
Evaluation Modes — How template-only, template+rubric, and rubric-only map to pipeline stages
ADeLe Classification — Question complexity dimensions
Few-Shot Examples — Configuration modes and example selection

Next Steps¶

Once your benchmark is built and saved, proceed to:

Running Verification — Execute the benchmark against LLMs
Analyzing Results — Inspect and compare verification outcomes