Creating Benchmarks¶
This section walks through building benchmarks end-to-end — from creating an empty checkpoint to saving a fully populated benchmark with questions, templates, rubrics, and few-shot examples.
Each tutorial is a complete, self-contained scenario. Choose the one that matches your evaluation needs, or work through them in order to learn all the tools.
Choose Your Scenario¶
| Scenario | Evaluation Strategy | What You'll Learn |
|---|---|---|
| Benchmark Operations | All | Full Benchmark API: creating, populating, templates, rubrics, readiness, filtering, collection protocols |
| Factual QA Benchmark | Template-only | Hand-written templates: boolean check, string normalization, numeric tolerance, regex in verify(), partial credit |
| Full Evaluation Benchmark | Template + rubric | Custom templates combined with all 6 rubric trait types (LLM boolean, LLM score, LLM literal, regex, callable, metric) |
| Quality Assessment | Rubric-only | No templates: quality evaluation for tasks with no single correct answer (safety, empathy, clarity) |
| Choosing Rubric Traits | Template + rubric | Need-driven trait selection: 7 evaluation needs mapped to the right trait type, decision flowchart |
| Scaled Authoring | Power user | Bulk ingestion, generate_all_templates(), AnswerBuilder, ADeLe classification, few-shot examples |
| AI-Assisted Authoring | Template generation | generate_answer_template(), AnswerBuilder, two-phase generation, batch generate_all_templates() |
| ADeLe Classification | Question profiling | 18-dimension classification, QuestionClassifier, batch scoring, ADeLe rubrics, complexity filtering |
Evaluation Strategies¶
Karenina supports three evaluation strategies. Every benchmark uses one of these:
┌─────────────────────────────────────────────────────────┐
│ Template-only │
│ Questions + Templates → Correctness (pass/fail) │
│ "Is the extracted information correct?" │
└─────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────┐
│ Template + Rubric │
│ Questions + Templates + Rubrics → Correctness + Quality │
│ "Is it correct AND well-written?" │
└─────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────┐
│ Rubric-only │
│ Questions + Rubrics → Quality assessment │
│ "Is it safe, clear, and appropriate?" │
└─────────────────────────────────────────────────────────┘
| Strategy | Required | Optional | Best For |
|---|---|---|---|
| Template-only | Questions, templates | ADeLe, few-shot | Factual questions with definitive answers |
| Template + rubric | Questions, templates, rubrics | ADeLe, few-shot | Comprehensive evaluation (correctness + quality) |
| Rubric-only | Questions, rubrics | ADeLe, few-shot | Subjective tasks, communication, safety |
See Evaluation Modes for details on how these strategies map to pipeline behavior.
Common Workflow¶
All four scenarios follow this general pattern:
Create benchmark
│
▼
Add questions (with or without templates)
│
▼
Define evaluation criteria (templates, rubrics, or both)
│
▼
Save checkpoint
│
▼
Reload and verify round-trip
Key APIs¶
| Operation | Method | Covered In |
|---|---|---|
| Create benchmark | Benchmark.create(name, description, version) |
All scenarios |
| Add question | benchmark.add_question(question, raw_answer, answer_template=...) |
All scenarios |
| Add template to existing question | benchmark.add_answer_template(question_id, code_string) |
Factual QA |
| Generate templates automatically | benchmark.generate_all_templates(model, model_provider) |
Scaled Authoring |
| Build templates programmatically | AnswerBuilder().add_attribute(...).compile() |
Scaled Authoring |
| Add global rubric trait | benchmark.add_global_rubric_trait(trait) |
Full Evaluation, Quality Assessment |
| Add per-question rubric trait | benchmark.add_question_rubric_trait(question_id, trait) |
Full Evaluation, Quality Assessment |
| Check readiness | benchmark.check_readiness() |
Benchmark Operations |
| Filter questions | benchmark.filter_questions(finished=True, has_template=True) |
Benchmark Operations |
| Save checkpoint | benchmark.save("path.jsonld") |
All scenarios |
| Load checkpoint | Benchmark.load("path.jsonld") |
All scenarios |
Core Concepts¶
These concept pages provide the foundational knowledge that the scenarios build on:
- Answer Templates — What templates are, field types,
verify()semantics - Rubrics — Trait types (LLM, regex, callable, metric), global vs per-question
- Checkpoints — JSON-LD format, save/load behavior
- Evaluation Modes — How template-only, template+rubric, and rubric-only map to pipeline stages
- ADeLe Classification — Question complexity dimensions
- Few-Shot Examples — Configuration modes and example selection
Next Steps¶
Once your benchmark is built and saved, proceed to:
- Running Verification — Execute the benchmark against LLMs
- Analyzing Results — Inspect and compare verification outcomes