Skip to content

Creating Benchmarks

This section walks through building benchmarks end-to-end — from creating an empty checkpoint to saving a fully populated benchmark with questions, templates, rubrics, and few-shot examples.

Each tutorial is a complete, self-contained scenario. Choose the one that matches your evaluation needs, or work through them in order to learn all the tools.

Choose Your Scenario

Scenario Evaluation Strategy What You'll Learn
Benchmark Operations All Full Benchmark API: creating, populating, templates, rubrics, readiness, filtering, collection protocols
Factual QA Benchmark Template-only Hand-written templates: boolean check, string normalization, numeric tolerance, regex in verify(), partial credit
Full Evaluation Benchmark Template + rubric Custom templates combined with all 6 rubric trait types (LLM boolean, LLM score, LLM literal, regex, callable, metric)
Quality Assessment Rubric-only No templates: quality evaluation for tasks with no single correct answer (safety, empathy, clarity)
Choosing Rubric Traits Template + rubric Need-driven trait selection: 7 evaluation needs mapped to the right trait type, decision flowchart
Scaled Authoring Power user Bulk ingestion, generate_all_templates(), AnswerBuilder, ADeLe classification, few-shot examples
AI-Assisted Authoring Template generation generate_answer_template(), AnswerBuilder, two-phase generation, batch generate_all_templates()
ADeLe Classification Question profiling 18-dimension classification, QuestionClassifier, batch scoring, ADeLe rubrics, complexity filtering

Evaluation Strategies

Karenina supports three evaluation strategies. Every benchmark uses one of these:

┌─────────────────────────────────────────────────────────┐
│                    Template-only                         │
│  Questions + Templates → Correctness (pass/fail)        │
│  "Is the extracted information correct?"                 │
└─────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────┐
│               Template + Rubric                          │
│  Questions + Templates + Rubrics → Correctness + Quality │
│  "Is it correct AND well-written?"                       │
└─────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────┐
│                    Rubric-only                            │
│  Questions + Rubrics → Quality assessment                │
│  "Is it safe, clear, and appropriate?"                   │
└─────────────────────────────────────────────────────────┘
Strategy Required Optional Best For
Template-only Questions, templates ADeLe, few-shot Factual questions with definitive answers
Template + rubric Questions, templates, rubrics ADeLe, few-shot Comprehensive evaluation (correctness + quality)
Rubric-only Questions, rubrics ADeLe, few-shot Subjective tasks, communication, safety

See Evaluation Modes for details on how these strategies map to pipeline behavior.


Common Workflow

All four scenarios follow this general pattern:

Create benchmark
Add questions (with or without templates)
Define evaluation criteria (templates, rubrics, or both)
Save checkpoint
Reload and verify round-trip

Key APIs

Operation Method Covered In
Create benchmark Benchmark.create(name, description, version) All scenarios
Add question benchmark.add_question(question, raw_answer, answer_template=...) All scenarios
Add template to existing question benchmark.add_answer_template(question_id, code_string) Factual QA
Generate templates automatically benchmark.generate_all_templates(model, model_provider) Scaled Authoring
Build templates programmatically AnswerBuilder().add_attribute(...).compile() Scaled Authoring
Add global rubric trait benchmark.add_global_rubric_trait(trait) Full Evaluation, Quality Assessment
Add per-question rubric trait benchmark.add_question_rubric_trait(question_id, trait) Full Evaluation, Quality Assessment
Check readiness benchmark.check_readiness() Benchmark Operations
Filter questions benchmark.filter_questions(finished=True, has_template=True) Benchmark Operations
Save checkpoint benchmark.save("path.jsonld") All scenarios
Load checkpoint Benchmark.load("path.jsonld") All scenarios

Core Concepts

These concept pages provide the foundational knowledge that the scenarios build on:


Next Steps

Once your benchmark is built and saved, proceed to: