Skip to content

Questions and Benchmarks

A benchmark is Karenina's self-contained evaluation package: questions, answer templates, rubric traits, and metadata bundled into a single portable unit. A question is the building block inside a benchmark, carrying the text sent to the LLM and a reference answer. This page provides a structural overview; the sub-pages cover each component in depth.

1. Benchmark Structure

A benchmark organizes its content in a tree. Understanding this tree is the key to understanding how Karenina's pieces fit together:

Benchmark
├── Metadata (name, version, description, creator, timestamps)
├── Custom Properties          ← arbitrary key-value pairs at the benchmark level
├── Global Rubric Traits       ← quality checks applied to every question
└── Questions[]
    ├── Question text          ← what to ask the LLM
    ├── Expected answer        ← raw_answer: human-readable reference answer
    ├── Answer notes           ← optional free-text notes for interpreting the answer
    ├── Answer template        ← correctness verification code (Pydantic model)
    ├── Question-specific traits ← quality checks for this question only
    ├── Few-shot examples      ← optional parsing guidance for the Judge LLM
    ├── Intrinsic metadata     ← keywords, author, sources, timestamps, custom fields
    └── Registry entry         ← finished flag, date_added (benchmark membership state)

The sub-pages cover each layer in depth:

  • Benchmarks: the benchmark as a package, metadata, persistence (checkpoints and database)
  • Questions: the Question schema, deterministic IDs, raw_answer vs ground_truth, the finished flag
  • Checkpoints: the JSON-LD file format used for portable benchmark persistence

2. Questions: Two Layers of Data

Each question stores data at two levels: the Question object itself (text, raw_answer, keywords, template, rubric traits, metadata) and a membership record tracking the question's state within this benchmark (finished flag, date_added). This split exists because the same question can belong to multiple benchmarks with different membership states. See Questions for the full field reference.

3. How Questions, Templates, and Rubrics Connect

Each question can optionally have an answer template and question-specific rubric traits. Additionally, global rubric traits defined at the benchmark level apply to all questions. These components are independently attachable: a question can exist without a template, without a rubric, or without both.

                    ┌────────────────────┐
                    │     Benchmark      │
                    │                    │
                    │  Global Rubric ────┼──── applies to ALL questions
                    └────────┬───────────┘
              ┌──────────────┼──────────────┐
              │              │              │
        ┌─────▼─────┐ ┌─────▼─────┐ ┌─────▼─────┐
        │ Question 1│ │ Question 2│ │ Question 3│
        │           │ │           │ │           │
        │ Template ✓│ │ Template ✓│ │ No template│
        │ Q-Rubric ✓│ │ No Q-Rubric│ │ Q-Rubric ✓│
        └───────────┘ └───────────┘ └───────────┘
  • Question 1: Evaluated with its template (correctness) + global rubric + question-specific rubric (quality)
  • Question 2: Evaluated with its template + global rubric only
  • Question 3: No template; can only be evaluated in rubric_only mode

For details on what templates and rubrics do, see Answer Templates, Rubrics, and Templates vs Rubrics.

4. The finished Flag

Only questions marked finished=True enter the verification pipeline. Defaults and troubleshooting are covered in Questions.

5. Evaluation Modes

The benchmark's composition (which questions have templates, which have rubrics) determines which evaluation mode to use:

Mode Templates Rubrics When to Use
template_only Yes No Pure correctness verification (default)
template_and_rubric Yes Yes Correctness + quality assessment
rubric_only No Yes Quality-only evaluation (open-ended questions)

See Evaluation Modes for the complete stage matrix and configuration details.

6. Definition vs Execution

The benchmark defines what to evaluate: which questions to ask, how to verify correctness, and what quality traits to assess. Runtime settings (which models to use, how many replicates, timeouts, caching) are specified separately in VerificationConfig. This separation means the same benchmark can be run against different models or configurations without modification. Results are stored in the database, not inside the benchmark.

7. Next Steps