Checkpoints: The Memory of Evaluation¶
While a Benchmark is the logical package for your evaluation, a Checkpoint is its physical reality. It is the "Record of Truth": a single, portable file that captures the complete state of a benchmark so it can be shared, version-controlled, and reproduced exactly in any environment.
Think of a checkpoint as the Memory of your evaluation. It doesn't just store questions; it stores the precise logic, quality standards, and provenance that define why a result is a pass or a fail.
1. The "Record of Truth" Philosophy¶
Karenina uses checkpoints to solve the "it works on my machine" problem in LLM evaluation. A checkpoint is designed to be:
- Self-Contained: It includes the actual Python source code of your Answer Templates. You don't need a central repository to run a checkpoint; the logic travels with the data.
- Human-Readable: Even though it's a machine-interpretable format, you can open a checkpoint in any text editor and understand exactly what is being evaluated.
- Semantically Rich: By using JSON-LD, we anchor our evaluation data in the global Schema.org standard, making your benchmarks interoperable with other AI safety and evaluation tools.
2. Anatomy of a Checkpoint¶
A checkpoint organizes your benchmark into a clear, nested hierarchy. When you look inside, you are seeing a snapshot of the Four Pillars:
- Benchmark Metadata: The identity (name, version, creator) and the timeline (when it was born and last modified).
- The Global Standards: Rubric traits that apply to every question in the set.
- The Questions: A collection of Question objects, each wrapped in a unique identity.
- The Local Logic: The specific Answer Templates and question-specific rubrics attached to individual prompts.
┌───────────────────────────────────────────────────────────┐
│ DataFeed (The Benchmark Root) │
│ │
│ Identity Metadata Global Rubric Traits │
│ (Name, Version, Creator) (Safety, Conciseness) │
│ │
│ ┌───────────────────────────────────────────────────┐ │
│ │ DataFeedItems (The Questions) │ │
│ │ │ │
│ │ ┌─────────────┐ ┌─────────────┐ │ │
│ │ │ Question 1 │ │ Question 2 │ ... │ │
│ │ │ (and ID) │ │ (and ID) │ │ │
│ │ └──────┬──────┘ └──────┬──────┘ │ │
│ │ │ │ │ │
│ │ ┌──────▼─────────────────▼──────────────────┐ │ │
│ │ │ Inside each Question │ │ │
│ │ │ │ │ │
│ │ │ - Answer Template (Python source) │ │ │
│ │ │ - Question-Specific Rubrics │ │ │
│ │ │ - Local Metadata (Author, Sources) │ │ │
│ │ └───────────────────────────────────────────┘ │ │
│ └───────────────────────────────────────────────────┘ │
└───────────────────────────────────────────────────────────┘
3. The Journey of a Checkpoint¶
3.1. Capturing State (save)¶
When you save a benchmark, Karenina serializes the in-memory Pydantic models into a clean, indented JSON-LD file. It automatically updates the "last modified" timestamp and ensures that all Python logic is safely converted to strings.
from pathlib import Path
# Capture the current state of the benchmark
benchmark.save(Path("drug_target_v1.jsonld"))
Deep judgment configuration stripping. By default, save() strips deep judgment configuration fields (e.g., deep_judgment_enabled, deep_judgment_excerpt_enabled) from LLM rubric traits before writing the file. This keeps checkpoint files focused on the benchmark definition and avoids coupling saved checkpoints to a particular deep judgment configuration. To preserve deep judgment settings in the checkpoint (required for use_checkpoint mode), pass save_deep_judgment_config=True:
# Preserve deep judgment trait settings in the checkpoint
benchmark.save(Path("drug_target_v1.jsonld"), save_deep_judgment_config=True)
| Parameter | Type | Default | Description |
|---|---|---|---|
path |
Path |
(required) | File path for the checkpoint (.jsonld or .json) |
save_deep_judgment_config |
bool |
False |
If True, include deep judgment configuration in LLM rubric traits. If False, deep judgment settings are stripped before saving. |
Trait field round-trip reliability. All trait fields (including summary, min_score, max_score, invert_result, higher_is_better, and deep judgment settings) are preserved through save/load cycles. Each trait type serializer writes these fields as additionalProperty entries in the JSON-LD Rating object, and the deserializer restores them faithfully.
3.2. Portability & Sharing¶
Because it's a single file, a checkpoint can be committed to Git, sent to a colleague, or archived as part of a research paper. It captures the Definition of the evaluation, not the results, keeping the file lightweight and focused.
3.3. Restoring Context (load)¶
Loading a checkpoint restores the complete evaluation context. Karenina validates the file structure, rebuilds the internal question cache, and prepares the Python templates for execution.
from karenina import Benchmark
from pathlib import Path
# Restore the benchmark from a file
benchmark = Benchmark.load(Path("drug_target_v1.jsonld"))
4. Why JSON-LD?¶
Karenina chose JSON-LD (JSON for Linked Data) over plain JSON or CSV for three critical reasons:
| Benefit | Impact on Your Evaluation |
|---|---|
| Semantic Clarity | Explicitly defines what is a Question, an Answer, or a Rating using standard types. |
| Interoperability | Your benchmarks aren't locked into Karenina; they speak the language of the web (Schema.org). |
| Stability | The format versioning allows us to evolve the framework while ensuring your old benchmarks still load correctly. |
5. Detailed Reference: The Checkpoint Specification¶
For power users and tool developers, this section breaks down the technical mapping of a checkpoint file.
5.1. Schema.org Mapping¶
| Karenina Concept | Schema.org Type | Purpose |
|---|---|---|
| Benchmark | DataFeed |
The root container for the evaluation set. |
| Question Wrapper | DataFeedItem |
Holds the unique ID and membership timestamps. |
| Prompt | Question |
The literal text and nested components. |
| Reference Answer | Answer |
The human-readable raw_answer. |
| Verification Logic | SoftwareSourceCode |
The Python code for the answer_template. |
| Rubric Trait | Rating |
Qualitative assessments (global or local). |
| Keywords | keywords on Question |
Topic labels for categorization (native schema.org property). |
| Metadata | PropertyValue |
Arbitrary key-value pairs (notes, author, sources, etc.). |
5.2. Deterministic IDs¶
Question IDs in a checkpoint are content-addressable fingerprints. They are generated using an MD5 hash of the question text:
urn:uuid:question-{readable-prefix}-{8-char-hash}
This ensures that the same question text always produces the same identity across any checkpoint file.
5.3. The @context Block¶
The @context tells JSON-LD processors how to interpret property names. Karenina's canonical context:
{
"@context": {
"@version": 1.1,
"@vocab": "https://schema.org/",
"karenina": "urn:karenina:vocab:",
"dataFeedElement": { "@id": "dataFeedElement", "@container": "@set" },
"item": { "@id": "item", "@type": "@id" },
"acceptedAnswer": { "@id": "acceptedAnswer", "@type": "@id" },
"rating": { "@id": "contentRating", "@container": "@set" },
"additionalProperty": { "@id": "additionalProperty", "@container": "@set" },
"keywords": { "@id": "keywords", "@container": "@set" }
}
}
Key points:
@vocabmaps all unqualified terms tohttps://schema.org/. Only entries that add semantic information (container types, ID references, or remappings) are included explicitly.kareninadefines a namespace prefix for Karenina-specific vocabulary. AlladditionalTypevalues onRatingobjects use this prefix (e.g.,karenina:GlobalRubricTrait,karenina:QuestionSpecificRegexTrait).rating→contentRatingremaps the JSON keyratingto schema.org'scontentRatingproperty, which is the valid property onCreativeWorkfor acceptingRatingvalues.
5.4. The karenina: Vocabulary Namespace¶
Rubric traits are stored as Rating objects with an additionalType that identifies the trait kind and scope. All values use the karenina: namespace prefix:
additionalType |
Trait Type | Scope |
|---|---|---|
karenina:GlobalRubricTrait |
LLM (boolean/score) | Global |
karenina:GlobalLLMRubricTrait |
LLM (literal) | Global |
karenina:GlobalRegexTrait |
Regex | Global |
karenina:GlobalCallableTrait |
Callable | Global |
karenina:GlobalMetricRubricTrait |
Metric | Global |
karenina:GlobalDynamicRubricTrait |
Dynamic | Global |
karenina:GlobalAgenticRubricTrait |
Agentic | Global |
karenina:QuestionSpecificRubricTrait |
LLM (boolean/score) | Per-question |
karenina:QuestionSpecificLLMRubricTrait |
LLM (literal) | Per-question |
karenina:QuestionSpecificRegexTrait |
Regex | Per-question |
karenina:QuestionSpecificCallableTrait |
Callable | Per-question |
karenina:QuestionSpecificMetricRubricTrait |
Metric | Per-question |
karenina:QuestionSpecificDynamicRubricTrait |
Dynamic | Per-question |
karenina:QuestionSpecificAgenticRubricTrait |
Agentic | Per-question |
Old checkpoints without the karenina: prefix are normalized automatically on load.
5.5. Example Structure (Annotated JSON-LD)¶
{
"@context": { "..." : "see above" },
"@type": "DataFeed",
"name": "Documentation Test Benchmark",
"version": "1.0.0",
"rating": [
{
"@type": "Rating",
"name": "safety",
"description": "Response is safe and appropriate",
"bestRating": 1.0,
"worstRating": 0.0,
"additionalType": "karenina:GlobalRubricTrait"
}
],
"dataFeedElement": [
{
"@type": "DataFeedItem",
"@id": "urn:uuid:question-what-is-the-capital-of-france-cb0b4aaf",
"item": {
"@type": "Question",
"text": "What is the capital of France?",
"keywords": ["geography", "europe"],
"acceptedAnswer": { "@type": "Answer", "text": "Paris" },
"hasPart": {
"@type": "SoftwareSourceCode",
"text": "class Answer(BaseAnswer): ...",
"programmingLanguage": "Python"
}
}
}
]
}
6. Next Steps¶
- Answer Templates: Understanding how the code inside a checkpoint is executed.
- Rubrics: How different trait types are represented as
Ratingobjects. - Evaluation Modes: How to run the evaluation defined in your checkpoint.
- Creating Benchmarks: Step-by-step guides for building your first checkpoint.