Quick Start¶
Get started with Karenina in minutes. This guide walks you through creating a benchmark, adding questions, writing answer templates, defining rubric traits, running verification, and inspecting results.
By the end you will have a working benchmark that evaluates LLM responses for both correctness (via answer templates) and quality (via rubric traits).
Prerequisites¶
- Python 3.11+
- Karenina installed (see Installation)
- API keys for the LLM providers you plan to use:
export OPENAI_API_KEY="sk-..." export ANTHROPIC_API_KEY="sk-ant-..." export GOOGLE_API_KEY="AI..."
Step 1: Create a Benchmark¶
A benchmark is the top-level container that holds questions, answer templates, rubric traits, and verification results.
from karenina import Benchmark
benchmark = Benchmark.create(
name="Genomics Knowledge Benchmark",
description="Testing LLM knowledge of genomics and molecular biology",
version="1.0.0",
creator="Your Name",
)
print(f"Created benchmark: {benchmark.name}")
Created benchmark: Genomics Knowledge Benchmark
Learn more: Creating Benchmarks · Core Concepts
Step 2: Add Questions and Answers pairs¶
Each question has a text prompt and a reference answer (the ground truth).
questions = [
{
"question": "How many chromosomes are in a human somatic cell?",
"answer": "46",
},
{
"question": "What is the approved drug target of Venetoclax?",
"answer": "BCL2",
},
{
"question": "How many protein subunits does hemoglobin A have?",
"answer": "4",
},
]
question_ids = []
for q in questions:
qid = benchmark.add_question(
question=q["question"],
raw_answer=q["answer"],
author={"name": "Bio Curator", "email": "curator@example.com"},
)
question_ids.append(qid)
print(f"Added {len(question_ids)} questions")
Added 3 questions
Learn more: Factual QA Benchmark — including bulk import from Excel, CSV, and TSV files
Step 3: Write Answer Templates¶
Answer templates are Pydantic models that define how a Judge LLM should parse and verify a model's response. At its core, designing a template means establishing the exact validation logic used to evaluate the model's answers. Each field uses VerifiedField to declare:
- A description that tells the judge what to extract
- The ground truth value (what a correct answer looks like)
- A verification primitive (
verify_with) that checks the extracted value against ground truth
The class must inherit from BaseAnswer.
Case A: Automatic Generation¶
The fastest way to get started is to let Karenina generate templates for you using an LLM. This analyses each question and its reference answer, then produces a complete template:
benchmark.generate_all_templates(
model="claude-haiku-4-5",
model_provider="anthropic",
temperature=0.0,
)
print(f"Generated templates for {benchmark.question_count} questions")
Generated templates for 3 questions
You can review a generated template to see what the LLM produced:
generated_code = benchmark.get_template(question_ids[0])
print(generated_code)
from karenina.schemas.entities import BaseAnswer, VerifiedField, NumericExact
class Answer(BaseAnswer):
chromosome_count: int = VerifiedField(
description="Extract the integer number of chromosomes stated in the response as the count present in a human somatic cell.",
ground_truth=46,
verify_with=NumericExact(),
)
Case B: Manual Definition (Class-Based)¶
Alternatively, when you need precise control over verification logic, define templates as Python classes and pass them directly. Each field uses VerifiedField to declare what to extract, the correct answer, and how to compare:
from karenina.schemas.entities import BaseAnswer, VerifiedField
from karenina.schemas.primitives import BooleanMatch
class Answer(BaseAnswer):
identifies_bcl2_as_target: bool = VerifiedField(
description=(
"True if the response identifies BCL2 (including Bcl-2, BCL-2, or "
"B-cell lymphoma 2) as the direct pharmacological target. False if "
"BCL2 is mentioned only as a pathway member or a different protein "
"is identified as the primary target."
),
ground_truth=True,
verify_with=BooleanMatch(),
)
benchmark.update_template(question_ids[1], Answer)
print("Updated template for Venetoclax question with class-based definition")
Updated template for Venetoclax question with class-based definition
Learn more: Factual QA Benchmark · Scaled Authoring · Answer Templates (Concepts)
Step 4: Add Rubric Traits¶
While templates verify correctness, rubrics assess quality — properties of the raw response like conciseness, safety, or format compliance.
Karenina supports four trait types: LLM, regex, callable, and metric. Here we use two.
Global Trait: evaluated for every question¶
from karenina.schemas import LLMRubricTrait
benchmark.add_global_rubric_trait(
LLMRubricTrait(
name="Conciseness",
description="Rate how concise the answer is on a scale of 1-5, where 1 is very verbose and 5 is extremely concise.",
kind="score",
)
)
print("Added global rubric trait: Conciseness (score 1-5)")
Added global rubric trait: Conciseness (score 1-5)
Question-Specific Trait: evaluated for one question¶
This LLM trait checks that the Venetoclax answer discusses the drug's safety profile:
from karenina.schemas import LLMRubricTrait
venetoclax_qid = question_ids[1] # "What is the approved drug target of Venetoclax?"
benchmark.add_question_rubric_trait(
venetoclax_qid,
LLMRubricTrait(
name="Discusses Safety Profile",
description=(
"Answer True if the response discusses the safety profile of venetoclax, "
"including any mention of adverse effects, toxicity, contraindications, "
"or risk factors. Answer False if the response omits safety considerations "
"entirely."
),
kind="boolean",
),
)
print(f"Added LLM trait 'Discusses Safety Profile' to question {venetoclax_qid}")
Added LLM trait 'Discusses Safety Profile' to question urn:uuid:question-what-is-the-approved-drug-target-of-venetoclax-2a9de717
Learn more: Full Evaluation Benchmark · All Four Trait Types — LLM, regex, callable, and metric traits
Step 5: Run Verification¶
Configure the answering model (the model being evaluated) and the parsing model (the judge), then run verification.
from karenina.schemas import ModelConfig, VerificationConfig
config = VerificationConfig(
answering_models=[
ModelConfig(
id="claude-haiku-4-5",
model_name="claude-haiku-4-5",
model_provider="anthropic",
interface="langchain",
temperature=0.7,
system_prompt="You are a knowledgeable assistant. Answer accurately and concisely.",
)
],
parsing_models=[
ModelConfig(
id="claude-haiku-4-5",
model_name="claude-haiku-4-5",
model_provider="anthropic",
interface="langchain",
temperature=0.0,
)
],
evaluation_mode="template_and_rubric",
)
results = benchmark.run_verification(config)
print(f"Verification complete — {len(results.results)} results")
Adapter cleanup timed out after 10 seconds
Verification complete — 3 results
Learn more: Verification Config · Multi-Model Evaluation · Model Config Reference · CLI Verification
Step 6: Inspect Results¶
VerificationResultSet provides specialized accessors that convert results into pandas DataFrames for analysis.
Template results¶
Use get_template_results() to access pass/fail data and field-level comparisons:
template_results = results.get_template_results()
df_templates = template_results.to_dataframe()
df_templates[["question_id", "field_name", "gt_value", "llm_value", "field_match"]]
| question_id | field_name | gt_value | llm_value | field_match | |
|---|---|---|---|---|---|
| 0 | urn:uuid:question-how-many-chromosomes-are-in-... | chromosome_count | 46 | 46 | True |
| 1 | urn:uuid:question-what-is-the-approved-drug-ta... | identifies_bcl2_as_target | True | True | True |
| 2 | urn:uuid:question-how-many-protein-subunits-do... | hemoglobin_a_subunit_count | 4 | 4 | True |
Pass rate¶
template_results.aggregate_pass_rate(by="question_id")
{'urn:uuid:question-how-many-chromosomes-are-in-a-human-somatic-cell-3e6df3f9': 1.0,
'urn:uuid:question-how-many-protein-subunits-does-hemoglobin-a-have-99d0c100': 1.0,
'urn:uuid:question-what-is-the-approved-drug-target-of-venetoclax-2a9de717': 1.0}
Rubric results¶
Use get_rubrics_results() to access trait scores as a DataFrame:
rubric_results = results.get_rubrics_results()
df_rubrics = rubric_results.to_dataframe()
df_rubrics[["question_id", "trait_name", "trait_score", "trait_type"]]
| question_id | trait_name | trait_score | trait_type | |
|---|---|---|---|---|
| 0 | urn:uuid:question-how-many-chromosomes-are-in-... | Conciseness | 4 | llm_score |
| 1 | urn:uuid:question-what-is-the-approved-drug-ta... | Conciseness | 3 | llm_score |
| 2 | urn:uuid:question-what-is-the-approved-drug-ta... | Discusses Safety Profile | False | llm_binary |
| 3 | urn:uuid:question-how-many-protein-subunits-do... | Conciseness | 4 | llm_score |
Learn more: DataFrame Analysis · VerificationResult · Exporting Results
Step 7: Save and Load¶
Save the benchmark — including questions, templates, rubrics, and results — as a JSON-LD checkpoint file.
from pathlib import Path
checkpoint_path = Path(_tmpdir) / "genomics_benchmark.jsonld"
benchmark.save(checkpoint_path)
print("Saved to genomics_benchmark.jsonld")
Saved to genomics_benchmark.jsonld
Load it back later:
loaded = Benchmark.load(checkpoint_path)
print(f"Loaded '{loaded.name}' with {loaded.question_count} questions")
Loaded 'Genomics Knowledge Benchmark' with 3 questions
Learn more: Checkpoints · Factual QA Benchmark · Loading Benchmarks