Quick Start¶

Get started with Karenina in minutes. This guide walks you through creating a benchmark, adding questions, writing answer templates, defining rubric traits, running verification, and inspecting results.

By the end you will have a working benchmark that evaluates LLM responses for both correctness (via answer templates) and quality (via rubric traits).

Prerequisites¶

Python 3.11+
Karenina installed (see Installation)
API keys for the LLM providers you plan to use:

export OPENAI_API_KEY="sk-..."
export ANTHROPIC_API_KEY="sk-ant-..."
export GOOGLE_API_KEY="AI..."

Step 1: Create a Benchmark¶

A benchmark is the top-level container that holds questions, answer templates, rubric traits, and verification results.

In [2]:

Copied!





from karenina import Benchmark

benchmark = Benchmark.create(
    name="Genomics Knowledge Benchmark",
    description="Testing LLM knowledge of genomics and molecular biology",
    version="1.0.0",
    creator="Your Name",
)

print(f"Created benchmark: {benchmark.name}")
from karenina import Benchmark

benchmark = Benchmark.create(
    name="Genomics Knowledge Benchmark",
    description="Testing LLM knowledge of genomics and molecular biology",
    version="1.0.0",
    creator="Your Name",
)

print(f"Created benchmark: {benchmark.name}")

Created benchmark: Genomics Knowledge Benchmark

Learn more: Creating Benchmarks · Core Concepts

Step 2: Add Questions and Answers pairs¶

Each question has a text prompt and a reference answer (the ground truth).

In [3]:

Copied!





questions = [
    {
        "question": "How many chromosomes are in a human somatic cell?",
        "answer": "46",
    },
    {
        "question": "What is the approved drug target of Venetoclax?",
        "answer": "BCL2",
    },
    {
        "question": "How many protein subunits does hemoglobin A have?",
        "answer": "4",
    },
]

question_ids = []
for q in questions:
    qid = benchmark.add_question(
        question=q["question"],
        raw_answer=q["answer"],
        author={"name": "Bio Curator", "email": "curator@example.com"},
    )
    question_ids.append(qid)

print(f"Added {len(question_ids)} questions")
questions = [
    {
        "question": "How many chromosomes are in a human somatic cell?",
        "answer": "46",
    },
    {
        "question": "What is the approved drug target of Venetoclax?",
        "answer": "BCL2",
    },
    {
        "question": "How many protein subunits does hemoglobin A have?",
        "answer": "4",
    },
]

question_ids = []
for q in questions:
    qid = benchmark.add_question(
        question=q["question"],
        raw_answer=q["answer"],
        author={"name": "Bio Curator", "email": "curator@example.com"},
    )
    question_ids.append(qid)

print(f"Added {len(question_ids)} questions")

Added 3 questions

Learn more: Factual QA Benchmark — including bulk import from Excel, CSV, and TSV files

Step 3: Write Answer Templates¶

Answer templates are Pydantic models that define how a Judge LLM should parse and verify a model's response. At its core, designing a template means establishing the exact validation logic used to evaluate the model's answers. Each field uses VerifiedField to declare:

A description that tells the judge what to extract
The ground truth value (what a correct answer looks like)
A verification primitive (verify_with) that checks the extracted value against ground truth

The class must inherit from BaseAnswer.

Case A: Automatic Generation¶

The fastest way to get started is to let Karenina generate templates for you using an LLM. This analyses each question and its reference answer, then produces a complete template:

In [4]:

Copied!





benchmark.generate_all_templates(
    model="claude-haiku-4-5",
    model_provider="anthropic",
    temperature=0.0,
)

print(f"Generated templates for {benchmark.question_count} questions")
benchmark.generate_all_templates(
    model="claude-haiku-4-5",
    model_provider="anthropic",
    temperature=0.0,
)

print(f"Generated templates for {benchmark.question_count} questions")

Generated templates for 3 questions

You can review a generated template to see what the LLM produced:

In [5]:

Copied!

generated_code = benchmark.get_template(question_ids[0])
print(generated_code)
generated_code = benchmark.get_template(question_ids[0])
print(generated_code)

from karenina.schemas.entities import BaseAnswer, VerifiedField, NumericExact


class Answer(BaseAnswer):
    chromosome_count: int = VerifiedField(
        description="Extract the integer number of chromosomes stated in the response as the count present in a human somatic cell.",
        ground_truth=46,
        verify_with=NumericExact(),
    )

Case B: Manual Definition (Class-Based)¶

Alternatively, when you need precise control over verification logic, define templates as Python classes and pass them directly. Each field uses VerifiedField to declare what to extract, the correct answer, and how to compare:

In [6]:

Copied!





from karenina.schemas.entities import BaseAnswer, VerifiedField
from karenina.schemas.primitives import BooleanMatch


class Answer(BaseAnswer):
    identifies_bcl2_as_target: bool = VerifiedField(
        description=(
            "True if the response identifies BCL2 (including Bcl-2, BCL-2, or "
            "B-cell lymphoma 2) as the direct pharmacological target. False if "
            "BCL2 is mentioned only as a pathway member or a different protein "
            "is identified as the primary target."
        ),
        ground_truth=True,
        verify_with=BooleanMatch(),
    )


benchmark.update_template(question_ids[1], Answer)

print("Updated template for Venetoclax question with class-based definition")
from karenina.schemas.entities import BaseAnswer, VerifiedField
from karenina.schemas.primitives import BooleanMatch


class Answer(BaseAnswer):
    identifies_bcl2_as_target: bool = VerifiedField(
        description=(
            "True if the response identifies BCL2 (including Bcl-2, BCL-2, or "
            "B-cell lymphoma 2) as the direct pharmacological target. False if "
            "BCL2 is mentioned only as a pathway member or a different protein "
            "is identified as the primary target."
        ),
        ground_truth=True,
        verify_with=BooleanMatch(),
    )


benchmark.update_template(question_ids[1], Answer)

print("Updated template for Venetoclax question with class-based definition")

Updated template for Venetoclax question with class-based definition

Learn more: Factual QA Benchmark · Scaled Authoring · Answer Templates (Concepts)

Step 4: Add Rubric Traits¶

While templates verify correctness, rubrics assess quality — properties of the raw response like conciseness, safety, or format compliance.

Karenina supports four trait types: LLM, regex, callable, and metric. Here we use two.

Global Trait: evaluated for every question¶

In [7]:

Copied!





from karenina.schemas import LLMRubricTrait

benchmark.add_global_rubric_trait(
    LLMRubricTrait(
        name="Conciseness",
        description="Rate how concise the answer is on a scale of 1-5, where 1 is very verbose and 5 is extremely concise.",
        kind="score",
    )
)
print("Added global rubric trait: Conciseness (score 1-5)")
from karenina.schemas import LLMRubricTrait

benchmark.add_global_rubric_trait(
    LLMRubricTrait(
        name="Conciseness",
        description="Rate how concise the answer is on a scale of 1-5, where 1 is very verbose and 5 is extremely concise.",
        kind="score",
    )
)
print("Added global rubric trait: Conciseness (score 1-5)")

Added global rubric trait: Conciseness (score 1-5)

Question-Specific Trait: evaluated for one question¶

This LLM trait checks that the Venetoclax answer discusses the drug's safety profile:

In [8]:

Copied!





from karenina.schemas import LLMRubricTrait

venetoclax_qid = question_ids[1]  # "What is the approved drug target of Venetoclax?"

benchmark.add_question_rubric_trait(
    venetoclax_qid,
    LLMRubricTrait(
        name="Discusses Safety Profile",
        description=(
            "Answer True if the response discusses the safety profile of venetoclax, "
            "including any mention of adverse effects, toxicity, contraindications, "
            "or risk factors. Answer False if the response omits safety considerations "
            "entirely."
        ),
        kind="boolean",
    ),
)
print(f"Added LLM trait 'Discusses Safety Profile' to question {venetoclax_qid}")
from karenina.schemas import LLMRubricTrait

venetoclax_qid = question_ids[1]  # "What is the approved drug target of Venetoclax?"

benchmark.add_question_rubric_trait(
    venetoclax_qid,
    LLMRubricTrait(
        name="Discusses Safety Profile",
        description=(
            "Answer True if the response discusses the safety profile of venetoclax, "
            "including any mention of adverse effects, toxicity, contraindications, "
            "or risk factors. Answer False if the response omits safety considerations "
            "entirely."
        ),
        kind="boolean",
    ),
)
print(f"Added LLM trait 'Discusses Safety Profile' to question {venetoclax_qid}")

Added LLM trait 'Discusses Safety Profile' to question urn:uuid:question-what-is-the-approved-drug-target-of-venetoclax-2a9de717

Learn more: Full Evaluation Benchmark · All Four Trait Types — LLM, regex, callable, and metric traits

Step 5: Run Verification¶

Configure the answering model (the model being evaluated) and the parsing model (the judge), then run verification.

In [9]:

Copied!





from karenina.schemas import ModelConfig, VerificationConfig

config = VerificationConfig(
    answering_models=[
        ModelConfig(
            id="claude-haiku-4-5",
            model_name="claude-haiku-4-5",
            model_provider="anthropic",
            interface="langchain",
            temperature=0.7,
            system_prompt="You are a knowledgeable assistant. Answer accurately and concisely.",
        )
    ],
    parsing_models=[
        ModelConfig(
            id="claude-haiku-4-5",
            model_name="claude-haiku-4-5",
            model_provider="anthropic",
            interface="langchain",
            temperature=0.0,
        )
    ],
    evaluation_mode="template_and_rubric",
)

results = benchmark.run_verification(config)
print(f"Verification complete — {len(results.results)} results")
from karenina.schemas import ModelConfig, VerificationConfig

config = VerificationConfig(
    answering_models=[
        ModelConfig(
            id="claude-haiku-4-5",
            model_name="claude-haiku-4-5",
            model_provider="anthropic",
            interface="langchain",
            temperature=0.7,
            system_prompt="You are a knowledgeable assistant. Answer accurately and concisely.",
        )
    ],
    parsing_models=[
        ModelConfig(
            id="claude-haiku-4-5",
            model_name="claude-haiku-4-5",
            model_provider="anthropic",
            interface="langchain",
            temperature=0.0,
        )
    ],
    evaluation_mode="template_and_rubric",
)

results = benchmark.run_verification(config)
print(f"Verification complete — {len(results.results)} results")

Adapter cleanup timed out after 10 seconds

Verification complete — 3 results

Learn more: Verification Config · Multi-Model Evaluation · Model Config Reference · CLI Verification

Step 6: Inspect Results¶

VerificationResultSet provides specialized accessors that convert results into pandas DataFrames for analysis.

Template results¶

Use get_template_results() to access pass/fail data and field-level comparisons:

In [10]:

Copied!

template_results = results.get_template_results()
df_templates = template_results.to_dataframe()

df_templates[["question_id", "field_name", "gt_value", "llm_value", "field_match"]]
template_results = results.get_template_results()
df_templates = template_results.to_dataframe()

df_templates[["question_id", "field_name", "gt_value", "llm_value", "field_match"]]

Out[10]:

	question_id	field_name	gt_value	llm_value	field_match
0	urn:uuid:question-how-many-chromosomes-are-in-...	chromosome_count	46	46	True
1	urn:uuid:question-what-is-the-approved-drug-ta...	identifies_bcl2_as_target	True	True	True
2	urn:uuid:question-how-many-protein-subunits-do...	hemoglobin_a_subunit_count	4	4	True

Pass rate¶

In [11]:

Copied!

template_results.aggregate_pass_rate(by="question_id")
template_results.aggregate_pass_rate(by="question_id")

Out[11]:

{'urn:uuid:question-how-many-chromosomes-are-in-a-human-somatic-cell-3e6df3f9': 1.0,
 'urn:uuid:question-how-many-protein-subunits-does-hemoglobin-a-have-99d0c100': 1.0,
 'urn:uuid:question-what-is-the-approved-drug-target-of-venetoclax-2a9de717': 1.0}

Rubric results¶

Use get_rubrics_results() to access trait scores as a DataFrame:

In [12]:

Copied!

rubric_results = results.get_rubrics_results()
df_rubrics = rubric_results.to_dataframe()

df_rubrics[["question_id", "trait_name", "trait_score", "trait_type"]]
rubric_results = results.get_rubrics_results()
df_rubrics = rubric_results.to_dataframe()

df_rubrics[["question_id", "trait_name", "trait_score", "trait_type"]]

Out[12]:

	question_id	trait_name	trait_score	trait_type
0	urn:uuid:question-how-many-chromosomes-are-in-...	Conciseness	4	llm_score
1	urn:uuid:question-what-is-the-approved-drug-ta...	Conciseness	3	llm_score
2	urn:uuid:question-what-is-the-approved-drug-ta...	Discusses Safety Profile	False	llm_binary
3	urn:uuid:question-how-many-protein-subunits-do...	Conciseness	4	llm_score

Learn more: DataFrame Analysis · VerificationResult · Exporting Results

Step 7: Save and Load¶

Save the benchmark — including questions, templates, rubrics, and results — as a JSON-LD checkpoint file.

In [13]:

Copied!

from pathlib import Path

checkpoint_path = Path(_tmpdir) / "genomics_benchmark.jsonld"
benchmark.save(checkpoint_path)
print("Saved to genomics_benchmark.jsonld")
from pathlib import Path

checkpoint_path = Path(_tmpdir) / "genomics_benchmark.jsonld"
benchmark.save(checkpoint_path)
print("Saved to genomics_benchmark.jsonld")

Saved to genomics_benchmark.jsonld

Load it back later:

In [14]:

Copied!

loaded = Benchmark.load(checkpoint_path)
print(f"Loaded '{loaded.name}' with {loaded.question_count} questions")
loaded = Benchmark.load(checkpoint_path)
print(f"Loaded '{loaded.name}' with {loaded.question_count} questions")

Loaded 'Genomics Knowledge Benchmark' with 3 questions

Learn more: Checkpoints · Factual QA Benchmark · Loading Benchmarks