Verification Primitives¶

Verification primitives are deterministic comparison functions that decide whether a judge-extracted value is correct. Each primitive is assigned to a field via VerifiedField(verify_with=...) and runs after the Judge LLM has parsed the response into a structured answer template. Primitives never call an LLM; they execute pure Python comparison logic against a known ground truth or against parameters embedded in the primitive itself.

In [ ]:

Copied!





from typing import Literal

from karenina.schemas.entities import BaseAnswer, VerifiedField
from karenina.schemas.primitives import (
    BooleanMatch,
    ContainsAll,
    ContainsAny,
    DateMatch,
    DateRange,
    DateTolerance,
    ExactMatch,
    LiteralMatch,
    NumericExact,
    NumericRange,
    NumericTolerance,
    OrderedMatch,
    RegexMatch,
    SemanticMatch,
    SetContainment,
    TraceContains,
    TraceLength,
    TraceRegex,
)
from karenina.schemas.entities.normalizers import SynonymMap
from typing import Literal

from karenina.schemas.entities import BaseAnswer, VerifiedField
from karenina.schemas.primitives import (
    BooleanMatch,
    ContainsAll,
    ContainsAny,
    DateMatch,
    DateRange,
    DateTolerance,
    ExactMatch,
    LiteralMatch,
    NumericExact,
    NumericRange,
    NumericTolerance,
    OrderedMatch,
    RegexMatch,
    SemanticMatch,
    SetContainment,
    TraceContains,
    TraceLength,
    TraceRegex,
)
from karenina.schemas.entities.normalizers import SynonymMap

1. What Verification Primitives Are¶

A verification primitive is a small, stateless function object that answers one question: does the extracted value match what was expected? It receives two inputs (the judge-extracted value and the ground truth), applies a comparison rule, and returns True or False. That is its entire job.

Primitives are the last link in the answer template evaluation chain:

Response (free text)  →  Judge LLM  →  Extracted values  →  Primitives  →  Pass/Fail

Primitives do not see the original question, the full response text, or the judge's reasoning. They see only the extracted value and the ground truth (or, for trace primitives, the raw response text). This deliberate narrowness makes them deterministic: the same extracted value always produces the same verdict, regardless of which judge model was used.

What Primitives Are Not¶

Primitives are not rubric traits. Rubrics assess observable qualities of the response (safety, conciseness, citation style) without requiring ground truth. Primitives verify correctness against a known expected answer. Rubrics answer "how well?"; primitives answer "right or wrong?"

Primitives are also not the verification pipeline. The verification pipeline orchestrates 13 stages including answer generation, judge parsing, and result storage. Primitives participate in a single stage: verify_template (stage 8).

2. Core Idea: Deterministic Checks After LLM Parsing¶

The most important idea behind verification primitives is the separation between LLM judgment and programmatic verification.

In Karenina's LLM-as-judge approach, only the Judge LLM reads the response and extracts structured data. Once the judge has filled in the template, everything that follows is deterministic code. Primitives are that deterministic code. They never consult an LLM. Swap out the judge model, and the same primitives still produce the same verdict from the same extracted values.

This separation provides three properties:

Reproducibility: given the same extracted values, verification always produces the same result
Transparency: you can inspect exactly why a field passed or failed by examining the primitive's parameters and the extracted value
Speed: primitives execute in microseconds with no API calls, no latency, no cost

The alternative, asking an LLM "is this answer correct?", is what rubric traits do for subjective qualities. For factual correctness with known ground truth, primitives are faster, cheaper, and more reliable.

Walkthrough: From Response to Verdict¶

Suppose the benchmark asks: "How many pairs of chromosomes does a normal human somatic cell have?"

You define the template:

In [ ]:

Copied!





class Answer(BaseAnswer):
    pair_count: int = VerifiedField(
        description="The number of chromosome pairs in a normal human somatic cell",
        ground_truth=23,
        verify_with=NumericExact(),
    )
class Answer(BaseAnswer):
    pair_count: int = VerifiedField(
        description="The number of chromosome pairs in a normal human somatic cell",
        ground_truth=23,
        verify_with=NumericExact(),
    )

The answering model responds:

"Human somatic cells are diploid, containing 46 chromosomes organized into 23 pairs..."

The Judge LLM extracts: {"pair_count": 23} (guided by the field description and type)

The primitive runs the check: NumericExact().check(23, 23) returns True

The judge does not see ground_truth or verify_with. The primitive does not see the original response. Each party does one job.

In [ ]:

Copied!





# Direct instantiation to demonstrate the primitive in isolation.
# In practice, the pipeline handles parsing and verification automatically.
parsed = Answer(pair_count=23)
print(f"Correct:  pair_count=23  -> verify(): {parsed.verify()}")

parsed_wrong = Answer(pair_count=46)
print(f"Wrong:    pair_count=46  -> verify(): {parsed_wrong.verify()}")
# Direct instantiation to demonstrate the primitive in isolation.
# In practice, the pipeline handles parsing and verification automatically.
parsed = Answer(pair_count=23)
print(f"Correct:  pair_count=23  -> verify(): {parsed.verify()}")

parsed_wrong = Answer(pair_count=46)
print(f"Wrong:    pair_count=46  -> verify(): {parsed_wrong.verify()}")

3. Two Categories: Parsed vs Trace¶

Karenina provides 18 primitives in two categories:

Category	Count	Operates on	Included in judge schema?	Method
Parsed	15	Judge-extracted field value + ground truth	Yes	`check(extracted, expected)`
Trace	3	Raw LLM response text	No (field excluded from schema)	`check_trace(raw_trace)`

Parsed primitives are the default. The Judge LLM sees the field in the JSON schema, extracts a value, and the primitive compares that value against ground_truth. Most evaluation uses parsed primitives.

Trace primitives bypass the judge entirely for that field. The field is removed from the JSON schema sent to the judge, so the judge never attempts to extract it. Instead, the pipeline runs the primitive directly against the raw response text. Trace fields must be typed as bool because the primitive returns a boolean (pattern found or not found), and that result is compared against ground_truth to determine pass or fail (see Section 6 for details).

When to Use Each Category¶

Situation	Use
Value must be interpreted from natural language	Parsed primitive
Value must be extracted from context (synonyms, paraphrases)	Parsed primitive
Check is a simple pattern match (regex, substring)	Trace primitive
Check is a mechanical constraint (response length)	Trace primitive
You need the extracted value in results for analysis	Parsed primitive

Use trace primitives when the check is a pure pattern match or length constraint that does not benefit from LLM interpretation. For example, checking whether the response contains a clinical trial identifier (NCT\d{8}) is more reliable as a regex than as a judge extraction.

4. Choosing the Right Primitive¶

By Data Type¶

Field Type	Natural Primitive	When to Use an Alternative
`bool`	`BooleanMatch`	Always use `BooleanMatch` for parsed booleans
`str`	`ExactMatch`	`ContainsAny`/`ContainsAll` for multiple acceptable answers; `RegexMatch` for format validation; `SemanticMatch` for meaning-based comparison
`int`, `float`	`NumericExact`	`NumericTolerance` for measurements with acceptable variance; `NumericRange` when no single correct value exists
`list[str]`	`SetContainment`	`OrderedMatch` when element order matters
`Literal[...]`	`LiteralMatch`	Always use `LiteralMatch` for Literal fields
`str` (date)	`DateMatch`	`DateTolerance` for approximate dates; `DateRange` when any date in a window is acceptable

By Verification Need¶

Need	Primitive	Key Parameter
Exact string after normalization	`ExactMatch`	`normalize`
Any of several acceptable substrings	`ContainsAny`	`substrings`
All required terms present	`ContainsAll`	`substrings`
Format matches a pattern	`RegexMatch`	`pattern`
Meaning is similar (requires embeddings)	`SemanticMatch`	`threshold`
Exact number	`NumericExact`	(none)
Number within tolerance	`NumericTolerance`	`tolerance`, `mode`
Number in a range	`NumericRange`	`min`, `max`
Set membership	`SetContainment`	`mode`
Ordered list equality	`OrderedMatch`	`normalize`
Fixed category match	`LiteralMatch`	(none)
Date equality	`DateMatch`	`format`
Date within tolerance	`DateTolerance`	`tolerance`, `unit`
Date in a range	`DateRange`	`min`, `max`
Regex in raw response	`TraceRegex`	`pattern`, `count_min`
Substring in raw response	`TraceContains`	`substring`
Response length constraint	`TraceLength`	`min`, `max`, `unit`

Decision Heuristics¶

Start simple. BooleanMatch, ExactMatch, and NumericExact cover most cases. Reach for more complex primitives only when these are insufficient.
Use normalization before adding alternatives. If ExactMatch fails because of case or whitespace differences, add normalizers rather than switching to ContainsAny.
Prefer parsed primitives over trace primitives unless the check is a pure pattern match. Parsed primitives benefit from the judge's ability to interpret context and synonyms.
Use ground_truth when a single correct answer exists. Use parameter-based primitives (ContainsAny, NumericRange, DateRange) when the answer space is a set or range rather than a single point.

5. Parsed Primitives Reference¶

All parsed primitives subclass VerificationPrimitive and implement check(extracted, expected) -> bool. The extracted argument is the value the Judge LLM produced; expected is the ground_truth from VerifiedField.

5.1. Boolean¶

BooleanMatch¶

Compare the extracted boolean to the ground truth boolean. Both values are coerced to bool before comparison.

No parameters.

Applies to: bool

In [ ]:

Copied!





class Answer(BaseAnswer):
    is_approved: bool = VerifiedField(
        description="Whether the drug is FDA-approved",
        ground_truth=True,
        verify_with=BooleanMatch(),
    )


parsed = Answer(is_approved=True)
print(f"True  -> verify(): {parsed.verify()}")

parsed_wrong = Answer(is_approved=False)
print(f"False -> verify(): {parsed_wrong.verify()}")
class Answer(BaseAnswer):
    is_approved: bool = VerifiedField(
        description="Whether the drug is FDA-approved",
        ground_truth=True,
        verify_with=BooleanMatch(),
    )


parsed = Answer(is_approved=True)
print(f"True  -> verify(): {parsed.verify()}")

parsed_wrong = Answer(is_approved=False)
print(f"False -> verify(): {parsed_wrong.verify()}")

5.2. String¶

ExactMatch¶

Normalize both values, then compare for string equality.

Parameter	Type	Default	Description
`normalize`	`list[Normalizer]`	`["lowercase", "strip"]`	Normalizers applied to both values before comparison

Applies to: str, int, float

In [ ]:

Copied!





class Answer(BaseAnswer):
    target: str = VerifiedField(
        description="Protein target name",
        ground_truth="BCL2",
        verify_with=ExactMatch(normalize=["lowercase", "strip"]),
    )


for value in ["BCL2", "bcl2", " Bcl2 ", "KRAS"]:
    parsed = Answer(target=value)
    print(f"{value!r:>10} -> verify(): {parsed.verify()}")
class Answer(BaseAnswer):
    target: str = VerifiedField(
        description="Protein target name",
        ground_truth="BCL2",
        verify_with=ExactMatch(normalize=["lowercase", "strip"]),
    )


for value in ["BCL2", "bcl2", " Bcl2 ", "KRAS"]:
    parsed = Answer(target=value)
    print(f"{value!r:>10} -> verify(): {parsed.verify()}")

ContainsAny¶

Pass if the extracted text contains at least one of the specified substrings. This primitive ignores ground_truth; the expected values are supplied via the substrings parameter.

Parameter	Type	Default	Description
`substrings`	`list[str]`	required	At least one must appear in the extracted value
`normalize`	`list[Normalizer]`	`[]`	Normalizers applied before comparison

Applies to: str

In [ ]:

Copied!





class Answer(BaseAnswer):
    mechanism: str = VerifiedField(
        description="Mechanism of action described in the response",
        ground_truth="N/A",  # ignored by ContainsAny; required by VerifiedField
        verify_with=ContainsAny(substrings=["apoptosis", "autophagy"]),
    )


parsed = Answer(mechanism="Induces apoptosis by inhibiting BCL2")
print(f"Contains 'apoptosis': {parsed.verify()}")

parsed_miss = Answer(mechanism="Inhibits cell proliferation")
print(f"Contains neither:     {parsed_miss.verify()}")
class Answer(BaseAnswer):
    mechanism: str = VerifiedField(
        description="Mechanism of action described in the response",
        ground_truth="N/A",  # ignored by ContainsAny; required by VerifiedField
        verify_with=ContainsAny(substrings=["apoptosis", "autophagy"]),
    )


parsed = Answer(mechanism="Induces apoptosis by inhibiting BCL2")
print(f"Contains 'apoptosis': {parsed.verify()}")

parsed_miss = Answer(mechanism="Inhibits cell proliferation")
print(f"Contains neither:     {parsed_miss.verify()}")

Primitives that ignore ground_truth

ContainsAny, ContainsAll, RegexMatch, NumericRange, and DateRange carry their expected values in constructor parameters, not in ground_truth. You must still provide a ground_truth value because it is a required parameter of VerifiedField, but the primitive does not use it. By convention, use a placeholder such as "N/A" for strings or 0 for numbers.

ContainsAll¶

Pass if the extracted text contains all of the specified substrings. This primitive ignores ground_truth.

Parameter	Type	Default	Description
`substrings`	`list[str]`	required	All must appear in the extracted value
`normalize`	`list[Normalizer]`	`[]`	Normalizers applied before comparison

Applies to: str

In [ ]:

Copied!





class Answer(BaseAnswer):
    summary: str = VerifiedField(
        description="Trial design summary",
        ground_truth="N/A",
        verify_with=ContainsAll(substrings=["phase III", "randomized", "double-blind"]),
    )


parsed = Answer(summary="A phase III, randomized, double-blind study")
print(f"Contains all:  {parsed.verify()}")

parsed_miss = Answer(summary="A phase III, open-label study")
print(f"Missing terms: {parsed_miss.verify()}")
class Answer(BaseAnswer):
    summary: str = VerifiedField(
        description="Trial design summary",
        ground_truth="N/A",
        verify_with=ContainsAll(substrings=["phase III", "randomized", "double-blind"]),
    )


parsed = Answer(summary="A phase III, randomized, double-blind study")
print(f"Contains all:  {parsed.verify()}")

parsed_miss = Answer(summary="A phase III, open-label study")
print(f"Missing terms: {parsed_miss.verify()}")

RegexMatch¶

Pass if the extracted text matches the specified regex pattern. This primitive ignores ground_truth. Uses re.search(), so the pattern can match anywhere in the string.

Parameter	Type	Default	Description
`pattern`	`str`	required	Regular expression to match against the extracted value
`flags`	`list[str]`	`[]`	Regex flag names (e.g., `["IGNORECASE"]`)

Applies to: str

In [ ]:

Copied!





class Answer(BaseAnswer):
    identifier: str = VerifiedField(
        description="ClinicalTrials.gov identifier",
        ground_truth="N/A",
        verify_with=RegexMatch(pattern=r"NCT\d{8}"),
    )


parsed = Answer(identifier="NCT02141282")
print(f"Valid ID:   {parsed.verify()}")

parsed_bad = Answer(identifier="CT-2014-001")
print(f"Invalid ID: {parsed_bad.verify()}")
class Answer(BaseAnswer):
    identifier: str = VerifiedField(
        description="ClinicalTrials.gov identifier",
        ground_truth="N/A",
        verify_with=RegexMatch(pattern=r"NCT\d{8}"),
    )


parsed = Answer(identifier="NCT02141282")
print(f"Valid ID:   {parsed.verify()}")

parsed_bad = Answer(identifier="CT-2014-001")
print(f"Invalid ID: {parsed_bad.verify()}")

SemanticMatch¶

Pass if the embedding similarity between the extracted value and ground_truth meets the threshold.

Parameter	Type	Default	Description
`threshold`	`float`	`0.85`	Minimum cosine similarity score to pass

Applies to: str

In [ ]:

Copied!





class Answer(BaseAnswer):
    rationale: str = VerifiedField(
        description="Clinical rationale for the treatment",
        ground_truth="Targets the BCL2 anti-apoptotic protein",
        verify_with=SemanticMatch(threshold=0.80),
    )
class Answer(BaseAnswer):
    rationale: str = VerifiedField(
        description="Clinical rationale for the treatment",
        ground_truth="Targets the BCL2 anti-apoptotic protein",
        verify_with=SemanticMatch(threshold=0.80),
    )

SemanticMatch requires the embedding_check pipeline stage

SemanticMatch cannot be called directly; its check() method raises NotImplementedError. It serves as a marker that tells the pipeline's embedding_check stage (stage 9) to compute embedding similarity for this field. You must enable embedding_check in your VerificationConfig and configure an embedding model. Calling verify() directly on a template with SemanticMatch fields will raise an error.

5.3. Numeric¶

NumericExact¶

Pass if the extracted value equals the ground truth after float coercion.

No parameters.

Applies to: int, float

In [ ]:

Copied!





class Answer(BaseAnswer):
    patient_count: int = VerifiedField(
        description="Number of patients enrolled",
        ground_truth=342,
        verify_with=NumericExact(),
    )


parsed = Answer(patient_count=342)
print(f"342 -> {parsed.verify()}")

parsed_wrong = Answer(patient_count=340)
print(f"340 -> {parsed_wrong.verify()}")
class Answer(BaseAnswer):
    patient_count: int = VerifiedField(
        description="Number of patients enrolled",
        ground_truth=342,
        verify_with=NumericExact(),
    )


parsed = Answer(patient_count=342)
print(f"342 -> {parsed.verify()}")

parsed_wrong = Answer(patient_count=340)
print(f"340 -> {parsed_wrong.verify()}")

NumericTolerance¶

Pass if the extracted value is within a specified tolerance of the ground truth.

Parameter	Type	Default	Description
`tolerance`	`float`	required	Allowed deviation from ground truth
`mode`	`Literal["relative", "absolute"]`	`"relative"`	`"relative"`: fraction of ground truth; `"absolute"`: raw difference

Applies to: int, float

In [ ]:

Copied!





class Answer(BaseAnswer):
    hazard_ratio: float = VerifiedField(
        description="Hazard ratio from primary analysis",
        ground_truth=0.72,
        verify_with=NumericTolerance(tolerance=0.05, mode="absolute"),
    )


for value in [0.72, 0.70, 0.77, 0.80]:
    parsed = Answer(hazard_ratio=value)
    print(f"{value} -> verify(): {parsed.verify()}")
class Answer(BaseAnswer):
    hazard_ratio: float = VerifiedField(
        description="Hazard ratio from primary analysis",
        ground_truth=0.72,
        verify_with=NumericTolerance(tolerance=0.05, mode="absolute"),
    )


for value in [0.72, 0.70, 0.77, 0.80]:
    parsed = Answer(hazard_ratio=value)
    print(f"{value} -> verify(): {parsed.verify()}")

Choosing tolerance values

Use NumericExact() for exact counts (chromosomes, enrolled patients). Use NumericTolerance(tolerance=..., mode="absolute") for physical measurements with known precision (body temperature, boiling points). Use NumericTolerance(tolerance=..., mode="relative") for values that span wide ranges where a percentage margin makes more sense (e.g., tolerance=0.1 to accept within 10%).

NumericRange¶

Pass if the extracted value falls within the specified bounds (inclusive). This primitive ignores ground_truth.

Parameter	Type	Default	Description
`min`	`float \\| None`	`None`	Lower bound (inclusive); no lower bound if `None`
`max`	`float \\| None`	`None`	Upper bound (inclusive); no upper bound if `None`

Applies to: int, float

In [ ]:

Copied!





class Answer(BaseAnswer):
    p_value: float = VerifiedField(
        description="Primary endpoint p-value",
        ground_truth=0,
        verify_with=NumericRange(min=0.0, max=0.05),
    )


for value in [0.001, 0.05, 0.10]:
    parsed = Answer(p_value=value)
    print(f"{value} -> verify(): {parsed.verify()}")
class Answer(BaseAnswer):
    p_value: float = VerifiedField(
        description="Primary endpoint p-value",
        ground_truth=0,
        verify_with=NumericRange(min=0.0, max=0.05),
    )


for value in [0.001, 0.05, 0.10]:
    parsed = Answer(p_value=value)
    print(f"{value} -> verify(): {parsed.verify()}")

5.4. List¶

SetContainment¶

Compare lists as sets with configurable containment modes.

Parameter	Type	Default	Description
`mode`	`str`	`"exact"`	`"exact"`, `"subset"`, `"superset"`, or `"overlap"`
`min_overlap`	`int \\| None`	`None`	Minimum shared elements; only used in `"overlap"` mode (defaults to 1 if not set)

Applies to: list[str]

Mode	Passes when	Use case
`"exact"`	Extracted set equals expected set	Must name all and only the expected items
`"subset"`	Extracted is a subset of expected	Extracted items must all be valid; not all expected items are required
`"superset"`	Extracted is a superset of expected	All expected items must appear; extra items are acceptable
`"overlap"`	At least `min_overlap` elements in common	Partial credit or flexible matching

In [ ]:

Copied!





class Answer(BaseAnswer):
    indications: list[str] = VerifiedField(
        description="Approved indications listed in the response",
        ground_truth=["CLL", "SLL", "AML"],
        verify_with=SetContainment(mode="superset"),
    )


parsed = Answer(indications=["CLL", "SLL", "AML", "NHL"])
print(f"Superset (extra OK): {parsed.verify()}")

parsed_missing = Answer(indications=["CLL", "SLL"])
print(f"Missing AML:         {parsed_missing.verify()}")
class Answer(BaseAnswer):
    indications: list[str] = VerifiedField(
        description="Approved indications listed in the response",
        ground_truth=["CLL", "SLL", "AML"],
        verify_with=SetContainment(mode="superset"),
    )


parsed = Answer(indications=["CLL", "SLL", "AML", "NHL"])
print(f"Superset (extra OK): {parsed.verify()}")

parsed_missing = Answer(indications=["CLL", "SLL"])
print(f"Missing AML:         {parsed_missing.verify()}")

OrderedMatch¶

Pass if each element of the extracted list matches the corresponding element of the ground truth list after normalization. Lists must have the same length.

Parameter	Type	Default	Description
`normalize`	`list[Normalizer]`	`["lowercase", "strip"]`	Normalizers applied to each element before comparison

Applies to: list[str]

In [ ]:

Copied!





class Answer(BaseAnswer):
    authors: list[str] = VerifiedField(
        description="Author names in order of appearance",
        ground_truth=["Smith J", "Jones A", "Patel R"],
        verify_with=OrderedMatch(normalize=["lowercase", "strip"]),
    )


parsed = Answer(authors=["smith j", "jones a", "patel r"])
print(f"Correct order: {parsed.verify()}")

parsed_wrong = Answer(authors=["Jones A", "Smith J", "Patel R"])
print(f"Wrong order:   {parsed_wrong.verify()}")
class Answer(BaseAnswer):
    authors: list[str] = VerifiedField(
        description="Author names in order of appearance",
        ground_truth=["Smith J", "Jones A", "Patel R"],
        verify_with=OrderedMatch(normalize=["lowercase", "strip"]),
    )


parsed = Answer(authors=["smith j", "jones a", "patel r"])
print(f"Correct order: {parsed.verify()}")

parsed_wrong = Answer(authors=["Jones A", "Smith J", "Patel R"])
print(f"Wrong order:   {parsed_wrong.verify()}")

5.5. Categorical¶

LiteralMatch¶

Pass if the extracted value exactly matches the ground truth. Designed for fields typed as Literal[...], where Pydantic generates an enum in the JSON schema that constrains the judge to a fixed set of values.

No parameters.

Applies to: Literal[...]

In [ ]:

Copied!





class Answer(BaseAnswer):
    trial_phase: Literal["I", "II", "III", "IV"] = VerifiedField(
        description="Clinical trial phase",
        ground_truth="III",
        verify_with=LiteralMatch(),
    )


parsed = Answer(trial_phase="III")
print(f"Correct phase: {parsed.verify()}")

parsed_wrong = Answer(trial_phase="II")
print(f"Wrong phase:   {parsed_wrong.verify()}")
class Answer(BaseAnswer):
    trial_phase: Literal["I", "II", "III", "IV"] = VerifiedField(
        description="Clinical trial phase",
        ground_truth="III",
        verify_with=LiteralMatch(),
    )


parsed = Answer(trial_phase="III")
print(f"Correct phase: {parsed.verify()}")

parsed_wrong = Answer(trial_phase="II")
print(f"Wrong phase:   {parsed_wrong.verify()}")

5.6. Date and Time¶

DateMatch¶

Parse both values as dates and compare for equality. Only the date portion is compared; time is ignored. When format is None, python-dateutil is used for flexible date parsing.

Parameter	Type	Default	Description
`format`	`str \\| None`	`None`	strptime format string; uses python-dateutil if `None`

Applies to: str

In [ ]:

Copied!





class Answer(BaseAnswer):
    approval_date: str = VerifiedField(
        description="FDA approval date",
        ground_truth="2016-04-11",
        verify_with=DateMatch(),
    )


parsed = Answer(approval_date="April 11, 2016")
print(f"Flexible parsing: {parsed.verify()}")

parsed_wrong = Answer(approval_date="2016-04-12")
print(f"Wrong date:       {parsed_wrong.verify()}")
class Answer(BaseAnswer):
    approval_date: str = VerifiedField(
        description="FDA approval date",
        ground_truth="2016-04-11",
        verify_with=DateMatch(),
    )


parsed = Answer(approval_date="April 11, 2016")
print(f"Flexible parsing: {parsed.verify()}")

parsed_wrong = Answer(approval_date="2016-04-12")
print(f"Wrong date:       {parsed_wrong.verify()}")

DateTolerance¶

Pass if the extracted date is within the specified tolerance of the ground truth date.

Parameter	Type	Default	Description
`tolerance`	`int`	required	Maximum allowed difference
`unit`	`Literal["days", "hours", "minutes"]`	`"days"`	Time unit: `"days"`, `"hours"`, or `"minutes"`

Applies to: str

In [ ]:

Copied!





class Answer(BaseAnswer):
    approval_date: str = VerifiedField(
        description="FDA approval date",
        ground_truth="2016-04-11",
        verify_with=DateTolerance(tolerance=30, unit="days"),
    )


parsed = Answer(approval_date="2016-04-25")
print(f"Within 30 days: {parsed.verify()}")

parsed_far = Answer(approval_date="2016-06-15")
print(f"Beyond 30 days: {parsed_far.verify()}")
class Answer(BaseAnswer):
    approval_date: str = VerifiedField(
        description="FDA approval date",
        ground_truth="2016-04-11",
        verify_with=DateTolerance(tolerance=30, unit="days"),
    )


parsed = Answer(approval_date="2016-04-25")
print(f"Within 30 days: {parsed.verify()}")

parsed_far = Answer(approval_date="2016-06-15")
print(f"Beyond 30 days: {parsed_far.verify()}")

DateRange¶

Pass if the extracted date falls within the specified bounds (inclusive). This primitive ignores ground_truth.

Parameter	Type	Default	Description
`min`	`str \\| None`	`None`	Earliest acceptable date; no lower bound if `None`
`max`	`str \\| None`	`None`	Latest acceptable date; no upper bound if `None`

Applies to: str

In [ ]:

Copied!





class Answer(BaseAnswer):
    submission_date: str = VerifiedField(
        description="NDA submission date",
        ground_truth="N/A",
        verify_with=DateRange(min="2015-01-01", max="2015-12-31"),
    )


parsed = Answer(submission_date="2015-06-15")
print(f"In range:     {parsed.verify()}")

parsed_out = Answer(submission_date="2016-02-01")
print(f"Out of range: {parsed_out.verify()}")
class Answer(BaseAnswer):
    submission_date: str = VerifiedField(
        description="NDA submission date",
        ground_truth="N/A",
        verify_with=DateRange(min="2015-01-01", max="2015-12-31"),
    )


parsed = Answer(submission_date="2015-06-15")
print(f"In range:     {parsed.verify()}")

parsed_out = Answer(submission_date="2016-02-01")
print(f"Out of range: {parsed_out.verify()}")

6. Trace Primitives¶

Trace primitives operate on the raw LLM response text, bypassing the judge entirely. Fields that use a trace primitive are removed from the JSON schema sent to the judge, so the judge never sees or attempts to extract them. The pipeline evaluates them directly after parsing completes.

How Trace Primitives Use `ground_truth`¶

Trace fields must be typed as bool. The primitive's check_trace() method returns a boolean (pattern found, substring present, length within bounds), and the pipeline compares that result against bool(ground_truth):

pass = primitive.check_trace(raw_response) == bool(ground_truth)

Setting ground_truth=True means "the check should succeed" (the pattern should be found, the substring should be present). Setting ground_truth=False inverts the logic: the field passes when the check does not succeed. This lets you test for both presence and absence using the same primitive.

TraceRegex¶

Pass if the specified regex pattern is found in the raw response. When count_min is set, pass only if the pattern matches at least that many times.

Parameter	Type	Default	Description
`pattern`	`str`	required	Regular expression to search for in the raw response
`count_min`	`int \\| None`	`None`	Minimum number of matches required; any match passes if `None`

In [ ]:

Copied!





class Answer(BaseAnswer):
    cites_trial: bool = VerifiedField(
        description="Whether the response cites a clinical trial",
        ground_truth=True,
        verify_with=TraceRegex(pattern=r"NCT\d{8}"),
    )


# Trace primitives normally receive _raw_trace from the pipeline.
# Here we set it manually to demonstrate the primitive in isolation.
parsed = Answer(cites_trial=True)
parsed._raw_trace = "The MURANO trial (NCT02005471) demonstrated superior PFS."
print(f"Contains NCT ID: {parsed.verify()}")

parsed_no = Answer(cites_trial=True)
parsed_no._raw_trace = "The trial demonstrated superior PFS."
print(f"No NCT ID:       {parsed_no.verify()}")
class Answer(BaseAnswer):
    cites_trial: bool = VerifiedField(
        description="Whether the response cites a clinical trial",
        ground_truth=True,
        verify_with=TraceRegex(pattern=r"NCT\d{8}"),
    )


# Trace primitives normally receive _raw_trace from the pipeline.
# Here we set it manually to demonstrate the primitive in isolation.
parsed = Answer(cites_trial=True)
parsed._raw_trace = "The MURANO trial (NCT02005471) demonstrated superior PFS."
print(f"Contains NCT ID: {parsed.verify()}")

parsed_no = Answer(cites_trial=True)
parsed_no._raw_trace = "The trial demonstrated superior PFS."
print(f"No NCT ID:       {parsed_no.verify()}")

TraceContains¶

Pass if the specified substring appears in the raw response.

Parameter	Type	Default	Description
`substring`	`str`	required	Text to search for in the raw response

In [ ]:

Copied!





class Answer(BaseAnswer):
    mentions_limitations: bool = VerifiedField(
        description="Whether the response mentions study limitations",
        ground_truth=True,
        verify_with=TraceContains(substring="limitation"),
    )
class Answer(BaseAnswer):
    mentions_limitations: bool = VerifiedField(
        description="Whether the response mentions study limitations",
        ground_truth=True,
        verify_with=TraceContains(substring="limitation"),
    )

Testing for absence: set ground_truth=False to verify that a pattern or substring does not appear. For example, to check that a response avoids a specific brand name:

In [ ]:

Copied!





class Answer(BaseAnswer):
    avoids_brand_name: bool = VerifiedField(
        description="Whether the response avoids the brand name",
        ground_truth=False,
        verify_with=TraceContains(substring="Venclexta"),
    )


# TraceContains returns True if "Venclexta" is found.
# Since ground_truth=False, the pipeline compares True == False -> fail.
# If the substring is absent, the pipeline compares False == False -> pass.
parsed = Answer(avoids_brand_name=True)
parsed._raw_trace = "The selective BCL2 inhibitor was approved for CLL."
print(f"Brand name absent: {parsed.verify()}")

parsed_found = Answer(avoids_brand_name=True)
parsed_found._raw_trace = "Venclexta (venetoclax) was approved for CLL."
print(f"Brand name present: {parsed_found.verify()}")
class Answer(BaseAnswer):
    avoids_brand_name: bool = VerifiedField(
        description="Whether the response avoids the brand name",
        ground_truth=False,
        verify_with=TraceContains(substring="Venclexta"),
    )


# TraceContains returns True if "Venclexta" is found.
# Since ground_truth=False, the pipeline compares True == False -> fail.
# If the substring is absent, the pipeline compares False == False -> pass.
parsed = Answer(avoids_brand_name=True)
parsed._raw_trace = "The selective BCL2 inhibitor was approved for CLL."
print(f"Brand name absent: {parsed.verify()}")

parsed_found = Answer(avoids_brand_name=True)
parsed_found._raw_trace = "Venclexta (venetoclax) was approved for CLL."
print(f"Brand name present: {parsed_found.verify()}")

TraceLength¶

Pass if the raw response length falls within the specified bounds.

Parameter	Type	Default	Description
`min`	`int \\| None`	`None`	Minimum length; no lower bound if `None`
`max`	`int \\| None`	`None`	Maximum length; no upper bound if `None`
`unit`	`str`	`"chars"`	Unit of measurement: `"chars"` or `"words"`

In [ ]:

Copied!





class Answer(BaseAnswer):
    is_substantive: bool = VerifiedField(
        description="Whether the response is substantive in length",
        ground_truth=True,
        verify_with=TraceLength(min=100, unit="words"),
    )
class Answer(BaseAnswer):
    is_substantive: bool = VerifiedField(
        description="Whether the response is substantive in length",
        ground_truth=True,
        verify_with=TraceLength(min=100, unit="words"),
    )

7. Normalizers¶

Normalizers preprocess string values before comparison. They are used by ExactMatch, ContainsAny, ContainsAll, and OrderedMatch. Normalizers are applied sequentially to both the extracted value and the expected value (or to each substring for ContainsAny/ContainsAll).

Built-in Normalizers¶

Normalizer	Description
`"lowercase"`	Convert to lowercase
`"strip"`	Remove leading and trailing whitespace
`"remove_punctuation"`	Remove all punctuation characters (`string.punctuation`)
`"collapse_whitespace"`	Replace runs of whitespace with a single space, then strip
`SynonymMap(mapping={...})`	Map known synonyms to canonical forms via exact key lookup

The Normalizer type alias is str | SynonymMap. Normalizers are applied in the order they are listed.

SynonymMap¶

SynonymMap performs exact key lookup: if the entire input string matches a key in the mapping, it is replaced with the corresponding value. If no key matches, the string passes through unchanged.

In [ ]:

Copied!





class Answer(BaseAnswer):
    gene: str = VerifiedField(
        description="Gene name",
        ground_truth="BCL2",
        verify_with=ExactMatch(normalize=[
            "lowercase",
            "strip",
            SynonymMap(mapping={"bcl-2": "bcl2", "b-cell lymphoma 2": "bcl2"}),
        ]),
    )


for value in ["BCL2", "Bcl-2", "B-cell lymphoma 2", "KRAS"]:
    parsed = Answer(gene=value)
    print(f"{value!r:>24} -> verify(): {parsed.verify()}")
class Answer(BaseAnswer):
    gene: str = VerifiedField(
        description="Gene name",
        ground_truth="BCL2",
        verify_with=ExactMatch(normalize=[
            "lowercase",
            "strip",
            SynonymMap(mapping={"bcl-2": "bcl2", "b-cell lymphoma 2": "bcl2"}),
        ]),
    )


for value in ["BCL2", "Bcl-2", "B-cell lymphoma 2", "KRAS"]:
    parsed = Answer(gene=value)
    print(f"{value!r:>24} -> verify(): {parsed.verify()}")

Normalizer ordering matters

SynonymMap uses exact key lookup against the entire input string. If your mapping keys are lowercase (e.g., "bcl-2"), place "lowercase" before the SynonymMap so that "Bcl-2" is first lowercased to "bcl-2" before the synonym lookup runs. The order of normalizers in the list determines the order of application.

8. Writing Custom Primitives¶

When none of the 18 built-in primitives fit your verification need, you can write a custom one. Both base classes are in karenina.schemas.primitives.

Custom Parsed Primitive¶

Subclass VerificationPrimitive and implement check(extracted, expected) -> bool:

In [ ]:

Copied!

from typing import Any
from karenina.schemas.primitives import VerificationPrimitive

class CaseInsensitiveContains(VerificationPrimitive):
    """Pass if ground truth appears as a substring of the extracted value."""

    def check(self, extracted: Any, expected: Any) -> bool:
        return str(expected).lower() in str(extracted).lower()
from typing import Any
from karenina.schemas.primitives import VerificationPrimitive

class CaseInsensitiveContains(VerificationPrimitive):
    """Pass if ground truth appears as a substring of the extracted value."""

    def check(self, extracted: Any, expected: Any) -> bool:
        return str(expected).lower() in str(extracted).lower()

Custom Trace Primitive¶

Subclass TracePrimitive and implement check_trace(raw_trace) -> bool. Expected values are stored as constructor parameters on the primitive, not passed as arguments:

In [ ]:

Copied!

from karenina.schemas.primitives import TracePrimitive

class TraceWordCount(TracePrimitive):
    """Pass if the response word count falls within bounds."""

    min_words: int = 0
    max_words: int = 10000

    def check_trace(self, raw_trace: str) -> bool:
        count = len(raw_trace.split())
        return self.min_words <= count <= self.max_words
from karenina.schemas.primitives import TracePrimitive

class TraceWordCount(TracePrimitive):
    """Pass if the response word count falls within bounds."""

    min_words: int = 0
    max_words: int = 10000

    def check_trace(self, raw_trace: str) -> bool:
        count = len(raw_trace.split())
        return self.min_words <= count <= self.max_words

Registration¶

Custom primitives can be registered using the @_register_primitive decorator, which enables serialization and deserialization through the primitive registry. This is currently a private API intended as an internal extension point; its interface may change between releases. For most use cases, passing the primitive instance directly to VerifiedField(verify_with=...) works without registration.

9. Next Steps¶

Answer Templates: how primitives fit into the template lifecycle, field patterns, and VerifiedField parameters
Rubrics: quality evaluation without ground truth (the complement to primitives)
Verification Pipeline: how the verify_template and embedding_check stages execute primitives
Templates vs Rubrics: when to use primitives (correctness) vs rubric traits (quality)
Evaluation Modes: combining template verification and rubric evaluation

Verification Primitives¶

1. What Verification Primitives Are¶

What Primitives Are Not¶

2. Core Idea: Deterministic Checks After LLM Parsing¶

Walkthrough: From Response to Verdict¶

3. Two Categories: Parsed vs Trace¶

When to Use Each Category¶

4. Choosing the Right Primitive¶

By Data Type¶

By Verification Need¶

Decision Heuristics¶

5. Parsed Primitives Reference¶

5.1. Boolean¶

BooleanMatch¶

5.2. String¶

ExactMatch¶

ContainsAny¶

ContainsAll¶

RegexMatch¶

SemanticMatch¶

5.3. Numeric¶

NumericExact¶

NumericTolerance¶

NumericRange¶

5.4. List¶

SetContainment¶

OrderedMatch¶

5.5. Categorical¶

LiteralMatch¶

5.6. Date and Time¶

DateMatch¶

DateTolerance¶

DateRange¶

6. Trace Primitives¶

How Trace Primitives Use ground_truth¶

TraceRegex¶

TraceContains¶

TraceLength¶

7. Normalizers¶

Built-in Normalizers¶

SynonymMap¶

8. Writing Custom Primitives¶

Custom Parsed Primitive¶

Custom Trace Primitive¶

Registration¶

9. Next Steps¶

How Trace Primitives Use `ground_truth`¶