Quick Start: TaskEval¶
Evaluate any text output using Karenina's judge LLM machinery. This guide walks you through logging outputs, attaching evaluation criteria, running evaluation, and inspecting results.
By the end you will have evaluated a pre-recorded LLM response for both correctness (via an answer template) and quality (via rubric traits), with no question definition or answer generation required.
Prerequisites¶
- Python 3.11+
- Karenina installed (see Installation)
- API key for the judge LLM provider:
export ANTHROPIC_API_KEY="sk-ant-..."
Step 1: Create a TaskEval Instance¶
TaskEval is a recording and evaluation container. You log text into it, attach evaluation criteria, then run evaluation. Create one with an optional task ID and metadata for tracking.
from karenina.benchmark.task_eval import TaskEval
task = TaskEval(
task_id="drug-target-eval",
metadata={"model": "claude-haiku-4-5", "scenario": "pharmacology"},
)
print(f"Created TaskEval: {task.task_id}")
Created TaskEval: drug-target-eval
Learn more: TaskEval Concepts
Step 2: Log the Output to Evaluate¶
TaskEval evaluates text that you supply. Use log() to record any string as content for evaluation.
Here we log an LLM response about a drug target. This could come from any source: an agent run, a CI pipeline, an external API, or a manual experiment.
task.log(
"The approved drug target of venetoclax is BCL2 (B-cell lymphoma 2). "
"Venetoclax is a selective BCL2 inhibitor that works by displacing pro-apoptotic "
"proteins, triggering programmed cell death in cancer cells [1]."
)
print(f"Logged {len(task.global_logs)} event(s)")
Logged 1 event(s)
Learn more: Logging methods · Structured trace logging
Step 3: Define an Answer Template¶
An answer template is a Pydantic schema that defines what to extract from the logged output and how to verify it. Each field uses VerifiedField to declare what to extract, the correct value, and a verification primitive that checks the result.
from karenina.schemas.entities import BaseAnswer, VerifiedField
from karenina.schemas.primitives import BooleanMatch
class Answer(BaseAnswer):
identifies_bcl2_as_target: bool = VerifiedField(
description=(
"True if the response identifies BCL2 (including Bcl-2, BCL-2, or "
"B-cell lymphoma 2) as the direct pharmacological target of venetoclax. "
"False if BCL2 is mentioned only as a pathway member or a different "
"protein is identified as the primary target."
),
ground_truth=True,
verify_with=BooleanMatch(),
)
mentions_mechanism: bool = VerifiedField(
description=(
"True if the response explains the mechanism of action (e.g., inhibiting "
"BCL2 to trigger apoptosis). False if only the target is named without "
"any mechanistic explanation."
),
ground_truth=True,
verify_with=BooleanMatch(),
)
task.add_template(Answer)
print("Attached answer template with 2 verification fields")
Attached answer template with 2 verification fields
Learn more: Answer Templates · Template Authoring skill
Step 4: Add Rubric Traits¶
While templates verify correctness, rubrics assess quality. Each trait evaluates one dimension of the response independently. Here we add an LLM-judged score trait and a regex pattern trait.
from karenina.schemas.entities.rubric import LLMRubricTrait, RegexRubricTrait, Rubric
rubric = Rubric(
llm_traits=[
LLMRubricTrait(
name="Conciseness",
description=(
"Rate how concise the response is on a scale of 1-5, where "
"1 is very verbose and 5 is extremely concise."
),
kind="score",
),
],
regex_traits=[
RegexRubricTrait(
name="Has Citations",
description="The response includes numbered citations in bracket notation (e.g., [1]).",
pattern=r"\[\d+\]",
case_sensitive=False,
),
],
)
task.add_rubric(rubric)
print(f"Added rubric with {len(rubric.llm_traits)} LLM trait(s) and {len(rubric.regex_traits)} regex trait(s)")
Added rubric with 1 LLM trait(s) and 1 regex trait(s)
Learn more: Rubrics · All trait types: LLM, regex, callable, metric, literal
Step 5: Run Evaluation¶
Configure the judge LLM and run evaluation. TaskEval uses parsing_only=True since no answering model is needed.
from karenina.schemas.config.models import ModelConfig
from karenina.schemas.verification.config import VerificationConfig
config = VerificationConfig(
parsing_models=[
ModelConfig(
id="claude-haiku-4-5",
model_name="claude-haiku-4-5",
model_provider="anthropic",
interface="langchain",
temperature=0.0,
)
],
parsing_only=True,
)
result = task.evaluate(config)
print(f"Evaluation complete: {result.summary()}")
Evaluation complete: 1/1 template verifications passed | 2/2 rubric traits passed
Learn more: Evaluation modes · Model Config
Step 6: Inspect Results¶
evaluate() returns a TaskEvalResult. The quickest way to see what happened is display(), which prints a formatted report with template pass/fail status and rubric scores:
print(result.display())
════════════════════════════════════════════════════════════════════════════════
TASK EVALUATION RESULTS
════════════════════════════════════════════════════════════════════════════════
Task ID: drug-target-eval
Model: claude-haiku-4-5
Scenario: pharmacology
Timestamp: 2026-03-05 11:47:10
────────────────────────────────────────────────────────────
GLOBAL EVALUATION
────────────────────────────────────────────────────────────
Verification Results:
Question: answer
Status: ✓ PASSED
Output: "--- AI Message ---
The approved drug target of venetoclax is BCL2 (B-cell lymphoma 2). Venetoclax is a selective BCL2 inhibitor that works by displacing pro-apoptotic proteins, triggering programmed cell death in cancer cells [1]."
Rubric: Conciseness=4, Has Citations=✓
════════════════════════════════════════════════════════════════════════════════
SUMMARY: 1/1 template verifications passed | 2/2 rubric traits passed
════════════════════════════════════════════════════════════════════════════════
For a one-line overview, use summary():
print(result.summary())
1/1 template verifications passed | 2/2 rubric traits passed
You can also export the full results as JSON or Markdown for downstream analysis or archiving:
print(result.export_json()[:300])
{
"task_id": "drug-target-eval",
"metadata": {
"model": "claude-haiku-4-5",
"scenario": "pharmacology"
},
"timestamp": "2026-03-05T11:47:10.909758",
"summary": "1/1 template verifications passed | 2/2 rubric traits passed",
"global_evaluation": {
"verification_results": {
For programmatic access to individual fields (template verdicts, rubric scores, metadata), see the results inspection reference.
Learn more: Results inspection · DataFrame analysis
Bonus: Multi-Step Evaluation¶
TaskEval supports step-scoped evaluation for multi-phase agent workflows. Each step gets its own logs, templates, and rubric traits, evaluated independently.
multi_task = TaskEval(task_id="multi-step-agent")
# Log per-step outputs
multi_task.log(
"Found 3 relevant papers on venetoclax mechanism of action.",
step_id="retrieval",
)
multi_task.log(
"BCL2 is the primary target of venetoclax. It selectively inhibits BCL2, "
"triggering apoptosis in CLL cells.",
step_id="synthesis",
)
# Step-specific rubrics
multi_task.add_rubric(
Rubric(llm_traits=[
LLMRubricTrait(
name="retrieval_quality",
kind="boolean",
description="True if relevant sources were found for the query.",
)
]),
step_id="retrieval",
)
multi_task.add_rubric(
Rubric(llm_traits=[
LLMRubricTrait(
name="synthesis_accuracy",
kind="boolean",
description="True if the synthesis accurately reflects the retrieved information.",
)
]),
step_id="synthesis",
)
# Evaluate each step individually
for step_id in ["retrieval", "synthesis"]:
step_result = multi_task.evaluate(config, step_id=step_id)
step_eval = step_result.per_step[step_id]
stats = step_eval.get_summary_stats()
print(f"Step '{step_id}': {stats['rubric_traits_passed']}/{stats['rubric_traits_total']} traits passed")
AbstentionCheck triggered for question c2a195f75adcf3e4c69280d908b1c7b4 - overriding result to False
Step 'retrieval': 1/1 traits passed Step 'synthesis': 1/1 traits passed
Learn more: Multi-step evaluation · Step scoping
Next Steps¶
- TaskEval Concepts: Merge strategies, object structure, pipeline integration
- Answer Templates: Field types, descriptions, and writing
verify()methods - Rubrics: All five trait types (LLM, regex, callable, metric, literal)
- Benchmark Quick Start: If you need Karenina to generate responses and evaluate them