Scaled Authoring¶
When building large benchmarks with dozens or hundreds of questions, manual question entry and template writing don't scale. This scenario demonstrates the power-user workflow: bulk question ingestion, automated template generation, programmatic template construction, ADeLe question classification, and few-shot example injection.
What you'll learn:
- Bulk question ingestion from files
- Automated template generation with
generate_all_templates() - Programmatic templates with
AnswerBuilder - ADeLe question classification
- Few-shot examples (per-question and via
FewShotConfig) - Progress callbacks
Create the Benchmark¶
from karenina import Benchmark
benchmark = Benchmark.create(
name="Pharmacology Knowledge Base",
description="Large-scale evaluation of LLM pharmacology knowledge",
version="1.0.0",
)
print(f"Created: {benchmark.name}")
Bulk Question Ingestion¶
For large benchmarks, questions typically come from spreadsheets or CSV files rather than manual entry. The extract_questions_from_file() function reads tabular data and returns (Question, metadata) pairs ready to add to a benchmark.
The function supports Excel, CSV, and TSV formats:
| Format | Extension | Notes |
|---|---|---|
| Excel | .xlsx, .xls |
Supports multiple sheets |
| CSV | .csv |
Comma-separated |
| TSV | .tsv |
Tab-separated |
# In a real workflow, you'd point to an actual file:
#
# questions = extract_questions_from_file(
# file_path="questions.xlsx",
# question_column="Question",
# answer_column="Expected Answer",
# author_name_column="Author",
# keywords_columns=[
# {"column": "Tags", "separator": ","},
# ],
# )
#
# for question, metadata in questions:
# benchmark.add_question(question, **metadata)
For this tutorial, we add questions manually — simulating what extract_questions_from_file() would produce:
questions = [
("What is the mechanism of action of metformin?", "Metformin activates AMP-activated protein kinase (AMPK) and reduces hepatic glucose production."),
("What is the half-life of amoxicillin?", "Approximately 1 hour."),
("Name three common side effects of statins.", "Muscle pain, liver enzyme elevation, and digestive problems."),
("What is the antidote for acetaminophen overdose?", "N-acetylcysteine (NAC)."),
("How does warfarin prevent blood clots?", "Warfarin inhibits vitamin K epoxide reductase, blocking synthesis of clotting factors II, VII, IX, and X."),
("What is the therapeutic index of lithium?", "Narrow — therapeutic range is 0.6-1.2 mEq/L, with toxicity above 1.5 mEq/L."),
("What class of drug is omeprazole?", "Proton pump inhibitor (PPI)."),
("What is the primary indication for naloxone?", "Reversal of opioid overdose."),
]
for q_text, answer in questions:
benchmark.add_question(question=q_text, raw_answer=answer)
print(f"Added {benchmark.question_count} questions")
print(f"With templates: {len(benchmark.get_finished_templates())}")
Automated Template Generation¶
With many questions added, generate_all_templates() uses an LLM to produce Answer classes for every question that lacks a template. The only_missing=True default skips questions that already have templates, making this safe for incremental workflows — add new questions, then run generation to fill in only the new ones.
# Generate templates for all questions that don't have one
with patch.object(type(benchmark), "generate_all_templates", _mock_generate_all_templates):
results = benchmark.generate_all_templates(
model="claude-haiku-4-5",
model_provider="anthropic",
only_missing=True,
progress_callback=lambda pct, msg: print(f" {pct:.0f}%: {msg}"),
)
# Summarize results
generated = sum(1 for r in results.values() if r["success"] and not r.get("skipped"))
skipped = sum(1 for r in results.values() if r.get("skipped"))
failed = sum(1 for r in results.values() if not r["success"])
print(f"\nGenerated: {generated}")
print(f"Skipped: {skipped}")
print(f"Failed: {failed}")
print(f"Progress: {benchmark.get_progress()}%")
Review a generated template to check the output:
first_id = benchmark.get_question_ids()[0]
template = benchmark.get_template(first_id)
print(template)
Programmatic Templates with AnswerBuilder¶
For questions where you want precise control without writing template class code by hand, AnswerBuilder provides a fluent interface. It builds Answer classes using VerifiedField with appropriate verification primitives, then compiles them into executable classes. Each attribute you add is mapped to a VerifiedField with a matching primitive (BooleanMatch for booleans, ExactMatch for strings, NumericTolerance for floats, and so on).
from karenina.benchmark.authoring.answers.builder import AnswerBuilder
builder = (
AnswerBuilder()
.add_attribute("mentions_ampk", "bool", "Whether AMPK activation is mentioned", True)
.add_attribute("mentions_hepatic", "bool", "Whether hepatic glucose reduction is mentioned", True)
.add_regex("has_mechanism_keyword", r"\b(activates|inhibits|blocks)\b", expected=True, match_type="contains")
)
Answer = builder.compile()
# Replace the auto-generated template for the metformin question
metformin_id = benchmark.get_question_ids()[0] # First question
benchmark.update_template(metformin_id, Answer)
print(f"Updated template for: {metformin_id[:50]}...")
The compiled class uses VerifiedField internally. For example, add_attribute("mentions_ampk", "bool", ..., True) produces a field equivalent to:
mentions_ampk: bool = VerifiedField(
description="Whether AMPK activation is mentioned",
ground_truth=True,
verify_with=BooleanMatch(),
)
BaseAnswer auto-generates ground_truth(), verify(), and verify_granular() from the VerifiedField metadata, so the compiled class needs no hand-written verification methods.
When to use each approach:
- AnswerBuilder: Quick boolean/regex templates built from data, no string manipulation needed.
- Class definitions or code strings: Complex verification logic, custom normalization, multi-step checks that need full Python expressiveness.
ADeLe Question Classification¶
ADeLe (Annotated Demand Levels) classifies questions across 18 cognitive complexity dimensions, producing scores from 0 (none) to 5 (very high) on each dimension. This helps you understand what your benchmark measures and filter questions by difficulty.
Classification requires LLM access. In a real workflow with API keys configured:
# In a real workflow:
#
# classifier = QuestionClassifier(
# model_name="claude-haiku-4-5",
# provider="anthropic",
# )
#
# question_pairs = [
# (qid, benchmark.get_question(qid)["question"])
# for qid in benchmark.get_question_ids()
# ]
#
# results = classifier.classify_batch(
# questions=question_pairs,
# on_progress=lambda done, total: print(f"{done}/{total}"),
# )
#
# for q_id, result in results.items():
# print(f"{q_id[:30]}... volume={result.scores['volume']}, "
# f"reasoning={result.scores['logical_reasoning_logic']}")
ADeLe provides:
- 18 cognitive complexity dimensions — volume, reasoning depth, domain specificity, and more
- Ordinal scores from 0 (none) to 5 (very high) on each dimension
- Filtering support — select question subsets by complexity for targeted evaluation
See ADeLe Concepts for the full dimension list and scoring reference.
Adding Few-Shot Examples¶
Few-shot examples guide the answering model toward the expected response format by prepending question-answer pairs to the prompt during verification.
Per-Question Examples¶
Attach examples directly to individual questions when adding them:
# Add few-shot examples to specific questions
benchmark.add_question(
question="What is the mechanism of action of clopidogrel?",
raw_answer="Clopidogrel irreversibly inhibits the P2Y12 ADP receptor on platelets.",
few_shot_examples=[
{"question": "How does warfarin prevent blood clots?", "answer": "Inhibits vitamin K epoxide reductase"},
{"question": "How does heparin prevent blood clots?", "answer": "Activates antithrombin III"},
],
)
print(f"Added question with {2} few-shot examples")
FewShotConfig for Verification¶
FewShotConfig controls which examples are used during verification. Global external examples are appended to every question. The global_k parameter limits how many per-question examples are included.
from karenina.schemas import FewShotConfig
few_shot_config = FewShotConfig(
global_mode="k-shot",
global_k=2,
global_external_examples=[
{"question": "What class of drug is aspirin?", "answer": "NSAID"},
],
)
print(f"Mode: {few_shot_config.global_mode}")
print(f"K: {few_shot_config.global_k}")
print(f"Global external examples: {len(few_shot_config.global_external_examples)}")
tmpdir = tempfile.mkdtemp()
checkpoint_path = Path(tmpdir) / "pharmacology_knowledge_base.jsonld"
benchmark.save(checkpoint_path)
loaded = Benchmark.load(checkpoint_path)
print(f"Questions: {loaded.question_count}")
print(f"Templates: {len(loaded.get_finished_templates())}")
print(f"Progress: {loaded.get_progress()}%")
Cleanup¶
import shutil
shutil.rmtree(tmpdir, ignore_errors=True)
Workflow Summary¶
Extract from file (or add manually)
│
▼
generate_all_templates(only_missing=True)
│
▼
Review → Replace with AnswerBuilder or custom code where needed
│
▼
Classify with ADeLe [optional]
│
▼
Add few-shot examples [optional]
│
▼
Save checkpoint
Next Steps¶
- Factual QA Benchmark — Detailed template patterns (boolean, numeric, regex)
- Full Evaluation Benchmark — Add rubric traits alongside templates
- Quality Assessment — Rubric-only evaluation
- Answer Templates — Template concepts
- ADeLe Concepts — Full ADeLe dimension reference
- Few-Shot Configuration — Advanced few-shot options