AI-Assisted Template Authoring¶
When building benchmarks with many questions, hand-writing every answer template gives full control but scales poorly. AI-assisted generation speeds up the process: generate a draft from the question and its reference answer, review the output, and refine where needed. For cases where you want programmatic control without writing raw class code, AnswerBuilder provides a fluent interface for constructing templates from attribute and regex specifications.
This tutorial covers both approaches and the practical workflow pattern that combines them: bulk generate, review each template, then replace problematic ones with hand-written or builder-constructed versions.
What you'll learn:
- Generate a template from a question using
generate_answer_template() - Understand the two-phase generation process (ground truth extraction, then field descriptions)
- Review and edit generated template code
- Build templates programmatically with
AnswerBuilder - Combine
AnswerBuilderattributes with regex patterns - Use
generate_all_templates()for batch generation - Workflow pattern: generate, review, refine
When to Use Each Tool¶
Three approaches exist for creating answer templates. Choose based on how much control you need and how many templates you are writing.
| Tool | Best For | Speed | Control |
|---|---|---|---|
generate_answer_template() |
Quick drafts, large benchmarks | Fast | Review needed |
AnswerBuilder |
Programmatic construction, uniform patterns | Medium | High |
| Hand-written templates | Complex verify() logic, custom validation |
Slow | Full |
In practice, most large benchmarks use a combination: generate drafts for the majority, replace a few with AnswerBuilder or hand-written versions where the generated output needs adjustment.
Generate a Template¶
generate_answer_template() takes a Question object and returns Python code for an Answer(BaseAnswer) class. The generation uses a two-phase approach: first, an LLM extracts the ground truth fields from the question and reference answer; second, it generates field descriptions that instruct the judge on what to extract from responses.
# Create a Question object
question = Question(
question="What is the putative target of venetoclax?",
raw_answer="BCL2 (B-cell lymphoma 2), an anti-apoptotic protein",
)
# Generate the template (mocked for this tutorial)
with patch(
"karenina.benchmark.authoring.answers.generator.generate_answer_template",
_mock_generate_answer_template,
):
code_string = _mock_generate_answer_template(question, model="claude-haiku-4-5", model_provider="anthropic")
print("Generated template code:")
print(code_string)
Generated template code:
from karenina.schemas.entities import BaseAnswer
from pydantic import Field
class Answer(BaseAnswer):
identifies_target: bool = Field(
description="True if the response identifies BCL2 as the primary pharmacological target of venetoclax."
)
mentions_mechanism: bool = Field(
description="True if the response describes venetoclax as a selective BCL2 inhibitor that mimics BH3-only proteins."
)
def ground_truth(self):
self.correct = {"identifies_target": True, "mentions_mechanism": True}
def verify(self) -> bool:
for field, expected in self.correct.items():
if getattr(self, field) != expected:
return False
return True
The two phases happen automatically inside generate_answer_template(). Phase 1 (ground truth extraction) determines what fields the template needs and what their correct values are. Phase 2 (field description generation) writes the instructional text that the judge LLM follows when parsing responses. You never call these phases separately; the function returns the final Python code string ready to use.
Review Generated Code¶
Before attaching a generated template to your benchmark, review it for correctness. Check these elements:
- Field types: Are
bool,str,float, andLiteraltypes appropriate for each field? - Descriptions: Do they give the judge enough context to parse correctly?
ground_truthvalues: Do they match the reference answer?verify_withprimitives: Is the right primitive selected for each field type?
# The generated code above defines:
# identifies_target: bool -> VerifiedField with BooleanMatch, ground_truth=True
# mentions_mechanism: bool -> VerifiedField with BooleanMatch, ground_truth=True
# No ground_truth() or verify() methods needed; VerifiedField handles both.
# Inspect the structure
lines = code_string.strip().split("\n")
print(f"Total lines: {len(lines)}")
print("Fields defined: identifies_target, mentions_mechanism")
print("Ground truth: identifies_target=True, mentions_mechanism=True")
print("Verification: BooleanMatch() for both fields")
Total lines: 19 Fields defined: identifies_target, mentions_mechanism Ground truth: identifies_target=True, mentions_mechanism=True Verify logic: iterates over self.correct, checks each field
For boolean fields, the most common issue is descriptions that are too vague. A description like "True if the response mentions BCL2" may return True for responses that mention BCL2 in a different context. The generated code above is more precise: it specifies "as the primary pharmacological target of venetoclax."
Add to Benchmark¶
Once you are satisfied with the generated code, attach it to a benchmark question using add_answer_template(), which accepts a code string directly.
benchmark = Benchmark.create(
name="Drug Target Identification",
description="Evaluate LLM accuracy on identifying drug targets",
version="1.0.0",
)
q1_id = benchmark.add_question(
question="What is the putative target of venetoclax?",
raw_answer="BCL2 (B-cell lymphoma 2), an anti-apoptotic protein",
)
benchmark.add_answer_template(q1_id, code_string)
print(f"Template added: {benchmark.has_template(q1_id)}")
print(f"Progress: {benchmark.get_progress()}%")
Template added: True Progress: 100.0%
Edit Generated Templates¶
Generated templates are code strings, so you can modify them with standard string operations before attaching. This is useful when a field description needs refinement, a ground truth value needs correction, or you want to swap a primitive for a more appropriate one.
# Refine the identifies_target description to be more precise,
# and add normalization to handle case variations
edited_code = code_string.replace(
' description="True if the response identifies BCL2 as the primary pharmacological target of venetoclax.",\n'
" ground_truth=True,\n"
" verify_with=BooleanMatch(),",
' description=(\n'
' "True if the response identifies BCL2 (including Bcl-2, BCL-2, or "\n'
' "B-cell lymphoma 2) as the direct pharmacological target of venetoclax. "\n'
' "False if BCL2 is mentioned only as a pathway member."\n'
" ),\n"
" ground_truth=True,\n"
" verify_with=BooleanMatch(),",
)
# Re-attach the edited version
benchmark.add_answer_template(q1_id, edited_code)
print(f"Template updated for: {q1_id[:50]}...")
print(f"Has template: {benchmark.has_template(q1_id)}")
Template updated for: urn:uuid:question-what-is-the-putative-target-of-v... Has template: True
For minor edits (tweaking a description, adjusting a ground truth value, swapping a primitive), string replacement works well. For larger structural changes (adding fields, introducing custom composition logic), consider writing the template by hand or using AnswerBuilder instead.
Build with AnswerBuilder¶
AnswerBuilder provides a fluent interface for constructing templates without writing class code. Chain add_attribute() calls to define fields, then call compile() to produce an executable Answer class.
builder = (
AnswerBuilder()
.add_attribute(
"identifies_target", "bool",
"True if the response identifies BCL2 as the primary target of venetoclax.",
True,
)
.add_attribute(
"mentions_mechanism", "bool",
"True if the response describes venetoclax as a BH3 mimetic.",
True,
)
.add_attribute(
"specifies_indication", "bool",
"True if the response mentions CLL or AML as an approved indication.",
True,
)
)
Answer = builder.compile()
print(f"Compiled class: {Answer.__name__}")
print(f"Builder state:\n{builder}")
Compiled class: Answer
Builder state:
AnswerBuilder:
Classical Attributes (3):
- identifies_target: bool = True
True if the response identifies BCL2 as the primary target of venetoclax.
- mentions_mechanism: bool = True
True if the response describes venetoclax as a BH3 mimetic.
- specifies_indication: bool = True
True if the response mentions CLL or AML as an approved indication.
Regex Patterns (0):
(none)
add_attribute() accepts four arguments: the field name, the type as a string ("bool", "int", "float", "Literal['a', 'b']"), a description for the judge, and the ground truth value. The builder validates names and types at build time, catching errors before compile().
Add Regex Patterns¶
AnswerBuilder also supports regex patterns via add_regex(). Regex checks run against the raw LLM response trace, independent of parsed fields. You can combine attributes and regex patterns in a single builder.
builder_with_regex = (
AnswerBuilder()
.add_attribute(
"identifies_drug_class", "bool",
"True if the response identifies venetoclax as a BH3 mimetic or BCL2 inhibitor.",
True,
)
.add_regex(
"mentions_bcl2",
r"(?i)\bBCL[-\s]?2\b",
expected="BCL2",
match_type="contains",
description="Raw response contains a mention of BCL2.",
)
.add_regex(
"has_citation",
r"\[\d+\]",
expected=1,
match_type="count",
description="Response includes at least one bracketed citation.",
)
)
Answer = builder_with_regex.compile()
print(f"Compiled class: {Answer.__name__}")
print(f"Builder state:\n{builder_with_regex}")
Compiled class: Answer
Builder state:
AnswerBuilder:
Classical Attributes (1):
- identifies_drug_class: bool = True
True if the response identifies venetoclax as a BH3 mimetic or BCL2 inhibitor.
Regex Patterns (2):
- mentions_bcl2: (?i)\bBCL[-\s]?2\b (contains, expected='BCL2')
Raw response contains a mention of BCL2.
- has_citation: \[\d+\] (count, expected=1)
Response includes at least one bracketed citation.
The match_type parameter controls how regex results are evaluated:
match_type |
expected type |
Behavior |
|---|---|---|
"exact" |
str |
Exactly one match, equal to expected |
"contains" |
str |
expected appears among matches |
"count" |
int |
Number of matches equals expected |
"all" |
list |
All items in expected found in matches |
Batch Generation¶
For benchmarks with many questions, generate_all_templates() generates templates for every question that lacks one. The only_missing=True default makes this safe for incremental workflows: add new questions, then run generation to fill in only the gaps.
# Add several questions without templates
q2_id = benchmark.add_question(
question="What is the mechanism of action of metformin?",
raw_answer="Activates AMP-activated protein kinase (AMPK) and reduces hepatic glucose production.",
)
q3_id = benchmark.add_question(
question="What is the half-life of amoxicillin?",
raw_answer="Approximately 1 hour.",
)
q4_id = benchmark.add_question(
question="What is the antidote for acetaminophen overdose?",
raw_answer="N-acetylcysteine (NAC).",
)
print(f"Questions: {benchmark.question_count}")
print(f"With templates: {len(benchmark.get_finished_templates())}")
Questions: 4 With templates: 4
# Generate templates for all questions missing one
with patch.object(type(benchmark), "generate_all_templates", _mock_generate_all_templates):
results = benchmark.generate_all_templates(
model="claude-haiku-4-5",
model_provider="anthropic",
only_missing=True,
progress_callback=lambda pct, msg: print(f" {pct:.0f}%: {msg}"),
)
generated = sum(1 for r in results.values() if r["success"] and not r.get("skipped"))
skipped = sum(1 for r in results.values() if r.get("skipped"))
print(f"\nGenerated: {generated}")
print(f"Skipped (already had template): {skipped}")
print(f"Progress: {benchmark.get_progress()}%")
25%: Generated 1/4 50%: Generated 2/4 75%: Generated 3/4 100%: Generated 4/4 Generated: 3 Skipped (already had template): 1 Progress: 100.0%
The progress_callback receives a percentage (0 to 100) and a status message, useful for displaying progress in scripts or the GUI. For production use, pass a real model and provider; the function calls generate_answer_template() internally for each question.
Workflow: Generate then Refine¶
The practical workflow for large benchmarks combines all three tools. Start with bulk generation, then review and replace where needed.
generate_all_templates(only_missing=True)
|
v
Review each generated template
|
v
Replace problematic ones:
- AnswerBuilder for uniform patterns
- Hand-written code for complex verify() logic
|
v
Save checkpoint
# Step 1: Bulk generation already done above
# Step 2: Review a generated template
first_id = benchmark.get_question_ids()[0]
template = benchmark.get_template(first_id)
print(f"Reviewing template for: {first_id[:50]}...")
print(f"Template present: {template is not None}")
# Step 3: Replace one with an AnswerBuilder version
metformin_builder = (
AnswerBuilder()
.add_attribute("mentions_ampk", "bool", "Whether AMPK activation is mentioned", True)
.add_attribute("mentions_hepatic", "bool", "Whether hepatic glucose reduction is mentioned", True)
)
benchmark.update_template(q2_id, metformin_builder.compile())
print("\nReplaced template for metformin question")
# Step 4: Save
tmpdir = tempfile.mkdtemp()
checkpoint_path = Path(tmpdir) / "drug_targets.jsonld"
benchmark.save(checkpoint_path)
print(f"Saved to: {checkpoint_path.name}")
print(f"Questions: {benchmark.question_count}")
print(f"Templates: {len(benchmark.get_finished_templates())}")
Reviewing template for: urn:uuid:question-what-is-the-putative-target-of-v... Template present: True Replaced template for metformin question Saved to: drug_targets.jsonld Questions: 4 Templates: 4
This pattern scales to hundreds of questions. The initial bulk generation handles the majority; targeted replacement keeps quality high for questions where the generated output falls short.
Cleanup¶
import shutil
shutil.rmtree(tmpdir, ignore_errors=True)
Next Steps¶
- Answer Templates: Deep dive into template concepts, field types, and
verify()patterns - Factual QA Benchmark: Hand-written template patterns (boolean, string, numeric, regex)
- Scaled Authoring: Bulk ingestion, ADeLe classification, and few-shot examples
- Running Verification: Execute the benchmark against LLMs