Evaluating with TaskEval¶

TaskEval evaluates pre-recorded text or agent traces without defining questions or generating answers. You log outputs, attach evaluation criteria (templates for correctness, rubrics for quality, or both), and run the judge LLM. For the underlying concepts, see TaskEval.

Overview¶

Log outputs → Attach criteria → Evaluate → Inspect results

Choose Your Scenario¶

Scenario	Focus Area	What You'll Learn
Basic Evaluation	Template + rubric	Create TaskEval, log text/traces, attach templates and rubrics, configure `VerificationConfig`, inspect results
Quality Assessment	Rubric-only	LLM, regex, and callable traits, rubric-only evaluation, compare scores across outputs
Multi-Step Evaluation	Step-scoped	Named steps, `target` routing, step-scoped criteria, per-step vs global evaluation

Common Workflow¶

All three scenarios follow this general pattern:

Create TaskEval
    │
    ▼
Log outputs (text, traces, or both)
    │
    ▼
Attach evaluation criteria (templates, rubrics, or both)
    │
    ▼
Configure VerificationConfig (parsing_only=True)
    │
    ▼
Evaluate and inspect results

Key APIs¶

Operation	Method	Covered In
Create instance	`TaskEval(task_id=..., metadata=...)`	All scenarios
Log text	`task.log(text)`	Basic Evaluation
Log traces	`task.log_trace(messages)`	Basic Evaluation, Multi-Step
Add template	`task.add_template(AnswerClass)`	Basic Evaluation, Multi-Step
Add rubric	`task.add_rubric(rubric)`	All scenarios
Evaluate globally	`task.evaluate(config)`	All scenarios
Evaluate one step	`task.evaluate(config, step_id="...")`	Multi-Step
Inspect results	`result.summary()`, `result.display()`	All scenarios
Export results	`result.export_json()`, `result.export_markdown()`	Basic Evaluation

Core Concepts¶

These concept pages provide the foundational knowledge that the scenarios build on:

TaskEval: Object structure, pipeline integration, merge strategies
Answer Templates: Template structure, field types, verify() semantics
Rubrics: Trait types (LLM, regex, callable, metric), global vs per-question
Evaluation Modes: How template-only, template+rubric, and rubric-only map to pipeline stages
Verification Pipeline: The 13-stage engine that TaskEval feeds into

Next Steps¶

Analyzing Results: DataFrame analysis, export, and iteration
Running Verification: Benchmark-mode verification workflows
Creating Benchmarks: Build benchmarks with questions and templates