Evaluating with TaskEval¶
TaskEval evaluates pre-recorded text or agent traces without defining questions or generating answers. You log outputs, attach evaluation criteria (templates for correctness, rubrics for quality, or both), and run the judge LLM. For the underlying concepts, see TaskEval.
Overview¶
Log outputs → Attach criteria → Evaluate → Inspect results
Choose Your Scenario¶
| Scenario | Focus Area | What You'll Learn |
|---|---|---|
| Basic Evaluation | Template + rubric | Create TaskEval, log text/traces, attach templates and rubrics, configure VerificationConfig, inspect results |
| Quality Assessment | Rubric-only | LLM, regex, and callable traits, rubric-only evaluation, compare scores across outputs |
| Multi-Step Evaluation | Step-scoped | Named steps, target routing, step-scoped criteria, per-step vs global evaluation |
Common Workflow¶
All three scenarios follow this general pattern:
Create TaskEval
│
▼
Log outputs (text, traces, or both)
│
▼
Attach evaluation criteria (templates, rubrics, or both)
│
▼
Configure VerificationConfig (parsing_only=True)
│
▼
Evaluate and inspect results
Key APIs¶
| Operation | Method | Covered In |
|---|---|---|
| Create instance | TaskEval(task_id=..., metadata=...) |
All scenarios |
| Log text | task.log(text) |
Basic Evaluation |
| Log traces | task.log_trace(messages) |
Basic Evaluation, Multi-Step |
| Add template | task.add_template(AnswerClass) |
Basic Evaluation, Multi-Step |
| Add rubric | task.add_rubric(rubric) |
All scenarios |
| Evaluate globally | task.evaluate(config) |
All scenarios |
| Evaluate one step | task.evaluate(config, step_id="...") |
Multi-Step |
| Inspect results | result.summary(), result.display() |
All scenarios |
| Export results | result.export_json(), result.export_markdown() |
Basic Evaluation |
Core Concepts¶
These concept pages provide the foundational knowledge that the scenarios build on:
- TaskEval: Object structure, pipeline integration, merge strategies
- Answer Templates: Template structure, field types,
verify()semantics - Rubrics: Trait types (LLM, regex, callable, metric), global vs per-question
- Evaluation Modes: How template-only, template+rubric, and rubric-only map to pipeline stages
- Verification Pipeline: The 13-stage engine that TaskEval feeds into
Next Steps¶
- Analyzing Results: DataFrame analysis, export, and iteration
- Running Verification: Benchmark-mode verification workflows
- Creating Benchmarks: Build benchmarks with questions and templates