Running Verification¶
This section walks through running verification end-to-end — from loading a saved benchmark to inspecting results. Each tutorial is a complete, self-contained scenario focused on a specific verification workflow.
Choose the scenario that matches your evaluation needs, or work through them in order of increasing complexity.
Choose Your Scenario¶
| Scenario | Focus Area | What You'll Learn |
|---|---|---|
| Basic Verification | Template-only evaluation | Load, configure, run, inspect — the simplest verification path with VerificationConfig, result iteration, CLI equivalents |
| Full Evaluation | Template + rubric | Enable rubrics, abstention/sufficiency checks, embedding verification, PromptConfig, presets |
| Multi-Model Comparison | Comparing models | Multiple answering models, answer caching, replicates, DataFrames, model-level grouping |
| Deep Judgment | Excerpt-based reasoning | Deep judgment for templates and rubrics, excerpt extraction, hallucination risk, search validation |
| MCP Agent Evaluation | Tool-using agents | MCP tool configuration, agent middleware, trace handling, recursion limits |
| Agentic Evaluation | Workspace-based agents | Agentic parsing, investigation judge, VerifiedField primitives, agentic rubric traits, context modes |
| Manual Interface | Pre-recorded traces | Offline evaluation with pre-recorded responses, template iteration, parsing model comparison |
| Progressive Save | Resumable runs | Checkpoint progress incrementally, resume interrupted runs, .state/.tmp files, ProgressiveSaveManager |
| Few-Shot Configuration | Example injection | Global modes (all, k-shot, custom, none), per-question overrides, FewShotConfig, example resolution |
Common Workflow¶
All nine scenarios follow this general pattern:
Load benchmark
│
▼
Configure verification (models, evaluation mode, features)
│
▼
Run verification (all questions or a subset)
│
▼
Inspect results (iterate, filter, group, summarize)
Key APIs¶
| Operation | Method | Covered In |
|---|---|---|
| Load benchmark | Benchmark.load("checkpoint.jsonld") |
All scenarios |
| Full configuration | VerificationConfig(answering_models=[...], ...) |
Basic Verification |
| Quick configuration | VerificationConfig.from_overrides(...) |
Basic Verification |
| Load from preset | VerificationConfig.from_preset(path) |
Full Evaluation |
| Run all questions | benchmark.run_verification(config) |
All scenarios |
| Run subset | benchmark.run_verification(config, question_ids=[...]) |
Basic Verification |
| Iterate results | for result in results: ... |
All scenarios |
| Summary statistics | results.get_summary() |
Basic Verification |
| Filter results | results.filter(completed_only=True) |
Basic Verification |
| Group by model | results.group_by_model() |
Multi-Model Comparison |
| Group by question | results.group_by_question() |
Multi-Model Comparison |
| DataFrame analysis | results.get_template_results().to_dataframe() |
Multi-Model Comparison |
| CLI verification | karenina verify checkpoint.jsonld --preset ... |
All scenarios |
Core Concepts¶
These concept pages provide the foundational knowledge that the scenarios build on:
- Evaluation Modes — How template-only, template+rubric, and rubric-only map to pipeline stages
- Answer Templates — Template structure, field types,
verify()semantics - Rubrics — Trait types (LLM, regex, callable, metric), global vs per-question
- Adapters — Port/adapter architecture, available backends
- Checkpoints — JSON-LD format, save/load behavior
Next Steps¶
- Creating Benchmarks — Build a benchmark to verify
- Analyzing Results — DataFrame analysis, export, iteration