Running Verification¶

This section walks through running verification end-to-end — from loading a saved benchmark to inspecting results. Each tutorial is a complete, self-contained scenario focused on a specific verification workflow.

Choose the scenario that matches your evaluation needs, or work through them in order of increasing complexity.

Choose Your Scenario¶

Scenario	Focus Area	What You'll Learn
Basic Verification	Template-only evaluation	Load, configure, run, inspect — the simplest verification path with `VerificationConfig`, result iteration, CLI equivalents
Full Evaluation	Template + rubric	Enable rubrics, abstention/sufficiency checks, embedding verification, `PromptConfig`, presets
Multi-Model Comparison	Comparing models	Multiple answering models, answer caching, replicates, DataFrames, model-level grouping
Deep Judgment	Excerpt-based reasoning	Deep judgment for templates and rubrics, excerpt extraction, hallucination risk, search validation
MCP Agent Evaluation	Tool-using agents	MCP tool configuration, agent middleware, trace handling, recursion limits
Agentic Evaluation	Workspace-based agents	Agentic parsing, investigation judge, `VerifiedField` primitives, agentic rubric traits, context modes
Manual Interface	Pre-recorded traces	Offline evaluation with pre-recorded responses, template iteration, parsing model comparison
Progressive Save	Resumable runs	Checkpoint progress incrementally, resume interrupted runs, `.state`/`.tmp` files, `ProgressiveSaveManager`
Few-Shot Configuration	Example injection	Global modes (all, k-shot, custom, none), per-question overrides, `FewShotConfig`, example resolution

Common Workflow¶

All nine scenarios follow this general pattern:

Load benchmark
    │
    ▼
Configure verification (models, evaluation mode, features)
    │
    ▼
Run verification (all questions or a subset)
    │
    ▼
Inspect results (iterate, filter, group, summarize)

Key APIs¶

Operation	Method	Covered In
Load benchmark	`Benchmark.load("checkpoint.jsonld")`	All scenarios
Full configuration	`VerificationConfig(answering_models=[...], ...)`	Basic Verification
Quick configuration	`VerificationConfig.from_overrides(...)`	Basic Verification
Load from preset	`VerificationConfig.from_preset(path)`	Full Evaluation
Run all questions	`benchmark.run_verification(config)`	All scenarios
Run subset	`benchmark.run_verification(config, question_ids=[...])`	Basic Verification
Iterate results	`for result in results: ...`	All scenarios
Summary statistics	`results.get_summary()`	Basic Verification
Filter results	`results.filter(completed_only=True)`	Basic Verification
Group by model	`results.group_by_model()`	Multi-Model Comparison
Group by question	`results.group_by_question()`	Multi-Model Comparison
DataFrame analysis	`results.get_template_results().to_dataframe()`	Multi-Model Comparison
CLI verification	`karenina verify checkpoint.jsonld --preset ...`	All scenarios

Core Concepts¶

These concept pages provide the foundational knowledge that the scenarios build on:

Evaluation Modes — How template-only, template+rubric, and rubric-only map to pipeline stages
Answer Templates — Template structure, field types, verify() semantics
Rubrics — Trait types (LLM, regex, callable, metric), global vs per-question
Adapters — Port/adapter architecture, available backends
Checkpoints — JSON-LD format, save/load behavior

Next Steps¶

Creating Benchmarks — Build a benchmark to verify
Analyzing Results — DataFrame analysis, export, iteration