Skip to content

Running Verification

This section walks through running verification end-to-end — from loading a saved benchmark to inspecting results. Each tutorial is a complete, self-contained scenario focused on a specific verification workflow.

Choose the scenario that matches your evaluation needs, or work through them in order of increasing complexity.

Choose Your Scenario

Scenario Focus Area What You'll Learn
Basic Verification Template-only evaluation Load, configure, run, inspect — the simplest verification path with VerificationConfig, result iteration, CLI equivalents
Full Evaluation Template + rubric Enable rubrics, abstention/sufficiency checks, embedding verification, PromptConfig, presets
Multi-Model Comparison Comparing models Multiple answering models, answer caching, replicates, DataFrames, model-level grouping
Deep Judgment Excerpt-based reasoning Deep judgment for templates and rubrics, excerpt extraction, hallucination risk, search validation
MCP Agent Evaluation Tool-using agents MCP tool configuration, agent middleware, trace handling, recursion limits
Agentic Evaluation Workspace-based agents Agentic parsing, investigation judge, VerifiedField primitives, agentic rubric traits, context modes
Manual Interface Pre-recorded traces Offline evaluation with pre-recorded responses, template iteration, parsing model comparison
Progressive Save Resumable runs Checkpoint progress incrementally, resume interrupted runs, .state/.tmp files, ProgressiveSaveManager
Few-Shot Configuration Example injection Global modes (all, k-shot, custom, none), per-question overrides, FewShotConfig, example resolution

Common Workflow

All nine scenarios follow this general pattern:

Load benchmark
Configure verification (models, evaluation mode, features)
Run verification (all questions or a subset)
Inspect results (iterate, filter, group, summarize)

Key APIs

Operation Method Covered In
Load benchmark Benchmark.load("checkpoint.jsonld") All scenarios
Full configuration VerificationConfig(answering_models=[...], ...) Basic Verification
Quick configuration VerificationConfig.from_overrides(...) Basic Verification
Load from preset VerificationConfig.from_preset(path) Full Evaluation
Run all questions benchmark.run_verification(config) All scenarios
Run subset benchmark.run_verification(config, question_ids=[...]) Basic Verification
Iterate results for result in results: ... All scenarios
Summary statistics results.get_summary() Basic Verification
Filter results results.filter(completed_only=True) Basic Verification
Group by model results.group_by_model() Multi-Model Comparison
Group by question results.group_by_question() Multi-Model Comparison
DataFrame analysis results.get_template_results().to_dataframe() Multi-Model Comparison
CLI verification karenina verify checkpoint.jsonld --preset ... All scenarios

Core Concepts

These concept pages provide the foundational knowledge that the scenarios build on:

  • Evaluation Modes — How template-only, template+rubric, and rubric-only map to pipeline stages
  • Answer Templates — Template structure, field types, verify() semantics
  • Rubrics — Trait types (LLM, regex, callable, metric), global vs per-question
  • Adapters — Port/adapter architecture, available backends
  • Checkpoints — JSON-LD format, save/load behavior

Next Steps