karenina verify
Run verification on a benchmark checkpoint file.
karenina verify [OPTIONS] [BENCHMARK_PATH]
The verify command is the primary way to evaluate LLM outputs against benchmark criteria. It loads a checkpoint, configures the verification pipeline, runs evaluation, and exports results.
Arguments
| Argument |
Type |
Required |
Description |
BENCHMARK_PATH |
PATH |
Yes (unless --resume) |
Path to the benchmark JSON-LD checkpoint file |
Options
Configuration Sources
| Option |
Type |
Default |
Description |
--preset |
PATH |
— |
Path to a preset configuration JSON file |
--interactive |
flag |
False |
Launch interactive configuration wizard |
--mode |
TEXT |
basic |
Interactive mode: basic or advanced |
Configuration priority: interactive selection > CLI arguments + preset > preset only > defaults.
Output and Filtering
| Option |
Type |
Default |
Description |
--output |
PATH |
— |
Output file path (.json or .csv). If omitted, you will be prompted. |
--questions |
TEXT |
— |
Question indices: comma-separated (0,1,2) or range (5-10) |
--question-ids |
TEXT |
— |
Comma-separated question IDs (URN format) |
--verbose |
flag |
False |
Show a progress bar during verification |
Model Configuration
| Option |
Type |
Default |
Description |
--answering-model |
TEXT |
— |
Answering model name (required without --preset) |
--answering-provider |
TEXT |
— |
Answering model provider for LangChain (required without --preset) |
--answering-id |
TEXT |
answering-1 |
Answering model ID |
--parsing-model |
TEXT |
— |
Parsing model name (required without --preset) |
--parsing-provider |
TEXT |
— |
Parsing model provider for LangChain (required without --preset) |
--parsing-id |
TEXT |
parsing-1 |
Parsing model ID |
--temperature |
FLOAT |
— |
Model temperature (0.0–2.0) |
--interface |
TEXT |
— |
Model interface (required without --preset). Values: langchain, openrouter, openai_endpoint, claude_agent_sdk, claude_tool, manual |
General Settings
| Option |
Type |
Default |
Description |
--replicate-count |
INTEGER |
— |
Number of replicates per verification task |
--evaluation-mode |
TEXT |
template_only |
Evaluation mode: template_only, template_and_rubric, or rubric_only |
Feature Flags
| Option |
Type |
Default |
Description |
--abstention |
flag |
False |
Enable abstention detection (stage 5) |
--sufficiency |
flag |
False |
Enable trace sufficiency detection (stage 6) |
--embedding-check |
flag |
False |
Enable embedding similarity check (stage 9) |
--deep-judgment |
flag |
False |
Enable deep judgment for templates (stage 10) |
Deep Judgment — Rubric Settings
| Option |
Type |
Default |
Description |
--deep-judgment-rubric-mode |
TEXT |
disabled |
Mode: disabled, enable_all, use_checkpoint, or custom |
--deep-judgment-rubric-excerpts |
flag |
True |
Enable excerpt extraction for rubric traits (enable_all mode) |
--deep-judgment-rubric-max-excerpts |
INTEGER |
7 |
Maximum excerpts per rubric trait (enable_all mode) |
--deep-judgment-rubric-fuzzy-threshold |
FLOAT |
0.8 |
Fuzzy match threshold for excerpt validation (0.0–1.0) |
--deep-judgment-rubric-retry-attempts |
INTEGER |
2 |
Retry attempts for excerpt extraction |
--deep-judgment-rubric-search |
flag |
False |
Enable search-based hallucination detection for rubric excerpts |
--deep-judgment-rubric-search-tool |
TEXT |
tavily |
Search tool for rubric hallucination detection |
--deep-judgment-rubric-config |
PATH |
— |
Path to custom deep judgment config JSON (custom mode) |
Embedding Check Settings
| Option |
Type |
Default |
Description |
--embedding-threshold |
FLOAT |
0.85 |
Embedding similarity threshold (0.0–1.0) |
--embedding-model |
TEXT |
all-MiniLM-L6-v2 |
Embedding model name |
Trace Filtering (MCP)
| Option |
Type |
Default |
Description |
--use-full-trace-for-template / --use-final-message-for-template |
flag |
False |
Use full MCP agent trace (True) or only the final AI message (False) for template parsing |
--use-full-trace-for-rubric / --use-final-message-for-rubric |
flag |
True |
Use full MCP agent trace (True) or only the final AI message (False) for rubric evaluation |
Async Execution
| Option |
Type |
Default |
Description |
--no-async |
flag |
False |
Disable async execution (run sequentially) |
--async-workers |
INTEGER |
2 or KARENINA_ASYNC_MAX_WORKERS |
Number of async workers |
Manual Traces
| Option |
Type |
Default |
Description |
--manual-traces |
PATH |
— |
JSON file with manual traces ({question_hash: trace_string}). Required when --interface manual. |
Progressive Save and Resume
| Option |
Type |
Default |
Description |
--progressive-save |
flag |
False |
Enable incremental saving to .tmp and .state files for crash recovery. Requires --output. |
--resume |
PATH |
— |
Resume from a .state file. Loads config from state; other config options are ignored. |
Exit Codes
| Code |
Meaning |
0 |
Verification completed successfully |
1 |
Error: no templates found, invalid arguments, benchmark load failure, partial batch failure (VerificationBatchError), or unexpected error |
130 |
Interrupted by user (Ctrl+C) |
If some questions fail while others succeed, the executor raises VerificationBatchError with both partial_results and errors. The CLI catches this, writes any partial results to the output file, and exits with code 1. In parallel mode, a timeout also triggers VerificationBatchError.
When interrupted with Ctrl+C during a progressive save run, the command prints a resume hint:
Interrupted by user.
To resume: karenina verify --resume results.json
Progressive Save and Resume
Progressive save enables crash recovery for long-running verification jobs. When enabled, results are saved incrementally as each question completes.
How it works
-
Start with progressive save: The command creates two sidecar files next to the output path:
<output>.tmp — Accumulated results so far
<output>.state — Job state including config, benchmark path, and task manifest
-
On crash or interrupt: The .state file preserves which tasks have completed. The .tmp file holds partial results.
-
Resume: Pass the .state file path to --resume. The command reloads config from state, identifies pending tasks, and runs only those. All other CLI options are ignored when resuming.
-
Finalize: When all tasks complete, results are written to the output file and the .tmp/.state sidecar files are cleaned up.
Check progress
Use karenina verify-status to inspect a state file:
karenina verify-status results.json.state
See verify-status for full details.
Examples
Basic verification with a preset
karenina verify checkpoint.jsonld --preset default.json
Override preset model
karenina verify checkpoint.jsonld --preset default.json \
--answering-model claude-haiku-4-5
CLI arguments only (no preset)
karenina verify checkpoint.jsonld \
--answering-model claude-haiku-4-5 \
--parsing-model claude-haiku-4-5 \
--interface langchain \
--answering-provider anthropic \
--parsing-provider anthropic
Specific questions by index
karenina verify checkpoint.jsonld --preset default.json --questions 0,1,2
Specific questions by range
karenina verify checkpoint.jsonld --preset default.json --questions 5-10
Enable feature flags
karenina verify checkpoint.jsonld --preset default.json \
--abstention --sufficiency --deep-judgment
Template and rubric evaluation
karenina verify checkpoint.jsonld --preset default.json \
--evaluation-mode template_and_rubric
Export to CSV with progress bar
karenina verify checkpoint.jsonld --preset default.json \
--output results.csv --verbose
Manual traces
karenina verify checkpoint.jsonld \
--interface manual \
--manual-traces traces/my_traces.json \
--parsing-model claude-haiku-4-5 \
--parsing-provider anthropic
Progressive save and resume
# Start with progressive save
karenina verify checkpoint.jsonld --preset default.json \
--output results.json --progressive-save
# Resume after interruption
karenina verify --resume results.json.state
Model comparison (multiple runs)
karenina verify checkpoint.jsonld --preset default.json \
--answering-model claude-haiku-4-5 --output results_haiku.json
karenina verify checkpoint.jsonld --preset default.json \
--answering-model claude-sonnet-4-5 --output results_sonnet.json
Full-featured run
karenina verify checkpoint.jsonld --preset default.json \
--evaluation-mode template_and_rubric \
--abstention --sufficiency --embedding-check --deep-judgment \
--replicate-count 3 --output results.json --verbose
Interactive mode
karenina verify checkpoint.jsonld --interactive
karenina verify checkpoint.jsonld --interactive --mode advanced