VerificationConfig Reference¶
This is the exhaustive reference for all VerificationConfig fields. For a tutorial introduction with examples, see Basic Verification.
VerificationConfig is a Pydantic model with 33 fields organized into 10 categories below.
Models¶
| Field | Type | Default | Description |
|---|---|---|---|
answering_models |
list[ModelConfig] |
[] |
List of answering model configurations. Each defines a model that generates responses to benchmark questions. Default system prompt applied automatically if not set. |
parsing_models |
list[ModelConfig] |
(required) | List of parsing (judge) model configurations. At least one is required. Each defines a model that parses LLM responses into structured templates. Default system prompt applied automatically if not set. |
Default system prompts (applied when model has no explicit system_prompt):
- Answering: "You are an expert assistant. Answer the question accurately and concisely."
- Parsing: "You are a validation assistant. Parse and validate responses against the given Pydantic template."
See ModelConfig Reference for all ModelConfig fields.
Execution¶
| Field | Type | Default | Description |
|---|---|---|---|
replicate_count |
int |
1 |
Number of times to run each question/model combination. Higher values allow measuring variance across runs. Must be >= 1. |
parsing_only |
bool |
False |
When True, only parsing models are required (no answering models needed). Used for TaskEval and similar use cases where answers are pre-generated. |
Evaluation Mode¶
| Field | Type | Default | Description |
|---|---|---|---|
evaluation_mode |
Literal["template_only", "template_and_rubric", "rubric_only"] |
"template_only" |
Determines which pipeline stages run. template_only: template verification only. template_and_rubric: both template and rubric evaluation. rubric_only: skip template verification, evaluate rubrics on raw response. When set to template_and_rubric or rubric_only, rubric evaluation is automatically enabled. |
rubric_trait_names |
list[str] \| None |
None |
Optional filter to evaluate only specific rubric traits by name. When None, all traits are evaluated. |
rubric_evaluation_strategy |
Literal["batch", "sequential"] \| None |
"batch" |
How LLM rubric traits are evaluated. batch: all LLM traits in a single call (efficient, requires JSON output). sequential: traits evaluated one-by-one (more reliable, higher cost). |
agentic_rubric_strategy |
Literal["individual", "shared"] |
"individual" |
How agentic rubric traits are evaluated. individual: one agent per trait (default, most reliable). shared: one agent evaluates all traits that share a model (efficient, but falls back to individual when models differ). |
agentic_rubric_parallel |
bool |
False |
Reserved for future use. When implemented, will allow parallel evaluation of independent agentic traits. |
Trace Filtering¶
| Field | Type | Default | Description |
|---|---|---|---|
use_full_trace_for_template |
bool |
False |
If True, pass full agent trace to template parsing. If False, extract only the final AI message. The full trace is always captured in raw_llm_response regardless. |
use_full_trace_for_rubric |
bool |
True |
If True, pass full agent trace to rubric evaluation. If False, extract only the final AI message. The full trace is always captured in raw_llm_response regardless. |
Note
If use_full_trace_for_template=False and the trace doesn't end with an AI message, the trace validation stage will fail with an error.
Pre-Parsing Checks¶
| Field | Type | Default | Description |
|---|---|---|---|
abstention_enabled |
bool |
False |
Enable abstention/refusal detection. When the model refuses to answer, parsing is skipped and the result is auto-failed. |
sufficiency_enabled |
bool |
False |
Enable response sufficiency detection. When the response lacks enough information to fill the template, parsing is skipped and the result is auto-failed. |
See Full Evaluation for usage examples.
Embedding Check¶
| Field | Type | Default | Env Var | Description |
|---|---|---|---|---|
embedding_check_enabled |
bool |
False |
EMBEDDING_CHECK |
Enable semantic similarity verification as a fallback after template verify(). |
embedding_check_model |
str |
"all-MiniLM-L6-v2" |
EMBEDDING_CHECK_MODEL |
SentenceTransformer model name for computing embeddings. |
embedding_check_threshold |
float |
0.85 |
EMBEDDING_CHECK_THRESHOLD |
Cosine similarity threshold. Constrained to [0.0, 1.0]. Values above this threshold are considered semantically matching. |
Environment variable precedence: Env vars are applied only when the field is not explicitly set. Explicit arguments always take priority over env vars.
Async Execution¶
| Field | Type | Default | Env Var | Description |
|---|---|---|---|---|
async_enabled |
bool |
True |
KARENINA_ASYNC_ENABLED |
Enable parallel execution of verification across questions. |
async_max_workers |
int |
2 |
KARENINA_ASYNC_MAX_WORKERS |
Maximum number of concurrent verification workers when async is enabled. Must be >= 1. |
Both sequential and parallel execution modes collect per-question errors without aborting. If any questions fail (or the parallel batch exceeds its timeout), VerificationBatchError is raised with partial_results and errors attributes so callers can recover partial progress. See Basic Verification: Error Handling for usage examples.
Deep Judgment — Templates¶
| Field | Type | Default | Description |
|---|---|---|---|
deep_judgment_enabled |
bool |
False |
Enable multi-stage deep judgment analysis for template verification. Adds excerpt extraction, fuzzy matching, and reasoning to parsed results. |
deep_judgment_max_excerpts_per_attribute |
int |
3 |
Maximum number of excerpts to extract per template attribute during deep judgment. |
deep_judgment_fuzzy_match_threshold |
float |
0.80 |
Fuzzy match similarity threshold for validating excerpts against the original trace. |
deep_judgment_excerpt_retry_attempts |
int |
2 |
Number of retry attempts for excerpt extraction when fuzzy matching fails. |
deep_judgment_search_enabled |
bool |
False |
Enable search-enhanced excerpt validation. When enabled, excerpts are verified against external evidence to detect hallucination. |
deep_judgment_search_tool |
str \| Callable |
"tavily" |
Search tool for excerpt validation. Built-in: "tavily". Can also be any callable with signature (str \| list[str]) -> (str \| list[str]). Requires TAVILY_API_KEY for built-in tool. |
Deep Judgment — Rubrics¶
| Field | Type | Default | Description |
|---|---|---|---|
deep_judgment_rubric_mode |
Literal["disabled", "enable_all", "use_checkpoint", "custom"] |
"disabled" |
Controls how deep judgment is applied to rubric traits. disabled: off. enable_all: apply to all LLM traits. use_checkpoint: use settings saved in checkpoint. custom: use per-trait configuration from deep_judgment_rubric_config. |
deep_judgment_rubric_global_excerpts |
bool |
True |
For enable_all mode: globally enable or disable excerpt extraction for all traits. |
deep_judgment_rubric_config |
dict[str, Any] \| None |
None |
Per-trait configuration for custom mode. See structure below. |
deep_judgment_rubric_max_excerpts_default |
int |
7 |
Default maximum excerpts per rubric trait (used as fallback when per-trait config omits this setting). |
deep_judgment_rubric_fuzzy_match_threshold_default |
float |
0.80 |
Default fuzzy match threshold for rubric excerpt validation. |
deep_judgment_rubric_excerpt_retry_attempts_default |
int |
2 |
Default retry attempts for rubric excerpt extraction. |
deep_judgment_rubric_search_tool |
str \| Callable |
"tavily" |
Search tool for rubric hallucination detection. Same options as deep_judgment_search_tool. |
Custom Mode Config Structure¶
The deep_judgment_rubric_config dict (for custom mode) expects:
{
"global": {
"TraitName": {
"enabled": true,
"excerpt_enabled": true,
"max_excerpts": 5,
"fuzzy_match_threshold": 0.80,
"excerpt_retry_attempts": 2,
"search_enabled": false
}
},
"question_specific": {
"question-id": {
"TraitName": {
"enabled": true,
"excerpt_enabled": false
}
}
}
}
Each trait entry is validated as a DeepJudgmentTraitConfig with these fields:
| Field | Type | Default | Description |
|---|---|---|---|
enabled |
bool |
True |
Whether deep judgment is enabled for this trait. |
excerpt_enabled |
bool |
True |
Whether to extract excerpts for this trait. |
max_excerpts |
int \| None |
None |
Max excerpts (falls back to deep_judgment_rubric_max_excerpts_default). |
fuzzy_match_threshold |
float \| None |
None |
Fuzzy threshold (falls back to global default). |
excerpt_retry_attempts |
int \| None |
None |
Retry attempts (falls back to global default). |
search_enabled |
bool |
False |
Enable search validation for this trait's excerpts. |
Agentic Parsing¶
| Field | Type | Default | Description |
|---|---|---|---|
agentic_parsing |
bool |
False |
Enable agentic parsing (Stage 7b). The judge uses tools to independently verify artifacts before extracting structured data. Requires a parsing model with agent_tier='deep_agent'. |
agentic_judge_context |
Literal["workspace_only", "trace_and_workspace", "trace_only"] |
"workspace_only" |
What context the investigation agent receives. workspace_only: question + workspace path. trace_and_workspace: answering agent trace + workspace path. trace_only: equivalent to classical Stage 7a parsing. |
agentic_parsing_max_turns |
int |
15 |
Max turns for the investigation agent. Must be >= 1. |
agentic_parsing_timeout |
float |
120.0 |
Timeout in seconds for the investigation agent. Must be >= 0.0. |
Scenario Execution¶
| Field | Type | Default | Description |
|---|---|---|---|
scenario_turn_limit |
int |
20 |
Maximum turns before forced termination in scenario execution. Must be >= 1. |
Additional Configuration¶
| Field | Type | Default | Description |
|---|---|---|---|
few_shot_config |
FewShotConfig \| None |
None |
Few-shot prompting configuration. Controls example injection into prompts. See Few-Shot Configuration. |
prompt_config |
PromptConfig \| None |
None |
Per-task prompt instruction overrides. Injects custom instructions into specific pipeline stages. See Full Evaluation for usage and PromptConfig Reference for all fields. |
db_config |
DBConfig \| None |
None |
DBConfig instance for automatic result persistence to a database. When set, results are saved after each verification run. See DBConfig fields below. |
DBConfig Fields¶
DBConfig controls the database connection for auto-saving verification results. Import from karenina.storage:
| Field | Type | Default | Description |
|---|---|---|---|
storage_url |
str |
(required) | SQLAlchemy database URL (e.g. sqlite:///results.db, postgresql://user:pass@host/db) |
auto_create |
bool |
True |
Automatically create tables and views if missing |
auto_commit |
bool |
True |
Commit transactions automatically after operations |
echo |
bool |
False |
Log all SQL statements (useful for debugging) |
pool_size |
int |
5 |
Connection pool size (non-SQLite only) |
max_overflow |
int |
10 |
Max connections beyond pool_size (non-SQLite only) |
pool_recycle |
int |
3600 |
Recycle connections after N seconds (-1 to disable) |
pool_pre_ping |
bool |
True |
Test connections before use |
SQLite databases automatically set pool_size=1 and max_overflow=0.
Auto-save is controlled by the AUTOSAVE_DATABASE environment variable (true/false, default true). Auto-save only runs when db_config is set — without it, no database writes occur. Auto-save is non-blocking: failures are logged but do not raise exceptions.
Convenience Methods¶
from_overrides()¶
Create a VerificationConfig by applying selective overrides to an optional base config. This is the canonical way to construct configs programmatically.
config = VerificationConfig.from_overrides(
answering_model="claude-haiku-4-5",
answering_provider="anthropic",
answering_id="my-answering",
parsing_model="claude-haiku-4-5",
parsing_provider="anthropic",
parsing_id="my-parsing",
evaluation_mode="template_and_rubric",
abstention=True,
)
| Parameter | Maps To | Description |
|---|---|---|
answering_model |
answering_models[0].model_name |
Answering model name |
answering_provider |
answering_models[0].model_provider |
Answering model provider |
answering_id |
answering_models[0].id |
Answering model identifier |
answering_interface |
answering_models[0].interface |
Answering adapter interface |
parsing_model |
parsing_models[0].model_name |
Parsing model name |
parsing_provider |
parsing_models[0].model_provider |
Parsing model provider |
parsing_id |
parsing_models[0].id |
Parsing model identifier |
parsing_interface |
parsing_models[0].interface |
Parsing adapter interface |
temperature |
Both models' temperature |
Shared temperature override |
manual_traces |
answering_models[0].manual_traces |
Pre-recorded traces (sets interface to manual) |
replicate_count |
replicate_count |
Number of replicates |
abstention |
abstention_enabled |
Enable abstention detection |
sufficiency |
sufficiency_enabled |
Enable sufficiency detection |
embedding_check |
embedding_check_enabled |
Enable embedding check |
deep_judgment |
deep_judgment_enabled |
Enable template deep judgment |
evaluation_mode |
evaluation_mode |
Sets the evaluation mode |
embedding_threshold |
embedding_check_threshold |
Embedding similarity threshold |
embedding_model |
embedding_check_model |
Embedding model name |
async_execution |
async_enabled |
Enable async execution |
async_workers |
async_max_workers |
Number of async workers |
use_full_trace_for_template |
use_full_trace_for_template |
Trace filtering for templates |
use_full_trace_for_rubric |
use_full_trace_for_rubric |
Trace filtering for rubrics |
deep_judgment_rubric_mode |
deep_judgment_rubric_mode |
Rubric deep judgment mode |
deep_judgment_rubric_excerpts |
deep_judgment_rubric_global_excerpts |
Global excerpt toggle |
deep_judgment_rubric_max_excerpts |
deep_judgment_rubric_max_excerpts_default |
Max excerpts per trait |
deep_judgment_rubric_fuzzy_threshold |
deep_judgment_rubric_fuzzy_match_threshold_default |
Fuzzy match threshold |
deep_judgment_rubric_retry_attempts |
deep_judgment_rubric_excerpt_retry_attempts_default |
Retry attempts |
deep_judgment_rubric_search_tool |
deep_judgment_rubric_search_tool |
Rubric search tool |
deep_judgment_rubric_config |
deep_judgment_rubric_config |
Custom per-trait config |
Preset Methods¶
| Method | Description |
|---|---|
save_preset(name, description, presets_dir) |
Save config as a preset JSON file |
from_preset(filepath) |
Load a VerificationConfig from a preset file |
sanitize_preset_name(name) |
Convert preset name to safe filename |
validate_preset_metadata(name, description) |
Validate preset name and description |
See Presets for usage details.
Inspection Methods¶
| Method | Returns | Description |
|---|---|---|
get_few_shot_config() |
FewShotConfig \| None |
Get the active few-shot configuration |
is_few_shot_enabled() |
bool |
Check if few-shot prompting is enabled |
Configuration Precedence¶
Fields are resolved in this order (highest priority first):
- Explicit arguments passed to the constructor or
from_overrides() - Environment variables (only for fields that support them — embedding and async settings)
- Field defaults defined on the class
See Configuration Hierarchy for the full precedence model including presets and CLI arguments.