Deep Judgment¶
This scenario adds deep judgment to verification — a multi-stage evaluation process that extracts excerpts from responses, performs fuzzy matching, generates reasoning traces, and optionally validates claims against external search results. Deep judgment works for both template and rubric evaluation.
What you'll learn:
- Enable deep judgment for template verification
- Inspect extracted excerpts, reasoning, and hallucination risk
- Enable search-based validation for factual claims
- Configure deep judgment for rubric traits
- Tune deep judgment parameters
What Deep Judgment Does¶
Deep judgment adds a multi-stage evaluation layer between parsing and final result:
Standard pipeline: With deep judgment:
Parse → Verify → Finalize Parse → Extract Excerpts → Fuzzy Match
→ Reason → [Search] → Verify → Finalize
For each template attribute, deep judgment:
- Extracts excerpts from the response that relate to the attribute
- Fuzzy matches excerpts against ground truth (similarity scoring)
- Generates reasoning explaining how the excerpt supports or contradicts the expected value
- Optionally searches external sources to validate factual claims
Enable DJ for Templates¶
from karenina import Benchmark
benchmark = Benchmark.load(str(_tmp))
config = VerificationConfig(
answering_models=[
ModelConfig(id="haiku", model_name="claude-haiku-4-5",
model_provider="anthropic", interface="langchain")
],
parsing_models=[
ModelConfig(id="haiku-parser", model_name="claude-haiku-4-5",
model_provider="anthropic", interface="langchain",
temperature=0.0)
],
evaluation_mode="template_only",
deep_judgment_enabled=True,
)
results = benchmark.run_verification(config)
print(f"Results with DJ: {len(results)}")
| Parameter | Type | Default | Description |
|---|---|---|---|
deep_judgment_enabled |
bool |
False |
Enable deep judgment for template verification |
deep_judgment_search_enabled |
bool |
False |
Enable external search validation |
deep_judgment_excerpt_retry_attempts |
int |
2 |
Max retries for excerpt extraction |
deep_judgment_fuzzy_match_threshold |
float |
0.7 |
Min similarity score for excerpt matching |
Inspect DJ Template Results¶
Extracted Excerpts¶
Each attribute in the template gets a list of supporting excerpts:
for result in results:
dj = result.deep_judgment
if dj and dj.deep_judgment_performed:
print(f"\nQ: {result.metadata.question_text[:50]}")
for attr, excerpts in (dj.extracted_excerpts or {}).items():
print(f" Attribute: {attr}")
for exc in excerpts:
print(f" \"{exc['text'][:60]}\" (confidence: {exc['confidence']}, similarity: {exc['similarity_score']:.2f})")
Reasoning Traces¶
for result in results[:2]:
dj = result.deep_judgment
if dj and dj.attribute_reasoning:
print(f"\nQ: {result.metadata.question_text[:50]}")
for attr, reasoning in dj.attribute_reasoning.items():
print(f" {attr}: {reasoning[:80]}...")
DJ with Search Validation¶
Enable search to validate factual claims against external sources. Requires a search API key (e.g., Tavily):
config_with_search = VerificationConfig(
answering_models=[
ModelConfig(id="haiku", model_name="claude-haiku-4-5",
model_provider="anthropic", interface="langchain")
],
parsing_models=[
ModelConfig(id="haiku-parser", model_name="claude-haiku-4-5",
model_provider="anthropic", interface="langchain",
temperature=0.0)
],
evaluation_mode="template_only",
deep_judgment_enabled=True,
deep_judgment_search_enabled=True,
)
print(f"DJ enabled: {config_with_search.deep_judgment_enabled}")
print(f"Search enabled: {config_with_search.deep_judgment_search_enabled}")
When search is enabled, each excerpt includes hallucination_risk (none/low/medium/high) and supporting search results:
# Inspect search-validated results (question 5 has search data)
result = results[4]
dj = result.deep_judgment
if dj and dj.deep_judgment_search_enabled:
print(f"Q: {result.metadata.question_text[:50]}")
for attr, excerpts in (dj.extracted_excerpts or {}).items():
for exc in excerpts:
if "search_results" in exc:
print(f" Excerpt: \"{exc['text'][:50]}\"")
print(f" Hallucination risk: {exc.get('hallucination_risk', 'N/A')}")
print(f" Search: {exc['search_results'][:80]}...")
if dj.hallucination_risk_assessment:
print(f" Risk assessment: {dj.hallucination_risk_assessment}")
DJ for Rubrics¶
Deep judgment also works for rubric traits, providing per-trait excerpts and reasoning:
config_rubric_dj = VerificationConfig(
answering_models=[
ModelConfig(id="haiku", model_name="claude-haiku-4-5",
model_provider="anthropic", interface="langchain")
],
parsing_models=[
ModelConfig(id="haiku-parser", model_name="claude-haiku-4-5",
model_provider="anthropic", interface="langchain",
temperature=0.0)
],
evaluation_mode="template_and_rubric",
deep_judgment_rubric_mode="enable_all",
)
rubric_dj_results = benchmark.run_verification(config_rubric_dj)
print(f"Rubric DJ results: {len(rubric_dj_results)}")
Inspect Rubric DJ Results¶
for result in rubric_dj_results:
djr = result.deep_judgment_rubric
if djr and djr.deep_judgment_rubric_performed:
print(f"\nQ: {result.metadata.question_text[:40]}")
# Per-trait reasoning
for trait, reasoning in (djr.rubric_trait_reasoning or {}).items():
score = (djr.deep_judgment_rubric_scores or {}).get(trait, "N/A")
print(f" {trait} (score={score}): {reasoning[:60]}...")
# Per-trait excerpts
for trait, excerpts in (djr.extracted_rubric_excerpts or {}).items():
print(f" {trait} excerpts: {len(excerpts)}")
deep_judgment_rubric_mode |
Behavior |
|---|---|
"disabled" (default) |
No rubric deep judgment |
"enable_all" |
Apply DJ to all LLM rubric traits |
"use_checkpoint" |
Use per-trait settings from checkpoint |
"custom" |
Use deep_judgment_rubric_config dict |
CLI Equivalent¶
# Enable deep judgment via CLI:
# karenina verify benchmark.jsonld --preset base.json --deep-judgment
# With search validation:
# karenina verify benchmark.jsonld --preset base.json --deep-judgment --deep-judgment-search
# With rubric deep judgment:
# karenina verify benchmark.jsonld --preset base.json --deep-judgment --deep-judgment-rubric-mode all
print("CLI: --deep-judgment, --deep-judgment-search, --deep-judgment-rubric-mode")
Related Pages¶
- Full Evaluation — Template+rubric without deep judgment
- Basic Verification — Simplest verification path
- VerificationConfig Reference — All DJ configuration fields
- Advanced Pipeline — Pipeline stage details