Deep Judgment¶

This scenario adds deep judgment to verification — a multi-stage evaluation process that extracts excerpts from responses, performs fuzzy matching, generates reasoning traces, and optionally validates claims against external search results. Deep judgment works for both template and rubric evaluation.

What you'll learn:

Enable deep judgment for template verification
Inspect extracted excerpts, reasoning, and hallucination risk
Enable search-based validation for factual claims
Configure deep judgment for rubric traits
Tune deep judgment parameters

What Deep Judgment Does¶

Deep judgment adds a multi-stage evaluation layer between parsing and final result:

Standard pipeline:                  With deep judgment:
Parse → Verify → Finalize          Parse → Extract Excerpts → Fuzzy Match
                                          → Reason → [Search] → Verify → Finalize

For each template attribute, deep judgment:

Extracts excerpts from the response that relate to the attribute
Fuzzy matches excerpts against ground truth (similarity scoring)
Generates reasoning explaining how the excerpt supports or contradicts the expected value
Optionally searches external sources to validate factual claims

Enable DJ for Templates¶

In [ ]:

Copied!





from karenina import Benchmark

benchmark = Benchmark.load(str(_tmp))

config = VerificationConfig(
    answering_models=[
        ModelConfig(id="haiku", model_name="claude-haiku-4-5",
                    model_provider="anthropic", interface="langchain")
    ],
    parsing_models=[
        ModelConfig(id="haiku-parser", model_name="claude-haiku-4-5",
                    model_provider="anthropic", interface="langchain",
                    temperature=0.0)
    ],
    evaluation_mode="template_only",
    deep_judgment_enabled=True,
)

results = benchmark.run_verification(config)
print(f"Results with DJ: {len(results)}")
from karenina import Benchmark

benchmark = Benchmark.load(str(_tmp))

config = VerificationConfig(
    answering_models=[
        ModelConfig(id="haiku", model_name="claude-haiku-4-5",
                    model_provider="anthropic", interface="langchain")
    ],
    parsing_models=[
        ModelConfig(id="haiku-parser", model_name="claude-haiku-4-5",
                    model_provider="anthropic", interface="langchain",
                    temperature=0.0)
    ],
    evaluation_mode="template_only",
    deep_judgment_enabled=True,
)

results = benchmark.run_verification(config)
print(f"Results with DJ: {len(results)}")

Parameter	Type	Default	Description
`deep_judgment_enabled`	`bool`	`False`	Enable deep judgment for template verification
`deep_judgment_search_enabled`	`bool`	`False`	Enable external search validation
`deep_judgment_excerpt_retry_attempts`	`int`	`2`	Max retries for excerpt extraction
`deep_judgment_fuzzy_match_threshold`	`float`	`0.7`	Min similarity score for excerpt matching

Inspect DJ Template Results¶

Extracted Excerpts¶

Each attribute in the template gets a list of supporting excerpts:

In [ ]:

Copied!





for result in results:
    dj = result.deep_judgment
    if dj and dj.deep_judgment_performed:
        print(f"\nQ: {result.metadata.question_text[:50]}")
        for attr, excerpts in (dj.extracted_excerpts or {}).items():
            print(f"  Attribute: {attr}")
            for exc in excerpts:
                print(f"    \"{exc['text'][:60]}\" (confidence: {exc['confidence']}, similarity: {exc['similarity_score']:.2f})")
for result in results:
    dj = result.deep_judgment
    if dj and dj.deep_judgment_performed:
        print(f"\nQ: {result.metadata.question_text[:50]}")
        for attr, excerpts in (dj.extracted_excerpts or {}).items():
            print(f"  Attribute: {attr}")
            for exc in excerpts:
                print(f"    \"{exc['text'][:60]}\" (confidence: {exc['confidence']}, similarity: {exc['similarity_score']:.2f})")

Reasoning Traces¶

In [ ]:

Copied!





for result in results[:2]:
    dj = result.deep_judgment
    if dj and dj.attribute_reasoning:
        print(f"\nQ: {result.metadata.question_text[:50]}")
        for attr, reasoning in dj.attribute_reasoning.items():
            print(f"  {attr}: {reasoning[:80]}...")
for result in results[:2]:
    dj = result.deep_judgment
    if dj and dj.attribute_reasoning:
        print(f"\nQ: {result.metadata.question_text[:50]}")
        for attr, reasoning in dj.attribute_reasoning.items():
            print(f"  {attr}: {reasoning[:80]}...")

DJ with Search Validation¶

Enable search to validate factual claims against external sources. Requires a search API key (e.g., Tavily):

In [ ]:

Copied!





config_with_search = VerificationConfig(
    answering_models=[
        ModelConfig(id="haiku", model_name="claude-haiku-4-5",
                    model_provider="anthropic", interface="langchain")
    ],
    parsing_models=[
        ModelConfig(id="haiku-parser", model_name="claude-haiku-4-5",
                    model_provider="anthropic", interface="langchain",
                    temperature=0.0)
    ],
    evaluation_mode="template_only",
    deep_judgment_enabled=True,
    deep_judgment_search_enabled=True,
)

print(f"DJ enabled: {config_with_search.deep_judgment_enabled}")
print(f"Search enabled: {config_with_search.deep_judgment_search_enabled}")
config_with_search = VerificationConfig(
    answering_models=[
        ModelConfig(id="haiku", model_name="claude-haiku-4-5",
                    model_provider="anthropic", interface="langchain")
    ],
    parsing_models=[
        ModelConfig(id="haiku-parser", model_name="claude-haiku-4-5",
                    model_provider="anthropic", interface="langchain",
                    temperature=0.0)
    ],
    evaluation_mode="template_only",
    deep_judgment_enabled=True,
    deep_judgment_search_enabled=True,
)

print(f"DJ enabled: {config_with_search.deep_judgment_enabled}")
print(f"Search enabled: {config_with_search.deep_judgment_search_enabled}")

When search is enabled, each excerpt includes hallucination_risk (none/low/medium/high) and supporting search results:

In [ ]:

Copied!





# Inspect search-validated results (question 5 has search data)
result = results[4]
dj = result.deep_judgment
if dj and dj.deep_judgment_search_enabled:
    print(f"Q: {result.metadata.question_text[:50]}")
    for attr, excerpts in (dj.extracted_excerpts or {}).items():
        for exc in excerpts:
            if "search_results" in exc:
                print(f"  Excerpt: \"{exc['text'][:50]}\"")
                print(f"  Hallucination risk: {exc.get('hallucination_risk', 'N/A')}")
                print(f"  Search: {exc['search_results'][:80]}...")
    if dj.hallucination_risk_assessment:
        print(f"  Risk assessment: {dj.hallucination_risk_assessment}")
# Inspect search-validated results (question 5 has search data)
result = results[4]
dj = result.deep_judgment
if dj and dj.deep_judgment_search_enabled:
    print(f"Q: {result.metadata.question_text[:50]}")
    for attr, excerpts in (dj.extracted_excerpts or {}).items():
        for exc in excerpts:
            if "search_results" in exc:
                print(f"  Excerpt: \"{exc['text'][:50]}\"")
                print(f"  Hallucination risk: {exc.get('hallucination_risk', 'N/A')}")
                print(f"  Search: {exc['search_results'][:80]}...")
    if dj.hallucination_risk_assessment:
        print(f"  Risk assessment: {dj.hallucination_risk_assessment}")

DJ for Rubrics¶

Deep judgment also works for rubric traits, providing per-trait excerpts and reasoning:

In [ ]:

Copied!





config_rubric_dj = VerificationConfig(
    answering_models=[
        ModelConfig(id="haiku", model_name="claude-haiku-4-5",
                    model_provider="anthropic", interface="langchain")
    ],
    parsing_models=[
        ModelConfig(id="haiku-parser", model_name="claude-haiku-4-5",
                    model_provider="anthropic", interface="langchain",
                    temperature=0.0)
    ],
    evaluation_mode="template_and_rubric",
    deep_judgment_rubric_mode="enable_all",
)

rubric_dj_results = benchmark.run_verification(config_rubric_dj)
print(f"Rubric DJ results: {len(rubric_dj_results)}")
config_rubric_dj = VerificationConfig(
    answering_models=[
        ModelConfig(id="haiku", model_name="claude-haiku-4-5",
                    model_provider="anthropic", interface="langchain")
    ],
    parsing_models=[
        ModelConfig(id="haiku-parser", model_name="claude-haiku-4-5",
                    model_provider="anthropic", interface="langchain",
                    temperature=0.0)
    ],
    evaluation_mode="template_and_rubric",
    deep_judgment_rubric_mode="enable_all",
)

rubric_dj_results = benchmark.run_verification(config_rubric_dj)
print(f"Rubric DJ results: {len(rubric_dj_results)}")

Inspect Rubric DJ Results¶

In [ ]:

Copied!





for result in rubric_dj_results:
    djr = result.deep_judgment_rubric
    if djr and djr.deep_judgment_rubric_performed:
        print(f"\nQ: {result.metadata.question_text[:40]}")
        # Per-trait reasoning
        for trait, reasoning in (djr.rubric_trait_reasoning or {}).items():
            score = (djr.deep_judgment_rubric_scores or {}).get(trait, "N/A")
            print(f"  {trait} (score={score}): {reasoning[:60]}...")
        # Per-trait excerpts
        for trait, excerpts in (djr.extracted_rubric_excerpts or {}).items():
            print(f"  {trait} excerpts: {len(excerpts)}")
for result in rubric_dj_results:
    djr = result.deep_judgment_rubric
    if djr and djr.deep_judgment_rubric_performed:
        print(f"\nQ: {result.metadata.question_text[:40]}")
        # Per-trait reasoning
        for trait, reasoning in (djr.rubric_trait_reasoning or {}).items():
            score = (djr.deep_judgment_rubric_scores or {}).get(trait, "N/A")
            print(f"  {trait} (score={score}): {reasoning[:60]}...")
        # Per-trait excerpts
        for trait, excerpts in (djr.extracted_rubric_excerpts or {}).items():
            print(f"  {trait} excerpts: {len(excerpts)}")

`deep_judgment_rubric_mode`	Behavior
`"disabled"` (default)	No rubric deep judgment
`"enable_all"`	Apply DJ to all LLM rubric traits
`"use_checkpoint"`	Use per-trait settings from checkpoint
`"custom"`	Use `deep_judgment_rubric_config` dict

CLI Equivalent¶

In [ ]:

Copied!

# Enable deep judgment via CLI:
# karenina verify benchmark.jsonld --preset base.json --deep-judgment

# With search validation:
# karenina verify benchmark.jsonld --preset base.json --deep-judgment --deep-judgment-search

# With rubric deep judgment:
# karenina verify benchmark.jsonld --preset base.json --deep-judgment --deep-judgment-rubric-mode all

print("CLI: --deep-judgment, --deep-judgment-search, --deep-judgment-rubric-mode")
# Enable deep judgment via CLI:
# karenina verify benchmark.jsonld --preset base.json --deep-judgment

# With search validation:
# karenina verify benchmark.jsonld --preset base.json --deep-judgment --deep-judgment-search

# With rubric deep judgment:
# karenina verify benchmark.jsonld --preset base.json --deep-judgment --deep-judgment-rubric-mode all

print("CLI: --deep-judgment, --deep-judgment-search, --deep-judgment-rubric-mode")

Full Evaluation — Template+rubric without deep judgment
Basic Verification — Simplest verification path
VerificationConfig Reference — All DJ configuration fields
Advanced Pipeline — Pipeline stage details

Deep Judgment¶

What Deep Judgment Does¶

Enable DJ for Templates¶

Inspect DJ Template Results¶

Extracted Excerpts¶

Reasoning Traces¶

DJ with Search Validation¶

DJ for Rubrics¶

Inspect Rubric DJ Results¶

CLI Equivalent¶

Related Pages¶