Agentic Rubric Evaluation¶

This page covers the internal machinery of Stage 11b (AgenticRubricEvaluation): how the stage dispatches to individual or shared agent strategies, how the evaluator runs investigation and extraction, and how results flow into FinalizeResult. It is for contributors and power users who need to understand what happens under the hood.

For a conceptual overview of agentic traits and how they differ from other trait types, see the agentic traits concept page. For the general pipeline architecture, see verification-pipeline.md. For the parallel Stage 7b internals (agentic template parsing), see agentic-evaluation.md.

1. Pipeline Position¶

Stage 11b sits between RubricEvaluation (Stage 11) and DeepJudgmentRubricAutoFail (Stage 12) in the pipeline. It evaluates only AgenticRubricTrait instances; standard trait types (LLM, regex, callable, metric) are handled by Stage 11.

The StageOrchestrator.from_config() method includes Stage 11b when the rubric contains agentic traits:

# orchestrator.py
if evaluation_mode == "template_and_rubric" and rubric and rubric.agentic_traits:
    stages.append(AgenticRubricEvaluationStage())

In rubric_only mode, the same inclusion logic applies without the evaluation_mode guard.

2. AgenticTraitEvaluator¶

File: src/karenina/benchmark/verification/evaluators/rubric/agentic_trait.py.

AgenticTraitEvaluator is the core evaluation unit. It takes a resolved ModelConfig at construction time and exposes two entry points: evaluate_trait() for the full two-step flow, and run_extraction() for extraction alone (used by the shared strategy).

evaluate_trait()¶

Runs the complete two-step evaluation for a single trait:

Investigation (_run_investigation()): launches an agent via AgentPort.run() to investigate the response and/or workspace. Returns the raw investigation trace string.
Extraction (run_extraction()): sends the investigation trace to ParserPort.parse_to_pydantic() to extract a structured score.

Error Handling¶

Errors at each step produce different return signatures:

Failure point	Returned `(score, trace)`	Rationale
Agent investigation fails	`(None, None)`	No trace was produced; nothing to preserve
Score extraction fails	`(None, investigation_trace)`	The trace has diagnostic value even without a score
Both succeed	`(score, investigation_trace)`	Normal result

Both failure modes log at WARNING level with exc_info=True, so the exception details appear in logs without halting the pipeline.

_run_investigation()¶

Builds the agent invocation from the trait's configuration:

System prompt: identifies the agent as an evaluation agent, embeds trait.description, and states the expected output kind (trait.kind)
User prompt: assembled from context mode filtering (see Section 5)
Agent config: max_turns from trait.max_turns, timeout from trait.timeout_seconds, workspace_path from the resolved workspace

Returns result.raw_trace from the agent run.

run_extraction()¶

This method is public because the shared strategy needs to call it directly (one shared investigation, then per-trait extraction). It dispatches on trait.kind:

Kind	Target schema	Extracted field
`boolean`	`SingleBooleanScore`	`.result` (bool)
`score`	`SingleNumericScore`	`.score` (int)
`literal`	`SingleLiteralClassification`	`.classification` (str, then resolved to int index)
`type[BaseModel]` (template kind)	The user's `BaseModel` subclass	`.model_dump()` (dict of all fields)

For literal traits, _resolve_literal_index() maps the classification string to its position in the trait.classes dict. If the classification does not match any defined class, it returns -1.

For template kind traits, run_extraction() delegates to _extract_template() instead of the standard extraction flow. See Section 2.1 below.

The extraction prompt includes kind-specific context: score range for score traits, class descriptions for literal traits.

_extract_template()¶

This method handles extraction for template kind traits. Instead of building a score-extraction prompt, it sends the investigation trace to ParserPort.parse_to_pydantic() with the user's BaseModel subclass as the target schema. The parser produces a populated instance of that class, and the method returns model_dump() of the result.

The system prompt instructs the parser to fill every field based on evidence from the investigation. Field descriptions on the BaseModel guide the parser toward correct extraction, making well-described fields important for accuracy.

The returned dict is stored in agentic_trait_scores with dot-notation keys: each field becomes {trait_name}.{field_name}. This flattening happens in Stage 11b's _execute_individual (or _execute_shared) after evaluate_trait() returns.

3. Strategy Dispatch¶

File: src/karenina/benchmark/verification/stages/pipeline/agentic_rubric_evaluation.py.

The stage's execute() method reads context.agentic_rubric_strategy (default: "individual") and dispatches accordingly.

Individual Strategy (`_execute_individual`)¶

Evaluates each trait with its own agent. For every AgenticRubricTrait:

Resolve the model via _resolve_model() (see Section 3.1)
Create an AgenticTraitEvaluator with the resolved model
Call evaluator.evaluate_trait() for the full investigation + extraction cycle
Collect (score, trace) into the result dicts

Traits whose resolved model lacks agent_factory support are skipped with (None, None).

Shared Strategy (`_execute_shared`)¶

Evaluates all traits with a single shared agent, then extracts per-trait scores:

Resolve models for all traits
Verify all valid models are identical (same interface, model_provider, model_name). If they differ, fall back to individual strategy automatically.
Build a combined investigation prompt listing all trait descriptions
Run one shared agent investigation
For each trait, call evaluator.run_extraction() against the shared trace

If the shared investigation fails (agent exception), the stage falls back to the individual strategy. Per-trait extraction failures within the shared strategy set score=None for that trait while preserving the shared trace.

3.1. Model Resolution (`_resolve_model`)¶

For each trait:

trait.model_override  (if set)  →  resolved model
        or
context.parsing_model (fallback) →  resolved model

After resolution, the method checks AdapterRegistry.get_spec(model.interface) to confirm that the interface has an agent_factory registered. If no agent support is available, _resolve_model() returns None and the trait is skipped.

4. Artifact Contract¶

Requires¶

Key	Source Stage
`RAW_LLM_RESPONSE`	GenerateAnswer (Stage 2)

Produces¶

Key	Type	Description
`AGENTIC_RUBRIC_EVALUATION_PERFORMED`	`bool`	Always `True` when the stage runs
`AGENTIC_TRAIT_SCORES`	`dict[str, int \\| bool \\| None]`	Trait name to score. `None` indicates evaluation failure.
`AGENTIC_TRAIT_INVESTIGATION_TRACES`	`dict[str, str \\| None]`	Trait name to investigation trace. `None` if agent failed before producing output.

All three are stored as both artifacts (for downstream stages) and result fields (for FinalizeResult), using set_artifact_and_result().

should_run() Conditions¶

The stage skips itself when any of the following are true:

A prior stage set context.error
context.rubric is None
rubric.agentic_traits is empty

Note that the orchestrator already gates stage inclusion by evaluation_mode and the presence of rubric.agentic_traits, so should_run() only needs to check runtime conditions.

5. Context Mode Filtering¶

Each AgenticRubricTrait has a context_mode field that controls what the investigation agent receives. The evaluator builds the user prompt accordingly:

`context_mode`	Agent sees trace?	Agent sees workspace?	Use case
`workspace_only`	No	Yes	Strictest: agent must discover everything independently from workspace artifacts
`trace_and_workspace`	Yes	Yes	Balanced: agent reviews the answering trace and can verify workspace artifacts
`trace_only`	Yes	No	No workspace access; useful when evaluation depends only on response content

The prompt construction logic in _run_investigation():

if trait.context_mode in ("trace_and_workspace", "trace_only") and raw_llm_response:
    user_parts.append(f"\n--- ANSWERING AGENT TRACE ---\n{raw_llm_response}\n--- END TRACE ---")

if workspace_path and trait.context_mode != "trace_only":
    user_parts.append(f"\nWorkspace directory: {workspace_path}")

The question text is always included, regardless of mode.

6. Shared Strategy Merging¶

When the shared strategy runs a single agent for multiple traits, the stage must reconcile potentially different per-trait configurations:

Parameter	Merge rule	Rationale
`max_turns`	`max()` across all valid traits	The shared agent must have enough turns to investigate the most demanding trait
`timeout_seconds`	`max()` across all valid traits	Same reasoning for timeout
Include trace	`any()` trait has `trace_and_workspace` or `trace_only`	Union: if any trait needs the trace, include it
Include workspace	`any()` trait has `trace_and_workspace` or `workspace_only`	Union: if any trait needs workspace access, include it

The combined investigation prompt lists all trait descriptions as bullet points:

combined_desc_parts = [f"- {trait.name}: {trait.description}" for trait in valid_traits]

The agent is instructed to "report findings for each criterion clearly so scores can be extracted per trait."

7. Stage Interactions¶

FinalizeResult Condition¶

FinalizeResult (Stage 13) creates the VerificationResultRubric sub-object when either standard rubric evaluation or agentic rubric evaluation was performed:

agentic_evaluation_performed = context.get_result_field(
    ArtifactKeys.AGENTIC_RUBRIC_EVALUATION_PERFORMED, False
)

if rubric_evaluation_performed or agentic_evaluation_performed:
    # Build VerificationResultRubric ...

This means a rubric with only agentic traits (no LLM/regex/callable/metric traits) will still produce a rubric result section, because agentic_evaluation_performed is True even when rubric_evaluation_performed is False.

Within VerificationResultRubric, agentic results occupy two dedicated fields:

Field	Type	Source
`agentic_trait_scores`	`dict[str, int \\| bool] \\| None`	`AGENTIC_TRAIT_SCORES` result field
`agentic_trait_investigation_traces`	`dict[str, str] \\| None`	`AGENTIC_TRAIT_INVESTIGATION_TRACES` result field

Runner Auto-Upgrade Check¶

In run_single_model_verification(), the runner automatically upgrades evaluation_mode from "template_only" to "template_and_rubric" when a rubric with any traits (including agentic traits) is provided:

if (
    rubric
    and (
        rubric.llm_traits or rubric.regex_traits or rubric.callable_traits
        or rubric.metric_traits or rubric.agentic_traits
    )
    and evaluation_mode == "template_only"
):
    evaluation_mode = "template_and_rubric"

This ensures that passing a rubric with only agentic traits triggers Stage 11b without requiring the caller to explicitly set evaluation_mode.

Orchestrator Registration Order¶

The orchestrator places Stage 11b after all standard rubric stages:

... → RubricEvaluationStage (11) → DeepJudgmentRubricAutoFailStage (12) → AgenticRubricEvaluationStage (11b) → FinalizeResultStage (13)

In rubric_only mode, Stage 11b appears after the standard rubric + deep judgment block and before FinalizeResult. Stage 11b does not depend on Stage 11's output; each handles disjoint trait types.

Other Stage Interactions¶

Stage	Interaction
GenerateAnswer (2)	Produces `RAW_LLM_RESPONSE`, the only artifact Stage 11b requires. Also resolves workspace (used by agentic traits that need workspace access).
RubricEvaluation (11)	Handles LLM, regex, callable, and metric traits. Does not touch agentic traits; the two stages operate on disjoint trait sets.
DeepJudgmentRubricAutoFail (12)	Operates on standard rubric traits. Does not interact with agentic trait results.
FinalizeResult (13)	Reads agentic result fields and wires them into `VerificationResultRubric` (see above).

8. Trace Materialization¶

When any AgenticRubricTrait in the rubric has materialize_trace=True, Stage 11b writes the answering agent trace to a file instead of inlining it in the investigation prompt. This is useful for long traces that would consume excessive context.

_write_trace_file staticmethod¶

File: benchmark/verification/stages/pipeline/agentic_rubric_evaluation.py

_write_trace_file() places the trace file under <workspace>/.karenina/traces/ when a workspace path is available. If no workspace is set, it creates a temporary directory as a fallback. The filename encodes the question ID and, when present, the scenario turn number. The method returns the Path to the written file.

Stage-Level Lifecycle¶

The stage handles materialization in a single pass, not per-trait:

Before evaluating any trait, the stage checks any(t.materialize_trace for t in traits). If true, it calls _write_trace_file() once.
The resulting trace_file_path is passed to evaluate_trait() for every trait (regardless of whether that specific trait has materialize_trace=True). The evaluator checks trait.materialize_trace before using the path.
After all evaluations complete, the stage checks any(t.persist_trace for t in traits). If no trait requests persistence, the trace file is deleted. If any trait sets persist_trace=True, the file remains.

Prompt Behavior¶

When materialize_trace=True and a trace_file_path is available, _run_investigation() replaces the inline trace with a reference:

The full agent trace is saved to: /path/to/.karenina/traces/trace_q_xyz.txt
Use file tools (grep, search, read) to examine it.

This allows the investigation agent to selectively search the trace using file tools rather than processing the entire trace in its context window.

Interaction with Context Modes¶

materialize_trace=True requires context_mode to include the trace ("trace_only" or "trace_and_workspace"). Setting it with context_mode="workspace_only" raises a ValueError at validation time, because there is no trace to materialize.

9. Key File Reference¶

Domain	File (relative to `karenina/src/karenina/`)
Stage implementation (Stage 11b)	`benchmark/verification/stages/pipeline/agentic_rubric_evaluation.py`
Evaluator (investigation + extraction)	`benchmark/verification/evaluators/rubric/agentic_trait.py`
AgenticRubricTrait schema	`schemas/entities/rubric.py`
ArtifactKeys (agentic rubric section)	`benchmark/verification/stages/core/base.py`
VerificationContext (agentic rubric fields)	`benchmark/verification/stages/core/base.py`
Stage orchestrator (Stage 11b registration)	`benchmark/verification/stages/core/orchestrator.py`
Pipeline runner (auto-upgrade, config threading)	`benchmark/verification/runner.py`
FinalizeResult (rubric result assembly)	`benchmark/verification/stages/pipeline/finalize_result.py`
VerificationResultRubric (agentic fields)	`schemas/verification/result_components.py`
Extraction output schemas	`schemas/outputs/rubric.py`
Adapter registry (agent_factory check)	`adapters/registry.py`

10. Next Steps¶

Agentic Traits: conceptual overview and usage guide
Agentic Evaluation: Stage 7b internals (agentic template parsing)
Verification Pipeline: the 13-stage execution engine
Rubrics: rubric architecture and trait types
Deep Judgment Rubrics: deep judgment for standard rubric traits