Skip to content

Agentic Evaluation

This page covers the internal machinery of agentic evaluation: how the pipeline detects agentic adapters, resolves workspaces, runs the two-step investigation/extraction parse stage, and cleans up afterward. It is for contributors and power users who need to understand what happens under the hood.

For a conceptual overview and usage guide, see the agentic evaluation concept page. For the general pipeline architecture, see verification-pipeline.md.

1. AdapterSpec.agent_tier

File: src/karenina/adapters/registry.py, AdapterSpec dataclass.

agent_tier: str = "tool_loop"

This field distinguishes runtimes that are themselves agents with built-in tools (e.g., Claude Code, agent_tier="deep_agent") from scaffolded adapters where the adapter explicitly orchestrates each tool call turn (LangChain, Claude Tool, agent_tier="tool_loop").

The GenerateAnswer stage (stage 2) checks this field to decide whether to use AgentPort or LLMPort for the answering step:

# generate_answer.py, Step 2
spec = AdapterRegistry.get_spec(answering_model.interface)
use_agent = bool(answering_model.mcp_urls_dict) or (spec is not None and spec.agent_tier == "deep_agent")

When use_agent=True, the stage calls AgentPort.run(), which captures the full conversation trace: tool calls, tool results, and intermediate reasoning. When False, it calls LLMPort.invoke(), which returns only the final text response.

Interface agent_tier Reason
claude_agent_sdk "deep_agent" The Claude CLI binary is itself an agent; the LLMPort path would lose tool call traces
langchain "tool_loop" The adapter orchestrates tool calls explicitly
claude_tool "tool_loop" Same: adapter-orchestrated
manual "tool_loop" Pre-recorded traces; no live agent
openai_endpoint "tool_loop" Routes to langchain
openrouter "tool_loop" Routes to langchain

See Adapters for the full adapter reference.

2. Workspace Resolution (GenerateAnswer Stage)

File: src/karenina/benchmark/verification/stages/pipeline/generate_answer.py, method _resolve_workspace().

When agentic_parsing=True, the GenerateAnswer stage resolves a workspace directory before invoking the answering agent. This workspace is the working directory the agent operates in.

Resolution Logic

workspace_root (from Benchmark)
  + question_workspace_path (from Question.workspace_path)
  = source directory

If workspace_copy=True:
  source is copied to: workspace_root / {workspace_path}_run_{timestamp}_pid{pid}
  context.workspace_path = the copy (safe to modify)
  context.workspace_is_copy = True

If workspace_copy=False:
  context.workspace_path = source directly (destructive)
  context.workspace_is_copy = False

If question_workspace_path is None:
  An empty directory is created: workspace_root / {question_id}_run_{timestamp}_pid{pid}
  context.workspace_path = new directory
  context.workspace_is_copy = True

The unique suffix (run_{timestamp}_pid{pid}) includes the replicate number when present, ensuring parallel verification runs do not collide:

suffix = f"run_{timestamp}_pid{os.getpid()}"
if context.replicate is not None:
    suffix += f"_rep{context.replicate}"

The resolved workspace_path is passed to AgentConfig.workspace_path, which the Claude SDK adapter wires to ClaudeAgentOptions.cwd. This makes the agent's file system operations (Read, Bash, etc.) operate relative to the resolved workspace.

Preconditions

  • context.workspace_root must be set when agentic_parsing=True. If it is None, the stage raises RuntimeError.
  • If question_workspace_path points to a directory that does not exist under workspace_root, the stage raises RuntimeError.

3. AgenticParseTemplateStage (Stage 7b)

File: src/karenina/benchmark/verification/stages/pipeline/agentic_parse_template.py.

This stage replaces ParseTemplateStage (stage 7a) when agentic_parsing=True in VerificationConfig. The selection happens in StageOrchestrator.from_config():

# orchestrator.py
if agentic_parsing:
    stages.append(AgenticParseTemplateStage())
else:
    stages.append(ParseTemplateStage())

Only one of stage 7a or 7b is ever present in a pipeline run; they are never both instantiated.

Artifact Contract

Direction Artifact Keys
Requires RAW_ANSWER, ANSWER, RAW_LLM_RESPONSE
Produces PARSED_ANSWER, PARSING_MODEL_STR, INVESTIGATION_TRACE, AGENTIC_PARSING_PERFORMED

should_run() Conditions

The stage skips itself when any of the following are true:

  • A prior stage set context.error
  • agentic_parsing is False
  • recursion_limit_reached is True
  • Trace validation failed
  • Abstention was detected
  • Sufficiency was detected as insufficient

Step 1: Investigation

Calls AgentPort.run() (via get_agent(context.parsing_model)) with:

  • System prompt: instructs the agent to independently verify the answering agent's work, with the JSON schema of the answer template (from build_parsing_schema()) embedded as the target structure
  • User prompt: built from the question text plus context controlled by agentic_judge_context
  • Workspace path: for tool access (Read, Bash, etc.)
  • Agent config: max_turns and timeout from agentic_parsing_max_turns and agentic_parsing_timeout

The agentic_judge_context field controls what context the investigation agent receives:

Mode Agent sees Use case
workspace_only Only workspace path (agent must discover everything independently) Strictest evaluation; agent cannot be influenced by the answering trace
trace_and_workspace Answering trace + workspace path Balanced; agent can review the answering agent's reasoning and verify artifacts
trace_only Only answering trace (equivalent to classical stage 7a) No workspace access; useful when workspace is not relevant

The return value is the raw text of the investigation agent's conversation trace.

Step 2: Extraction

Calls ParserPort.parse_to_pydantic() (via get_parser(context.parsing_model)) with:

  • System prompt: instructs the parser to extract structured data from the investigation report
  • Input: the investigation trace from Step 1
  • Target schema: the answer template class (same Pydantic BaseAnswer subclass that classical stage 7a would use)

The return value is a parsed answer instance, identical in type to what ParseTemplateStage would produce. All downstream stages (VerifyTemplate, EmbeddingCheck, etc.) work identically regardless of which parse stage ran.

Result Storage

The stage stores four artifacts and two result builder fields:

context.set_artifact(ArtifactKeys.PARSED_ANSWER, parsed_answer)
context.set_artifact(ArtifactKeys.PARSING_MODEL_STR, model_str)
context.set_artifact(ArtifactKeys.INVESTIGATION_TRACE, investigation_trace)
context.set_artifact(ArtifactKeys.AGENTIC_PARSING_PERFORMED, True)

context.set_result_field(ArtifactKeys.INVESTIGATION_TRACE, investigation_trace)
context.set_result_field(ArtifactKeys.AGENTIC_PARSING_PERFORMED, True)

The stage also sets TEMPLATE_EVALUATOR to None and DEEP_JUDGMENT_PERFORMED to False, since agentic parsing does not use the classical template evaluator or deep judgment extraction.

4. Ground Truth Stripping in BaseAnswer.model_json_schema()

File: src/karenina/schemas/entities/answer.py.

VerifiedField stores ground truth and verification metadata in json_schema_extra["__verification__"]. This metadata must never reach the judge LLM, as it would leak correct answers.

BaseAnswer overrides model_json_schema() to recursively strip all __verification__ keys from the generated JSON schema:

@classmethod
def model_json_schema(cls, *args, **kwargs):
    schema = super().model_json_schema(*args, **kwargs)

    def _strip_verification(obj):
        if isinstance(obj, dict):
            obj.pop("__verification__", None)
            for value in obj.values():
                _strip_verification(value)
        elif isinstance(obj, list):
            for item in obj:
                _strip_verification(item)

    _strip_verification(schema)
    return schema

This protects all code paths that generate JSON schemas: the Claude SDK parser adapter, the agentic investigation stage, and build_parsing_schema(). The stripping happens at the source rather than per-adapter, so any new code path that calls model_json_schema() is automatically protected.

Extraction hints (field descriptions used in prompt assembly) flow through Pydantic FieldInfo objects, not through model_json_schema(), so they are unaffected by this stripping.

5. Workspace Cleanup (FinalizeResult Stage)

File: src/karenina/benchmark/verification/stages/pipeline/finalize_result.py.

FinalizeResult (stage 13) handles workspace cleanup after all other stages have completed. It applies triple-guard protection before deleting any directory:

  1. context.workspace_path is set (a workspace was resolved by GenerateAnswer)
  2. context.workspace_cleanup is True (the workspace_cleanup setting from VerificationConfig)
  3. context.workspace_is_copy is True (the directory is a working copy, not an original)
if context.workspace_path and context.workspace_cleanup and context.workspace_is_copy:
    try:
        shutil.rmtree(context.workspace_path)
    except Exception:
        logger.warning("Failed to clean up workspace: %s", context.workspace_path, exc_info=True)

Only working copies created by workspace_copy=True (or freshly created empty directories) are eligible for cleanup. Original workspace directories are never deleted. Cleanup failures are logged as warnings but do not affect the pipeline result.

6. VerificationResultTemplate Extensions

File: src/karenina/schemas/verification/result_components.py.

Two fields on VerificationResultTemplate carry agentic parsing data into the result:

Field Type Set by
investigation_trace str \| None AgenticParseTemplateStage via context.set_result_field()
agentic_parsing_performed bool AgenticParseTemplateStage via context.set_result_field()

These are wired into the VerificationResultTemplate constructor by FinalizeResult:

template = VerificationResultTemplate(
    ...
    investigation_trace=context.get_result_field(ArtifactKeys.INVESTIGATION_TRACE),
    agentic_parsing_performed=context.get_result_field(ArtifactKeys.AGENTIC_PARSING_PERFORMED, False),
    ...
)

When agentic parsing was not used, investigation_trace is None and agentic_parsing_performed is False.

7. ArtifactKeys for Agentic Parsing

File: src/karenina/benchmark/verification/stages/core/base.py, class ArtifactKeys.

Three constants in the "Agentic Parsing" section:

INVESTIGATION_TRACE = "investigation_trace"
WORKSPACE_PATH = "workspace_path"
AGENTIC_PARSING_PERFORMED = "agentic_parsing_performed"

INVESTIGATION_TRACE and AGENTIC_PARSING_PERFORMED are used as both artifact keys and result field keys. WORKSPACE_PATH is used only as an artifact key (the workspace path is not persisted in the result).

8. Pipeline Threading

The agentic configuration flows from the Benchmark facade through the batch runner and into each individual pipeline context. The chain:

Benchmark.workspace_root
  -> Benchmark.run_verification(config, ...)
    -> run_verification_batch(workspace_root=self._workspace_root, ...)
      -> generate_task_queue(workspace_root=..., ...)
        -> task dict["workspace_root"]  (overrides config value)
        -> task dict includes extract_feature_flags(config):
             agentic_parsing, agentic_judge_context, agentic_parsing_max_turns,
             agentic_parsing_timeout, workspace_copy, workspace_cleanup
      -> _run_single_task(task)
        -> run_single_model_verification(workspace_root=..., ...)
          -> VerificationContext(workspace_root=..., agentic_parsing=..., ...)
            -> GenerateAnswer._resolve_workspace()

VerificationConfig fields (agentic_parsing, workspace_copy, workspace_cleanup, agentic_judge_context, agentic_parsing_max_turns, agentic_parsing_timeout) flow via extract_feature_flags(config) into each task dict. The workspace_root is provided separately by the Benchmark facade and overrides any value in the config at the task queue generation step.

VerificationContext Fields

The following VerificationContext fields (in stages/core/base.py) control agentic evaluation at runtime:

Field Type Default Source
agentic_parsing bool False VerificationConfig via extract_feature_flags
agentic_judge_context str "workspace_only" VerificationConfig via extract_feature_flags
agentic_parsing_max_turns int 15 VerificationConfig via extract_feature_flags
agentic_parsing_timeout float 120.0 VerificationConfig via extract_feature_flags
question_workspace_path str \| None None Question.workspace_path via task dict
workspace_path Path \| None None Set by GenerateAnswer._resolve_workspace()
workspace_is_copy bool False Set by GenerateAnswer._resolve_workspace()
workspace_root Path \| None None Benchmark.workspace_root via task dict
workspace_copy bool True VerificationConfig via extract_feature_flags
workspace_cleanup bool True VerificationConfig via extract_feature_flags

9. Interaction with Other Pipeline Stages

Agentic parsing affects several other stages:

Stage Interaction
ValidateTemplate (1) Unaffected. Runs identically; produces the ANSWER and RAW_ANSWER artifacts that Stage 7b requires.
GenerateAnswer (2) Resolves workspace when agentic_parsing=True. Also uses AgentPort when agent_tier=="deep_agent".
RecursionLimitAutoFail (3) If the answering agent hit the recursion limit, Stage 7b skips itself.
AbstentionCheck (5) If abstention is detected, Stage 7b skips itself.
SufficiencyCheck (6) If the response is insufficient, Stage 7b skips itself.
VerifyTemplate (8) Unaffected. Receives the same PARSED_ANSWER artifact regardless of whether Stage 7a or 7b produced it.
EmbeddingCheck (9) Unaffected. Checks field_verification_result produced by Stage 8.
DeepJudgmentAutoFail (10) Skips itself when agentic parsing was used, because Stage 7b sets DEEP_JUDGMENT_PERFORMED to False.
FinalizeResult (13) Reads INVESTIGATION_TRACE and AGENTIC_PARSING_PERFORMED from the result builder. Handles workspace cleanup.

10. Key File Reference

Domain File (relative to karenina/src/karenina/)
AdapterSpec with agent_tier adapters/registry.py
Claude SDK registration (sets agent_tier="deep_agent") adapters/claude_agent_sdk/registration.py
Workspace resolution and answer generation benchmark/verification/stages/pipeline/generate_answer.py
Agentic parse stage (Stage 7b) benchmark/verification/stages/pipeline/agentic_parse_template.py
Classical parse stage (Stage 7a) benchmark/verification/stages/pipeline/parse_template.py
Stage orchestrator (selects 7a vs 7b) benchmark/verification/stages/core/orchestrator.py
ArtifactKeys and VerificationContext benchmark/verification/stages/core/base.py
Ground truth stripping schemas/entities/answer.py
Result components (investigation_trace field) schemas/verification/result_components.py
Workspace cleanup benchmark/verification/stages/pipeline/finalize_result.py
Feature flag extraction benchmark/verification/utils/task_helpers.py
JSON schema builder benchmark/verification/utils/schema_builder.py
Pipeline runner benchmark/verification/runner.py
Batch runner benchmark/verification/batch_runner.py
Benchmark facade benchmark/benchmark.py

11. Next Steps