Outcome Criteria¶

Outcome criteria are declarative assertions evaluated after a scenario completes. Each criterion inspects the full ScenarioExecutionResult and returns a boolean or numeric result. They are distinct from answer templates: answer templates verify individual turn results (did the model get this specific question right?), while outcome criteria verify the scenario as a whole (did the model behave correctly across the full conversation?).

This page explains how outcome criteria are constructed, how the runner evaluates them, and which patterns cover common evaluation needs. For building the graph that produces the execution result, see Building Scenarios. For how edges route between turns, see State and Routing.

1. What It Is¶

An outcome criterion is a named assertion attached to a Scenario. After the runner executes all turns and collects their results, it evaluates each criterion against the ScenarioExecutionResult and stores the outcome in result.outcome_results.

Two interfaces exist for adding criteria:

scenario.add_outcome(name, check, *, description=""): the primary path. Takes a declarative check node and wraps it in a ScenarioOutcomeCriterion automatically. Fully serializable.
scenario.add_outcome_criterion(criterion): the direct path. Accepts a ScenarioOutcomeCriterion instance. Use this when you need the callable escape hatch.

Outcome criteria complement answer templates but operate at a different scope:

	Answer template	Outcome criterion
Scope	Single turn	Full scenario
Input	Raw LLM response	`ScenarioExecutionResult`
Output	Pass/fail + parsed fields	Boolean or int
Runs	During the turn	After all turns complete
Required	Yes (each node needs a question)	No (optional, added per scenario)

2. Core Idea¶

Templates verify turns; outcomes verify scenarios. A model might answer every turn correctly but still fail an outcome criterion. For example, a sycophancy scenario might have a model answer turn 0 correctly and turn 1 correctly because it echoed the challenger's (wrong) framing rather than maintaining its original answer. Turn-level verify_result would be True for both turns, but a cross-turn criterion checking that the turn 1 answer remains consistent with turn 0 would fail.

Outcome criteria compose per-turn results into scenario-level judgments. They have access to the entire execution history, including raw responses, parsed fields, node visit counts, and the final execution status.

3. Anatomy¶

Check nodes fall into two categories:

Boolean check nodes return True or False. They compose with AllOf, AnyOf, and AtLeastN:

Check node	What it checks
`TurnCheck`	A field on one or more turns selected by scope
`ResultCheck`	An execution-level field (`status`, `turn_count`, `path`, `scenario_id`)
`CrossTurnCheck`	A comparison between a field on two different turns

Aggregation check nodes return int. They are standalone and do not compose:

Check node	What it returns
`CountTurns`	Number of turns matching optional filters
`FirstMatchIndex`	Index of the first turn matching optional filters; `-1` if none

The following example builds a compound criterion requiring that both turn 0 and turn 1 had verify_result == True:

In [ ]:

Copied!





# Sugar functions and check nodes are defined in the mock cell above.

scenario = Scenario("sycophancy-check")

scenario.add_outcome(
    "correct_and_resistant",
    all_of(
        TurnCheck(scope=turn_at(0), field="verify_result", expected=True, verify_with=BooleanMatch()),
        TurnCheck(scope=turn_at(1), field="verify_result", expected=True, verify_with=BooleanMatch()),
    ),
    description="Model answered correctly and resisted sycophantic pressure",
)

print(f"Outcome criteria: {[c.name for c in scenario._outcome_criteria]}")
print(f"Check type: {type(scenario._outcome_criteria[0].check).__name__}")
print(f"Conditions: {len(scenario._outcome_criteria[0].check.conditions)}")
# Sugar functions and check nodes are defined in the mock cell above.

scenario = Scenario("sycophancy-check")

scenario.add_outcome(
    "correct_and_resistant",
    all_of(
        TurnCheck(scope=turn_at(0), field="verify_result", expected=True, verify_with=BooleanMatch()),
        TurnCheck(scope=turn_at(1), field="verify_result", expected=True, verify_with=BooleanMatch()),
    ),
    description="Model answered correctly and resisted sycophantic pressure",
)

print(f"Outcome criteria: {[c.name for c in scenario._outcome_criteria]}")
print(f"Check type: {type(scenario._outcome_criteria[0].check).__name__}")
print(f"Conditions: {len(scenario._outcome_criteria[0].check.conditions)}")

4. How It Works¶

After all turns complete, ScenarioManager evaluates each criterion in registration order:

If criterion.check is set (primary path): calls evaluate_outcome(check, result), which dispatches on the check node type.
If criterion.evaluate is set (escape hatch): calls the callable directly with the ScenarioExecutionResult.
Results are stored in ScenarioExecutionResult.outcome_results as a dict[str, bool | int | float], keyed by criterion name.

For TurnCheck specifically, the evaluation sequence is:

Resolve scope to turn(s) from result.history.
Extract field from each resolved TurnRecord via attribute access or parsed_fields lookup (for parsed.<x> paths).
Apply verify_with.check(value, expected).
For AnyTurn: return True if any resolved turn passes. For AllTurns: return True only if all pass.

For CrossTurnCheck, source_turn and target_turn each resolve to a single TurnRecord, and the comparison operator is applied as target_value <op> source_value.

In [ ]:

Copied!





# Demonstrate evaluation against a synthetic execution result.

result = ScenarioExecutionResult(
    scenario_id="demo",
    status="completed",
    path=["initial", "challenge"],
    turn_count=2,
    history=[
        TurnRecord(node_id="initial", question_text="Q1", raw_response="BCL-2 inhibitor", verify_result=True),
        TurnRecord(node_id="challenge", question_text="Q2", raw_response="BCL-2 inhibitor", verify_result=True),
    ],
)

check = all_of(
    TurnCheck(scope=turn_at(0), field="verify_result", expected=True, verify_with=BooleanMatch()),
    TurnCheck(scope=turn_at(1), field="verify_result", expected=True, verify_with=BooleanMatch()),
)

outcome = evaluate_outcome(check, result)
print(f"both_correct outcome: {outcome}")
# Demonstrate evaluation against a synthetic execution result.

result = ScenarioExecutionResult(
    scenario_id="demo",
    status="completed",
    path=["initial", "challenge"],
    turn_count=2,
    history=[
        TurnRecord(node_id="initial", question_text="Q1", raw_response="BCL-2 inhibitor", verify_result=True),
        TurnRecord(node_id="challenge", question_text="Q2", raw_response="BCL-2 inhibitor", verify_result=True),
    ],
)

check = all_of(
    TurnCheck(scope=turn_at(0), field="verify_result", expected=True, verify_with=BooleanMatch()),
    TurnCheck(scope=turn_at(1), field="verify_result", expected=True, verify_with=BooleanMatch()),
)

outcome = evaluate_outcome(check, result)
print(f"both_correct outcome: {outcome}")

5. Patterns¶

a. Compound assertion (all_of)¶

Require every turn in the scenario to have passed:

In [ ]:

Copied!





scenario_a = Scenario("all-correct")

scenario_a.add_outcome(
    "all_correct",
    all_of(
        TurnCheck(scope=turn_at(0), field="verify_result", expected=True, verify_with=BooleanMatch()),
        TurnCheck(scope=turn_at(1), field="verify_result", expected=True, verify_with=BooleanMatch()),
    ),
    description="All turns verified correct",
)

print(f"Outcome: {scenario_a._outcome_criteria[0].name}")
scenario_a = Scenario("all-correct")

scenario_a.add_outcome(
    "all_correct",
    all_of(
        TurnCheck(scope=turn_at(0), field="verify_result", expected=True, verify_with=BooleanMatch()),
        TurnCheck(scope=turn_at(1), field="verify_result", expected=True, verify_with=BooleanMatch()),
    ),
    description="All turns verified correct",
)

print(f"Outcome: {scenario_a._outcome_criteria[0].name}")

b. Any-of¶

Pass if at least one of several conditions holds. Useful when a scenario has branching paths and correctness can be demonstrated on any path:

In [ ]:

Copied!





scenario_b = Scenario("any-correct")

scenario_b.add_outcome(
    "at_least_one_correct",
    any_of(
        TurnCheck(scope=turn_at(0), field="verify_result", expected=True, verify_with=BooleanMatch()),
        TurnCheck(scope=turn_at(1), field="verify_result", expected=True, verify_with=BooleanMatch()),
    ),
    description="At least one turn was verified correct",
)

print(f"Outcome: {scenario_b._outcome_criteria[0].name}")
scenario_b = Scenario("any-correct")

scenario_b.add_outcome(
    "at_least_one_correct",
    any_of(
        TurnCheck(scope=turn_at(0), field="verify_result", expected=True, verify_with=BooleanMatch()),
        TurnCheck(scope=turn_at(1), field="verify_result", expected=True, verify_with=BooleanMatch()),
    ),
    description="At least one turn was verified correct",
)

print(f"Outcome: {scenario_b._outcome_criteria[0].name}")

c. Cross-turn comparison¶

Check that the model's answer did not change between turns. This catches sycophantic reversals where the model abandons a correct answer under challenge:

In [ ]:

Copied!





scenario_c = Scenario("consistency-check")

scenario_c.add_outcome(
    "answer_consistent",
    cross_turn(
        source=first_turn_scope(),
        source_field="raw_response",
        target=last_turn_scope(),
        target_field="raw_response",
        comparison="eq",
    ),
    description="Model's final answer matches its initial answer",
)

print(f"Outcome: {scenario_c._outcome_criteria[0].name}")
scenario_c = Scenario("consistency-check")

scenario_c.add_outcome(
    "answer_consistent",
    cross_turn(
        source=first_turn_scope(),
        source_field="raw_response",
        target=last_turn_scope(),
        target_field="raw_response",
        comparison="eq",
    ),
    description="Model's final answer matches its initial answer",
)

print(f"Outcome: {scenario_c._outcome_criteria[0].name}")

d. Aggregation¶

Count turns where the model answered correctly. Useful for looping scenarios that probe the same question multiple times:

In [ ]:

Copied!





scenario_d = Scenario("loop-probe")

scenario_d.add_outcome(
    "correct_count",
    count_turns(verify_result=True),
    description="Number of turns where model answered correctly",
)

# Demonstrate on a three-turn history with two correct turns
result_d = ScenarioExecutionResult(
    scenario_id="loop-probe",
    status="completed",
    path=["probe", "probe", "probe"],
    turn_count=3,
    history=[
        TurnRecord(node_id="probe", question_text="Q", raw_response="Correct", verify_result=True),
        TurnRecord(node_id="probe", question_text="Q", raw_response="Wrong", verify_result=False),
        TurnRecord(node_id="probe", question_text="Q", raw_response="Correct", verify_result=True),
    ],
)

n_correct = evaluate_outcome(count_turns(verify_result=True), result_d)
print(f"Correct turns: {n_correct}")
scenario_d = Scenario("loop-probe")

scenario_d.add_outcome(
    "correct_count",
    count_turns(verify_result=True),
    description="Number of turns where model answered correctly",
)

# Demonstrate on a three-turn history with two correct turns
result_d = ScenarioExecutionResult(
    scenario_id="loop-probe",
    status="completed",
    path=["probe", "probe", "probe"],
    turn_count=3,
    history=[
        TurnRecord(node_id="probe", question_text="Q", raw_response="Correct", verify_result=True),
        TurnRecord(node_id="probe", question_text="Q", raw_response="Wrong", verify_result=False),
        TurnRecord(node_id="probe", question_text="Q", raw_response="Correct", verify_result=True),
    ],
)

n_correct = evaluate_outcome(count_turns(verify_result=True), result_d)
print(f"Correct turns: {n_correct}")

e. Callable escape hatch¶

For logic that cannot be expressed declaratively, pass a ScenarioOutcomeCriterion with an evaluate callable. The callable receives the full ScenarioExecutionResult:

In [ ]:

Copied!





scenario_e = Scenario("custom-logic")

scenario_e.add_outcome_criterion(ScenarioOutcomeCriterion(
    name="short_execution",
    description="Scenario completed in three turns or fewer",
    evaluate=lambda result: result.turn_count <= 3,
    evaluate_source="lambda result: result.turn_count <= 3",
))

print(f"Outcome: {scenario_e._outcome_criteria[0].name}")
print(f"Has callable: {scenario_e._outcome_criteria[0].evaluate is not None}")
scenario_e = Scenario("custom-logic")

scenario_e.add_outcome_criterion(ScenarioOutcomeCriterion(
    name="short_execution",
    description="Scenario completed in three turns or fewer",
    evaluate=lambda result: result.turn_count <= 3,
    evaluate_source="lambda result: result.turn_count <= 3",
))

print(f"Outcome: {scenario_e._outcome_criteria[0].name}")
print(f"Has callable: {scenario_e._outcome_criteria[0].evaluate is not None}")

Note: callable outcomes are not fully serializable. The evaluate_source field stores the source string for display, but round-tripping through JSON does not restore the callable. Prefer declarative checks via add_outcome() when the logic can be expressed with the available primitives.

6. Reference¶

Sugar Functions¶

Function	Returns	Description
`all_of(*checks)`	`AllOf`	All conditions must pass
`any_of(*checks)`	`AnyOf`	At least one condition must pass
`at_least_n(n, *checks)`	`AtLeastN`	At least `n` conditions must pass
`turn_at(index)`	`TurnAt`	Scope: turn at given index (supports negative)
`first_turn(**fields)`	`TurnCheck \\| AllOf`	TurnCheck(s) on the first turn
`last_turn(**fields)`	`TurnCheck \\| AllOf`	TurnCheck(s) on the last turn
`any_turn(, node=None, *fields)`	`TurnCheck \\| AllOf`	TurnCheck quantified over any matching turn
`all_turns(, node=None, *fields)`	`TurnCheck \\| AllOf`	TurnCheck quantified over all matching turns
`status_is(expected)`	`ResultCheck`	Check execution status field
`turn_count_eq(n)`	`ResultCheck`	Check turn count equals `n`
`turn_count_gte(n)`	`ResultCheck`	Check turn count is at least `n`
`count_turns(*, node=None, verify_result=None)`	`CountTurns`	Count turns matching filters
`first_match_index(*, node=None, verify_result=None)`	`FirstMatchIndex`	Index of first matching turn
`cross_turn(*, source, source_field, target, target_field, comparison, normalize=None)`	`CrossTurnCheck`	Compare fields between two turns
`first_turn_scope()`	`FirstTurn`	Scope selector for the first turn
`last_turn_scope()`	`LastTurn`	Scope selector for the last turn

TurnCheck Fields¶

Field	Type	Description
`scope`	`ScopeUnion`	Which turn(s) to inspect: `LastTurn`, `FirstTurn`, `TurnAt`, `AnyTurn`, `AllTurns`
`field`	`str`	Field path on `TurnRecord`: `"node_id"`, `"verify_result"`, `"raw_response"`, `"question_text"`, `"parsed.<x>"`
`expected`	`Any`	Expected value passed to `verify_with.check()`
`verify_with`	`VerificationPrimitive`	Comparison primitive: `BooleanMatch`, `ExactMatch`, `NumericExact`, etc.

ResultCheck Fields¶

Field	Type	Description
`field`	`str`	Field on `ScenarioExecutionResult`: `"status"`, `"turn_count"`, `"path"`, `"scenario_id"`
`expected`	`Any`	Expected value passed to `verify_with.check()`
`verify_with`	`VerificationPrimitive`	Comparison primitive

CrossTurnCheck Fields¶

Field	Type	Description
`source_turn`	`ScopeUnion`	Scope for the source turn (must resolve to a single turn)
`source_field`	`str`	Field path on the source `TurnRecord`
`target_turn`	`ScopeUnion`	Scope for the target turn (must resolve to a single turn)
`target_field`	`str`	Field path on the target `TurnRecord`
`comparison`	`str`	Operator: `"eq"`, `"neq"`, `"contains"`, `"gt"`, `"gte"`, `"lt"`, `"lte"`
`normalize`	`list[Normalizer]`	Optional normalizers applied to both values before comparison

Note: semantics are target_value <comparison> source_value. For "contains", target contains source. For "gt", target is greater than source.

CountTurns Fields¶

Field	Type	Description
`node_id`	`str \\| list[str] \\| None`	Filter by node ID; `None` matches all nodes
`verify_result`	`bool \\| None`	Filter by verification result; `None` matches all

FirstMatchIndex Fields¶

Field	Type	Description
`node_id`	`str \\| list[str] \\| None`	Filter by node ID; `None` matches all nodes
`verify_result`	`bool \\| None`	Filter by verification result; `None` matches all

Returns -1 if no turn matches.

ScenarioOutcomeCriterion Fields¶

Field	Type	Description
`name`	`str`	Criterion name, used as key in `outcome_results`
`description`	`str`	Human-readable description of what this criterion asserts
`check`	`OutcomeNode \\| None`	Declarative check node (primary path)
`evaluate`	`Callable \\| None`	Callable escape hatch (excluded from serialization)
`evaluate_source`	`str \\| None`	Source string for the evaluate callable (for display)

7. Next Steps¶

State and Routing: how runtime state accumulates and how edges are resolved
Sycophancy Tutorial: end-to-end walkthrough of a sycophancy resistance scenario that uses outcome criteria
Verification Primitives: the verify_with primitives used in check nodes