Progressive Save and Resume¶

This tutorial shows how to checkpoint verification progress so you can resume interrupted runs. For long verification runs (many questions, expensive models, multiple replicates), progressive save writes results incrementally. If the run stops for any reason, you resume from the last checkpoint instead of re-evaluating everything.

What you'll learn:

Enable progressive save with --progressive-save (CLI)
Resume an interrupted run with --resume
Check run status with karenina verify-status
Use ProgressiveSaveManager directly in Python
Understand the .tmp and .state file pair
Verify configuration compatibility before resuming
Read intermediate results from .tmp files during a run

CLI Workflow¶

Three commands cover the full progressive save lifecycle:

In [2]:

Copied!





# 1. Start a verification run with progressive save enabled:
# karenina verify checkpoint.jsonld --preset default.json \
#   --output results.json --progressive-save

# 2. Check progress while a run is active (or after interruption):
# karenina verify-status results.json.state

# 3. Resume from where the run stopped:
# karenina verify --resume results.json.state

print("CLI commands shown in comments above")
# 1. Start a verification run with progressive save enabled:
# karenina verify checkpoint.jsonld --preset default.json \
#   --output results.json --progressive-save

# 2. Check progress while a run is active (or after interruption):
# karenina verify-status results.json.state

# 3. Resume from where the run stopped:
# karenina verify --resume results.json.state

print("CLI commands shown in comments above")

CLI commands shown in comments above

The --progressive-save flag tells the runner to write results incrementally to a .tmp file and track progress in a .state file. If the process is interrupted, --resume picks up from the last completed task. The verify-status command reads the .state file and reports how many tasks are pending.

How It Works¶

Progressive save maintains two sidecar files alongside your output path:

verify --progressive-save
    │
    ├── results.json.tmp   (accumulated results)
    ├── results.json.state (progress tracking)
    │
    ▼ (on completion)
    results.json           (final output)

The .tmp file stores results in standard export format, so you can read intermediate results at any time. The .state file tracks the task manifest (every task that needs to run), the set of completed task IDs, and a config hash for compatibility checks. Both files use atomic writes to prevent corruption if the process terminates mid-write.

On successful completion, finalize() removes the .tmp and .state files and writes the final output.

Python API: Initialize¶

Create a ProgressiveSaveManager, define the task manifest, and call initialize():

In [3]:

Copied!





config = VerificationConfig.from_overrides(
    answering_id="claude-haiku-4-5",
    answering_model="claude-haiku-4-5",
    parsing_id="claude-haiku-4-5",
    parsing_model="claude-haiku-4-5",
)

manager = ProgressiveSaveManager(
    output_path=_output_path,
    config=config,
    benchmark_path=_benchmark_path,
)

# Build task manifest from TaskIdentifier objects
task_ids = []
for qid, _, _ in _questions:
    tid = TaskIdentifier(
        question_id=qid,
        answering_canonical_key="claude-haiku-4-5",
        parsing_canonical_key="claude-haiku-4-5",
        replicate=0,
    )
    task_ids.append(tid.to_key())

manager.initialize(task_ids)
print(f"Initialized with {manager.total_tasks} tasks")
print(f"Completed so far: {manager.completed_count}")
print(f"State file: {manager.state_path.name}")
print(f"Tmp file:   {manager.tmp_path.name}")
config = VerificationConfig.from_overrides(
    answering_id="claude-haiku-4-5",
    answering_model="claude-haiku-4-5",
    parsing_id="claude-haiku-4-5",
    parsing_model="claude-haiku-4-5",
)

manager = ProgressiveSaveManager(
    output_path=_output_path,
    config=config,
    benchmark_path=_benchmark_path,
)

# Build task manifest from TaskIdentifier objects
task_ids = []
for qid, _, _ in _questions:
    tid = TaskIdentifier(
        question_id=qid,
        answering_canonical_key="claude-haiku-4-5",
        parsing_canonical_key="claude-haiku-4-5",
        replicate=0,
    )
    task_ids.append(tid.to_key())

manager.initialize(task_ids)
print(f"Initialized with {manager.total_tasks} tasks")
print(f"Completed so far: {manager.completed_count}")
print(f"State file: {manager.state_path.name}")
print(f"Tmp file:   {manager.tmp_path.name}")

Initialized with 3 tasks
Completed so far: 0
State file: results.json.state
Tmp file:   results.json.tmp

The task manifest is a list of string keys. Each key uniquely identifies one verification task by question ID, answering model, parsing model, and replicate number.

Python API: Add Results¶

As verification completes each task, call add_result() to persist it:

In [4]:

Copied!





# Simulate completing the first two questions
result_1 = _make_result("q1", "What is the primary target of venetoclax?", "BCL2", True)
result_2 = _make_result("q2", "How many chromosome pairs do humans have?", "23", True)

manager.add_result(result_1)
manager.add_result(result_2)

print(f"Completed: {manager.completed_count}/{manager.total_tasks}")
print(f"Pending:   {len(manager.get_pending_task_ids())} tasks remain")
# Simulate completing the first two questions
result_1 = _make_result("q1", "What is the primary target of venetoclax?", "BCL2", True)
result_2 = _make_result("q2", "How many chromosome pairs do humans have?", "23", True)

manager.add_result(result_1)
manager.add_result(result_2)

print(f"Completed: {manager.completed_count}/{manager.total_tasks}")
print(f"Pending:   {len(manager.get_pending_task_ids())} tasks remain")

Completed: 2/3
Pending:   3 tasks remain

Each add_result() call atomically updates both the .tmp and .state files. If the process crashes after this call, the completed results are safely on disk.

Python API: Inspect State¶

Use inspect_state_file() to check progress without loading the full manager:

In [5]:

Copied!





status = inspect_state_file(manager.state_path)

print(f"Total tasks:    {status.total_tasks}")
print(f"Completed:      {status.completed_count}")
print(f"Pending:        {status.pending_count}")
print(f"Progress:       {status.progress_percent:.1f}%")
print(f"Tmp file exists:{status.tmp_file_exists}")
print(f"Tmp file size:  {status.tmp_file_size} bytes")
status = inspect_state_file(manager.state_path)

print(f"Total tasks:    {status.total_tasks}")
print(f"Completed:      {status.completed_count}")
print(f"Pending:        {status.pending_count}")
print(f"Progress:       {status.progress_percent:.1f}%")
print(f"Tmp file exists:{status.tmp_file_exists}")
print(f"Tmp file size:  {status.tmp_file_size} bytes")

Total tasks:    3
Completed:      2
Pending:        3
Progress:       66.7%
Tmp file exists:True
Tmp file size:  5291 bytes

ProgressiveJobStatus is a lightweight dataclass. It reads only the .state JSON file, so it works even while another process is writing results.

Python API: Resume¶

To resume an interrupted run, load the manager from the .state file:

In [6]:

Copied!

resumed = ProgressiveSaveManager.load_for_resume(manager.state_path)

print(f"Resumed with {resumed.completed_count}/{resumed.total_tasks} already done")

pending = resumed.get_pending_task_ids()
print(f"Pending task IDs: {len(pending)}")

# Complete the remaining task
result_3 = _make_result("q3", "What organ produces insulin?", "Pancreas", True)
resumed.add_result(result_3)

print(f"After adding q3: {resumed.completed_count}/{resumed.total_tasks}")
resumed = ProgressiveSaveManager.load_for_resume(manager.state_path)

print(f"Resumed with {resumed.completed_count}/{resumed.total_tasks} already done")

pending = resumed.get_pending_task_ids()
print(f"Pending task IDs: {len(pending)}")

# Complete the remaining task
result_3 = _make_result("q3", "What organ produces insulin?", "Pancreas", True)
resumed.add_result(result_3)

print(f"After adding q3: {resumed.completed_count}/{resumed.total_tasks}")

Resumed with 2/3 already done
Pending task IDs: 3
After adding q3: 3/3

load_for_resume() reconstructs the full manager state: it reloads the config, task manifest, completed set, and all previously saved results from the .tmp file.

Python API: Finalize¶

Once all tasks are complete, call finalize() to clean up the sidecar files:

In [7]:

Copied!

state_path = resumed.state_path
tmp_path = resumed.tmp_path

print(f"Before finalize: .state exists = {state_path.exists()}")
print(f"Before finalize: .tmp exists   = {tmp_path.exists()}")

resumed.finalize()

print(f"After finalize:  .state exists = {state_path.exists()}")
print(f"After finalize:  .tmp exists   = {tmp_path.exists()}")
state_path = resumed.state_path
tmp_path = resumed.tmp_path

print(f"Before finalize: .state exists = {state_path.exists()}")
print(f"Before finalize: .tmp exists   = {tmp_path.exists()}")

resumed.finalize()

print(f"After finalize:  .state exists = {state_path.exists()}")
print(f"After finalize:  .tmp exists   = {tmp_path.exists()}")

Before finalize: .state exists = True
Before finalize: .tmp exists   = True
After finalize:  .state exists = False
After finalize:  .tmp exists   = False

After finalize(), only the final output file remains. The runner writes the complete results to the output path before calling finalize(), so the .tmp and .state files are no longer needed.

Configuration Compatibility¶

When resuming, the manager checks that the config and benchmark path match the original run. This prevents accidental result mixing:

In [8]:

Copied!





# Create a fresh manager to test compatibility
fresh_manager = ProgressiveSaveManager(
    output_path=_output_path,
    config=config,
    benchmark_path=_benchmark_path,
)
fresh_manager.initialize(task_ids)

# Same config and benchmark: compatible
compatible, reason = fresh_manager.is_compatible(config, _benchmark_path)
print(f"Same config:      compatible={compatible}")

# Different config: incompatible
different_config = VerificationConfig.from_overrides(
    answering_id="claude-sonnet-4-20250514",
    answering_model="claude-sonnet-4-20250514",
    parsing_id="claude-haiku-4-5",
    parsing_model="claude-haiku-4-5",
)
compatible, reason = fresh_manager.is_compatible(different_config, _benchmark_path)
print(f"Different config: compatible={compatible}")
print(f"Reason: {reason}")

# Different benchmark path: incompatible
compatible, reason = fresh_manager.is_compatible(config, "/other/benchmark.jsonld")
print(f"Different path:   compatible={compatible}")
print(f"Reason: {reason}")
# Create a fresh manager to test compatibility
fresh_manager = ProgressiveSaveManager(
    output_path=_output_path,
    config=config,
    benchmark_path=_benchmark_path,
)
fresh_manager.initialize(task_ids)

# Same config and benchmark: compatible
compatible, reason = fresh_manager.is_compatible(config, _benchmark_path)
print(f"Same config:      compatible={compatible}")

# Different config: incompatible
different_config = VerificationConfig.from_overrides(
    answering_id="claude-sonnet-4-20250514",
    answering_model="claude-sonnet-4-20250514",
    parsing_id="claude-haiku-4-5",
    parsing_model="claude-haiku-4-5",
)
compatible, reason = fresh_manager.is_compatible(different_config, _benchmark_path)
print(f"Different config: compatible={compatible}")
print(f"Reason: {reason}")

# Different benchmark path: incompatible
compatible, reason = fresh_manager.is_compatible(config, "/other/benchmark.jsonld")
print(f"Different path:   compatible={compatible}")
print(f"Reason: {reason}")

Same config:      compatible=True
Different config: compatible=False
Reason: Configuration has changed since the job started
Different path:   compatible=False
Reason: Benchmark path changed: /var/folders/34/129m5tdd04vf10ptyj12w6f80000gp/T/tmpkm92u_qk/benchmark.jsonld -> /other/benchmark.jsonld

The config hash is computed from the full VerificationConfig JSON (excluding manual traces). Any change to models, evaluation mode, or pipeline settings will be detected.

Cleanup¶

Next Steps¶

Basic Verification: Single-model template-only evaluation
Full Evaluation: Template and rubric evaluation with quality checks
CLI Reference: verify: All karenina verify options
CLI Reference: verify-status: Inspect progressive save state