Progressive Save and Resume¶
This tutorial shows how to checkpoint verification progress so you can resume interrupted runs. For long verification runs (many questions, expensive models, multiple replicates), progressive save writes results incrementally. If the run stops for any reason, you resume from the last checkpoint instead of re-evaluating everything.
What you'll learn:
- Enable progressive save with
--progressive-save(CLI) - Resume an interrupted run with
--resume - Check run status with
karenina verify-status - Use
ProgressiveSaveManagerdirectly in Python - Understand the
.tmpand.statefile pair - Verify configuration compatibility before resuming
- Read intermediate results from
.tmpfiles during a run
# 1. Start a verification run with progressive save enabled:
# karenina verify checkpoint.jsonld --preset default.json \
# --output results.json --progressive-save
# 2. Check progress while a run is active (or after interruption):
# karenina verify-status results.json.state
# 3. Resume from where the run stopped:
# karenina verify --resume results.json.state
print("CLI commands shown in comments above")
CLI commands shown in comments above
The --progressive-save flag tells the runner to write results incrementally to a .tmp file and track progress in a .state file. If the process is interrupted, --resume picks up from the last completed task. The verify-status command reads the .state file and reports how many tasks are pending.
How It Works¶
Progressive save maintains two sidecar files alongside your output path:
verify --progressive-save
│
├── results.json.tmp (accumulated results)
├── results.json.state (progress tracking)
│
▼ (on completion)
results.json (final output)
The .tmp file stores results in standard export format, so you can read intermediate results at any time. The .state file tracks the task manifest (every task that needs to run), the set of completed task IDs, and a config hash for compatibility checks. Both files use atomic writes to prevent corruption if the process terminates mid-write.
On successful completion, finalize() removes the .tmp and .state files and writes the final output.
Python API: Initialize¶
Create a ProgressiveSaveManager, define the task manifest, and call initialize():
config = VerificationConfig.from_overrides(
answering_id="claude-haiku-4-5",
answering_model="claude-haiku-4-5",
parsing_id="claude-haiku-4-5",
parsing_model="claude-haiku-4-5",
)
manager = ProgressiveSaveManager(
output_path=_output_path,
config=config,
benchmark_path=_benchmark_path,
)
# Build task manifest from TaskIdentifier objects
task_ids = []
for qid, _, _ in _questions:
tid = TaskIdentifier(
question_id=qid,
answering_canonical_key="claude-haiku-4-5",
parsing_canonical_key="claude-haiku-4-5",
replicate=0,
)
task_ids.append(tid.to_key())
manager.initialize(task_ids)
print(f"Initialized with {manager.total_tasks} tasks")
print(f"Completed so far: {manager.completed_count}")
print(f"State file: {manager.state_path.name}")
print(f"Tmp file: {manager.tmp_path.name}")
Initialized with 3 tasks Completed so far: 0 State file: results.json.state Tmp file: results.json.tmp
The task manifest is a list of string keys. Each key uniquely identifies one verification task by question ID, answering model, parsing model, and replicate number.
Python API: Add Results¶
As verification completes each task, call add_result() to persist it:
# Simulate completing the first two questions
result_1 = _make_result("q1", "What is the primary target of venetoclax?", "BCL2", True)
result_2 = _make_result("q2", "How many chromosome pairs do humans have?", "23", True)
manager.add_result(result_1)
manager.add_result(result_2)
print(f"Completed: {manager.completed_count}/{manager.total_tasks}")
print(f"Pending: {len(manager.get_pending_task_ids())} tasks remain")
Completed: 2/3 Pending: 3 tasks remain
Each add_result() call atomically updates both the .tmp and .state files. If the process crashes after this call, the completed results are safely on disk.
Python API: Inspect State¶
Use inspect_state_file() to check progress without loading the full manager:
status = inspect_state_file(manager.state_path)
print(f"Total tasks: {status.total_tasks}")
print(f"Completed: {status.completed_count}")
print(f"Pending: {status.pending_count}")
print(f"Progress: {status.progress_percent:.1f}%")
print(f"Tmp file exists:{status.tmp_file_exists}")
print(f"Tmp file size: {status.tmp_file_size} bytes")
Total tasks: 3 Completed: 2 Pending: 3 Progress: 66.7% Tmp file exists:True Tmp file size: 5291 bytes
ProgressiveJobStatus is a lightweight dataclass. It reads only the .state JSON file, so it works even while another process is writing results.
Python API: Resume¶
To resume an interrupted run, load the manager from the .state file:
resumed = ProgressiveSaveManager.load_for_resume(manager.state_path)
print(f"Resumed with {resumed.completed_count}/{resumed.total_tasks} already done")
pending = resumed.get_pending_task_ids()
print(f"Pending task IDs: {len(pending)}")
# Complete the remaining task
result_3 = _make_result("q3", "What organ produces insulin?", "Pancreas", True)
resumed.add_result(result_3)
print(f"After adding q3: {resumed.completed_count}/{resumed.total_tasks}")
Resumed with 2/3 already done Pending task IDs: 3 After adding q3: 3/3
load_for_resume() reconstructs the full manager state: it reloads the config, task manifest, completed set, and all previously saved results from the .tmp file.
Python API: Finalize¶
Once all tasks are complete, call finalize() to clean up the sidecar files:
state_path = resumed.state_path
tmp_path = resumed.tmp_path
print(f"Before finalize: .state exists = {state_path.exists()}")
print(f"Before finalize: .tmp exists = {tmp_path.exists()}")
resumed.finalize()
print(f"After finalize: .state exists = {state_path.exists()}")
print(f"After finalize: .tmp exists = {tmp_path.exists()}")
Before finalize: .state exists = True Before finalize: .tmp exists = True After finalize: .state exists = False After finalize: .tmp exists = False
After finalize(), only the final output file remains. The runner writes the complete results to the output path before calling finalize(), so the .tmp and .state files are no longer needed.
Configuration Compatibility¶
When resuming, the manager checks that the config and benchmark path match the original run. This prevents accidental result mixing:
# Create a fresh manager to test compatibility
fresh_manager = ProgressiveSaveManager(
output_path=_output_path,
config=config,
benchmark_path=_benchmark_path,
)
fresh_manager.initialize(task_ids)
# Same config and benchmark: compatible
compatible, reason = fresh_manager.is_compatible(config, _benchmark_path)
print(f"Same config: compatible={compatible}")
# Different config: incompatible
different_config = VerificationConfig.from_overrides(
answering_id="claude-sonnet-4-20250514",
answering_model="claude-sonnet-4-20250514",
parsing_id="claude-haiku-4-5",
parsing_model="claude-haiku-4-5",
)
compatible, reason = fresh_manager.is_compatible(different_config, _benchmark_path)
print(f"Different config: compatible={compatible}")
print(f"Reason: {reason}")
# Different benchmark path: incompatible
compatible, reason = fresh_manager.is_compatible(config, "/other/benchmark.jsonld")
print(f"Different path: compatible={compatible}")
print(f"Reason: {reason}")
Same config: compatible=True Different config: compatible=False Reason: Configuration has changed since the job started Different path: compatible=False Reason: Benchmark path changed: /var/folders/34/129m5tdd04vf10ptyj12w6f80000gp/T/tmpkm92u_qk/benchmark.jsonld -> /other/benchmark.jsonld
The config hash is computed from the full VerificationConfig JSON (excluding manual traces). Any change to models, evaluation mode, or pipeline settings will be detected.
Cleanup¶
Next Steps¶
- Basic Verification: Single-model template-only evaluation
- Full Evaluation: Template and rubric evaluation with quality checks
- CLI Reference: verify: All
karenina verifyoptions - CLI Reference: verify-status: Inspect progressive save state