27 KiB
Hook & State Machine Cleanup - Unified Pattern
Goal
Implement a consistent pattern across all models (Crawl, Snapshot, ArchiveResult, Dependency) for:
- Running hooks
- Processing JSONL records
- Managing background hooks
- State transitions
Current State Analysis (ALL COMPLETE ✅)
✅ Crawl (archivebox/crawls/)
Status: COMPLETE
- ✅ Has state machine:
CrawlMachine - ✅
Crawl.run()- runs hooks, processes JSONL viaprocess_hook_records(), creates snapshots - ✅
Crawl.cleanup()- kills background hooks, runs on_CrawlEnd hooks - ✅ Uses
OUTPUT_DIR/plugin_name/for PWD - ✅ State machine calls model methods:
queued -> started: callscrawl.run()started -> sealed: callscrawl.cleanup()
✅ Snapshot (archivebox/core/)
Status: COMPLETE
- ✅ Has state machine:
SnapshotMachine - ✅
Snapshot.run()- creates pending ArchiveResults - ✅
Snapshot.cleanup()- kills background ArchiveResult hooks, callsupdate_from_output() - ✅
Snapshot.has_running_background_hooks()- checks PID files usingprocess_is_alive() - ✅
Snapshot.from_jsonl()- simplified, filtering moved to caller - ✅ State machine calls model methods:
queued -> started: callssnapshot.run()started -> sealed: callssnapshot.cleanup()is_finished(): useshas_running_background_hooks()
✅ ArchiveResult (archivebox/core/)
Status: COMPLETE - Major refactor completed
- ✅ Has state machine:
ArchiveResultMachine - ✅
ArchiveResult.run()- runs hook, callsupdate_from_output()for foreground hooks - ✅
ArchiveResult.update_from_output()- unified method for foreground and background hooks - ✅ Uses PWD
snapshot.OUTPUT_DIR/plugin_name - ✅ JSONL processing via
process_hook_records()with URL/depth filtering - ✅ DELETED special background hook methods:
- ❌
check_background_completed()- replaced byprocess_is_alive()helper - ❌
finalize_background_hook()- replaced byupdate_from_output() - ❌
_populate_output_fields()- merged intoupdate_from_output()
- ❌
- ✅ State machine transitions:
queued -> started: callsarchiveresult.run()started -> succeeded/failed/skipped: status set byupdate_from_output()
✅ Binary (archivebox/machine/) - NEW!
Status: COMPLETE - Replaced Dependency model entirely
- ✅ Has state machine:
BinaryMachine - ✅
Binary.run()- runs on_Binary__install_* hooks, processes JSONL - ✅
Binary.cleanup()- kills background installation hooks (for consistency) - ✅
Binary.from_jsonl()- handles both binaries.jsonl and hook output - ✅ Uses PWD
data/machines/{machine_id}/binaries/{name}/{id}/plugin_name/ - ✅ Configuration via static
plugins/*/binaries.jsonlfiles - ✅ State machine calls model methods:
queued -> started: callsbinary.run()started -> succeeded/failed: status set by hooks via JSONL
- ✅ Perfect symmetry with Crawl/Snapshot/ArchiveResult pattern
❌ Dependency Model - ELIMINATED
Status: Deleted entirely (replaced by Binary state machine)
- Static configuration now lives in
plugins/*/binaries.jsonl - Per-machine state tracked by Binary records
- No global singleton conflicts
- Hooks renamed from
on_Dependency__install_*toon_Binary__install_*
Unified Pattern (Target Architecture)
Pattern for ALL models:
# 1. State Machine orchestrates transitions
class ModelMachine(StateMachine):
@started.enter
def enter_started(self):
self.model.run() # Do the work
# Update status
def is_finished(self):
# Check if background hooks still running
if self.model.has_running_background_hooks():
return False
# Check if children finished
if self.model.has_pending_children():
return False
return True
@sealed.enter
def enter_sealed(self):
self.model.cleanup() # Clean up background hooks
# Update status
# 2. Model methods do the actual work
class Model:
def run(self):
"""Run hooks, process JSONL, create children."""
hooks = discover_hooks('ModelName')
for hook in hooks:
output_dir = self.OUTPUT_DIR / hook.parent.name
result = run_hook(hook, output_dir=output_dir, ...)
if result is None: # Background hook
continue
# Process JSONL records
records = result.get('records', [])
overrides = {'model': self, 'created_by_id': self.created_by_id}
process_hook_records(records, overrides=overrides)
# Create children (e.g., ArchiveResults, Snapshots)
self.create_children()
def cleanup(self):
"""Kill background hooks, run cleanup hooks."""
# Kill any background hooks
if self.OUTPUT_DIR.exists():
for pid_file in self.OUTPUT_DIR.glob('*/hook.pid'):
kill_process(pid_file)
# Run cleanup hooks (e.g., on_ModelEnd)
cleanup_hooks = discover_hooks('ModelEnd')
for hook in cleanup_hooks:
run_hook(hook, ...)
def has_running_background_hooks(self) -> bool:
"""Check if any background hooks still running."""
if not self.OUTPUT_DIR.exists():
return False
for pid_file in self.OUTPUT_DIR.glob('*/hook.pid'):
if process_is_alive(pid_file):
return True
return False
PWD Standard:
model.OUTPUT_DIR/plugin_name/
- Crawl:
users/{user}/crawls/{date}/{crawl_id}/plugin_name/ - Snapshot:
users/{user}/snapshots/{date}/{domain}/{snapshot_id}/plugin_name/ - ArchiveResult:
users/{user}/snapshots/{date}/{domain}/{snapshot_id}/plugin_name/(same as Snapshot) - Dependency:
dependencies/{dependency_id}/plugin_name/(set output_dir field directly)
Implementation Plan
Phase 1: Add unified helpers to hooks.py ✅ DONE
File: archivebox/hooks.py
Status: COMPLETE - Added three helper functions:
process_hook_records(records, overrides)- lines 1258-1323process_is_alive(pid_file)- lines 1326-1344kill_process(pid_file, sig)- lines 1347-1362
def process_hook_records(records: List[Dict], overrides: Dict = None) -> Dict[str, int]:
"""
Process JSONL records from hook output.
Dispatches to Model.from_jsonl() for each record type.
Args:
records: List of JSONL record dicts from result['records']
overrides: Dict with 'snapshot', 'crawl', 'dependency', 'created_by_id', etc.
Returns:
Dict with counts by record type
"""
stats = {}
for record in records:
record_type = record.get('type')
# Dispatch to appropriate model
if record_type == 'Snapshot':
from archivebox.core.models import Snapshot
Snapshot.from_jsonl(record, overrides)
stats['Snapshot'] = stats.get('Snapshot', 0) + 1
elif record_type == 'Tag':
from archivebox.core.models import Tag
Tag.from_jsonl(record, overrides)
stats['Tag'] = stats.get('Tag', 0) + 1
elif record_type == 'Binary':
from archivebox.machine.models import Binary
Binary.from_jsonl(record, overrides)
stats['Binary'] = stats.get('Binary', 0) + 1
# ... etc
return stats
def process_is_alive(pid_file: Path) -> bool:
"""Check if process in PID file is still running."""
if not pid_file.exists():
return False
try:
pid = int(pid_file.read_text().strip())
os.kill(pid, 0) # Signal 0 = check if exists
return True
except (OSError, ValueError):
return False
def kill_process(pid_file: Path, signal=SIGTERM):
"""Kill process in PID file."""
if not pid_file.exists():
return
try:
pid = int(pid_file.read_text().strip())
os.kill(pid, signal)
except (OSError, ValueError):
pass
Phase 2: Add Model.from_jsonl() static methods ✅ DONE
Files: archivebox/core/models.py, archivebox/machine/models.py, archivebox/crawls/models.py
Status: COMPLETE - Added from_jsonl() to:
- ✅
Tag.from_jsonl()- core/models.py lines 93-116 - ✅
Snapshot.from_jsonl()- core/models.py lines 1144-1189 - ✅
Machine.from_jsonl()- machine/models.py lines 66-89 - ✅
Dependency.from_jsonl()- machine/models.py lines 203-227 - ✅
Binary.from_jsonl()- machine/models.py lines 401-434
Example implementations added:
class Snapshot:
@staticmethod
def from_jsonl(record: Dict, overrides: Dict = None):
"""Create/update Snapshot from JSONL record."""
from archivebox.misc.jsonl import get_or_create_snapshot
overrides = overrides or {}
# Apply overrides (crawl, parent_snapshot, depth limits)
crawl = overrides.get('crawl')
snapshot = overrides.get('snapshot') # parent
if crawl:
depth = record.get('depth', (snapshot.depth + 1 if snapshot else 1))
if depth > crawl.max_depth:
return None
record.setdefault('crawl_id', str(crawl.id))
record.setdefault('depth', depth)
if snapshot:
record.setdefault('parent_snapshot_id', str(snapshot.id))
created_by_id = overrides.get('created_by_id')
new_snapshot = get_or_create_snapshot(record, created_by_id=created_by_id)
new_snapshot.status = Snapshot.StatusChoices.QUEUED
new_snapshot.retry_at = timezone.now()
new_snapshot.save()
return new_snapshot
class Tag:
@staticmethod
def from_jsonl(record: Dict, overrides: Dict = None):
"""Create/update Tag from JSONL record."""
from archivebox.misc.jsonl import get_or_create_tag
tag = get_or_create_tag(record)
# Auto-attach to snapshot if in overrides
if overrides and 'snapshot' in overrides:
overrides['snapshot'].tags.add(tag)
return tag
class Binary:
@staticmethod
def from_jsonl(record: Dict, overrides: Dict = None):
"""Create/update Binary from JSONL record."""
# Implementation similar to existing create_model_record()
...
# Etc for other models
Phase 3: Update ArchiveResult to use unified pattern ✅ DONE
File: archivebox/core/models.py
Status: COMPLETE
Changes made:
-
✅ Replaced inline JSONL processing (lines 1912-1950):
- Pre-filter Snapshot records for depth/URL constraints in ArchiveResult.run()
- Use
self._url_passes_filters(url)with parent snapshot's config for proper hierarchy - Replaced inline Tag/Snapshot/other record creation with
process_hook_records() - Removed ~60 lines of duplicate code
-
✅ Simplified Snapshot.from_jsonl() (lines 1144-1189):
- Removed depth checking (now done in caller)
- Just applies crawl metadata and creates snapshot
- Added docstring note: "Filtering should be done by caller BEFORE calling this method"
-
✅ Preserved ArchiveResult self-update logic:
- Status/output fields still updated from ArchiveResult JSONL record (lines 1856-1910)
- Special title extractor logic preserved (line 1952+)
- Search indexing trigger preserved (line 1957+)
-
✅ Key insight: Filtering happens in ArchiveResult.run() where we have parent snapshot context, NOT in from_jsonl() where we'd lose config hierarchy
Note: Did NOT delete special background hook methods (check_background_completed, finalize_background_hook) - that's Phase 6
Phase 4: Add Snapshot.cleanup() method ✅ DONE
File: archivebox/core/models.py
Status: COMPLETE
Changes made:
-
✅ Added Snapshot.cleanup() (lines 1144-1175):
- Kills background ArchiveResult hooks by scanning for
*/hook.pidfiles - Finalizes background ArchiveResults using
finalize_background_hook()(temporary until Phase 6) - Called by state machine when entering sealed state
- Kills background ArchiveResult hooks by scanning for
-
✅ Added Snapshot.has_running_background_hooks() (lines 1177-1195):
- Checks if any background hooks still running using
process_is_alive() - Used by state machine in
is_finished()check
- Checks if any background hooks still running using
Phase 5: Update SnapshotMachine to use cleanup() ✅ DONE
File: archivebox/core/statemachines.py
Status: COMPLETE
Changes made:
-
✅ Simplified is_finished() (lines 58-72):
- Removed inline background hook checking and finalization (lines 67-76 deleted)
- Now uses
self.snapshot.has_running_background_hooks()(line 68) - Removed ~12 lines of duplicate logic
-
✅ Added cleanup() to sealed.enter (lines 102-111):
- Calls
self.snapshot.cleanup()to kill background hooks (line 105) - Follows unified pattern: cleanup happens on seal, not in is_finished()
- Calls
Phase 6: Add ArchiveResult.update_from_output() and simplify run() ✅ DONE
File: archivebox/core/models.py
Status: COMPLETE - The BIG refactor (removed ~200 lines of duplication)
Changes made:
-
✅ Added
ArchiveResult.update_from_output()(lines 1908-2061):- Unified method for both foreground and background hooks
- Reads stdout.log and parses JSONL records
- Updates status/output_str/output_json from ArchiveResult JSONL record
- Walks filesystem to populate output_files/output_size/output_mimetypes
- Filters Snapshot records for depth/URL constraints (same as run())
- Processes side-effect records via
process_hook_records() - Updates snapshot title if title extractor
- Triggers search indexing if succeeded
- Cleans up PID files and empty logs
- ~160 lines of comprehensive logic
-
✅ Simplified
ArchiveResult.run()(lines 1841-1906):- Removed ~120 lines of duplicate filesystem reading logic
- Now just sets start_ts/pwd and calls
update_from_output() - Background hooks: return immediately after saving status=STARTED
- Foreground hooks: call
update_from_output()to do all the work - Removed ~10 lines of duplicate code
-
✅ Updated
Snapshot.cleanup()(line 1172):- Changed from
ar.finalize_background_hook()toar.update_from_output() - Uses the unified method instead of the old special-case method
- Changed from
-
✅ Deleted
_populate_output_fields()(was ~45 lines):- Logic merged into
update_from_output() - Eliminates duplication of filesystem walking code
- Logic merged into
-
✅ Deleted
check_background_completed()(was ~20 lines):- Replaced by
process_is_alive(pid_file)from hooks.py - Generic helper used by Snapshot.has_running_background_hooks()
- Replaced by
-
✅ Deleted
finalize_background_hook()(was ~85 lines):- Completely replaced by
update_from_output() - Was duplicate of foreground hook finalization logic
- Completely replaced by
Total lines removed: ~280 lines of duplicate code Total lines added: ~160 lines of unified code Net reduction: ~120 lines (-43%)
Phase 7-8: Dependency State Machine ❌ NOT NEEDED
Status: Intentionally skipped - Dependency doesn't need a state machine
Why no state machine for Dependency?
-
Wrong Granularity: Dependency is a GLOBAL singleton (one record per binary name)
- Multiple machines would race to update the same
status/retry_atfields - No clear semantics: "started" on which machine? "failed" on Machine A but "succeeded" on Machine B?
- Multiple machines would race to update the same
-
Wrong Timing: Installation should be SYNCHRONOUS, not queued
- When a worker needs wget, it should install wget NOW, not queue it for later
- No benefit to async state machine transitions
-
State Lives Elsewhere: Binary records are the actual state
- Each machine has its own Binary records (one per machine per binary)
- Binary.machine FK provides proper per-machine state tracking
Correct Architecture:
Dependency (global, no state machine):
├─ Configuration: bin_name, bin_providers, overrides
├─ run() method: synchronous installation attempt
└─ NO status, NO retry_at, NO state_machine_name
Binary (per-machine, has machine FK):
├─ State: is this binary installed on this specific machine?
├─ Created via JSONL output from on_Dependency hooks
└─ unique_together = (machine, name, abspath, version, sha256)
What was implemented:
- ✅ Refactored
Dependency.run()(lines 249-324):- Uses
discover_hooks()andprocess_hook_records()for consistency - Added comprehensive docstring explaining why no state machine
- Synchronous execution: returns Binary or None immediately
- Uses unified JSONL processing pattern
- Uses
- ✅ Kept Dependency simple: Just configuration fields, no state fields
- ✅ Multi-machine support: Each machine independently runs Dependency.run() and creates its own Binary
Summary of Changes
Progress: 6/6 Core Phases Complete ✅ + 2 Phases Skipped (Intentionally)
ALL core functionality is now complete! The unified pattern is consistently implemented across Crawl, Snapshot, and ArchiveResult. Dependency intentionally kept simple (no state machine needed).
Files Modified:
-
✅ DONE
archivebox/hooks.py- Add unified helpers:- ✅
process_hook_records(records, overrides)- dispatcher (lines 1258-1323) - ✅
process_is_alive(pid_file)- check if PID still running (lines 1326-1344) - ✅
kill_process(pid_file)- kill process (lines 1347-1362)
- ✅
-
✅ DONE
archivebox/crawls/models.py- Already updated:- ✅
Crawl.run()- runs hooks, processes JSONL, creates snapshots - ✅
Crawl.cleanup()- kills background hooks, runs on_CrawlEnd
- ✅
-
✅ DONE
archivebox/core/models.py:- ✅
Tag.from_jsonl()- lines 93-116 - ✅
Snapshot.from_jsonl()- lines 1197-1234 (simplified, removed filtering) - ✅
Snapshot.cleanup()- lines 1144-1172 (kill background hooks, calls ar.update_from_output()) - ✅
Snapshot.has_running_background_hooks()- lines 1174-1193 (check PIDs) - ✅
ArchiveResult.run()- simplified, usesupdate_from_output()(lines 1841-1906) - ✅
ArchiveResult.update_from_output()- unified filesystem reading (lines 1908-2061) - ✅ DELETED
ArchiveResult.check_background_completed()- replaced byprocess_is_alive() - ✅ DELETED
ArchiveResult.finalize_background_hook()- replaced byupdate_from_output() - ✅ DELETED
ArchiveResult._populate_output_fields()- merged intoupdate_from_output()
- ✅
-
✅ DONE
archivebox/core/statemachines.py:- ✅ Simplified
SnapshotMachine.is_finished()- useshas_running_background_hooks()(line 68) - ✅ Added cleanup call to
SnapshotMachine.sealed.enter(line 105)
- ✅ Simplified
-
✅ DONE
archivebox/machine/models.py:- ✅
Machine.from_jsonl()- lines 66-89 - ✅
Dependency.from_jsonl()- lines 203-227 - ✅
Binary.from_jsonl()- lines 401-434 - ✅ Refactored
Dependency.run()to use unified pattern (lines 249-324) - ✅ Added comprehensive docstring explaining why Dependency doesn't need state machine
- ✅ Kept Dependency simple: no state fields, synchronous execution only
- ✅
Code Metrics:
- Lines removed: ~280 lines of duplicate code
- Lines added: ~160 lines of unified code
- Net reduction: ~120 lines total (-43%)
- Files created: 0 (no new files needed)
Key Benefits:
- Consistency: All stateful models (Crawl, Snapshot, ArchiveResult) follow the same unified state machine pattern
- Simplicity: Eliminated special-case background hook handling (~280 lines of duplicate code)
- Correctness: Background hooks are properly cleaned up on seal transition
- Maintainability: Unified
process_hook_records()dispatcher for all JSONL processing - Testability: Consistent pattern makes testing easier
- Clear Separation: Stateful work items (Crawl/Snapshot/ArchiveResult) vs stateless config (Dependency)
- Multi-Machine Support: Dependency remains simple synchronous config, Binary tracks per-machine state
Final Unified Pattern
All models now follow this consistent architecture:
State Machine Structure
class ModelMachine(StateMachine):
queued = State(initial=True)
started = State()
sealed/succeeded/failed = State(final=True)
@started.enter
def enter_started(self):
self.model.run() # Execute the work
@sealed.enter # or @succeeded.enter
def enter_sealed(self):
self.model.cleanup() # Clean up background hooks
Model Methods
class Model:
# State machine fields
status = CharField(default='queued')
retry_at = DateTimeField(default=timezone.now)
output_dir = CharField(default='', blank=True)
state_machine_name = 'app.statemachines.ModelMachine'
def run(self):
"""Run hooks, process JSONL, create children."""
hooks = discover_hooks('EventName')
for hook in hooks:
output_dir = self.OUTPUT_DIR / hook.parent.name
result = run_hook(hook, output_dir=output_dir, ...)
if result is None: # Background hook
continue
# Process JSONL records
overrides = {'model': self, 'created_by_id': self.created_by_id}
process_hook_records(result['records'], overrides=overrides)
def cleanup(self):
"""Kill background hooks, run cleanup hooks."""
for pid_file in self.OUTPUT_DIR.glob('*/hook.pid'):
kill_process(pid_file)
# Update children from filesystem
child.update_from_output()
def update_and_requeue(self, **fields):
"""Update fields and bump modified_at."""
for field, value in fields.items():
setattr(self, field, value)
self.save(update_fields=[*fields.keys(), 'modified_at'])
@staticmethod
def from_jsonl(record: dict, overrides: dict = None):
"""Create/update model from JSONL record."""
# Implementation specific to model
# Called by process_hook_records()
Hook Processing Flow
1. Model.run() discovers hooks
2. Hooks execute and output JSONL to stdout
3. JSONL records dispatched via process_hook_records()
4. Each record type handled by Model.from_jsonl()
5. Background hooks tracked via hook.pid files
6. Model.cleanup() kills background hooks on seal
7. Children updated via update_from_output()
Multi-Machine Coordination
- Work Items (Crawl, Snapshot, ArchiveResult): No machine FK, any worker can claim
- Resources (Binary): Machine FK, one per machine per binary
- Configuration (Dependency): No machine FK, global singleton, synchronous execution
- Execution Tracking (ArchiveResult.iface): FK to NetworkInterface for observability
Testing Checklist
- Test Crawl → Snapshot creation with hooks
- Test Snapshot → ArchiveResult creation
- Test ArchiveResult foreground hooks (JSONL processing)
- Test ArchiveResult background hooks (PID tracking, cleanup)
- Test Dependency.run() synchronous installation
- Test background hook cleanup on seal transition
- Test multi-machine Crawl execution
- Test Binary creation per machine (one per machine per binary)
- Verify Dependency.run() can be called concurrently from multiple machines safely
FINAL ARCHITECTURE (Phases 1-8 Complete)
✅ Phases 1-6: Core Models Unified
All core models (Crawl, Snapshot, ArchiveResult) now follow the unified pattern:
- State machines orchestrate transitions
.run()methods execute hooks and process JSONL.cleanup()methods kill background hooks.update_and_requeue()methods update state for worker coordination- Consistent use of
process_hook_records()for JSONL dispatching
✅ Phases 7-8: Binary State Machine (Dependency Model Eliminated)
Key Decision: Eliminated Dependency model entirely and made Binary the state machine.
New Architecture
-
Static Configuration:
plugins/{plugin}/dependencies.jsonlfiles define binary requirements{"type": "Binary", "name": "yt-dlp", "bin_providers": "pip,brew,apt,env"} {"type": "Binary", "name": "node", "bin_providers": "apt,brew,env", "overrides": {"apt": {"packages": ["nodejs"]}}} {"type": "Binary", "name": "ffmpeg", "bin_providers": "apt,brew,env"} -
Dynamic State:
Binarymodel tracks per-machine installation state- Fields:
machine,name,bin_providers,overrides,abspath,version,sha256,binprovider - State machine:
queued → started → succeeded/failed - Output dir:
data/machines/{machine_id}/binaries/{binary_name}/{binary_id}/
- Fields:
Binary State Machine Flow
class BinaryMachine(StateMachine):
queued → started → succeeded/failed
@started.enter
def enter_started(self):
self.binary.run() # Runs on_Binary__install_* hooks
class Binary(models.Model):
def run(self):
"""
Runs ALL on_Binary__install_* hooks.
Each hook checks bin_providers and decides if it can handle this binary.
First hook to succeed wins.
Outputs JSONL with abspath, version, sha256, binprovider.
"""
hooks = discover_hooks('Binary')
for hook in hooks:
result = run_hook(hook, output_dir=self.OUTPUT_DIR/plugin_name,
binary_id=self.id, machine_id=self.machine_id,
name=self.name, bin_providers=self.bin_providers,
overrides=json.dumps(self.overrides))
# Hook outputs: {"type": "Binary", "name": "wget", "abspath": "/usr/bin/wget", "version": "1.21", "binprovider": "apt"}
# Binary.from_jsonl() updates self with installation results
Hook Naming Convention
- Before:
on_Dependency__install_using_pip_provider.py - After:
on_Binary__install_using_pip_provider.py
Each hook checks --bin-providers CLI argument:
if 'pip' not in bin_providers.split(','):
sys.exit(0) # Skip this binary
Perfect Symmetry Achieved
All models now follow identical patterns:
Crawl(queued) → CrawlMachine → Crawl.run() → sealed
Snapshot(queued) → SnapshotMachine → Snapshot.run() → sealed
ArchiveResult(queued) → ArchiveResultMachine → ArchiveResult.run() → succeeded/failed
Binary(queued) → BinaryMachine → Binary.run() → succeeded/failed
Benefits of Eliminating Dependency
- No global singleton conflicts: Binary is per-machine, no race conditions
- Simpler data model: One table instead of two (Dependency + InstalledBinary)
- Static configuration: dependencies.jsonl in version control, not database
- Consistent state machine: Binary follows same pattern as other models
- Cleaner hooks: Hooks check bin_providers themselves instead of orchestrator parsing names
Multi-Machine Coordination
- Work Items (Crawl, Snapshot, ArchiveResult): No machine FK, any worker can claim
- Resources (Binary): Machine FK, one per machine per binary name
- Configuration: Static files in
plugins/*/dependencies.jsonl - Execution Tracking: ArchiveResult.iface FK to NetworkInterface for observability
Testing Checklist (Updated)
- Core models use unified hook pattern (Phases 1-6)
- Binary installation via state machine
- Multiple machines can install same binary independently
- Hook bin_providers filtering works correctly
- Binary.from_jsonl() handles both dependencies.jsonl and hook output
- Binary OUTPUT_DIR structure: data/machines/{machine_id}/binaries/{name}/{id}/