Files
ArchiveBox/old/TODO_hook_statemachine_cleanup.md
Nick Sweeting f0aa19fa7d wip
2025-12-28 17:51:54 -08:00

27 KiB

Hook & State Machine Cleanup - Unified Pattern

Goal

Implement a consistent pattern across all models (Crawl, Snapshot, ArchiveResult, Dependency) for:

  1. Running hooks
  2. Processing JSONL records
  3. Managing background hooks
  4. State transitions

Current State Analysis (ALL COMPLETE )

Crawl (archivebox/crawls/)

Status: COMPLETE

  • Has state machine: CrawlMachine
  • Crawl.run() - runs hooks, processes JSONL via process_hook_records(), creates snapshots
  • Crawl.cleanup() - kills background hooks, runs on_CrawlEnd hooks
  • Uses OUTPUT_DIR/plugin_name/ for PWD
  • State machine calls model methods:
    • queued -> started: calls crawl.run()
    • started -> sealed: calls crawl.cleanup()

Snapshot (archivebox/core/)

Status: COMPLETE

  • Has state machine: SnapshotMachine
  • Snapshot.run() - creates pending ArchiveResults
  • Snapshot.cleanup() - kills background ArchiveResult hooks, calls update_from_output()
  • Snapshot.has_running_background_hooks() - checks PID files using process_is_alive()
  • Snapshot.from_jsonl() - simplified, filtering moved to caller
  • State machine calls model methods:
    • queued -> started: calls snapshot.run()
    • started -> sealed: calls snapshot.cleanup()
    • is_finished(): uses has_running_background_hooks()

ArchiveResult (archivebox/core/)

Status: COMPLETE - Major refactor completed

  • Has state machine: ArchiveResultMachine
  • ArchiveResult.run() - runs hook, calls update_from_output() for foreground hooks
  • ArchiveResult.update_from_output() - unified method for foreground and background hooks
  • Uses PWD snapshot.OUTPUT_DIR/plugin_name
  • JSONL processing via process_hook_records() with URL/depth filtering
  • DELETED special background hook methods:
    • check_background_completed() - replaced by process_is_alive() helper
    • finalize_background_hook() - replaced by update_from_output()
    • _populate_output_fields() - merged into update_from_output()
  • State machine transitions:
    • queued -> started: calls archiveresult.run()
    • started -> succeeded/failed/skipped: status set by update_from_output()

Binary (archivebox/machine/) - NEW!

Status: COMPLETE - Replaced Dependency model entirely

  • Has state machine: BinaryMachine
  • Binary.run() - runs on_Binary__install_* hooks, processes JSONL
  • Binary.cleanup() - kills background installation hooks (for consistency)
  • Binary.from_jsonl() - handles both binaries.jsonl and hook output
  • Uses PWD data/machines/{machine_id}/binaries/{name}/{id}/plugin_name/
  • Configuration via static plugins/*/binaries.jsonl files
  • State machine calls model methods:
    • queued -> started: calls binary.run()
    • started -> succeeded/failed: status set by hooks via JSONL
  • Perfect symmetry with Crawl/Snapshot/ArchiveResult pattern

Dependency Model - ELIMINATED

Status: Deleted entirely (replaced by Binary state machine)

  • Static configuration now lives in plugins/*/binaries.jsonl
  • Per-machine state tracked by Binary records
  • No global singleton conflicts
  • Hooks renamed from on_Dependency__install_* to on_Binary__install_*

Unified Pattern (Target Architecture)

Pattern for ALL models:

# 1. State Machine orchestrates transitions
class ModelMachine(StateMachine):
    @started.enter
    def enter_started(self):
        self.model.run()  # Do the work
        # Update status

    def is_finished(self):
        # Check if background hooks still running
        if self.model.has_running_background_hooks():
            return False
        # Check if children finished
        if self.model.has_pending_children():
            return False
        return True

    @sealed.enter
    def enter_sealed(self):
        self.model.cleanup()  # Clean up background hooks
        # Update status

# 2. Model methods do the actual work
class Model:
    def run(self):
        """Run hooks, process JSONL, create children."""
        hooks = discover_hooks('ModelName')
        for hook in hooks:
            output_dir = self.OUTPUT_DIR / hook.parent.name
            result = run_hook(hook, output_dir=output_dir, ...)

            if result is None:  # Background hook
                continue

            # Process JSONL records
            records = result.get('records', [])
            overrides = {'model': self, 'created_by_id': self.created_by_id}
            process_hook_records(records, overrides=overrides)

        # Create children (e.g., ArchiveResults, Snapshots)
        self.create_children()

    def cleanup(self):
        """Kill background hooks, run cleanup hooks."""
        # Kill any background hooks
        if self.OUTPUT_DIR.exists():
            for pid_file in self.OUTPUT_DIR.glob('*/hook.pid'):
                kill_process(pid_file)

        # Run cleanup hooks (e.g., on_ModelEnd)
        cleanup_hooks = discover_hooks('ModelEnd')
        for hook in cleanup_hooks:
            run_hook(hook, ...)

    def has_running_background_hooks(self) -> bool:
        """Check if any background hooks still running."""
        if not self.OUTPUT_DIR.exists():
            return False
        for pid_file in self.OUTPUT_DIR.glob('*/hook.pid'):
            if process_is_alive(pid_file):
                return True
        return False

PWD Standard:

model.OUTPUT_DIR/plugin_name/
  • Crawl: users/{user}/crawls/{date}/{crawl_id}/plugin_name/
  • Snapshot: users/{user}/snapshots/{date}/{domain}/{snapshot_id}/plugin_name/
  • ArchiveResult: users/{user}/snapshots/{date}/{domain}/{snapshot_id}/plugin_name/ (same as Snapshot)
  • Dependency: dependencies/{dependency_id}/plugin_name/ (set output_dir field directly)

Implementation Plan

Phase 1: Add unified helpers to hooks.py DONE

File: archivebox/hooks.py

Status: COMPLETE - Added three helper functions:

  • process_hook_records(records, overrides) - lines 1258-1323
  • process_is_alive(pid_file) - lines 1326-1344
  • kill_process(pid_file, sig) - lines 1347-1362
def process_hook_records(records: List[Dict], overrides: Dict = None) -> Dict[str, int]:
    """
    Process JSONL records from hook output.
    Dispatches to Model.from_jsonl() for each record type.

    Args:
        records: List of JSONL record dicts from result['records']
        overrides: Dict with 'snapshot', 'crawl', 'dependency', 'created_by_id', etc.

    Returns:
        Dict with counts by record type
    """
    stats = {}
    for record in records:
        record_type = record.get('type')

        # Dispatch to appropriate model
        if record_type == 'Snapshot':
            from archivebox.core.models import Snapshot
            Snapshot.from_jsonl(record, overrides)
            stats['Snapshot'] = stats.get('Snapshot', 0) + 1
        elif record_type == 'Tag':
            from archivebox.core.models import Tag
            Tag.from_jsonl(record, overrides)
            stats['Tag'] = stats.get('Tag', 0) + 1
        elif record_type == 'Binary':
            from archivebox.machine.models import Binary
            Binary.from_jsonl(record, overrides)
            stats['Binary'] = stats.get('Binary', 0) + 1
        # ... etc
    return stats

def process_is_alive(pid_file: Path) -> bool:
    """Check if process in PID file is still running."""
    if not pid_file.exists():
        return False
    try:
        pid = int(pid_file.read_text().strip())
        os.kill(pid, 0)  # Signal 0 = check if exists
        return True
    except (OSError, ValueError):
        return False

def kill_process(pid_file: Path, signal=SIGTERM):
    """Kill process in PID file."""
    if not pid_file.exists():
        return
    try:
        pid = int(pid_file.read_text().strip())
        os.kill(pid, signal)
    except (OSError, ValueError):
        pass

Phase 2: Add Model.from_jsonl() static methods DONE

Files: archivebox/core/models.py, archivebox/machine/models.py, archivebox/crawls/models.py

Status: COMPLETE - Added from_jsonl() to:

  • Tag.from_jsonl() - core/models.py lines 93-116
  • Snapshot.from_jsonl() - core/models.py lines 1144-1189
  • Machine.from_jsonl() - machine/models.py lines 66-89
  • Dependency.from_jsonl() - machine/models.py lines 203-227
  • Binary.from_jsonl() - machine/models.py lines 401-434

Example implementations added:

class Snapshot:
    @staticmethod
    def from_jsonl(record: Dict, overrides: Dict = None):
        """Create/update Snapshot from JSONL record."""
        from archivebox.misc.jsonl import get_or_create_snapshot
        overrides = overrides or {}

        # Apply overrides (crawl, parent_snapshot, depth limits)
        crawl = overrides.get('crawl')
        snapshot = overrides.get('snapshot')  # parent

        if crawl:
            depth = record.get('depth', (snapshot.depth + 1 if snapshot else 1))
            if depth > crawl.max_depth:
                return None
            record.setdefault('crawl_id', str(crawl.id))
            record.setdefault('depth', depth)
            if snapshot:
                record.setdefault('parent_snapshot_id', str(snapshot.id))

        created_by_id = overrides.get('created_by_id')
        new_snapshot = get_or_create_snapshot(record, created_by_id=created_by_id)
        new_snapshot.status = Snapshot.StatusChoices.QUEUED
        new_snapshot.retry_at = timezone.now()
        new_snapshot.save()
        return new_snapshot

class Tag:
    @staticmethod
    def from_jsonl(record: Dict, overrides: Dict = None):
        """Create/update Tag from JSONL record."""
        from archivebox.misc.jsonl import get_or_create_tag
        tag = get_or_create_tag(record)
        # Auto-attach to snapshot if in overrides
        if overrides and 'snapshot' in overrides:
            overrides['snapshot'].tags.add(tag)
        return tag

class Binary:
    @staticmethod
    def from_jsonl(record: Dict, overrides: Dict = None):
        """Create/update Binary from JSONL record."""
        # Implementation similar to existing create_model_record()
        ...

# Etc for other models

Phase 3: Update ArchiveResult to use unified pattern DONE

File: archivebox/core/models.py

Status: COMPLETE

Changes made:

  1. Replaced inline JSONL processing (lines 1912-1950):

    • Pre-filter Snapshot records for depth/URL constraints in ArchiveResult.run()
    • Use self._url_passes_filters(url) with parent snapshot's config for proper hierarchy
    • Replaced inline Tag/Snapshot/other record creation with process_hook_records()
    • Removed ~60 lines of duplicate code
  2. Simplified Snapshot.from_jsonl() (lines 1144-1189):

    • Removed depth checking (now done in caller)
    • Just applies crawl metadata and creates snapshot
    • Added docstring note: "Filtering should be done by caller BEFORE calling this method"
  3. Preserved ArchiveResult self-update logic:

    • Status/output fields still updated from ArchiveResult JSONL record (lines 1856-1910)
    • Special title extractor logic preserved (line 1952+)
    • Search indexing trigger preserved (line 1957+)
  4. Key insight: Filtering happens in ArchiveResult.run() where we have parent snapshot context, NOT in from_jsonl() where we'd lose config hierarchy

Note: Did NOT delete special background hook methods (check_background_completed, finalize_background_hook) - that's Phase 6

Phase 4: Add Snapshot.cleanup() method DONE

File: archivebox/core/models.py

Status: COMPLETE

Changes made:

  1. Added Snapshot.cleanup() (lines 1144-1175):

    • Kills background ArchiveResult hooks by scanning for */hook.pid files
    • Finalizes background ArchiveResults using finalize_background_hook() (temporary until Phase 6)
    • Called by state machine when entering sealed state
  2. Added Snapshot.has_running_background_hooks() (lines 1177-1195):

    • Checks if any background hooks still running using process_is_alive()
    • Used by state machine in is_finished() check

Phase 5: Update SnapshotMachine to use cleanup() DONE

File: archivebox/core/statemachines.py

Status: COMPLETE

Changes made:

  1. Simplified is_finished() (lines 58-72):

    • Removed inline background hook checking and finalization (lines 67-76 deleted)
    • Now uses self.snapshot.has_running_background_hooks() (line 68)
    • Removed ~12 lines of duplicate logic
  2. Added cleanup() to sealed.enter (lines 102-111):

    • Calls self.snapshot.cleanup() to kill background hooks (line 105)
    • Follows unified pattern: cleanup happens on seal, not in is_finished()

Phase 6: Add ArchiveResult.update_from_output() and simplify run() DONE

File: archivebox/core/models.py

Status: COMPLETE - The BIG refactor (removed ~200 lines of duplication)

Changes made:

  1. Added ArchiveResult.update_from_output() (lines 1908-2061):

    • Unified method for both foreground and background hooks
    • Reads stdout.log and parses JSONL records
    • Updates status/output_str/output_json from ArchiveResult JSONL record
    • Walks filesystem to populate output_files/output_size/output_mimetypes
    • Filters Snapshot records for depth/URL constraints (same as run())
    • Processes side-effect records via process_hook_records()
    • Updates snapshot title if title extractor
    • Triggers search indexing if succeeded
    • Cleans up PID files and empty logs
    • ~160 lines of comprehensive logic
  2. Simplified ArchiveResult.run() (lines 1841-1906):

    • Removed ~120 lines of duplicate filesystem reading logic
    • Now just sets start_ts/pwd and calls update_from_output()
    • Background hooks: return immediately after saving status=STARTED
    • Foreground hooks: call update_from_output() to do all the work
    • Removed ~10 lines of duplicate code
  3. Updated Snapshot.cleanup() (line 1172):

    • Changed from ar.finalize_background_hook() to ar.update_from_output()
    • Uses the unified method instead of the old special-case method
  4. Deleted _populate_output_fields() (was ~45 lines):

    • Logic merged into update_from_output()
    • Eliminates duplication of filesystem walking code
  5. Deleted check_background_completed() (was ~20 lines):

    • Replaced by process_is_alive(pid_file) from hooks.py
    • Generic helper used by Snapshot.has_running_background_hooks()
  6. Deleted finalize_background_hook() (was ~85 lines):

    • Completely replaced by update_from_output()
    • Was duplicate of foreground hook finalization logic

Total lines removed: ~280 lines of duplicate code Total lines added: ~160 lines of unified code Net reduction: ~120 lines (-43%)

Phase 7-8: Dependency State Machine NOT NEEDED

Status: Intentionally skipped - Dependency doesn't need a state machine

Why no state machine for Dependency?

  1. Wrong Granularity: Dependency is a GLOBAL singleton (one record per binary name)

    • Multiple machines would race to update the same status/retry_at fields
    • No clear semantics: "started" on which machine? "failed" on Machine A but "succeeded" on Machine B?
  2. Wrong Timing: Installation should be SYNCHRONOUS, not queued

    • When a worker needs wget, it should install wget NOW, not queue it for later
    • No benefit to async state machine transitions
  3. State Lives Elsewhere: Binary records are the actual state

    • Each machine has its own Binary records (one per machine per binary)
    • Binary.machine FK provides proper per-machine state tracking

Correct Architecture:

Dependency (global, no state machine):
  ├─ Configuration: bin_name, bin_providers, overrides
  ├─ run() method: synchronous installation attempt
  └─ NO status, NO retry_at, NO state_machine_name

Binary (per-machine, has machine FK):
  ├─ State: is this binary installed on this specific machine?
  ├─ Created via JSONL output from on_Dependency hooks
  └─ unique_together = (machine, name, abspath, version, sha256)

What was implemented:

  • Refactored Dependency.run() (lines 249-324):
    • Uses discover_hooks() and process_hook_records() for consistency
    • Added comprehensive docstring explaining why no state machine
    • Synchronous execution: returns Binary or None immediately
    • Uses unified JSONL processing pattern
  • Kept Dependency simple: Just configuration fields, no state fields
  • Multi-machine support: Each machine independently runs Dependency.run() and creates its own Binary

Summary of Changes

Progress: 6/6 Core Phases Complete + 2 Phases Skipped (Intentionally)

ALL core functionality is now complete! The unified pattern is consistently implemented across Crawl, Snapshot, and ArchiveResult. Dependency intentionally kept simple (no state machine needed).

Files Modified:

  1. DONE archivebox/hooks.py - Add unified helpers:

    • process_hook_records(records, overrides) - dispatcher (lines 1258-1323)
    • process_is_alive(pid_file) - check if PID still running (lines 1326-1344)
    • kill_process(pid_file) - kill process (lines 1347-1362)
  2. DONE archivebox/crawls/models.py - Already updated:

    • Crawl.run() - runs hooks, processes JSONL, creates snapshots
    • Crawl.cleanup() - kills background hooks, runs on_CrawlEnd
  3. DONE archivebox/core/models.py:

    • Tag.from_jsonl() - lines 93-116
    • Snapshot.from_jsonl() - lines 1197-1234 (simplified, removed filtering)
    • Snapshot.cleanup() - lines 1144-1172 (kill background hooks, calls ar.update_from_output())
    • Snapshot.has_running_background_hooks() - lines 1174-1193 (check PIDs)
    • ArchiveResult.run() - simplified, uses update_from_output() (lines 1841-1906)
    • ArchiveResult.update_from_output() - unified filesystem reading (lines 1908-2061)
    • DELETED ArchiveResult.check_background_completed() - replaced by process_is_alive()
    • DELETED ArchiveResult.finalize_background_hook() - replaced by update_from_output()
    • DELETED ArchiveResult._populate_output_fields() - merged into update_from_output()
  4. DONE archivebox/core/statemachines.py:

    • Simplified SnapshotMachine.is_finished() - uses has_running_background_hooks() (line 68)
    • Added cleanup call to SnapshotMachine.sealed.enter (line 105)
  5. DONE archivebox/machine/models.py:

    • Machine.from_jsonl() - lines 66-89
    • Dependency.from_jsonl() - lines 203-227
    • Binary.from_jsonl() - lines 401-434
    • Refactored Dependency.run() to use unified pattern (lines 249-324)
    • Added comprehensive docstring explaining why Dependency doesn't need state machine
    • Kept Dependency simple: no state fields, synchronous execution only

Code Metrics:

  • Lines removed: ~280 lines of duplicate code
  • Lines added: ~160 lines of unified code
  • Net reduction: ~120 lines total (-43%)
  • Files created: 0 (no new files needed)

Key Benefits:

  1. Consistency: All stateful models (Crawl, Snapshot, ArchiveResult) follow the same unified state machine pattern
  2. Simplicity: Eliminated special-case background hook handling (~280 lines of duplicate code)
  3. Correctness: Background hooks are properly cleaned up on seal transition
  4. Maintainability: Unified process_hook_records() dispatcher for all JSONL processing
  5. Testability: Consistent pattern makes testing easier
  6. Clear Separation: Stateful work items (Crawl/Snapshot/ArchiveResult) vs stateless config (Dependency)
  7. Multi-Machine Support: Dependency remains simple synchronous config, Binary tracks per-machine state

Final Unified Pattern

All models now follow this consistent architecture:

State Machine Structure

class ModelMachine(StateMachine):
    queued = State(initial=True)
    started = State()
    sealed/succeeded/failed = State(final=True)

    @started.enter
    def enter_started(self):
        self.model.run()  # Execute the work

    @sealed.enter  # or @succeeded.enter
    def enter_sealed(self):
        self.model.cleanup()  # Clean up background hooks

Model Methods

class Model:
    # State machine fields
    status = CharField(default='queued')
    retry_at = DateTimeField(default=timezone.now)
    output_dir = CharField(default='', blank=True)
    state_machine_name = 'app.statemachines.ModelMachine'

    def run(self):
        """Run hooks, process JSONL, create children."""
        hooks = discover_hooks('EventName')
        for hook in hooks:
            output_dir = self.OUTPUT_DIR / hook.parent.name
            result = run_hook(hook, output_dir=output_dir, ...)

            if result is None:  # Background hook
                continue

            # Process JSONL records
            overrides = {'model': self, 'created_by_id': self.created_by_id}
            process_hook_records(result['records'], overrides=overrides)

    def cleanup(self):
        """Kill background hooks, run cleanup hooks."""
        for pid_file in self.OUTPUT_DIR.glob('*/hook.pid'):
            kill_process(pid_file)
            # Update children from filesystem
            child.update_from_output()

    def update_and_requeue(self, **fields):
        """Update fields and bump modified_at."""
        for field, value in fields.items():
            setattr(self, field, value)
        self.save(update_fields=[*fields.keys(), 'modified_at'])

    @staticmethod
    def from_jsonl(record: dict, overrides: dict = None):
        """Create/update model from JSONL record."""
        # Implementation specific to model
        # Called by process_hook_records()

Hook Processing Flow

1. Model.run() discovers hooks
2. Hooks execute and output JSONL to stdout
3. JSONL records dispatched via process_hook_records()
4. Each record type handled by Model.from_jsonl()
5. Background hooks tracked via hook.pid files
6. Model.cleanup() kills background hooks on seal
7. Children updated via update_from_output()

Multi-Machine Coordination

  • Work Items (Crawl, Snapshot, ArchiveResult): No machine FK, any worker can claim
  • Resources (Binary): Machine FK, one per machine per binary
  • Configuration (Dependency): No machine FK, global singleton, synchronous execution
  • Execution Tracking (ArchiveResult.iface): FK to NetworkInterface for observability

Testing Checklist

  • Test Crawl → Snapshot creation with hooks
  • Test Snapshot → ArchiveResult creation
  • Test ArchiveResult foreground hooks (JSONL processing)
  • Test ArchiveResult background hooks (PID tracking, cleanup)
  • Test Dependency.run() synchronous installation
  • Test background hook cleanup on seal transition
  • Test multi-machine Crawl execution
  • Test Binary creation per machine (one per machine per binary)
  • Verify Dependency.run() can be called concurrently from multiple machines safely

FINAL ARCHITECTURE (Phases 1-8 Complete)

Phases 1-6: Core Models Unified

All core models (Crawl, Snapshot, ArchiveResult) now follow the unified pattern:

  • State machines orchestrate transitions
  • .run() methods execute hooks and process JSONL
  • .cleanup() methods kill background hooks
  • .update_and_requeue() methods update state for worker coordination
  • Consistent use of process_hook_records() for JSONL dispatching

Phases 7-8: Binary State Machine (Dependency Model Eliminated)

Key Decision: Eliminated Dependency model entirely and made Binary the state machine.

New Architecture

  • Static Configuration: plugins/{plugin}/dependencies.jsonl files define binary requirements

    {"type": "Binary", "name": "yt-dlp", "bin_providers": "pip,brew,apt,env"}
    {"type": "Binary", "name": "node", "bin_providers": "apt,brew,env", "overrides": {"apt": {"packages": ["nodejs"]}}}
    {"type": "Binary", "name": "ffmpeg", "bin_providers": "apt,brew,env"}
    
  • Dynamic State: Binary model tracks per-machine installation state

    • Fields: machine, name, bin_providers, overrides, abspath, version, sha256, binprovider
    • State machine: queued → started → succeeded/failed
    • Output dir: data/machines/{machine_id}/binaries/{binary_name}/{binary_id}/

Binary State Machine Flow

class BinaryMachine(StateMachine):
    queued  started  succeeded/failed

    @started.enter
    def enter_started(self):
        self.binary.run()  # Runs on_Binary__install_* hooks

class Binary(models.Model):
    def run(self):
        """
        Runs ALL on_Binary__install_* hooks.
        Each hook checks bin_providers and decides if it can handle this binary.
        First hook to succeed wins.
        Outputs JSONL with abspath, version, sha256, binprovider.
        """
        hooks = discover_hooks('Binary')
        for hook in hooks:
            result = run_hook(hook, output_dir=self.OUTPUT_DIR/plugin_name, 
                            binary_id=self.id, machine_id=self.machine_id,
                            name=self.name, bin_providers=self.bin_providers,
                            overrides=json.dumps(self.overrides))
            
            # Hook outputs: {"type": "Binary", "name": "wget", "abspath": "/usr/bin/wget", "version": "1.21", "binprovider": "apt"}
            # Binary.from_jsonl() updates self with installation results

Hook Naming Convention

  • Before: on_Dependency__install_using_pip_provider.py
  • After: on_Binary__install_using_pip_provider.py

Each hook checks --bin-providers CLI argument:

if 'pip' not in bin_providers.split(','):
    sys.exit(0)  # Skip this binary

Perfect Symmetry Achieved

All models now follow identical patterns:

Crawl(queued)  CrawlMachine  Crawl.run()  sealed
Snapshot(queued)  SnapshotMachine  Snapshot.run()  sealed  
ArchiveResult(queued)  ArchiveResultMachine  ArchiveResult.run()  succeeded/failed
Binary(queued)  BinaryMachine  Binary.run()  succeeded/failed

Benefits of Eliminating Dependency

  1. No global singleton conflicts: Binary is per-machine, no race conditions
  2. Simpler data model: One table instead of two (Dependency + InstalledBinary)
  3. Static configuration: dependencies.jsonl in version control, not database
  4. Consistent state machine: Binary follows same pattern as other models
  5. Cleaner hooks: Hooks check bin_providers themselves instead of orchestrator parsing names

Multi-Machine Coordination

  • Work Items (Crawl, Snapshot, ArchiveResult): No machine FK, any worker can claim
  • Resources (Binary): Machine FK, one per machine per binary name
  • Configuration: Static files in plugins/*/dependencies.jsonl
  • Execution Tracking: ArchiveResult.iface FK to NetworkInterface for observability

Testing Checklist (Updated)

  • Core models use unified hook pattern (Phases 1-6)
  • Binary installation via state machine
  • Multiple machines can install same binary independently
  • Hook bin_providers filtering works correctly
  • Binary.from_jsonl() handles both dependencies.jsonl and hook output
  • Binary OUTPUT_DIR structure: data/machines/{machine_id}/binaries/{name}/{id}/