ArchiveBox/old/TODO_hook_statemachine_cleanup.md

# Hook & State Machine Cleanup - Unified Pattern

## Goal
Implement a **consistent pattern** across all models (Crawl, Snapshot, ArchiveResult, Dependency) for:
1. Running hooks
2. Processing JSONL records
3. Managing background hooks
4. State transitions

## Current State Analysis (ALL COMPLETE ✅)

### ✅ Crawl (archivebox/crawls/)
**Status**: COMPLETE
- ✅ Has state machine: `CrawlMachine`
- ✅ `Crawl.run()` - runs hooks, processes JSONL via `process_hook_records()`, creates snapshots
- ✅ `Crawl.cleanup()` - kills background hooks, runs on_CrawlEnd hooks
- ✅ Uses `OUTPUT_DIR/plugin_name/` for PWD
- ✅ State machine calls model methods:
  - `queued -> started`: calls `crawl.run()`
  - `started -> sealed`: calls `crawl.cleanup()`

### ✅ Snapshot (archivebox/core/)
**Status**: COMPLETE
- ✅ Has state machine: `SnapshotMachine`
- ✅ `Snapshot.run()` - creates pending ArchiveResults
- ✅ `Snapshot.cleanup()` - kills background ArchiveResult hooks, calls `update_from_output()`
- ✅ `Snapshot.has_running_background_hooks()` - checks PID files using `process_is_alive()`
- ✅ `Snapshot.from_jsonl()` - simplified, filtering moved to caller
- ✅ State machine calls model methods:
  - `queued -> started`: calls `snapshot.run()`
  - `started -> sealed`: calls `snapshot.cleanup()`
  - `is_finished()`: uses `has_running_background_hooks()`

### ✅ ArchiveResult (archivebox/core/)
**Status**: COMPLETE - Major refactor completed
- ✅ Has state machine: `ArchiveResultMachine`
- ✅ `ArchiveResult.run()` - runs hook, calls `update_from_output()` for foreground hooks
- ✅ `ArchiveResult.update_from_output()` - unified method for foreground and background hooks
- ✅ Uses PWD `snapshot.OUTPUT_DIR/plugin_name`
- ✅ JSONL processing via `process_hook_records()` with URL/depth filtering
- ✅ **DELETED** special background hook methods:
  - ❌ `check_background_completed()` - replaced by `process_is_alive()` helper
  - ❌ `finalize_background_hook()` - replaced by `update_from_output()`
  - ❌ `_populate_output_fields()` - merged into `update_from_output()`
- ✅ State machine transitions:
  - `queued -> started`: calls `archiveresult.run()`
  - `started -> succeeded/failed/skipped`: status set by `update_from_output()`

### ✅ Binary (archivebox/machine/) - NEW!
**Status**: COMPLETE - Replaced Dependency model entirely
- ✅ Has state machine: `BinaryMachine`
- ✅ `Binary.run()` - runs on_Binary__install_* hooks, processes JSONL
- ✅ `Binary.cleanup()` - kills background installation hooks (for consistency)
- ✅ `Binary.from_jsonl()` - handles both binaries.jsonl and hook output
- ✅ Uses PWD `data/machines/{machine_id}/binaries/{name}/{id}/plugin_name/`
- ✅ Configuration via static `plugins/*/binaries.jsonl` files
- ✅ State machine calls model methods:
  - `queued -> started`: calls `binary.run()`
  - `started -> succeeded/failed`: status set by hooks via JSONL
- ✅ Perfect symmetry with Crawl/Snapshot/ArchiveResult pattern

### ❌ Dependency Model - ELIMINATED
**Status**: Deleted entirely (replaced by Binary state machine)
- Static configuration now lives in `plugins/*/binaries.jsonl`
- Per-machine state tracked by Binary records
- No global singleton conflicts
- Hooks renamed from `on_Dependency__install_*` to `on_Binary__install_*`

## Unified Pattern (Target Architecture)

### Pattern for ALL models:

```python
# 1. State Machine orchestrates transitions
class ModelMachine(StateMachine):
    @started.enter
    def enter_started(self):
        self.model.run()  # Do the work
        # Update status

    def is_finished(self):
        # Check if background hooks still running
        if self.model.has_running_background_hooks():
            return False
        # Check if children finished
        if self.model.has_pending_children():
            return False
        return True

    @sealed.enter
    def enter_sealed(self):
        self.model.cleanup()  # Clean up background hooks
        # Update status

# 2. Model methods do the actual work
class Model:
    def run(self):
        """Run hooks, process JSONL, create children."""
        hooks = discover_hooks('ModelName')
        for hook in hooks:
            output_dir = self.OUTPUT_DIR / hook.parent.name
            result = run_hook(hook, output_dir=output_dir, ...)

            if result is None:  # Background hook
                continue

            # Process JSONL records
            records = result.get('records', [])
            overrides = {'model': self, 'created_by_id': self.created_by_id}
            process_hook_records(records, overrides=overrides)

        # Create children (e.g., ArchiveResults, Snapshots)
        self.create_children()

    def cleanup(self):
        """Kill background hooks, run cleanup hooks."""
        # Kill any background hooks
        if self.OUTPUT_DIR.exists():
            for pid_file in self.OUTPUT_DIR.glob('*/hook.pid'):
                kill_process(pid_file)

        # Run cleanup hooks (e.g., on_ModelEnd)
        cleanup_hooks = discover_hooks('ModelEnd')
        for hook in cleanup_hooks:
            run_hook(hook, ...)

    def has_running_background_hooks(self) -> bool:
        """Check if any background hooks still running."""
        if not self.OUTPUT_DIR.exists():
            return False
        for pid_file in self.OUTPUT_DIR.glob('*/hook.pid'):
            if process_is_alive(pid_file):
                return True
        return False
```

### PWD Standard:
```
model.OUTPUT_DIR/plugin_name/
```
- Crawl: `users/{user}/crawls/{date}/{crawl_id}/plugin_name/`
- Snapshot: `users/{user}/snapshots/{date}/{domain}/{snapshot_id}/plugin_name/`
- ArchiveResult: `users/{user}/snapshots/{date}/{domain}/{snapshot_id}/plugin_name/` (same as Snapshot)
- Dependency: `dependencies/{dependency_id}/plugin_name/` (set output_dir field directly)

## Implementation Plan

### Phase 1: Add unified helpers to hooks.py ✅ DONE

**File**: `archivebox/hooks.py`

**Status**: COMPLETE - Added three helper functions:
- `process_hook_records(records, overrides)` - lines 1258-1323
- `process_is_alive(pid_file)` - lines 1326-1344
- `kill_process(pid_file, sig)` - lines 1347-1362

```python
def process_hook_records(records: List[Dict], overrides: Dict = None) -> Dict[str, int]:
    """
    Process JSONL records from hook output.
    Dispatches to Model.from_jsonl() for each record type.

    Args:
        records: List of JSONL record dicts from result['records']
        overrides: Dict with 'snapshot', 'crawl', 'dependency', 'created_by_id', etc.

    Returns:
        Dict with counts by record type
    """
    stats = {}
    for record in records:
        record_type = record.get('type')

        # Dispatch to appropriate model
        if record_type == 'Snapshot':
            from archivebox.core.models import Snapshot
            Snapshot.from_jsonl(record, overrides)
            stats['Snapshot'] = stats.get('Snapshot', 0) + 1
        elif record_type == 'Tag':
            from archivebox.core.models import Tag
            Tag.from_jsonl(record, overrides)
            stats['Tag'] = stats.get('Tag', 0) + 1
        elif record_type == 'Binary':
            from archivebox.machine.models import Binary
            Binary.from_jsonl(record, overrides)
            stats['Binary'] = stats.get('Binary', 0) + 1
        # ... etc
    return stats

def process_is_alive(pid_file: Path) -> bool:
    """Check if process in PID file is still running."""
    if not pid_file.exists():
        return False
    try:
        pid = int(pid_file.read_text().strip())
        os.kill(pid, 0)  # Signal 0 = check if exists
        return True
    except (OSError, ValueError):
        return False

def kill_process(pid_file: Path, signal=SIGTERM):
    """Kill process in PID file."""
    if not pid_file.exists():
        return
    try:
        pid = int(pid_file.read_text().strip())
        os.kill(pid, signal)
    except (OSError, ValueError):
        pass
```

### Phase 2: Add Model.from_jsonl() static methods ✅ DONE

**Files**: `archivebox/core/models.py`, `archivebox/machine/models.py`, `archivebox/crawls/models.py`

**Status**: COMPLETE - Added from_jsonl() to:
- ✅ `Tag.from_jsonl()` - core/models.py lines 93-116
- ✅ `Snapshot.from_jsonl()` - core/models.py lines 1144-1189
- ✅ `Machine.from_jsonl()` - machine/models.py lines 66-89
- ✅ `Dependency.from_jsonl()` - machine/models.py lines 203-227
- ✅ `Binary.from_jsonl()` - machine/models.py lines 401-434

Example implementations added:

```python
class Snapshot:
    @staticmethod
    def from_jsonl(record: Dict, overrides: Dict = None):
        """Create/update Snapshot from JSONL record."""
        from archivebox.misc.jsonl import get_or_create_snapshot
        overrides = overrides or {}

        # Apply overrides (crawl, parent_snapshot, depth limits)
        crawl = overrides.get('crawl')
        snapshot = overrides.get('snapshot')  # parent

        if crawl:
            depth = record.get('depth', (snapshot.depth + 1 if snapshot else 1))
            if depth > crawl.max_depth:
                return None
            record.setdefault('crawl_id', str(crawl.id))
            record.setdefault('depth', depth)
            if snapshot:
                record.setdefault('parent_snapshot_id', str(snapshot.id))

        created_by_id = overrides.get('created_by_id')
        new_snapshot = get_or_create_snapshot(record, created_by_id=created_by_id)
        new_snapshot.status = Snapshot.StatusChoices.QUEUED
        new_snapshot.retry_at = timezone.now()
        new_snapshot.save()
        return new_snapshot

class Tag:
    @staticmethod
    def from_jsonl(record: Dict, overrides: Dict = None):
        """Create/update Tag from JSONL record."""
        from archivebox.misc.jsonl import get_or_create_tag
        tag = get_or_create_tag(record)
        # Auto-attach to snapshot if in overrides
        if overrides and 'snapshot' in overrides:
            overrides['snapshot'].tags.add(tag)
        return tag

class Binary:
    @staticmethod
    def from_jsonl(record: Dict, overrides: Dict = None):
        """Create/update Binary from JSONL record."""
        # Implementation similar to existing create_model_record()
        ...

# Etc for other models
```

### Phase 3: Update ArchiveResult to use unified pattern ✅ DONE

**File**: `archivebox/core/models.py`

**Status**: COMPLETE

**Changes made**:

1. ✅ **Replaced inline JSONL processing** (lines 1912-1950):
   - Pre-filter Snapshot records for depth/URL constraints in ArchiveResult.run()
   - Use `self._url_passes_filters(url)` with parent snapshot's config for proper hierarchy
   - Replaced inline Tag/Snapshot/other record creation with `process_hook_records()`
   - Removed ~60 lines of duplicate code

2. ✅ **Simplified Snapshot.from_jsonl()** (lines 1144-1189):
   - Removed depth checking (now done in caller)
   - Just applies crawl metadata and creates snapshot
   - Added docstring note: "Filtering should be done by caller BEFORE calling this method"

3. ✅ **Preserved ArchiveResult self-update logic**:
   - Status/output fields still updated from ArchiveResult JSONL record (lines 1856-1910)
   - Special title extractor logic preserved (line 1952+)
   - Search indexing trigger preserved (line 1957+)

4. ✅ **Key insight**: Filtering happens in ArchiveResult.run() where we have parent snapshot context, NOT in from_jsonl() where we'd lose config hierarchy

**Note**: Did NOT delete special background hook methods (`check_background_completed`, `finalize_background_hook`) - that's Phase 6

### Phase 4: Add Snapshot.cleanup() method ✅ DONE

**File**: `archivebox/core/models.py`

**Status**: COMPLETE

**Changes made**:

1. ✅ **Added Snapshot.cleanup()** (lines 1144-1175):
   - Kills background ArchiveResult hooks by scanning for `*/hook.pid` files
   - Finalizes background ArchiveResults using `finalize_background_hook()` (temporary until Phase 6)
   - Called by state machine when entering sealed state

2. ✅ **Added Snapshot.has_running_background_hooks()** (lines 1177-1195):
   - Checks if any background hooks still running using `process_is_alive()`
   - Used by state machine in `is_finished()` check

### Phase 5: Update SnapshotMachine to use cleanup() ✅ DONE

**File**: `archivebox/core/statemachines.py`

**Status**: COMPLETE

**Changes made**:

1. ✅ **Simplified is_finished()** (lines 58-72):
   - Removed inline background hook checking and finalization (lines 67-76 deleted)
   - Now uses `self.snapshot.has_running_background_hooks()` (line 68)
   - Removed ~12 lines of duplicate logic

2. ✅ **Added cleanup() to sealed.enter** (lines 102-111):
   - Calls `self.snapshot.cleanup()` to kill background hooks (line 105)
   - Follows unified pattern: cleanup happens on seal, not in is_finished()

### Phase 6: Add ArchiveResult.update_from_output() and simplify run() ✅ DONE

**File**: `archivebox/core/models.py`

**Status**: COMPLETE - The BIG refactor (removed ~200 lines of duplication)

**Changes made**:

1. ✅ **Added `ArchiveResult.update_from_output()`** (lines 1908-2061):
   - Unified method for both foreground and background hooks
   - Reads stdout.log and parses JSONL records
   - Updates status/output_str/output_json from ArchiveResult JSONL record
   - Walks filesystem to populate output_files/output_size/output_mimetypes
   - Filters Snapshot records for depth/URL constraints (same as run())
   - Processes side-effect records via `process_hook_records()`
   - Updates snapshot title if title extractor
   - Triggers search indexing if succeeded
   - Cleans up PID files and empty logs
   - ~160 lines of comprehensive logic

2. ✅ **Simplified `ArchiveResult.run()`** (lines 1841-1906):
   - Removed ~120 lines of duplicate filesystem reading logic
   - Now just sets start_ts/pwd and calls `update_from_output()`
   - Background hooks: return immediately after saving status=STARTED
   - Foreground hooks: call `update_from_output()` to do all the work
   - Removed ~10 lines of duplicate code

3. ✅ **Updated `Snapshot.cleanup()`** (line 1172):
   - Changed from `ar.finalize_background_hook()` to `ar.update_from_output()`
   - Uses the unified method instead of the old special-case method

4. ✅ **Deleted `_populate_output_fields()`** (was ~45 lines):
   - Logic merged into `update_from_output()`
   - Eliminates duplication of filesystem walking code

5. ✅ **Deleted `check_background_completed()`** (was ~20 lines):
   - Replaced by `process_is_alive(pid_file)` from hooks.py
   - Generic helper used by Snapshot.has_running_background_hooks()

6. ✅ **Deleted `finalize_background_hook()`** (was ~85 lines):
   - Completely replaced by `update_from_output()`
   - Was duplicate of foreground hook finalization logic

**Total lines removed**: ~280 lines of duplicate code
**Total lines added**: ~160 lines of unified code
**Net reduction**: ~120 lines (-43%)

### Phase 7-8: Dependency State Machine ❌ NOT NEEDED

**Status**: Intentionally skipped - Dependency doesn't need a state machine

**Why no state machine for Dependency?**

1. **Wrong Granularity**: Dependency is a GLOBAL singleton (one record per binary name)
   - Multiple machines would race to update the same `status`/`retry_at` fields
   - No clear semantics: "started" on which machine? "failed" on Machine A but "succeeded" on Machine B?

2. **Wrong Timing**: Installation should be SYNCHRONOUS, not queued
   - When a worker needs wget, it should install wget NOW, not queue it for later
   - No benefit to async state machine transitions

3. **State Lives Elsewhere**: Binary records are the actual state
   - Each machine has its own Binary records (one per machine per binary)
   - Binary.machine FK provides proper per-machine state tracking

**Correct Architecture:**
```
Dependency (global, no state machine):
  ├─ Configuration: bin_name, bin_providers, overrides
  ├─ run() method: synchronous installation attempt
  └─ NO status, NO retry_at, NO state_machine_name

Binary (per-machine, has machine FK):
  ├─ State: is this binary installed on this specific machine?
  ├─ Created via JSONL output from on_Dependency hooks
  └─ unique_together = (machine, name, abspath, version, sha256)
```

**What was implemented:**
- ✅ **Refactored `Dependency.run()`** (lines 249-324):
  - Uses `discover_hooks()` and `process_hook_records()` for consistency
  - Added comprehensive docstring explaining why no state machine
  - Synchronous execution: returns Binary or None immediately
  - Uses unified JSONL processing pattern
- ✅ **Kept Dependency simple**: Just configuration fields, no state fields
- ✅ **Multi-machine support**: Each machine independently runs Dependency.run() and creates its own Binary

## Summary of Changes

### Progress: 6/6 Core Phases Complete ✅ + 2 Phases Skipped (Intentionally)

**ALL core functionality is now complete!** The unified pattern is consistently implemented across Crawl, Snapshot, and ArchiveResult. Dependency intentionally kept simple (no state machine needed).

### Files Modified:

1. ✅ **DONE** `archivebox/hooks.py` - Add unified helpers:
   - ✅ `process_hook_records(records, overrides)` - dispatcher (lines 1258-1323)
   - ✅ `process_is_alive(pid_file)` - check if PID still running (lines 1326-1344)
   - ✅ `kill_process(pid_file)` - kill process (lines 1347-1362)

2. ✅ **DONE** `archivebox/crawls/models.py` - Already updated:
   - ✅ `Crawl.run()` - runs hooks, processes JSONL, creates snapshots
   - ✅ `Crawl.cleanup()` - kills background hooks, runs on_CrawlEnd

3. ✅ **DONE** `archivebox/core/models.py`:
   - ✅ `Tag.from_jsonl()` - lines 93-116
   - ✅ `Snapshot.from_jsonl()` - lines 1197-1234 (simplified, removed filtering)
   - ✅ `Snapshot.cleanup()` - lines 1144-1172 (kill background hooks, calls ar.update_from_output())
   - ✅ `Snapshot.has_running_background_hooks()` - lines 1174-1193 (check PIDs)
   - ✅ `ArchiveResult.run()` - simplified, uses `update_from_output()` (lines 1841-1906)
   - ✅ `ArchiveResult.update_from_output()` - unified filesystem reading (lines 1908-2061)
   - ✅ **DELETED** `ArchiveResult.check_background_completed()` - replaced by `process_is_alive()`
   - ✅ **DELETED** `ArchiveResult.finalize_background_hook()` - replaced by `update_from_output()`
   - ✅ **DELETED** `ArchiveResult._populate_output_fields()` - merged into `update_from_output()`

4. ✅ **DONE** `archivebox/core/statemachines.py`:
   - ✅ Simplified `SnapshotMachine.is_finished()` - uses `has_running_background_hooks()` (line 68)
   - ✅ Added cleanup call to `SnapshotMachine.sealed.enter` (line 105)

5. ✅ **DONE** `archivebox/machine/models.py`:
   - ✅ `Machine.from_jsonl()` - lines 66-89
   - ✅ `Dependency.from_jsonl()` - lines 203-227
   - ✅ `Binary.from_jsonl()` - lines 401-434
   - ✅ Refactored `Dependency.run()` to use unified pattern (lines 249-324)
   - ✅ Added comprehensive docstring explaining why Dependency doesn't need state machine
   - ✅ Kept Dependency simple: no state fields, synchronous execution only

### Code Metrics:
- **Lines removed**: ~280 lines of duplicate code
- **Lines added**: ~160 lines of unified code
- **Net reduction**: ~120 lines total (-43%)
- **Files created**: 0 (no new files needed)

### Key Benefits:

1. **Consistency**: All stateful models (Crawl, Snapshot, ArchiveResult) follow the same unified state machine pattern
2. **Simplicity**: Eliminated special-case background hook handling (~280 lines of duplicate code)
3. **Correctness**: Background hooks are properly cleaned up on seal transition
4. **Maintainability**: Unified `process_hook_records()` dispatcher for all JSONL processing
5. **Testability**: Consistent pattern makes testing easier
6. **Clear Separation**: Stateful work items (Crawl/Snapshot/ArchiveResult) vs stateless config (Dependency)
7. **Multi-Machine Support**: Dependency remains simple synchronous config, Binary tracks per-machine state

## Final Unified Pattern

All models now follow this consistent architecture:

### State Machine Structure
```python
class ModelMachine(StateMachine):
    queued = State(initial=True)
    started = State()
    sealed/succeeded/failed = State(final=True)

    @started.enter
    def enter_started(self):
        self.model.run()  # Execute the work

    @sealed.enter  # or @succeeded.enter
    def enter_sealed(self):
        self.model.cleanup()  # Clean up background hooks
```

### Model Methods
```python
class Model:
    # State machine fields
    status = CharField(default='queued')
    retry_at = DateTimeField(default=timezone.now)
    output_dir = CharField(default='', blank=True)
    state_machine_name = 'app.statemachines.ModelMachine'

    def run(self):
        """Run hooks, process JSONL, create children."""
        hooks = discover_hooks('EventName')
        for hook in hooks:
            output_dir = self.OUTPUT_DIR / hook.parent.name
            result = run_hook(hook, output_dir=output_dir, ...)

            if result is None:  # Background hook
                continue

            # Process JSONL records
            overrides = {'model': self, 'created_by_id': self.created_by_id}
            process_hook_records(result['records'], overrides=overrides)

    def cleanup(self):
        """Kill background hooks, run cleanup hooks."""
        for pid_file in self.OUTPUT_DIR.glob('*/hook.pid'):
            kill_process(pid_file)
            # Update children from filesystem
            child.update_from_output()

    def update_and_requeue(self, **fields):
        """Update fields and bump modified_at."""
        for field, value in fields.items():
            setattr(self, field, value)
        self.save(update_fields=[*fields.keys(), 'modified_at'])

    @staticmethod
    def from_jsonl(record: dict, overrides: dict = None):
        """Create/update model from JSONL record."""
        # Implementation specific to model
        # Called by process_hook_records()
```

### Hook Processing Flow
```
1. Model.run() discovers hooks
2. Hooks execute and output JSONL to stdout
3. JSONL records dispatched via process_hook_records()
4. Each record type handled by Model.from_jsonl()
5. Background hooks tracked via hook.pid files
6. Model.cleanup() kills background hooks on seal
7. Children updated via update_from_output()
```

### Multi-Machine Coordination
- **Work Items** (Crawl, Snapshot, ArchiveResult): No machine FK, any worker can claim
- **Resources** (Binary): Machine FK, one per machine per binary
- **Configuration** (Dependency): No machine FK, global singleton, synchronous execution
- **Execution Tracking** (ArchiveResult.iface): FK to NetworkInterface for observability

## Testing Checklist

- [ ] Test Crawl → Snapshot creation with hooks
- [ ] Test Snapshot → ArchiveResult creation
- [ ] Test ArchiveResult foreground hooks (JSONL processing)
- [ ] Test ArchiveResult background hooks (PID tracking, cleanup)
- [ ] Test Dependency.run() synchronous installation
- [ ] Test background hook cleanup on seal transition
- [ ] Test multi-machine Crawl execution
- [ ] Test Binary creation per machine (one per machine per binary)
- [ ] Verify Dependency.run() can be called concurrently from multiple machines safely

## FINAL ARCHITECTURE (Phases 1-8 Complete)

### ✅ Phases 1-6: Core Models Unified
All core models (Crawl, Snapshot, ArchiveResult) now follow the unified pattern:
- State machines orchestrate transitions
- `.run()` methods execute hooks and process JSONL
- `.cleanup()` methods kill background hooks
- `.update_and_requeue()` methods update state for worker coordination
- Consistent use of `process_hook_records()` for JSONL dispatching

### ✅ Phases 7-8: Binary State Machine (Dependency Model Eliminated)

**Key Decision**: Eliminated `Dependency` model entirely and made `Binary` the state machine.

#### New Architecture
- **Static Configuration**: `plugins/{plugin}/dependencies.jsonl` files define binary requirements
  ```jsonl
  {"type": "Binary", "name": "yt-dlp", "bin_providers": "pip,brew,apt,env"}
  {"type": "Binary", "name": "node", "bin_providers": "apt,brew,env", "overrides": {"apt": {"packages": ["nodejs"]}}}
  {"type": "Binary", "name": "ffmpeg", "bin_providers": "apt,brew,env"}
  ```

- **Dynamic State**: `Binary` model tracks per-machine installation state
  - Fields: `machine`, `name`, `bin_providers`, `overrides`, `abspath`, `version`, `sha256`, `binprovider`
  - State machine: `queued → started → succeeded/failed`
  - Output dir: `data/machines/{machine_id}/binaries/{binary_name}/{binary_id}/`

#### Binary State Machine Flow
```python
class BinaryMachine(StateMachine):
    queued → started → succeeded/failed

    @started.enter
    def enter_started(self):
        self.binary.run()  # Runs on_Binary__install_* hooks

class Binary(models.Model):
    def run(self):
        """
        Runs ALL on_Binary__install_* hooks.
        Each hook checks bin_providers and decides if it can handle this binary.
        First hook to succeed wins.
        Outputs JSONL with abspath, version, sha256, binprovider.
        """
        hooks = discover_hooks('Binary')
        for hook in hooks:
            result = run_hook(hook, output_dir=self.OUTPUT_DIR/plugin_name,
                            binary_id=self.id, machine_id=self.machine_id,
                            name=self.name, bin_providers=self.bin_providers,
                            overrides=json.dumps(self.overrides))

            # Hook outputs: {"type": "Binary", "name": "wget", "abspath": "/usr/bin/wget", "version": "1.21", "binprovider": "apt"}
            # Binary.from_jsonl() updates self with installation results
```

#### Hook Naming Convention
- **Before**: `on_Dependency__install_using_pip_provider.py`
- **After**: `on_Binary__install_using_pip_provider.py`

Each hook checks `--bin-providers` CLI argument:
```python
if 'pip' not in bin_providers.split(','):
    sys.exit(0)  # Skip this binary
```

#### Perfect Symmetry Achieved
All models now follow identical patterns:
```python
Crawl(queued) → CrawlMachine → Crawl.run() → sealed
Snapshot(queued) → SnapshotMachine → Snapshot.run() → sealed
ArchiveResult(queued) → ArchiveResultMachine → ArchiveResult.run() → succeeded/failed
Binary(queued) → BinaryMachine → Binary.run() → succeeded/failed
```

#### Benefits of Eliminating Dependency
1. **No global singleton conflicts**: Binary is per-machine, no race conditions
2. **Simpler data model**: One table instead of two (Dependency + InstalledBinary)
3. **Static configuration**: dependencies.jsonl in version control, not database
4. **Consistent state machine**: Binary follows same pattern as other models
5. **Cleaner hooks**: Hooks check bin_providers themselves instead of orchestrator parsing names

#### Multi-Machine Coordination
- **Work Items** (Crawl, Snapshot, ArchiveResult): No machine FK, any worker can claim
- **Resources** (Binary): Machine FK, one per machine per binary name
- **Configuration**: Static files in `plugins/*/dependencies.jsonl`
- **Execution Tracking**: ArchiveResult.iface FK to NetworkInterface for observability

### Testing Checklist (Updated)
- [x] Core models use unified hook pattern (Phases 1-6)
- [ ] Binary installation via state machine
- [ ] Multiple machines can install same binary independently
- [ ] Hook bin_providers filtering works correctly
- [ ] Binary.from_jsonl() handles both dependencies.jsonl and hook output
- [ ] Binary OUTPUT_DIR structure: data/machines/{machine_id}/binaries/{name}/{id}/