alex/ArchiveBox

Fork 0

mirror of https://github.com/ArchiveBox/ArchiveBox.git synced 2026-01-02 17:05:38 +10:00

Files

Nick Sweeting f0aa19fa7d wip

2025-12-28 17:51:54 -08:00

27 KiB

Raw Permalink Blame History

Hook & State Machine Cleanup - Unified Pattern

Goal

Implement a consistent pattern across all models (Crawl, Snapshot, ArchiveResult, Dependency) for:

Running hooks
Processing JSONL records
Managing background hooks
State transitions

Current State Analysis (ALL COMPLETE ✅)

✅ Crawl (archivebox/crawls/)

Status: COMPLETE

✅ Has state machine: CrawlMachine
✅ Crawl.run() - runs hooks, processes JSONL via process_hook_records(), creates snapshots
✅ Crawl.cleanup() - kills background hooks, runs on_CrawlEnd hooks
✅ Uses OUTPUT_DIR/plugin_name/ for PWD
✅ State machine calls model methods:
- queued -> started: calls crawl.run()
- started -> sealed: calls crawl.cleanup()

✅ Snapshot (archivebox/core/)

Status: COMPLETE

✅ Has state machine: SnapshotMachine
✅ Snapshot.run() - creates pending ArchiveResults
✅ Snapshot.cleanup() - kills background ArchiveResult hooks, calls update_from_output()
✅ Snapshot.has_running_background_hooks() - checks PID files using process_is_alive()
✅ Snapshot.from_jsonl() - simplified, filtering moved to caller
✅ State machine calls model methods:
- queued -> started: calls snapshot.run()
- started -> sealed: calls snapshot.cleanup()
- is_finished(): uses has_running_background_hooks()

✅ ArchiveResult (archivebox/core/)

Status: COMPLETE - Major refactor completed

✅ Has state machine: ArchiveResultMachine
✅ ArchiveResult.run() - runs hook, calls update_from_output() for foreground hooks
✅ ArchiveResult.update_from_output() - unified method for foreground and background hooks
✅ Uses PWD snapshot.OUTPUT_DIR/plugin_name
✅ JSONL processing via process_hook_records() with URL/depth filtering
✅ DELETED special background hook methods:
- ❌ check_background_completed() - replaced by process_is_alive() helper
- ❌ finalize_background_hook() - replaced by update_from_output()
- ❌ _populate_output_fields() - merged into update_from_output()
✅ State machine transitions:
- queued -> started: calls archiveresult.run()
- started -> succeeded/failed/skipped: status set by update_from_output()

✅ Binary (archivebox/machine/) - NEW!

Status: COMPLETE - Replaced Dependency model entirely

✅ Has state machine: BinaryMachine
✅ Binary.run() - runs on_Binary__install_* hooks, processes JSONL
✅ Binary.cleanup() - kills background installation hooks (for consistency)
✅ Binary.from_jsonl() - handles both binaries.jsonl and hook output
✅ Uses PWD data/machines/{machine_id}/binaries/{name}/{id}/plugin_name/
✅ Configuration via static plugins/*/binaries.jsonl files
✅ State machine calls model methods:
- queued -> started: calls binary.run()
- started -> succeeded/failed: status set by hooks via JSONL
✅ Perfect symmetry with Crawl/Snapshot/ArchiveResult pattern

❌ Dependency Model - ELIMINATED

Status: Deleted entirely (replaced by Binary state machine)

Static configuration now lives in plugins/*/binaries.jsonl
Per-machine state tracked by Binary records
No global singleton conflicts
Hooks renamed from on_Dependency__install_* to on_Binary__install_*

Unified Pattern (Target Architecture)

Pattern for ALL models:

# 1. State Machine orchestrates transitions
class ModelMachine(StateMachine):
    @started.enter
    def enter_started(self):
        self.model.run()  # Do the work
        # Update status

    def is_finished(self):
        # Check if background hooks still running
        if self.model.has_running_background_hooks():
            return False
        # Check if children finished
        if self.model.has_pending_children():
            return False
        return True

    @sealed.enter
    def enter_sealed(self):
        self.model.cleanup()  # Clean up background hooks
        # Update status

# 2. Model methods do the actual work
class Model:
    def run(self):
        """Run hooks, process JSONL, create children."""
        hooks = discover_hooks('ModelName')
        for hook in hooks:
            output_dir = self.OUTPUT_DIR / hook.parent.name
            result = run_hook(hook, output_dir=output_dir, ...)

            if result is None:  # Background hook
                continue

            # Process JSONL records
            records = result.get('records', [])
            overrides = {'model': self, 'created_by_id': self.created_by_id}
            process_hook_records(records, overrides=overrides)

        # Create children (e.g., ArchiveResults, Snapshots)
        self.create_children()

    def cleanup(self):
        """Kill background hooks, run cleanup hooks."""
        # Kill any background hooks
        if self.OUTPUT_DIR.exists():
            for pid_file in self.OUTPUT_DIR.glob('*/hook.pid'):
                kill_process(pid_file)

        # Run cleanup hooks (e.g., on_ModelEnd)
        cleanup_hooks = discover_hooks('ModelEnd')
        for hook in cleanup_hooks:
            run_hook(hook, ...)

    def has_running_background_hooks(self) -> bool:
        """Check if any background hooks still running."""
        if not self.OUTPUT_DIR.exists():
            return False
        for pid_file in self.OUTPUT_DIR.glob('*/hook.pid'):
            if process_is_alive(pid_file):
                return True
        return False

PWD Standard:

model.OUTPUT_DIR/plugin_name/

Crawl: users/{user}/crawls/{date}/{crawl_id}/plugin_name/
Snapshot: users/{user}/snapshots/{date}/{domain}/{snapshot_id}/plugin_name/
ArchiveResult: users/{user}/snapshots/{date}/{domain}/{snapshot_id}/plugin_name/ (same as Snapshot)
Dependency: dependencies/{dependency_id}/plugin_name/ (set output_dir field directly)

Implementation Plan

Phase 1: Add unified helpers to hooks.py ✅ DONE

File: archivebox/hooks.py

Status: COMPLETE - Added three helper functions:

process_hook_records(records, overrides) - lines 1258-1323
process_is_alive(pid_file) - lines 1326-1344
kill_process(pid_file, sig) - lines 1347-1362

def process_hook_records(records: List[Dict], overrides: Dict = None) -> Dict[str, int]:
    """
    Process JSONL records from hook output.
    Dispatches to Model.from_jsonl() for each record type.

    Args:
        records: List of JSONL record dicts from result['records']
        overrides: Dict with 'snapshot', 'crawl', 'dependency', 'created_by_id', etc.

    Returns:
        Dict with counts by record type
    """
    stats = {}
    for record in records:
        record_type = record.get('type')

        # Dispatch to appropriate model
        if record_type == 'Snapshot':
            from archivebox.core.models import Snapshot
            Snapshot.from_jsonl(record, overrides)
            stats['Snapshot'] = stats.get('Snapshot', 0) + 1
        elif record_type == 'Tag':
            from archivebox.core.models import Tag
            Tag.from_jsonl(record, overrides)
            stats['Tag'] = stats.get('Tag', 0) + 1
        elif record_type == 'Binary':
            from archivebox.machine.models import Binary
            Binary.from_jsonl(record, overrides)
            stats['Binary'] = stats.get('Binary', 0) + 1
        # ... etc
    return stats

def process_is_alive(pid_file: Path) -> bool:
    """Check if process in PID file is still running."""
    if not pid_file.exists():
        return False
    try:
        pid = int(pid_file.read_text().strip())
        os.kill(pid, 0)  # Signal 0 = check if exists
        return True
    except (OSError, ValueError):
        return False

def kill_process(pid_file: Path, signal=SIGTERM):
    """Kill process in PID file."""
    if not pid_file.exists():
        return
    try:
        pid = int(pid_file.read_text().strip())
        os.kill(pid, signal)
    except (OSError, ValueError):
        pass

Phase 2: Add Model.from_jsonl() static methods ✅ DONE

Files: archivebox/core/models.py, archivebox/machine/models.py, archivebox/crawls/models.py

Status: COMPLETE - Added from_jsonl() to:

✅ Tag.from_jsonl() - core/models.py lines 93-116
✅ Snapshot.from_jsonl() - core/models.py lines 1144-1189
✅ Machine.from_jsonl() - machine/models.py lines 66-89
✅ Dependency.from_jsonl() - machine/models.py lines 203-227
✅ Binary.from_jsonl() - machine/models.py lines 401-434

Example implementations added:

class Snapshot:
    @staticmethod
    def from_jsonl(record: Dict, overrides: Dict = None):
        """Create/update Snapshot from JSONL record."""
        from archivebox.misc.jsonl import get_or_create_snapshot
        overrides = overrides or {}

        # Apply overrides (crawl, parent_snapshot, depth limits)
        crawl = overrides.get('crawl')
        snapshot = overrides.get('snapshot')  # parent

        if crawl:
            depth = record.get('depth', (snapshot.depth + 1 if snapshot else 1))
            if depth > crawl.max_depth:
                return None
            record.setdefault('crawl_id', str(crawl.id))
            record.setdefault('depth', depth)
            if snapshot:
                record.setdefault('parent_snapshot_id', str(snapshot.id))

        created_by_id = overrides.get('created_by_id')
        new_snapshot = get_or_create_snapshot(record, created_by_id=created_by_id)
        new_snapshot.status = Snapshot.StatusChoices.QUEUED
        new_snapshot.retry_at = timezone.now()
        new_snapshot.save()
        return new_snapshot

class Tag:
    @staticmethod
    def from_jsonl(record: Dict, overrides: Dict = None):
        """Create/update Tag from JSONL record."""
        from archivebox.misc.jsonl import get_or_create_tag
        tag = get_or_create_tag(record)
        # Auto-attach to snapshot if in overrides
        if overrides and 'snapshot' in overrides:
            overrides['snapshot'].tags.add(tag)
        return tag

class Binary:
    @staticmethod
    def from_jsonl(record: Dict, overrides: Dict = None):
        """Create/update Binary from JSONL record."""
        # Implementation similar to existing create_model_record()
        ...

# Etc for other models

Phase 3: Update ArchiveResult to use unified pattern ✅ DONE

File: archivebox/core/models.py

Status: COMPLETE

Changes made:

✅ Replaced inline JSONL processing (lines 1912-1950):
- Pre-filter Snapshot records for depth/URL constraints in ArchiveResult.run()
- Use self._url_passes_filters(url) with parent snapshot's config for proper hierarchy
- Replaced inline Tag/Snapshot/other record creation with process_hook_records()
- Removed ~60 lines of duplicate code
✅ Simplified Snapshot.from_jsonl() (lines 1144-1189):
- Removed depth checking (now done in caller)
- Just applies crawl metadata and creates snapshot
- Added docstring note: "Filtering should be done by caller BEFORE calling this method"
✅ Preserved ArchiveResult self-update logic:
- Status/output fields still updated from ArchiveResult JSONL record (lines 1856-1910)
- Special title extractor logic preserved (line 1952+)
- Search indexing trigger preserved (line 1957+)
✅ Key insight: Filtering happens in ArchiveResult.run() where we have parent snapshot context, NOT in from_jsonl() where we'd lose config hierarchy

Note: Did NOT delete special background hook methods (check_background_completed, finalize_background_hook) - that's Phase 6

Phase 4: Add Snapshot.cleanup() method ✅ DONE

File: archivebox/core/models.py

Status: COMPLETE

Changes made:

✅ Added Snapshot.cleanup() (lines 1144-1175):
- Kills background ArchiveResult hooks by scanning for */hook.pid files
- Finalizes background ArchiveResults using finalize_background_hook() (temporary until Phase 6)
- Called by state machine when entering sealed state
✅ Added Snapshot.has_running_background_hooks() (lines 1177-1195):
- Checks if any background hooks still running using process_is_alive()
- Used by state machine in is_finished() check

Phase 5: Update SnapshotMachine to use cleanup() ✅ DONE

File: archivebox/core/statemachines.py

Status: COMPLETE

Changes made:

✅ Simplified is_finished() (lines 58-72):
- Removed inline background hook checking and finalization (lines 67-76 deleted)
- Now uses self.snapshot.has_running_background_hooks() (line 68)
- Removed ~12 lines of duplicate logic
✅ Added cleanup() to sealed.enter (lines 102-111):
- Calls self.snapshot.cleanup() to kill background hooks (line 105)
- Follows unified pattern: cleanup happens on seal, not in is_finished()

Phase 6: Add ArchiveResult.update_from_output() and simplify run() ✅ DONE

File: archivebox/core/models.py

Status: COMPLETE - The BIG refactor (removed ~200 lines of duplication)

Changes made:

✅ Added ArchiveResult.update_from_output() (lines 1908-2061):
- Unified method for both foreground and background hooks
- Reads stdout.log and parses JSONL records
- Updates status/output_str/output_json from ArchiveResult JSONL record
- Walks filesystem to populate output_files/output_size/output_mimetypes
- Filters Snapshot records for depth/URL constraints (same as run())
- Processes side-effect records via process_hook_records()
- Updates snapshot title if title extractor
- Triggers search indexing if succeeded
- Cleans up PID files and empty logs
- ~160 lines of comprehensive logic
✅ Simplified ArchiveResult.run() (lines 1841-1906):
- Removed ~120 lines of duplicate filesystem reading logic
- Now just sets start_ts/pwd and calls update_from_output()
- Background hooks: return immediately after saving status=STARTED
- Foreground hooks: call update_from_output() to do all the work
- Removed ~10 lines of duplicate code
✅ Updated Snapshot.cleanup() (line 1172):
- Changed from ar.finalize_background_hook() to ar.update_from_output()
- Uses the unified method instead of the old special-case method
✅ Deleted _populate_output_fields() (was ~45 lines):
- Logic merged into update_from_output()
- Eliminates duplication of filesystem walking code
✅ Deleted check_background_completed() (was ~20 lines):
- Replaced by process_is_alive(pid_file) from hooks.py
- Generic helper used by Snapshot.has_running_background_hooks()
✅ Deleted finalize_background_hook() (was ~85 lines):
- Completely replaced by update_from_output()
- Was duplicate of foreground hook finalization logic

Total lines removed: ~280 lines of duplicate code Total lines added: ~160 lines of unified code Net reduction: ~120 lines (-43%)

Phase 7-8: Dependency State Machine ❌ NOT NEEDED

Status: Intentionally skipped - Dependency doesn't need a state machine

Why no state machine for Dependency?

Wrong Granularity: Dependency is a GLOBAL singleton (one record per binary name)
- Multiple machines would race to update the same status/retry_at fields
- No clear semantics: "started" on which machine? "failed" on Machine A but "succeeded" on Machine B?
Wrong Timing: Installation should be SYNCHRONOUS, not queued
- When a worker needs wget, it should install wget NOW, not queue it for later
- No benefit to async state machine transitions
State Lives Elsewhere: Binary records are the actual state
- Each machine has its own Binary records (one per machine per binary)
- Binary.machine FK provides proper per-machine state tracking

Correct Architecture:

Dependency (global, no state machine):
  ├─ Configuration: bin_name, bin_providers, overrides
  ├─ run() method: synchronous installation attempt
  └─ NO status, NO retry_at, NO state_machine_name

Binary (per-machine, has machine FK):
  ├─ State: is this binary installed on this specific machine?
  ├─ Created via JSONL output from on_Dependency hooks
  └─ unique_together = (machine, name, abspath, version, sha256)

What was implemented:

✅ Refactored Dependency.run() (lines 249-324):
- Uses discover_hooks() and process_hook_records() for consistency
- Added comprehensive docstring explaining why no state machine
- Synchronous execution: returns Binary or None immediately
- Uses unified JSONL processing pattern
✅ Kept Dependency simple: Just configuration fields, no state fields
✅ Multi-machine support: Each machine independently runs Dependency.run() and creates its own Binary

Summary of Changes

Progress: 6/6 Core Phases Complete ✅ + 2 Phases Skipped (Intentionally)

ALL core functionality is now complete! The unified pattern is consistently implemented across Crawl, Snapshot, and ArchiveResult. Dependency intentionally kept simple (no state machine needed).

Files Modified:

✅ DONE archivebox/hooks.py - Add unified helpers:
- ✅ process_hook_records(records, overrides) - dispatcher (lines 1258-1323)
- ✅ process_is_alive(pid_file) - check if PID still running (lines 1326-1344)
- ✅ kill_process(pid_file) - kill process (lines 1347-1362)
✅ DONE archivebox/crawls/models.py - Already updated:
- ✅ Crawl.run() - runs hooks, processes JSONL, creates snapshots
- ✅ Crawl.cleanup() - kills background hooks, runs on_CrawlEnd
✅ DONE archivebox/core/models.py:
- ✅ Tag.from_jsonl() - lines 93-116
- ✅ Snapshot.from_jsonl() - lines 1197-1234 (simplified, removed filtering)
- ✅ Snapshot.cleanup() - lines 1144-1172 (kill background hooks, calls ar.update_from_output())
- ✅ Snapshot.has_running_background_hooks() - lines 1174-1193 (check PIDs)
- ✅ ArchiveResult.run() - simplified, uses update_from_output() (lines 1841-1906)
- ✅ ArchiveResult.update_from_output() - unified filesystem reading (lines 1908-2061)
- ✅ DELETED ArchiveResult.check_background_completed() - replaced by process_is_alive()
- ✅ DELETED ArchiveResult.finalize_background_hook() - replaced by update_from_output()
- ✅ DELETED ArchiveResult._populate_output_fields() - merged into update_from_output()
✅ DONE archivebox/core/statemachines.py:
- ✅ Simplified SnapshotMachine.is_finished() - uses has_running_background_hooks() (line 68)
- ✅ Added cleanup call to SnapshotMachine.sealed.enter (line 105)
✅ DONE archivebox/machine/models.py:
- ✅ Machine.from_jsonl() - lines 66-89
- ✅ Dependency.from_jsonl() - lines 203-227
- ✅ Binary.from_jsonl() - lines 401-434
- ✅ Refactored Dependency.run() to use unified pattern (lines 249-324)
- ✅ Added comprehensive docstring explaining why Dependency doesn't need state machine
- ✅ Kept Dependency simple: no state fields, synchronous execution only

Code Metrics:

Lines removed: ~280 lines of duplicate code
Lines added: ~160 lines of unified code
Net reduction: ~120 lines total (-43%)
Files created: 0 (no new files needed)

Key Benefits:

Consistency: All stateful models (Crawl, Snapshot, ArchiveResult) follow the same unified state machine pattern
Simplicity: Eliminated special-case background hook handling (~280 lines of duplicate code)
Correctness: Background hooks are properly cleaned up on seal transition
Maintainability: Unified process_hook_records() dispatcher for all JSONL processing
Testability: Consistent pattern makes testing easier
Clear Separation: Stateful work items (Crawl/Snapshot/ArchiveResult) vs stateless config (Dependency)
Multi-Machine Support: Dependency remains simple synchronous config, Binary tracks per-machine state

Final Unified Pattern

All models now follow this consistent architecture:

State Machine Structure

class ModelMachine(StateMachine):
    queued = State(initial=True)
    started = State()
    sealed/succeeded/failed = State(final=True)

    @started.enter
    def enter_started(self):
        self.model.run()  # Execute the work

    @sealed.enter  # or @succeeded.enter
    def enter_sealed(self):
        self.model.cleanup()  # Clean up background hooks

Model Methods

class Model:
    # State machine fields
    status = CharField(default='queued')
    retry_at = DateTimeField(default=timezone.now)
    output_dir = CharField(default='', blank=True)
    state_machine_name = 'app.statemachines.ModelMachine'

    def run(self):
        """Run hooks, process JSONL, create children."""
        hooks = discover_hooks('EventName')
        for hook in hooks:
            output_dir = self.OUTPUT_DIR / hook.parent.name
            result = run_hook(hook, output_dir=output_dir, ...)

            if result is None:  # Background hook
                continue

            # Process JSONL records
            overrides = {'model': self, 'created_by_id': self.created_by_id}
            process_hook_records(result['records'], overrides=overrides)

    def cleanup(self):
        """Kill background hooks, run cleanup hooks."""
        for pid_file in self.OUTPUT_DIR.glob('*/hook.pid'):
            kill_process(pid_file)
            # Update children from filesystem
            child.update_from_output()

    def update_and_requeue(self, **fields):
        """Update fields and bump modified_at."""
        for field, value in fields.items():
            setattr(self, field, value)
        self.save(update_fields=[*fields.keys(), 'modified_at'])

    @staticmethod
    def from_jsonl(record: dict, overrides: dict = None):
        """Create/update model from JSONL record."""
        # Implementation specific to model
        # Called by process_hook_records()

Hook Processing Flow

1. Model.run() discovers hooks
2. Hooks execute and output JSONL to stdout
3. JSONL records dispatched via process_hook_records()
4. Each record type handled by Model.from_jsonl()
5. Background hooks tracked via hook.pid files
6. Model.cleanup() kills background hooks on seal
7. Children updated via update_from_output()

Multi-Machine Coordination

Work Items (Crawl, Snapshot, ArchiveResult): No machine FK, any worker can claim
Resources (Binary): Machine FK, one per machine per binary
Configuration (Dependency): No machine FK, global singleton, synchronous execution
Execution Tracking (ArchiveResult.iface): FK to NetworkInterface for observability

Testing Checklist

Test Crawl → Snapshot creation with hooks
Test Snapshot → ArchiveResult creation
Test ArchiveResult foreground hooks (JSONL processing)
Test ArchiveResult background hooks (PID tracking, cleanup)
Test Dependency.run() synchronous installation
Test background hook cleanup on seal transition
Test multi-machine Crawl execution
Test Binary creation per machine (one per machine per binary)
Verify Dependency.run() can be called concurrently from multiple machines safely

FINAL ARCHITECTURE (Phases 1-8 Complete)

✅ Phases 1-6: Core Models Unified

All core models (Crawl, Snapshot, ArchiveResult) now follow the unified pattern:

State machines orchestrate transitions
.run() methods execute hooks and process JSONL
.cleanup() methods kill background hooks
.update_and_requeue() methods update state for worker coordination
Consistent use of process_hook_records() for JSONL dispatching

✅ Phases 7-8: Binary State Machine (Dependency Model Eliminated)

Key Decision: Eliminated Dependency model entirely and made Binary the state machine.

New Architecture

Static Configuration: plugins/{plugin}/dependencies.jsonl files define binary requirements

{"type": "Binary", "name": "yt-dlp", "bin_providers": "pip,brew,apt,env"}
{"type": "Binary", "name": "node", "bin_providers": "apt,brew,env", "overrides": {"apt": {"packages": ["nodejs"]}}}
{"type": "Binary", "name": "ffmpeg", "bin_providers": "apt,brew,env"}

Dynamic State: Binary model tracks per-machine installation state
- Fields: machine, name, bin_providers, overrides, abspath, version, sha256, binprovider
- State machine: queued → started → succeeded/failed
- Output dir: data/machines/{machine_id}/binaries/{binary_name}/{binary_id}/

Binary State Machine Flow

class BinaryMachine(StateMachine):
    queued → started → succeeded/failed

    @started.enter
    def enter_started(self):
        self.binary.run()  # Runs on_Binary__install_* hooks

class Binary(models.Model):
    def run(self):
        """
        Runs ALL on_Binary__install_* hooks.
        Each hook checks bin_providers and decides if it can handle this binary.
        First hook to succeed wins.
        Outputs JSONL with abspath, version, sha256, binprovider.
        """
        hooks = discover_hooks('Binary')
        for hook in hooks:
            result = run_hook(hook, output_dir=self.OUTPUT_DIR/plugin_name, 
                            binary_id=self.id, machine_id=self.machine_id,
                            name=self.name, bin_providers=self.bin_providers,
                            overrides=json.dumps(self.overrides))
            
            # Hook outputs: {"type": "Binary", "name": "wget", "abspath": "/usr/bin/wget", "version": "1.21", "binprovider": "apt"}
            # Binary.from_jsonl() updates self with installation results

Hook Naming Convention

Before: on_Dependency__install_using_pip_provider.py
After: on_Binary__install_using_pip_provider.py

Each hook checks --bin-providers CLI argument:

if 'pip' not in bin_providers.split(','):
    sys.exit(0)  # Skip this binary

Perfect Symmetry Achieved

All models now follow identical patterns:

Crawl(queued) → CrawlMachine → Crawl.run() → sealed
Snapshot(queued) → SnapshotMachine → Snapshot.run() → sealed  
ArchiveResult(queued) → ArchiveResultMachine → ArchiveResult.run() → succeeded/failed
Binary(queued) → BinaryMachine → Binary.run() → succeeded/failed

Benefits of Eliminating Dependency

No global singleton conflicts: Binary is per-machine, no race conditions
Simpler data model: One table instead of two (Dependency + InstalledBinary)
Static configuration: dependencies.jsonl in version control, not database
Consistent state machine: Binary follows same pattern as other models
Cleaner hooks: Hooks check bin_providers themselves instead of orchestrator parsing names

Multi-Machine Coordination

Work Items (Crawl, Snapshot, ArchiveResult): No machine FK, any worker can claim
Resources (Binary): Machine FK, one per machine per binary name
Configuration: Static files in plugins/*/dependencies.jsonl
Execution Tracking: ArchiveResult.iface FK to NetworkInterface for observability

Testing Checklist (Updated)

Core models use unified hook pattern (Phases 1-6)
Binary installation via state machine
Multiple machines can install same binary independently
Hook bin_providers filtering works correctly
Binary.from_jsonl() handles both dependencies.jsonl and hook output
Binary OUTPUT_DIR structure: data/machines/{machine_id}/binaries/{name}/{id}/

27 KiB Raw Permalink Blame History

Hook & State Machine Cleanup - Unified Pattern

Goal

Current State Analysis (ALL COMPLETE ✅)

✅ Crawl (archivebox/crawls/)

✅ Snapshot (archivebox/core/)

✅ ArchiveResult (archivebox/core/)

✅ Binary (archivebox/machine/) - NEW!

❌ Dependency Model - ELIMINATED

Unified Pattern (Target Architecture)

Pattern for ALL models:

PWD Standard:

Implementation Plan

Phase 1: Add unified helpers to hooks.py ✅ DONE

Phase 2: Add Model.from_jsonl() static methods ✅ DONE

Phase 3: Update ArchiveResult to use unified pattern ✅ DONE

Phase 4: Add Snapshot.cleanup() method ✅ DONE

Phase 5: Update SnapshotMachine to use cleanup() ✅ DONE

Phase 6: Add ArchiveResult.update_from_output() and simplify run() ✅ DONE

Phase 7-8: Dependency State Machine ❌ NOT NEEDED

Summary of Changes

Progress: 6/6 Core Phases Complete ✅ + 2 Phases Skipped (Intentionally)

Files Modified:

Code Metrics:

Key Benefits:

Final Unified Pattern

State Machine Structure

Model Methods

Hook Processing Flow

Multi-Machine Coordination

Testing Checklist

FINAL ARCHITECTURE (Phases 1-8 Complete)

✅ Phases 1-6: Core Models Unified

✅ Phases 7-8: Binary State Machine (Dependency Model Eliminated)

New Architecture

Binary State Machine Flow

Hook Naming Convention

Perfect Symmetry Achieved

Benefits of Eliminating Dependency

Multi-Machine Coordination

Testing Checklist (Updated)

27 KiB

Raw Permalink Blame History