move todos

2026-01-03 09:25:42 +10:00 · 2025-12-28 04:44:38 -08:00
parent bd265c0083
commit d2e65cfd38
8 changed files with 0 additions and 956 deletions
--- a/old/TODO_chrome_plugin_cleanup.md
+++ b/old/TODO_chrome_plugin_cleanup.md
@@ -0,0 +1,431 @@
+# Chrome Plugin Consolidation - COMPLETED ✓
+
+## Core Principle: One ArchiveResult Per Plugin
+
+**Critical Realization:** Each plugin must produce exactly ONE ArchiveResult output. This is fundamental to ArchiveBox's architecture - you cannot have multiple outputs from a single plugin.
+
+### CRITICAL ARCHITECTURE CLARIFICATION
+
+**DO NOT CONFUSE THESE CONCEPTS:**
+
+1. **Plugin** = Directory name (e.g., `chrome`, `consolelog`, `screenshot`)
+   - Lives in `archivebox/plugins/<plugin_name>/`
+   - Can contain MULTIPLE hook files
+   - Produces ONE output directory: `users/{username}/snapshots/YYYYMMDD/{domain}/{snap_id}/{plugin_name}/`
+   - Creates ONE ArchiveResult record per snapshot
+
+2. **Hook** = Individual script file (e.g., `on_Snapshot__20_chrome_tab.bg.js`)
+   - Lives inside a plugin directory
+   - One plugin can have MANY hooks
+   - All hooks in a plugin run sequentially when that plugin's ArchiveResult is processed
+   - All hooks write to the SAME output directory (the plugin directory)
+
+3. **Extractor** = ArchiveResult.extractor field = PLUGIN NAME (not hook name)
+   - `ArchiveResult.extractor = 'chrome'` (plugin name)
+   - NOT `ArchiveResult.extractor = '20_chrome_tab.bg'` (hook name)
+
+4. **Output Directory** = `users/{username}/snapshots/YYYYMMDD/{domain}/{snap_id}/{plugin_name}/`
+   - One output directory per plugin (0.9.x structure)
+   - ALL hooks in that plugin write to this same directory
+   - Example: `users/default/snapshots/20251227/example.com/019b-6397-6a5b/chrome/` contains outputs from ALL chrome hooks
+   - Legacy: `archive/{timestamp}/` with symlink for backwards compatibility
+
+**Example 1: Chrome Plugin (Infrastructure - NO ArchiveResult)**
+```
+Plugin name: 'chrome'
+ArchiveResult: NONE (infrastructure only)
+Output directory: users/default/snapshots/20251227/example.com/019b-6397-6a5b/chrome/
+
+Hooks:
+  - on_Snapshot__20_chrome_tab.bg.js       # Launches Chrome, opens tab
+  - on_Snapshot__30_chrome_navigate.js     # Navigates to URL
+  - on_Snapshot__45_chrome_tab_cleanup.py  # Kills Chrome on cleanup
+
+Writes (temporary infrastructure files, deleted on cleanup):
+  - chrome/cdp_url.txt          # Other plugins read this to connect
+  - chrome/target_id.txt          # Tab ID for CDP connection
+  - chrome/page_loaded.txt      # Navigation completion marker
+  - chrome/navigation.json      # Navigation state
+  - chrome/hook.pid             # For cleanup
+
+NO ArchiveResult JSON is produced - this is pure infrastructure.
+On SIGTERM: Chrome exits, chrome/ directory is deleted.
+```
+
+**Example 2: Screenshot Plugin (Output Plugin - CREATES ArchiveResult)**
+```
+Plugin name: 'screenshot'
+ArchiveResult.extractor: 'screenshot'
+Output directory: users/default/snapshots/20251227/example.com/019b-6397-6a5b/screenshot/
+
+Hooks:
+  - on_Snapshot__34_screenshot.js
+
+Process:
+  1. Reads ../chrome/cdp_url.txt to get Chrome connection
+  2. Connects to Chrome CDP
+  3. Takes screenshot
+  4. Writes to: screenshot/screenshot.png
+  5. Emits ArchiveResult JSON to stdout
+
+Creates ArchiveResult with status=succeeded, output_files={'screenshot.png': {}}
+```
+
+**Example 3: PDF Plugin (Output Plugin - CREATES ArchiveResult)**
+```
+Plugin name: 'pdf'
+ArchiveResult.extractor: 'pdf'
+Output directory: users/default/snapshots/20251227/example.com/019b-6397-6a5b/pdf/
+
+Hooks:
+  - on_Snapshot__35_pdf.js
+
+Process:
+  1. Reads ../chrome/cdp_url.txt to get Chrome connection
+  2. Connects to Chrome CDP
+  3. Generates PDF
+  4. Writes to: pdf/output.pdf
+  5. Emits ArchiveResult JSON to stdout
+
+Creates ArchiveResult with status=succeeded, output_files={'output.pdf': {}}
+```
+
+**Lifecycle:**
+```
+1. Chrome hooks run → create chrome/ dir with infrastructure files
+2. Screenshot/PDF/etc hooks run → read chrome/cdp_url.txt, write to their own dirs
+3. Snapshot.cleanup() called → sends SIGTERM to background hooks
+4. Chrome receives SIGTERM → exits, deletes chrome/ dir
+5. Screenshot/PDF/etc dirs remain with their outputs
+```
+
+**DO NOT:**
+- Create one ArchiveResult per hook
+- Use hook names as extractor values
+- Create separate output directories per hook
+
+**DO:**
+- Create one ArchiveResult per plugin
+- Use plugin directory name as extractor value
+- Run all hooks in a plugin when processing its ArchiveResult
+- Write all hook outputs to the same plugin directory
+
+This principle drove the entire consolidation strategy:
+- **Chrome plugin** = Infrastructure only (NO ArchiveResult)
+- **Output plugins** = Each produces ONE distinct ArchiveResult (kept separate)
+
+## Final Structure
+
+### 1. Chrome Plugin (Infrastructure - No Output)
+
+**Location:** `archivebox/plugins/chrome/`
+
+This plugin provides shared Chrome infrastructure for other plugins. It manages the browser lifecycle but **produces NO ArchiveResult** - only infrastructure files in a single `chrome/` output directory.
+
+**Consolidates these former plugins:**
+- `chrome_session/` → Merged
+- `chrome_navigate/` → Merged
+- `chrome_cleanup/` → Merged
+- `chrome_extensions/` → Utilities merged
+
+**Hook Files:**
+```
+chrome/
+├── on_Crawl__00_chrome_install_config.py  # Configure Chrome settings
+├── on_Crawl__00_chrome_install.py         # Install Chrome binary
+├── on_Crawl__20_chrome_launch.bg.js       # Launch Chrome (Crawl-level, bg)
+├── on_Snapshot__20_chrome_tab.bg.js       # Open tab (Snapshot-level, bg)
+├── on_Snapshot__30_chrome_navigate.js     # Navigate to URL (foreground)
+├── on_Snapshot__45_chrome_tab_cleanup.py  # Close tab, kill bg hooks
+├── chrome_extension_utils.js              # Extension utilities
+├── config.json                            # Configuration
+└── tests/test_chrome.py                   # Tests
+```
+
+**Output Directory (Infrastructure Only):**
+```
+chrome/
+├── cdp_url.txt          # WebSocket URL for CDP connection
+├── pid.txt              # Chrome process PID
+├── target_id.txt          # Current tab target ID
+├── page_loaded.txt      # Navigation completion marker
+├── final_url.txt        # Final URL after redirects
+├── navigation.json      # Navigation state (NEW)
+└── hook.pid             # Background hook PIDs (for cleanup)
+```
+
+**New: navigation.json**
+
+Tracks navigation state with wait condition and timing:
+```json
+{
+  "waitUntil": "networkidle2",
+  "elapsed": 1523,
+  "url": "https://example.com",
+  "finalUrl": "https://example.com/",
+  "status": 200,
+  "timestamp": "2025-12-27T22:15:30.123Z"
+}
+```
+
+Fields:
+- `waitUntil` - Wait condition: `networkidle0`, `networkidle2`, `domcontentloaded`, or `load`
+- `elapsed` - Navigation time in milliseconds
+- `url` - Original requested URL
+- `finalUrl` - Final URL after redirects (success only)
+- `status` - HTTP status code (success only)
+- `error` - Error message (failure only)
+- `timestamp` - ISO 8601 completion timestamp
+
+### 2. Output Plugins (Each = One ArchiveResult)
+
+These remain **SEPARATE** plugins because each produces a distinct output/ArchiveResult. Each plugin references `../chrome` for infrastructure.
+
+#### consolelog Plugin
+```
+archivebox/plugins/consolelog/
+└── on_Snapshot__21_consolelog.bg.js
+```
+- **Output:** `console.jsonl` (browser console messages)
+- **Type:** Background hook (CDP listener)
+- **References:** `../chrome` for CDP URL
+
+#### ssl Plugin
+```
+archivebox/plugins/ssl/
+└── on_Snapshot__23_ssl.bg.js
+```
+- **Output:** `ssl.jsonl` (SSL/TLS certificate details)
+- **Type:** Background hook (CDP listener)
+- **References:** `../chrome` for CDP URL
+
+#### responses Plugin
+```
+archivebox/plugins/responses/
+└── on_Snapshot__24_responses.bg.js
+```
+- **Output:** `responses/` directory with `index.jsonl` (network responses)
+- **Type:** Background hook (CDP listener)
+- **References:** `../chrome` for CDP URL
+
+#### redirects Plugin
+```
+archivebox/plugins/redirects/
+└── on_Snapshot__31_redirects.bg.js
+```
+- **Output:** `redirects.jsonl` (redirect chain)
+- **Type:** Background hook (CDP listener)
+- **References:** `../chrome` for CDP URL
+- **Changed:** Converted to background hook, now uses CDP `Network.requestWillBeSent` to capture redirects from initial request
+
+#### staticfile Plugin
+```
+archivebox/plugins/staticfile/
+└── on_Snapshot__31_staticfile.bg.js
+```
+- **Output:** Downloaded static file (PDF, image, video, etc.)
+- **Type:** Background hook (CDP listener)
+- **References:** `../chrome` for CDP URL
+- **Changed:** Converted from Python to JavaScript, now uses CDP to detect Content-Type from initial response and download via CDP
+
+## What Changed
+
+### 1. Plugin Consolidation
+- Merged `chrome_session`, `chrome_navigate`, `chrome_cleanup`, `chrome_extensions` → `chrome/`
+- Chrome plugin now has **single output directory**: `chrome/`
+- All Chrome infrastructure hooks reference `.` (same directory)
+
+### 2. Background Hook Conversions
+
+**redirects Plugin:**
+- **Before:** Ran AFTER navigation, reconnected to Chrome to check for redirects
+- **After:** Background hook that sets up CDP listeners BEFORE navigation to capture redirects from initial request
+- **Method:** Uses CDP `Network.requestWillBeSent` event with `redirectResponse` parameter
+
+**staticfile Plugin:**
+- **Before:** Python script that ran AFTER navigation, checked response headers
+- **After:** Background JavaScript hook that sets up CDP listeners BEFORE navigation
+- **Method:** Uses CDP `page.on('response')` to capture Content-Type from initial request
+- **Language:** Converted from Python to JavaScript/Node.js for consistency
+
+### 3. Navigation State Tracking
+- **Added:** `navigation.json` file in `chrome/` output directory
+- **Contains:** `waitUntil` condition and `elapsed` milliseconds
+- **Purpose:** Track navigation performance and wait conditions for analysis
+
+### 4. Cleanup
+- **Deleted:** `chrome_session/on_CrawlEnd__99_chrome_cleanup.py` (manual cleanup hook)
+- **Reason:** Automatic cleanup via state machines is sufficient
+- **Verified:** Cleanup mechanisms in `core/models.py` and `crawls/models.py` work correctly
+
+## Hook Execution Order
+
+```
+═══ CRAWL LEVEL ═══
+  00. chrome_install_config.py    Configure Chrome settings
+  00. chrome_install.py            Install Chrome binary
+  20. chrome_launch.bg.js          Launch Chrome browser (STAYS RUNNING)
+
+═══ PER-SNAPSHOT LEVEL ═══
+
+Phase 1: PRE-NAVIGATION (Background hooks setup)
+  20. chrome_tab.bg.js             Open new tab (STAYS ALIVE)
+  21. consolelog.bg.js             Setup console listener (STAYS ALIVE)
+  23. ssl.bg.js                    Setup SSL listener (STAYS ALIVE)
+  24. responses.bg.js              Setup network response listener (STAYS ALIVE)
+  31. redirects.bg.js              Setup redirect listener (STAYS ALIVE)
+  31. staticfile.bg.js             Setup staticfile detector (STAYS ALIVE)
+
+Phase 2: NAVIGATION (Foreground - synchronization point)
+  30. chrome_navigate.js           Navigate to URL (BLOCKS until page loaded)
+                                   ↓
+                                   Writes navigation.json with waitUntil & elapsed
+                                   Writes page_loaded.txt marker
+                                   ↓
+                                   All background hooks can now finalize
+
+Phase 3: POST-NAVIGATION (Background hooks finalize)
+  (All .bg hooks save their data and wait for cleanup signal)
+
+Phase 4: OTHER EXTRACTORS (use loaded page)
+  34. screenshot.js
+  37. singlefile.js
+  ... (other extractors that need loaded page)
+
+Phase 5: CLEANUP
+  45. chrome_tab_cleanup.py        Close tab
+                                   Kill background hooks (SIGTERM → SIGKILL)
+                                   Update ArchiveResults
+```
+
+## Background Hook Pattern
+
+All `.bg.js` hooks follow this pattern:
+
+1. **Setup:** Create CDP listeners BEFORE navigation
+2. **Capture:** Collect data incrementally as events occur
+3. **Write:** Save data to filesystem continuously
+4. **Wait:** Keep process alive until SIGTERM
+5. **Finalize:** On SIGTERM, emit final JSONL result to stdout
+6. **Exit:** Clean exit with status code
+
+**Key files written:**
+- `hook.pid` - Process ID for cleanup mechanism
+- Output files (e.g., `console.jsonl`, `ssl.jsonl`, etc.)
+
+## Automatic Cleanup Mechanism
+
+**Snapshot-level cleanup** (`core/models.py`):
+```python
+def cleanup(self):
+    """Kill background hooks and close resources."""
+    # Scan OUTPUT_DIR for hook.pid files
+    # Send SIGTERM to processes
+    # Wait for graceful exit
+    # Send SIGKILL if process still alive
+    # Update ArchiveResults to FAILED if needed
+```
+
+**Crawl-level cleanup** (`crawls/models.py`):
+```python
+def cleanup(self):
+    """Kill Crawl-level background hooks (Chrome browser)."""
+    # Similar pattern for Crawl-level resources
+    # Kills Chrome launch process
+```
+
+**State machine integration:**
+- Both `SnapshotMachine` and `CrawlMachine` call `cleanup()` when entering `sealed` state
+- Ensures all background processes are cleaned up properly
+- No manual cleanup hooks needed
+
+## Directory References
+
+**Crawl output structure:**
+- Crawls output to: `users/{user_id}/crawls/{YYYYMMDD}/{crawl_id}/`
+- Example: `users/1/crawls/20251227/abc-def-123/`
+- Crawl-level plugins create subdirectories: `users/1/crawls/20251227/abc-def-123/chrome/`
+
+**Snapshot output structure:**
+- Snapshots output to: `archive/{timestamp}/`
+- Snapshot-level plugins create subdirectories: `archive/{timestamp}/chrome/`, `archive/{timestamp}/consolelog/`, etc.
+
+**Within chrome plugin:**
+- Hooks use `.` or `OUTPUT_DIR` to reference the `chrome/` directory they're running in
+- Example: `fs.writeFileSync(path.join(OUTPUT_DIR, 'navigation.json'), ...)`
+
+**From output plugins to chrome (same snapshot):**
+- Hooks use `../chrome` to reference Chrome infrastructure in same snapshot
+- Example: `const CHROME_SESSION_DIR = '../chrome';`
+- Used to read: `cdp_url.txt`, `target_id.txt`, `page_loaded.txt`
+
+**From snapshot hooks to crawl chrome:**
+- Snapshot hooks receive `CRAWL_OUTPUT_DIR` environment variable (set by hooks.py)
+- Use: `path.join(process.env.CRAWL_OUTPUT_DIR, 'chrome')` to find crawl-level Chrome
+- This allows snapshots to reuse the crawl's shared Chrome browser
+
+**Navigation synchronization:**
+- All hooks wait for `../chrome/page_loaded.txt` before finalizing
+- This file is written by `chrome_navigate.js` after navigation completes
+
+## Design Principles
+
+1. **One ArchiveResult Per Plugin**
+   - Each plugin produces exactly ONE output/ArchiveResult
+   - Infrastructure plugins (like chrome) produce NO ArchiveResult
+
+2. **Chrome as Infrastructure**
+   - Provides shared CDP connection, PIDs, navigation state
+   - No ArchiveResult output of its own
+   - Single output directory for all infrastructure files
+
+3. **Background Hooks for CDP**
+   - Hooks that need CDP listeners BEFORE navigation are background (`.bg.js`)
+   - They capture events from the initial request/response
+   - Stay alive through navigation and cleanup
+
+4. **Foreground for Synchronization**
+   - `chrome_navigate.js` is foreground (not `.bg`)
+   - Provides synchronization point - blocks until page loaded
+   - All other hooks wait for its completion marker
+
+5. **Automatic Cleanup**
+   - State machines handle background hook cleanup
+   - No manual cleanup hooks needed
+   - SIGTERM for graceful exit, SIGKILL as backup
+
+6. **Clear Separation**
+   - Infrastructure vs outputs
+   - One output directory per plugin
+   - Predictable, maintainable architecture
+
+## Benefits
+
+✓ **Architectural Clarity** - Clear separation between infrastructure and outputs
+✓ **Correct Output Model** - One ArchiveResult per plugin
+✓ **Better Performance** - CDP listeners capture data from initial request
+✓ **No Duplication** - Single Chrome infrastructure used by all
+✓ **Proper Lifecycle** - Background hooks cleaned up automatically
+✓ **Maintainable** - Easy to understand, debug, and extend
+✓ **Consistent** - All background hooks follow same pattern
+✓ **Observable** - Navigation state tracked for debugging
+
+## Testing
+
+Run tests:
+```bash
+sudo -u testuser bash -c 'source .venv/bin/activate && python -m pytest archivebox/plugins/chrome/tests/ -v'
+```
+
+## Migration Notes
+
+**For developers:**
+- Chrome infrastructure is now in `chrome/` output dir (not `chrome_session/`)
+- Reference `../chrome/cdp_url.txt` from output plugins
+- Navigation marker is `../chrome/page_loaded.txt`
+- Navigation details in `../chrome/navigation.json`
+
+**For users:**
+- No user-facing changes
+- Output structure remains the same
+- All extractors continue to work
--- a/old/TODO_fs_migrations.md
+++ b/old/TODO_fs_migrations.md
--- a/old/TODO_hook_architecture.md
+++ b/old/TODO_hook_architecture.md
--- a/old/TODO_hook_statemachine_cleanup.md
+++ b/old/TODO_hook_statemachine_cleanup.md
@@ -0,0 +1,665 @@
+# Hook & State Machine Cleanup - Unified Pattern
+
+## Goal
+Implement a **consistent pattern** across all models (Crawl, Snapshot, ArchiveResult, Dependency) for:
+1. Running hooks
+2. Processing JSONL records
+3. Managing background hooks
+4. State transitions
+
+## Current State Analysis (ALL COMPLETE ✅)
+
+### ✅ Crawl (archivebox/crawls/)
+**Status**: COMPLETE
+- ✅ Has state machine: `CrawlMachine`
+- ✅ `Crawl.run()` - runs hooks, processes JSONL via `process_hook_records()`, creates snapshots
+- ✅ `Crawl.cleanup()` - kills background hooks, runs on_CrawlEnd hooks
+- ✅ Uses `OUTPUT_DIR/plugin_name/` for PWD
+- ✅ State machine calls model methods:
+  - `queued -> started`: calls `crawl.run()`
+  - `started -> sealed`: calls `crawl.cleanup()`
+
+### ✅ Snapshot (archivebox/core/)
+**Status**: COMPLETE
+- ✅ Has state machine: `SnapshotMachine`
+- ✅ `Snapshot.run()` - creates pending ArchiveResults
+- ✅ `Snapshot.cleanup()` - kills background ArchiveResult hooks, calls `update_from_output()`
+- ✅ `Snapshot.has_running_background_hooks()` - checks PID files using `process_is_alive()`
+- ✅ `Snapshot.from_jsonl()` - simplified, filtering moved to caller
+- ✅ State machine calls model methods:
+  - `queued -> started`: calls `snapshot.run()`
+  - `started -> sealed`: calls `snapshot.cleanup()`
+  - `is_finished()`: uses `has_running_background_hooks()`
+
+### ✅ ArchiveResult (archivebox/core/)
+**Status**: COMPLETE - Major refactor completed
+- ✅ Has state machine: `ArchiveResultMachine`
+- ✅ `ArchiveResult.run()` - runs hook, calls `update_from_output()` for foreground hooks
+- ✅ `ArchiveResult.update_from_output()` - unified method for foreground and background hooks
+- ✅ Uses PWD `snapshot.OUTPUT_DIR/plugin_name`
+- ✅ JSONL processing via `process_hook_records()` with URL/depth filtering
+- ✅ **DELETED** special background hook methods:
+  - ❌ `check_background_completed()` - replaced by `process_is_alive()` helper
+  - ❌ `finalize_background_hook()` - replaced by `update_from_output()`
+  - ❌ `_populate_output_fields()` - merged into `update_from_output()`
+- ✅ State machine transitions:
+  - `queued -> started`: calls `archiveresult.run()`
+  - `started -> succeeded/failed/skipped`: status set by `update_from_output()`
+
+### ✅ Binary (archivebox/machine/) - NEW!
+**Status**: COMPLETE - Replaced Dependency model entirely
+- ✅ Has state machine: `BinaryMachine`
+- ✅ `Binary.run()` - runs on_Binary__install_* hooks, processes JSONL
+- ✅ `Binary.cleanup()` - kills background installation hooks (for consistency)
+- ✅ `Binary.from_jsonl()` - handles both binaries.jsonl and hook output
+- ✅ Uses PWD `data/machines/{machine_id}/binaries/{name}/{id}/plugin_name/`
+- ✅ Configuration via static `plugins/*/binaries.jsonl` files
+- ✅ State machine calls model methods:
+  - `queued -> started`: calls `binary.run()`
+  - `started -> succeeded/failed`: status set by hooks via JSONL
+- ✅ Perfect symmetry with Crawl/Snapshot/ArchiveResult pattern
+
+### ❌ Dependency Model - ELIMINATED
+**Status**: Deleted entirely (replaced by Binary state machine)
+- Static configuration now lives in `plugins/*/binaries.jsonl`
+- Per-machine state tracked by Binary records
+- No global singleton conflicts
+- Hooks renamed from `on_Dependency__install_*` to `on_Binary__install_*`
+
+## Unified Pattern (Target Architecture)
+
+### Pattern for ALL models:
+
+```python
+# 1. State Machine orchestrates transitions
+class ModelMachine(StateMachine):
+    @started.enter
+    def enter_started(self):
+        self.model.run()  # Do the work
+        # Update status
+
+    def is_finished(self):
+        # Check if background hooks still running
+        if self.model.has_running_background_hooks():
+            return False
+        # Check if children finished
+        if self.model.has_pending_children():
+            return False
+        return True
+
+    @sealed.enter
+    def enter_sealed(self):
+        self.model.cleanup()  # Clean up background hooks
+        # Update status
+
+# 2. Model methods do the actual work
+class Model:
+    def run(self):
+        """Run hooks, process JSONL, create children."""
+        hooks = discover_hooks('ModelName')
+        for hook in hooks:
+            output_dir = self.OUTPUT_DIR / hook.parent.name
+            result = run_hook(hook, output_dir=output_dir, ...)
+
+            if result is None:  # Background hook
+                continue
+
+            # Process JSONL records
+            records = result.get('records', [])
+            overrides = {'model': self, 'created_by_id': self.created_by_id}
+            process_hook_records(records, overrides=overrides)
+
+        # Create children (e.g., ArchiveResults, Snapshots)
+        self.create_children()
+
+    def cleanup(self):
+        """Kill background hooks, run cleanup hooks."""
+        # Kill any background hooks
+        if self.OUTPUT_DIR.exists():
+            for pid_file in self.OUTPUT_DIR.glob('*/hook.pid'):
+                kill_process(pid_file)
+
+        # Run cleanup hooks (e.g., on_ModelEnd)
+        cleanup_hooks = discover_hooks('ModelEnd')
+        for hook in cleanup_hooks:
+            run_hook(hook, ...)
+
+    def has_running_background_hooks(self) -> bool:
+        """Check if any background hooks still running."""
+        if not self.OUTPUT_DIR.exists():
+            return False
+        for pid_file in self.OUTPUT_DIR.glob('*/hook.pid'):
+            if process_is_alive(pid_file):
+                return True
+        return False
+```
+
+### PWD Standard:
+```
+model.OUTPUT_DIR/plugin_name/
+```
+- Crawl: `users/{user}/crawls/{date}/{crawl_id}/plugin_name/`
+- Snapshot: `users/{user}/snapshots/{date}/{domain}/{snapshot_id}/plugin_name/`
+- ArchiveResult: `users/{user}/snapshots/{date}/{domain}/{snapshot_id}/plugin_name/` (same as Snapshot)
+- Dependency: `dependencies/{dependency_id}/plugin_name/` (set output_dir field directly)
+
+## Implementation Plan
+
+### Phase 1: Add unified helpers to hooks.py ✅ DONE
+
+**File**: `archivebox/hooks.py`
+
+**Status**: COMPLETE - Added three helper functions:
+- `process_hook_records(records, overrides)` - lines 1258-1323
+- `process_is_alive(pid_file)` - lines 1326-1344
+- `kill_process(pid_file, sig)` - lines 1347-1362
+
+```python
+def process_hook_records(records: List[Dict], overrides: Dict = None) -> Dict[str, int]:
+    """
+    Process JSONL records from hook output.
+    Dispatches to Model.from_jsonl() for each record type.
+
+    Args:
+        records: List of JSONL record dicts from result['records']
+        overrides: Dict with 'snapshot', 'crawl', 'dependency', 'created_by_id', etc.
+
+    Returns:
+        Dict with counts by record type
+    """
+    stats = {}
+    for record in records:
+        record_type = record.get('type')
+
+        # Dispatch to appropriate model
+        if record_type == 'Snapshot':
+            from core.models import Snapshot
+            Snapshot.from_jsonl(record, overrides)
+            stats['Snapshot'] = stats.get('Snapshot', 0) + 1
+        elif record_type == 'Tag':
+            from core.models import Tag
+            Tag.from_jsonl(record, overrides)
+            stats['Tag'] = stats.get('Tag', 0) + 1
+        elif record_type == 'Binary':
+            from machine.models import Binary
+            Binary.from_jsonl(record, overrides)
+            stats['Binary'] = stats.get('Binary', 0) + 1
+        # ... etc
+    return stats
+
+def process_is_alive(pid_file: Path) -> bool:
+    """Check if process in PID file is still running."""
+    if not pid_file.exists():
+        return False
+    try:
+        pid = int(pid_file.read_text().strip())
+        os.kill(pid, 0)  # Signal 0 = check if exists
+        return True
+    except (OSError, ValueError):
+        return False
+
+def kill_process(pid_file: Path, signal=SIGTERM):
+    """Kill process in PID file."""
+    if not pid_file.exists():
+        return
+    try:
+        pid = int(pid_file.read_text().strip())
+        os.kill(pid, signal)
+    except (OSError, ValueError):
+        pass
+```
+
+### Phase 2: Add Model.from_jsonl() static methods ✅ DONE
+
+**Files**: `archivebox/core/models.py`, `archivebox/machine/models.py`, `archivebox/crawls/models.py`
+
+**Status**: COMPLETE - Added from_jsonl() to:
+- ✅ `Tag.from_jsonl()` - core/models.py lines 93-116
+- ✅ `Snapshot.from_jsonl()` - core/models.py lines 1144-1189
+- ✅ `Machine.from_jsonl()` - machine/models.py lines 66-89
+- ✅ `Dependency.from_jsonl()` - machine/models.py lines 203-227
+- ✅ `Binary.from_jsonl()` - machine/models.py lines 401-434
+
+Example implementations added:
+
+```python
+class Snapshot:
+    @staticmethod
+    def from_jsonl(record: Dict, overrides: Dict = None):
+        """Create/update Snapshot from JSONL record."""
+        from archivebox.misc.jsonl import get_or_create_snapshot
+        overrides = overrides or {}
+
+        # Apply overrides (crawl, parent_snapshot, depth limits)
+        crawl = overrides.get('crawl')
+        snapshot = overrides.get('snapshot')  # parent
+
+        if crawl:
+            depth = record.get('depth', (snapshot.depth + 1 if snapshot else 1))
+            if depth > crawl.max_depth:
+                return None
+            record.setdefault('crawl_id', str(crawl.id))
+            record.setdefault('depth', depth)
+            if snapshot:
+                record.setdefault('parent_snapshot_id', str(snapshot.id))
+
+        created_by_id = overrides.get('created_by_id')
+        new_snapshot = get_or_create_snapshot(record, created_by_id=created_by_id)
+        new_snapshot.status = Snapshot.StatusChoices.QUEUED
+        new_snapshot.retry_at = timezone.now()
+        new_snapshot.save()
+        return new_snapshot
+
+class Tag:
+    @staticmethod
+    def from_jsonl(record: Dict, overrides: Dict = None):
+        """Create/update Tag from JSONL record."""
+        from archivebox.misc.jsonl import get_or_create_tag
+        tag = get_or_create_tag(record)
+        # Auto-attach to snapshot if in overrides
+        if overrides and 'snapshot' in overrides:
+            overrides['snapshot'].tags.add(tag)
+        return tag
+
+class Binary:
+    @staticmethod
+    def from_jsonl(record: Dict, overrides: Dict = None):
+        """Create/update Binary from JSONL record."""
+        # Implementation similar to existing create_model_record()
+        ...
+
+# Etc for other models
+```
+
+### Phase 3: Update ArchiveResult to use unified pattern ✅ DONE
+
+**File**: `archivebox/core/models.py`
+
+**Status**: COMPLETE
+
+**Changes made**:
+
+1. ✅ **Replaced inline JSONL processing** (lines 1912-1950):
+   - Pre-filter Snapshot records for depth/URL constraints in ArchiveResult.run()
+   - Use `self._url_passes_filters(url)` with parent snapshot's config for proper hierarchy
+   - Replaced inline Tag/Snapshot/other record creation with `process_hook_records()`
+   - Removed ~60 lines of duplicate code
+
+2. ✅ **Simplified Snapshot.from_jsonl()** (lines 1144-1189):
+   - Removed depth checking (now done in caller)
+   - Just applies crawl metadata and creates snapshot
+   - Added docstring note: "Filtering should be done by caller BEFORE calling this method"
+
+3. ✅ **Preserved ArchiveResult self-update logic**:
+   - Status/output fields still updated from ArchiveResult JSONL record (lines 1856-1910)
+   - Special title extractor logic preserved (line 1952+)
+   - Search indexing trigger preserved (line 1957+)
+
+4. ✅ **Key insight**: Filtering happens in ArchiveResult.run() where we have parent snapshot context, NOT in from_jsonl() where we'd lose config hierarchy
+
+**Note**: Did NOT delete special background hook methods (`check_background_completed`, `finalize_background_hook`) - that's Phase 6
+
+### Phase 4: Add Snapshot.cleanup() method ✅ DONE
+
+**File**: `archivebox/core/models.py`
+
+**Status**: COMPLETE
+
+**Changes made**:
+
+1. ✅ **Added Snapshot.cleanup()** (lines 1144-1175):
+   - Kills background ArchiveResult hooks by scanning for `*/hook.pid` files
+   - Finalizes background ArchiveResults using `finalize_background_hook()` (temporary until Phase 6)
+   - Called by state machine when entering sealed state
+
+2. ✅ **Added Snapshot.has_running_background_hooks()** (lines 1177-1195):
+   - Checks if any background hooks still running using `process_is_alive()`
+   - Used by state machine in `is_finished()` check
+
+### Phase 5: Update SnapshotMachine to use cleanup() ✅ DONE
+
+**File**: `archivebox/core/statemachines.py`
+
+**Status**: COMPLETE
+
+**Changes made**:
+
+1. ✅ **Simplified is_finished()** (lines 58-72):
+   - Removed inline background hook checking and finalization (lines 67-76 deleted)
+   - Now uses `self.snapshot.has_running_background_hooks()` (line 68)
+   - Removed ~12 lines of duplicate logic
+
+2. ✅ **Added cleanup() to sealed.enter** (lines 102-111):
+   - Calls `self.snapshot.cleanup()` to kill background hooks (line 105)
+   - Follows unified pattern: cleanup happens on seal, not in is_finished()
+
+### Phase 6: Add ArchiveResult.update_from_output() and simplify run() ✅ DONE
+
+**File**: `archivebox/core/models.py`
+
+**Status**: COMPLETE - The BIG refactor (removed ~200 lines of duplication)
+
+**Changes made**:
+
+1. ✅ **Added `ArchiveResult.update_from_output()`** (lines 1908-2061):
+   - Unified method for both foreground and background hooks
+   - Reads stdout.log and parses JSONL records
+   - Updates status/output_str/output_json from ArchiveResult JSONL record
+   - Walks filesystem to populate output_files/output_size/output_mimetypes
+   - Filters Snapshot records for depth/URL constraints (same as run())
+   - Processes side-effect records via `process_hook_records()`
+   - Updates snapshot title if title extractor
+   - Triggers search indexing if succeeded
+   - Cleans up PID files and empty logs
+   - ~160 lines of comprehensive logic
+
+2. ✅ **Simplified `ArchiveResult.run()`** (lines 1841-1906):
+   - Removed ~120 lines of duplicate filesystem reading logic
+   - Now just sets start_ts/pwd and calls `update_from_output()`
+   - Background hooks: return immediately after saving status=STARTED
+   - Foreground hooks: call `update_from_output()` to do all the work
+   - Removed ~10 lines of duplicate code
+
+3. ✅ **Updated `Snapshot.cleanup()`** (line 1172):
+   - Changed from `ar.finalize_background_hook()` to `ar.update_from_output()`
+   - Uses the unified method instead of the old special-case method
+
+4. ✅ **Deleted `_populate_output_fields()`** (was ~45 lines):
+   - Logic merged into `update_from_output()`
+   - Eliminates duplication of filesystem walking code
+
+5. ✅ **Deleted `check_background_completed()`** (was ~20 lines):
+   - Replaced by `process_is_alive(pid_file)` from hooks.py
+   - Generic helper used by Snapshot.has_running_background_hooks()
+
+6. ✅ **Deleted `finalize_background_hook()`** (was ~85 lines):
+   - Completely replaced by `update_from_output()`
+   - Was duplicate of foreground hook finalization logic
+
+**Total lines removed**: ~280 lines of duplicate code
+**Total lines added**: ~160 lines of unified code
+**Net reduction**: ~120 lines (-43%)
+
+### Phase 7-8: Dependency State Machine ❌ NOT NEEDED
+
+**Status**: Intentionally skipped - Dependency doesn't need a state machine
+
+**Why no state machine for Dependency?**
+
+1. **Wrong Granularity**: Dependency is a GLOBAL singleton (one record per binary name)
+   - Multiple machines would race to update the same `status`/`retry_at` fields
+   - No clear semantics: "started" on which machine? "failed" on Machine A but "succeeded" on Machine B?
+
+2. **Wrong Timing**: Installation should be SYNCHRONOUS, not queued
+   - When a worker needs wget, it should install wget NOW, not queue it for later
+   - No benefit to async state machine transitions
+
+3. **State Lives Elsewhere**: Binary records are the actual state
+   - Each machine has its own Binary records (one per machine per binary)
+   - Binary.machine FK provides proper per-machine state tracking
+
+**Correct Architecture:**
+```
+Dependency (global, no state machine):
+  ├─ Configuration: bin_name, bin_providers, overrides
+  ├─ run() method: synchronous installation attempt
+  └─ NO status, NO retry_at, NO state_machine_name
+
+Binary (per-machine, has machine FK):
+  ├─ State: is this binary installed on this specific machine?
+  ├─ Created via JSONL output from on_Dependency hooks
+  └─ unique_together = (machine, name, abspath, version, sha256)
+```
+
+**What was implemented:**
+- ✅ **Refactored `Dependency.run()`** (lines 249-324):
+  - Uses `discover_hooks()` and `process_hook_records()` for consistency
+  - Added comprehensive docstring explaining why no state machine
+  - Synchronous execution: returns Binary or None immediately
+  - Uses unified JSONL processing pattern
+- ✅ **Kept Dependency simple**: Just configuration fields, no state fields
+- ✅ **Multi-machine support**: Each machine independently runs Dependency.run() and creates its own Binary
+
+## Summary of Changes
+
+### Progress: 6/6 Core Phases Complete ✅ + 2 Phases Skipped (Intentionally)
+
+**ALL core functionality is now complete!** The unified pattern is consistently implemented across Crawl, Snapshot, and ArchiveResult. Dependency intentionally kept simple (no state machine needed).
+
+### Files Modified:
+
+1. ✅ **DONE** `archivebox/hooks.py` - Add unified helpers:
+   - ✅ `process_hook_records(records, overrides)` - dispatcher (lines 1258-1323)
+   - ✅ `process_is_alive(pid_file)` - check if PID still running (lines 1326-1344)
+   - ✅ `kill_process(pid_file)` - kill process (lines 1347-1362)
+
+2. ✅ **DONE** `archivebox/crawls/models.py` - Already updated:
+   - ✅ `Crawl.run()` - runs hooks, processes JSONL, creates snapshots
+   - ✅ `Crawl.cleanup()` - kills background hooks, runs on_CrawlEnd
+
+3. ✅ **DONE** `archivebox/core/models.py`:
+   - ✅ `Tag.from_jsonl()` - lines 93-116
+   - ✅ `Snapshot.from_jsonl()` - lines 1197-1234 (simplified, removed filtering)
+   - ✅ `Snapshot.cleanup()` - lines 1144-1172 (kill background hooks, calls ar.update_from_output())
+   - ✅ `Snapshot.has_running_background_hooks()` - lines 1174-1193 (check PIDs)
+   - ✅ `ArchiveResult.run()` - simplified, uses `update_from_output()` (lines 1841-1906)
+   - ✅ `ArchiveResult.update_from_output()` - unified filesystem reading (lines 1908-2061)
+   - ✅ **DELETED** `ArchiveResult.check_background_completed()` - replaced by `process_is_alive()`
+   - ✅ **DELETED** `ArchiveResult.finalize_background_hook()` - replaced by `update_from_output()`
+   - ✅ **DELETED** `ArchiveResult._populate_output_fields()` - merged into `update_from_output()`
+
+4. ✅ **DONE** `archivebox/core/statemachines.py`:
+   - ✅ Simplified `SnapshotMachine.is_finished()` - uses `has_running_background_hooks()` (line 68)
+   - ✅ Added cleanup call to `SnapshotMachine.sealed.enter` (line 105)
+
+5. ✅ **DONE** `archivebox/machine/models.py`:
+   - ✅ `Machine.from_jsonl()` - lines 66-89
+   - ✅ `Dependency.from_jsonl()` - lines 203-227
+   - ✅ `Binary.from_jsonl()` - lines 401-434
+   - ✅ Refactored `Dependency.run()` to use unified pattern (lines 249-324)
+   - ✅ Added comprehensive docstring explaining why Dependency doesn't need state machine
+   - ✅ Kept Dependency simple: no state fields, synchronous execution only
+
+### Code Metrics:
+- **Lines removed**: ~280 lines of duplicate code
+- **Lines added**: ~160 lines of unified code
+- **Net reduction**: ~120 lines total (-43%)
+- **Files created**: 0 (no new files needed)
+
+### Key Benefits:
+
+1. **Consistency**: All stateful models (Crawl, Snapshot, ArchiveResult) follow the same unified state machine pattern
+2. **Simplicity**: Eliminated special-case background hook handling (~280 lines of duplicate code)
+3. **Correctness**: Background hooks are properly cleaned up on seal transition
+4. **Maintainability**: Unified `process_hook_records()` dispatcher for all JSONL processing
+5. **Testability**: Consistent pattern makes testing easier
+6. **Clear Separation**: Stateful work items (Crawl/Snapshot/ArchiveResult) vs stateless config (Dependency)
+7. **Multi-Machine Support**: Dependency remains simple synchronous config, Binary tracks per-machine state
+
+## Final Unified Pattern
+
+All models now follow this consistent architecture:
+
+### State Machine Structure
+```python
+class ModelMachine(StateMachine):
+    queued = State(initial=True)
+    started = State()
+    sealed/succeeded/failed = State(final=True)
+
+    @started.enter
+    def enter_started(self):
+        self.model.run()  # Execute the work
+
+    @sealed.enter  # or @succeeded.enter
+    def enter_sealed(self):
+        self.model.cleanup()  # Clean up background hooks
+```
+
+### Model Methods
+```python
+class Model:
+    # State machine fields
+    status = CharField(default='queued')
+    retry_at = DateTimeField(default=timezone.now)
+    output_dir = CharField(default='', blank=True)
+    state_machine_name = 'app.statemachines.ModelMachine'
+
+    def run(self):
+        """Run hooks, process JSONL, create children."""
+        hooks = discover_hooks('EventName')
+        for hook in hooks:
+            output_dir = self.OUTPUT_DIR / hook.parent.name
+            result = run_hook(hook, output_dir=output_dir, ...)
+
+            if result is None:  # Background hook
+                continue
+
+            # Process JSONL records
+            overrides = {'model': self, 'created_by_id': self.created_by_id}
+            process_hook_records(result['records'], overrides=overrides)
+
+    def cleanup(self):
+        """Kill background hooks, run cleanup hooks."""
+        for pid_file in self.OUTPUT_DIR.glob('*/hook.pid'):
+            kill_process(pid_file)
+            # Update children from filesystem
+            child.update_from_output()
+
+    def update_for_workers(self, **fields):
+        """Update fields and bump modified_at."""
+        for field, value in fields.items():
+            setattr(self, field, value)
+        self.save(update_fields=[*fields.keys(), 'modified_at'])
+
+    @staticmethod
+    def from_jsonl(record: dict, overrides: dict = None):
+        """Create/update model from JSONL record."""
+        # Implementation specific to model
+        # Called by process_hook_records()
+```
+
+### Hook Processing Flow
+```
+1. Model.run() discovers hooks
+2. Hooks execute and output JSONL to stdout
+3. JSONL records dispatched via process_hook_records()
+4. Each record type handled by Model.from_jsonl()
+5. Background hooks tracked via hook.pid files
+6. Model.cleanup() kills background hooks on seal
+7. Children updated via update_from_output()
+```
+
+### Multi-Machine Coordination
+- **Work Items** (Crawl, Snapshot, ArchiveResult): No machine FK, any worker can claim
+- **Resources** (Binary): Machine FK, one per machine per binary
+- **Configuration** (Dependency): No machine FK, global singleton, synchronous execution
+- **Execution Tracking** (ArchiveResult.iface): FK to NetworkInterface for observability
+
+## Testing Checklist
+
+- [ ] Test Crawl → Snapshot creation with hooks
+- [ ] Test Snapshot → ArchiveResult creation
+- [ ] Test ArchiveResult foreground hooks (JSONL processing)
+- [ ] Test ArchiveResult background hooks (PID tracking, cleanup)
+- [ ] Test Dependency.run() synchronous installation
+- [ ] Test background hook cleanup on seal transition
+- [ ] Test multi-machine Crawl execution
+- [ ] Test Binary creation per machine (one per machine per binary)
+- [ ] Verify Dependency.run() can be called concurrently from multiple machines safely
+
+## FINAL ARCHITECTURE (Phases 1-8 Complete)
+
+### ✅ Phases 1-6: Core Models Unified
+All core models (Crawl, Snapshot, ArchiveResult) now follow the unified pattern:
+- State machines orchestrate transitions
+- `.run()` methods execute hooks and process JSONL
+- `.cleanup()` methods kill background hooks
+- `.update_for_workers()` methods update state for worker coordination
+- Consistent use of `process_hook_records()` for JSONL dispatching
+
+### ✅ Phases 7-8: Binary State Machine (Dependency Model Eliminated)
+
+**Key Decision**: Eliminated `Dependency` model entirely and made `Binary` the state machine.
+
+#### New Architecture
+- **Static Configuration**: `plugins/{plugin}/dependencies.jsonl` files define binary requirements
+  ```jsonl
+  {"type": "Binary", "name": "yt-dlp", "bin_providers": "pip,brew,apt,env"}
+  {"type": "Binary", "name": "node", "bin_providers": "apt,brew,env", "overrides": {"apt": {"packages": ["nodejs"]}}}
+  {"type": "Binary", "name": "ffmpeg", "bin_providers": "apt,brew,env"}
+  ```
+
+- **Dynamic State**: `Binary` model tracks per-machine installation state
+  - Fields: `machine`, `name`, `bin_providers`, `overrides`, `abspath`, `version`, `sha256`, `binprovider`
+  - State machine: `queued → started → succeeded/failed`
+  - Output dir: `data/machines/{machine_id}/binaries/{binary_name}/{binary_id}/`
+
+#### Binary State Machine Flow
+```python
+class BinaryMachine(StateMachine):
+    queued → started → succeeded/failed
+
+    @started.enter
+    def enter_started(self):
+        self.binary.run()  # Runs on_Binary__install_* hooks
+
+class Binary(models.Model):
+    def run(self):
+        """
+        Runs ALL on_Binary__install_* hooks.
+        Each hook checks bin_providers and decides if it can handle this binary.
+        First hook to succeed wins.
+        Outputs JSONL with abspath, version, sha256, binprovider.
+        """
+        hooks = discover_hooks('Binary')
+        for hook in hooks:
+            result = run_hook(hook, output_dir=self.OUTPUT_DIR/plugin_name, 
+                            binary_id=self.id, machine_id=self.machine_id,
+                            name=self.name, bin_providers=self.bin_providers,
+                            overrides=json.dumps(self.overrides))
+            
+            # Hook outputs: {"type": "Binary", "name": "wget", "abspath": "/usr/bin/wget", "version": "1.21", "binprovider": "apt"}
+            # Binary.from_jsonl() updates self with installation results
+```
+
+#### Hook Naming Convention
+- **Before**: `on_Dependency__install_using_pip_provider.py`
+- **After**: `on_Binary__install_using_pip_provider.py`
+
+Each hook checks `--bin-providers` CLI argument:
+```python
+if 'pip' not in bin_providers.split(','):
+    sys.exit(0)  # Skip this binary
+```
+
+#### Perfect Symmetry Achieved
+All models now follow identical patterns:
+```python
+Crawl(queued) → CrawlMachine → Crawl.run() → sealed
+Snapshot(queued) → SnapshotMachine → Snapshot.run() → sealed  
+ArchiveResult(queued) → ArchiveResultMachine → ArchiveResult.run() → succeeded/failed
+Binary(queued) → BinaryMachine → Binary.run() → succeeded/failed
+```
+
+#### Benefits of Eliminating Dependency
+1. **No global singleton conflicts**: Binary is per-machine, no race conditions
+2. **Simpler data model**: One table instead of two (Dependency + InstalledBinary)
+3. **Static configuration**: dependencies.jsonl in version control, not database
+4. **Consistent state machine**: Binary follows same pattern as other models
+5. **Cleaner hooks**: Hooks check bin_providers themselves instead of orchestrator parsing names
+
+#### Multi-Machine Coordination
+- **Work Items** (Crawl, Snapshot, ArchiveResult): No machine FK, any worker can claim
+- **Resources** (Binary): Machine FK, one per machine per binary name
+- **Configuration**: Static files in `plugins/*/dependencies.jsonl`
+- **Execution Tracking**: ArchiveResult.iface FK to NetworkInterface for observability
+
+### Testing Checklist (Updated)
+- [x] Core models use unified hook pattern (Phases 1-6)
+- [ ] Binary installation via state machine
+- [ ] Multiple machines can install same binary independently  
+- [ ] Hook bin_providers filtering works correctly
+- [ ] Binary.from_jsonl() handles both dependencies.jsonl and hook output
+- [ ] Binary OUTPUT_DIR structure: data/machines/{machine_id}/binaries/{name}/{id}/
+
--- a/old/TODO_rename_extractor_to_plugin.md
+++ b/old/TODO_rename_extractor_to_plugin.md
@@ -0,0 +1,517 @@
+# TODO: Rename Extractor to Plugin - Implementation Progress
+
+**Status**: 🟡 In Progress (2/13 phases complete)
+**Started**: 2025-12-28
+**Estimated Files to Update**: ~150+ files
+
+---
+
+## Progress Overview
+
+### ✅ Completed Phases (2/13)
+
+- [x] **Phase 1**: Database Migration - Created migration 0033
+- [x] **Phase 2**: Core Model Updates - Updated ArchiveResult, ArchiveResultManager, Snapshot models
+
+### 🟡 In Progress (1/13)
+
+- [ ] **Phase 3**: Hook Execution System (hooks.py - all function renames)
+
+### ⏳ Pending Phases (10/13)
+
+- [ ] **Phase 4**: JSONL Import/Export (misc/jsonl.py)
+- [ ] **Phase 5**: CLI Commands (archivebox_extract, archivebox_add, archivebox_update)
+- [ ] **Phase 6**: API Endpoints (v1_core.py, v1_cli.py)
+- [ ] **Phase 7**: Admin Interface (admin_archiveresults.py, forms.py)
+- [ ] **Phase 8**: Views and Templates (views.py, templatetags, progress_monitor.html)
+- [ ] **Phase 9**: Worker System (workers/worker.py)
+- [ ] **Phase 10**: State Machine (statemachines.py)
+- [ ] **Phase 11**: Tests (test_migrations_helpers.py, test_recursive_crawl.py, etc.)
+- [ ] **Phase 12**: Terminology Standardization (via_extractor→plugin, comments, docstrings)
+- [ ] **Phase 13**: Run migrations and verify all tests pass
+
+---
+
+## What's Been Completed So Far
+
+### Phase 1: Database Migration ✅
+
+**File Created**: `archivebox/core/migrations/0033_rename_extractor_add_hook_name.py`
+
+Changes:
+- Used `migrations.RenameField()` to rename `extractor` → `plugin`
+- Added `hook_name` field (CharField, max_length=255, indexed, default='')
+- Preserves all existing data, indexes, and constraints
+
+### Phase 2: Core Models ✅
+
+**File Updated**: `archivebox/core/models.py`
+
+#### ArchiveResultManager
+- Updated `indexable()` method to use `plugin__in` and `plugin=method`
+- Changed reference from `ARCHIVE_METHODS_INDEXING_PRECEDENCE` to `EXTRACTOR_INDEXING_PRECEDENCE`
+
+#### ArchiveResult Model
+**Field Changes**:
+- Renamed field: `extractor` → `plugin`
+- Added field: `hook_name` (stores full filename like `on_Snapshot__50_wget.py`)
+- Updated comments to reference "plugin" instead of "extractor"
+
+**Method Updates**:
+- `get_extractor_choices()` → `get_plugin_choices()`
+- `__str__()`: Now uses `self.plugin`
+- `save()`: Logs `plugin` instead of `extractor`
+- `get_absolute_url()`: Uses `self.plugin`
+- `extractor_module` property → `plugin_module` property
+- `output_exists()`: Checks `self.plugin` directory
+- `embed_path()`: Uses `self.plugin` for paths
+- `create_output_dir()`: Creates `self.plugin` directory
+- `output_dir_name`: Returns `self.plugin`
+- `run()`: All references to extractor → plugin (including extractor_dir → plugin_dir)
+- `update_from_output()`: All references updated to plugin/plugin_dir
+- `_update_snapshot_title()`: Parameter renamed to `plugin_dir`
+- `trigger_search_indexing()`: Passes `plugin=self.plugin`
+- `output_dir` property: Returns plugin directory
+- `is_background_hook()`: Uses `plugin_dir`
+
+#### Snapshot Model
+**Method Updates**:
+- `create_pending_archiveresults()`: Uses `get_enabled_plugins()`, filters by `plugin=plugin`
+- `result_icons` (calc_icons): Maps by `r.plugin`, calls `get_plugin_name()` and `get_plugin_icon()`
+- `_merge_archive_results_from_index()`: Maps by `(ar.plugin, ar.start_ts)`, supports both 'extractor' and 'plugin' keys for backwards compat
+- `_create_archive_result_if_missing()`: Supports both 'extractor' and 'plugin' keys, creates with `plugin=plugin`
+- `write_index_json()`: Writes `'plugin': ar.plugin` in archive_results
+- `canonical_outputs()`: Updates `find_best_output_in_dir()` to use `plugin_name`, accesses `result.plugin`, creates keys like `{result.plugin}_path`
+- `latest_outputs()`: Uses `get_plugins()`, filters by `plugin=plugin`
+- `retry_failed_archiveresults()`: Updated docstring to reference "plugins" instead of "extractors"
+
+**Total Lines Changed in models.py**: ~50+ locations
+
+---
+
+## Full Implementation Plan
+
+# ArchiveResult Model Refactoring Plan: Rename Extractor to Plugin + Add Hook Name Field
+
+## Overview
+Refactor the ArchiveResult model and standardize terminology across the codebase:
+1. Rename the `extractor` field to `plugin` in ArchiveResult model
+2. Add a new `hook_name` field to store the specific hook filename that executed
+3. Update all related code paths (CLI, API, admin, views, hooks, JSONL, etc.)
+4. Standardize CLI flags from `--extract/--extractors` to `--plugins`
+5. **Standardize terminology throughout codebase**:
+   - "parsers" → "parser plugins"
+   - "extractors" → "extractor plugins"
+   - "parser extractors" → "parser plugins"
+   - "archive methods" → "extractor plugins"
+   - Document apt/brew/npm/pip as "package manager plugins" in comments
+
+## Current State Analysis
+
+### ArchiveResult Model (archivebox/core/models.py:1679-1750)
+```python
+class ArchiveResult(ModelWithOutputDir, ...):
+    extractor = models.CharField(max_length=32, db_index=True)  # e.g., "screenshot", "wget"
+    # New fields from migration 0029:
+    output_str, output_json, output_files, output_size, output_mimetypes
+    binary = ForeignKey('machine.Binary', ...)
+    # No hook_name field yet
+```
+
+### Hook Execution Flow
+1. `ArchiveResult.run()` discovers hooks for the plugin (e.g., `wget/on_Snapshot__50_wget.py`)
+2. `run_hook()` executes each hook script, captures output as HookResult
+3. `update_from_output()` parses JSONL and updates ArchiveResult fields
+4. Currently NO tracking of which specific hook file executed
+
+### Field Usage Across Codebase
+**extractor field** is used in ~100 locations:
+- **Model**: ArchiveResult.extractor field definition, __str__, manager queries
+- **CLI**: archivebox_extract.py (--plugin flag), archivebox_add.py, tests
+- **API**: v1_core.py (extractor filter), v1_cli.py (extract/extractors args)
+- **Admin**: admin_archiveresults.py (list filter, display)
+- **Views**: core/views.py (archiveresult_objects dict by extractor)
+- **Template Tags**: core_tags.py (extractor_icon, extractor_thumbnail, extractor_embed)
+- **Hooks**: hooks.py (get_extractors, get_extractor_name, run_hook output parsing)
+- **JSONL**: misc/jsonl.py (archiveresult_to_jsonl serializes extractor)
+- **Worker**: workers/worker.py (ArchiveResultWorker filters by extractor)
+- **Statemachine**: statemachines.py (logs extractor in state transitions)
+
+---
+
+## Implementation Plan
+
+### Phase 1: Database Migration (archivebox/core/migrations/) ✅ COMPLETE
+
+**Create migration 0033_rename_extractor_add_hook_name.py**:
+1. Rename field: `extractor` → `plugin` (preserve index, constraints)
+2. Add field: `hook_name` = CharField(max_length=255, blank=True, default='', db_index=True)
+   - **Stores full hook filename**: `on_Snapshot__50_wget.py`, `on_Crawl__10_chrome_session.js`, etc.
+   - Empty string for existing records (data migration sets all to '')
+3. Update any indexes or constraints that reference extractor
+
+**Decision**: Full filename chosen for explicitness and easy grep-ability
+
+**Critical Files to Update**:
+- ✅ ArchiveResult model field definitions
+- ✅ Migration dependencies (latest: 0032)
+
+---
+
+### Phase 2: Core Model Updates (archivebox/core/models.py) ✅ COMPLETE
+
+**ArchiveResult Model** (lines 1679-1820):
+- ✅ Rename field: `extractor` → `plugin`
+- ✅ Add field: `hook_name = models.CharField(...)`
+- ✅ Update __str__: `f'...-> {self.plugin}'`
+- ✅ Update absolute_url: Use plugin instead of extractor
+- ✅ Update embed_path: Use plugin directory name
+
+**ArchiveResultManager** (lines 1669-1677):
+- ✅ Update indexable(): `filter(plugin__in=INDEXABLE_METHODS, ...)`
+- ✅ Update precedence: `When(plugin=method, ...)`
+
+**Snapshot Model** (lines 1000-1600):
+- ✅ Update canonical_outputs: Access by plugin name
+- ✅ Update create_pending_archiveresults: Use plugin parameter
+- ✅ All queryset filters: `archiveresult_set.filter(plugin=...)`
+
+---
+
+### Phase 3: Hook Execution System (archivebox/hooks.py) 🟡 IN PROGRESS
+
+**Function Renames**:
+- [ ] `get_extractors()` → `get_plugins()` (lines 479-504)
+- [ ] `get_parser_extractors()` → `get_parser_plugins()` (lines 507-514)
+- [ ] `get_extractor_name()` → `get_plugin_name()` (lines 517-530)
+- [ ] `is_parser_extractor()` → `is_parser_plugin()` (lines 533-536)
+- [ ] `get_enabled_extractors()` → `get_enabled_plugins()` (lines 553-566)
+- [ ] `get_extractor_template()` → `get_plugin_template()` (line 1048)
+- [ ] `get_extractor_icon()` → `get_plugin_icon()` (line 1068)
+- [ ] `get_all_extractor_icons()` → `get_all_plugin_icons()` (line 1092)
+
+**Update HookResult TypedDict** (lines 63-73):
+- [ ] Add field: `hook_name: str` to store hook filename
+- [ ] Add field: `plugin: str` (if not already present)
+
+**Update run_hook()** (lines 141-389):
+- [ ] **Add hook_name parameter**: Pass hook filename to be stored in result
+- [ ] Update HookResult to include hook_name field
+- [ ] Update JSONL record output: Add `hook_name` key
+
+**Update ArchiveResult.run()** (lines 1838-1914):
+- [ ] When calling run_hook, pass the hook filename
+- [ ] Store hook_name in ArchiveResult before/after execution
+
+**Update ArchiveResult.update_from_output()** (lines 1916-2073):
+- [ ] Parse hook_name from JSONL output
+- [ ] Store in self.hook_name field
+- [ ] If not present in JSONL, infer from directory/filename
+
+**Constants to Rename**:
+- [ ] `ARCHIVE_METHODS_INDEXING_PRECEDENCE` → `EXTRACTOR_INDEXING_PRECEDENCE`
+
+**Comments/Docstrings**: Update all function docstrings to use "plugin" terminology
+
+---
+
+### Phase 4: JSONL Import/Export (archivebox/misc/jsonl.py)
+
+**Update archiveresult_to_jsonl()** (lines 173-200):
+- [ ] Change key: `'extractor': result.extractor` → `'plugin': result.plugin`
+- [ ] Add key: `'hook_name': result.hook_name`
+
+**Update JSONL parsing**:
+- [ ] **Accept both 'extractor' (legacy) and 'plugin' (new) keys when importing**
+- [ ] Always write 'plugin' key in new exports (never 'extractor')
+- [ ] Parse and store hook_name if present (backwards compat: empty if missing)
+
+**Decision**: Support both keys on import for smooth migration, always export new format
+
+---
+
+### Phase 5: CLI Commands (archivebox/cli/)
+
+**archivebox_extract.py** (lines 1-230):
+- [ ] Rename flag: `--plugin` stays (already correct!)
+- [ ] Update internal references: extractor → plugin
+- [ ] Update filter: `results.filter(plugin=plugin)`
+- [ ] Update display: `result.plugin`
+
+**archivebox_add.py**:
+- [ ] Rename config key: `'EXTRACTORS': plugins` → `'PLUGINS': plugins` (if not already)
+
+**archivebox_update.py**:
+- [ ] Standardize to `--plugins` flag (currently may be --extractors or --extract)
+
+**tests/test_oneshot.py**:
+- [ ] Update flag: `--extract=...` → `--plugins=...`
+
+---
+
+### Phase 6: API Endpoints (archivebox/api/)
+
+**v1_core.py** (ArchiveResult API):
+- [ ] Update schema field: `extractor: str` → `plugin: str`
+- [ ] Update schema field: Add `hook_name: str = ''`
+- [ ] Update FilterSchema: `q=[..., 'plugin', ...]`
+- [ ] Update extractor filter: `plugin: Optional[str] = Field(None, q='plugin__icontains')`
+
+**v1_cli.py** (CLI API):
+- [ ] Rename AddCommandSchema field: `extract: str` → `plugins: str`
+- [ ] Rename UpdateCommandSchema field: `extractors: str` → `plugins: str`
+- [ ] Update endpoint mapping: `args.plugins` → `plugins` parameter
+
+---
+
+### Phase 7: Admin Interface (archivebox/core/)
+
+**admin_archiveresults.py**:
+- [ ] Update all references: extractor → plugin
+- [ ] Update list_filter: `'plugin'` instead of `'extractor'`
+- [ ] Update ordering: `order_by('plugin')`
+- [ ] Update get_plugin_icon: (rename from get_extractor_icon if exists)
+
+**admin_snapshots.py**:
+- [ ] Update any commented TODOs referencing extractor
+
+**forms.py**:
+- [ ] Rename function: `get_archive_methods()` → `get_plugin_choices()`
+- [ ] Update form field: `archive_methods` → `plugins`
+
+---
+
+### Phase 8: Views and Templates (archivebox/core/)
+
+**views.py**:
+- [ ] Update dict building: `archiveresult_objects[result.plugin] = result`
+- [ ] Update all extractor references to plugin
+
+**templatetags/core_tags.py**:
+- [ ] **Rename template tags (BREAKING CHANGE)**:
+  - `extractor_icon()` → `plugin_icon()`
+  - `extractor_thumbnail()` → `plugin_thumbnail()`
+  - `extractor_embed()` → `plugin_embed()`
+- [ ] Update internal: `result.extractor` → `result.plugin`
+
+**Update HTML templates** (if any directly reference extractor):
+- [ ] Search for `{{ result.extractor }}` and similar
+- [ ] Update to `{{ result.plugin }}`
+- [ ] Update template tag calls
+- [ ] **CRITICAL**: Update JavaScript in `templates/admin/progress_monitor.html`:
+  - Lines 491, 505: Change `extractor.extractor` and `a.extractor` to use `plugin` field
+
+---
+
+### Phase 9: Worker System (archivebox/workers/worker.py)
+
+**ArchiveResultWorker**:
+- [ ] Rename parameter: `extractor` → `plugin` (lines 348, 350)
+- [ ] Update filter: `qs.filter(plugin=self.plugin)`
+- [ ] Update subprocess passing: Use plugin parameter
+
+---
+
+### Phase 10: State Machine (archivebox/core/statemachines.py)
+
+**ArchiveResultMachine**:
+- [ ] Update logging: Use `self.archiveresult.plugin` instead of extractor
+- [ ] Update any state metadata that includes extractor field
+
+---
+
+### Phase 11: Tests and Fixtures
+
+**Update test files**:
+- [ ] tests/test_migrations_*.py: Update expected field names in schema definitions
+- [ ] tests/test_hooks.py: Update assertions for plugin/hook_name fields
+- [ ] archivebox/tests/test_migrations_helpers.py: Update schema SQL (lines 161, 382, 468)
+- [ ] tests/test_recursive_crawl.py: Update SQL query `WHERE extractor = '60_parse_html_urls'` (line 163)
+- [ ] archivebox/cli/tests_piping.py: Update test function names and assertions
+- [ ] Any fixtures that create ArchiveResults: Use plugin parameter
+- [ ] Any mock objects that set `.extractor` attribute: Change to `.plugin`
+
+---
+
+### Phase 12: Terminology Standardization (NEW)
+
+This phase standardizes terminology throughout the codebase to use consistent "plugin" nomenclature.
+
+**via_extractor → plugin Rename (14 files)**:
+- [ ] Rename metadata field `via_extractor` to just `plugin`
+- [ ] Files affected:
+  - archivebox/hooks.py - Set plugin in run_hook() output
+  - archivebox/crawls/models.py - If via_extractor field exists
+  - archivebox/cli/archivebox_crawl.py - References to via_extractor
+  - All parser plugins that set via_extractor in output
+  - Test files with via_extractor assertions
+- [ ] Update all JSONL output from parser plugins to use "plugin" key
+
+**Logging Functions (archivebox/misc/logging_util.py)**:
+- [ ] `log_archive_method_started()` → `log_extractor_started()` (line 326)
+- [ ] `log_archive_method_finished()` → `log_extractor_finished()` (line 330)
+
+**Form Functions (archivebox/core/forms.py)**:
+- [ ] `get_archive_methods()` → `get_plugin_choices()` (line 15)
+- [ ] Form field `archive_methods` → `plugins` (line 24, 29)
+- [ ] Update form validation and view usage
+
+**Comments and Docstrings (81 files with "extractor" references)**:
+- [ ] Update comments to say "extractor plugin" instead of just "extractor"
+- [ ] Update comments to say "parser plugin" instead of "parser extractor"
+- [ ] All plugin files: Update docstrings to use "extractor plugin" terminology
+
+**Package Manager Plugin Documentation**:
+- [ ] Update comments in package manager hook files to say "package manager plugin":
+  - archivebox/plugins/apt/on_Binary__install_using_apt_provider.py
+  - archivebox/plugins/brew/on_Binary__install_using_brew_provider.py
+  - archivebox/plugins/npm/on_Binary__install_using_npm_provider.py
+  - archivebox/plugins/pip/on_Binary__install_using_pip_provider.py
+  - archivebox/plugins/env/on_Binary__install_using_env_provider.py
+  - archivebox/plugins/custom/on_Binary__install_using_custom_bash.py
+
+**String Literals in Error Messages**:
+- [ ] Search for error messages containing "extractor" and update to "plugin" or "extractor plugin"
+- [ ] Search for error messages containing "parser" and update to "parser plugin" where appropriate
+
+---
+
+## Critical Files Summary
+
+### Must Update (Core):
+1. ✅ `archivebox/core/models.py` - ArchiveResult, ArchiveResultManager, Snapshot
+2. ✅ `archivebox/core/migrations/0033_*.py` - New migration
+3. ⏳ `archivebox/hooks.py` - All hook execution and discovery functions
+4. ⏳ `archivebox/misc/jsonl.py` - Serialization/deserialization
+
+### Must Update (CLI):
+5. ⏳ `archivebox/cli/archivebox_extract.py`
+6. ⏳ `archivebox/cli/archivebox_add.py`
+7. ⏳ `archivebox/cli/archivebox_update.py`
+
+### Must Update (API):
+8. ⏳ `archivebox/api/v1_core.py`
+9. ⏳ `archivebox/api/v1_cli.py`
+
+### Must Update (Admin/Views):
+10. ⏳ `archivebox/core/admin_archiveresults.py`
+11. ⏳ `archivebox/core/views.py`
+12. ⏳ `archivebox/core/templatetags/core_tags.py`
+
+### Must Update (Workers/State):
+13. ⏳ `archivebox/workers/worker.py`
+14. ⏳ `archivebox/core/statemachines.py`
+
+### Must Update (Tests):
+15. ⏳ `tests/test_oneshot.py`
+16. ⏳ `archivebox/tests/test_hooks.py`
+17. ⏳ `archivebox/tests/test_migrations_helpers.py` - Schema SQL definitions
+18. ⏳ `tests/test_recursive_crawl.py` - SQL queries with field names
+19. ⏳ `archivebox/cli/tests_piping.py` - Test function docstrings
+
+### Must Update (Terminology - Phase 12):
+20. ⏳ `archivebox/misc/logging_util.py` - Rename logging functions
+21. ⏳ `archivebox/core/forms.py` - Rename form helper and field
+22. ⏳ `archivebox/templates/admin/progress_monitor.html` - JavaScript field refs
+23. ⏳ All 81 plugin files - Update docstrings and comments
+24. ⏳ 28 files with parser terminology - Update comments consistently
+
+---
+
+## Migration Strategy
+
+### Data Migration for Existing Records:
+```python
+def forwards(apps, schema_editor):
+    ArchiveResult = apps.get_model('core', 'ArchiveResult')
+    # All existing records get empty hook_name
+    ArchiveResult.objects.all().update(hook_name='')
+```
+
+### Backwards Compatibility:
+**BREAKING CHANGES** (per user requirements - no backwards compat):
+- CLI flags: Hard cutover to `--plugins` (no aliases)
+- API fields: `extractor` removed, `plugin` required
+- Template tags: All renamed to `plugin_*`
+
+**PARTIAL COMPAT** (for migration):
+- JSONL: Write 'plugin', but **accept both 'extractor' and 'plugin' on import**
+
+---
+
+## Testing Checklist
+
+- [ ] Migration 0033 runs successfully on test database
+- [ ] All migrations tests pass (test_migrations_*.py)
+- [ ] All hook tests pass (test_hooks.py)
+- [ ] CLI commands work with --plugins flag
+- [ ] API endpoints return plugin/hook_name fields correctly
+- [ ] Admin interface displays plugin correctly
+- [ ] Admin progress monitor JavaScript works (no console errors)
+- [ ] JSONL export includes both plugin and hook_name
+- [ ] JSONL import accepts both 'extractor' and 'plugin' keys
+- [ ] Hook execution populates hook_name field
+- [ ] Worker filtering by plugin works
+- [ ] Template tags render with new names (plugin_icon, etc.)
+- [ ] All renamed functions work correctly
+- [ ] SQL queries in tests use correct field names
+- [ ] Terminology is consistent across codebase
+
+---
+
+## Critical Issues to Address
+
+### 1. via_extractor Field (DECISION: RENAME)
+- Currently used in 14 files for tracking which parser plugin discovered a URL
+- **Decision**: Rename `via_extractor` → `plugin` (not via_plugin, just "plugin")
+- **Impact**: Crawler and parser plugin code - 14 files to update
+- Files affected:
+  - archivebox/hooks.py
+  - archivebox/crawls/models.py
+  - archivebox/cli/archivebox_crawl.py
+  - All parser plugins (parse_html_urls, parse_rss_urls, parse_jsonl_urls, etc.)
+  - Tests: tests_piping.py, test_parse_rss_urls_comprehensive.py
+- This creates consistent naming where "plugin" is used for both:
+  - ArchiveResult.plugin (which extractor plugin ran)
+  - URL discovery metadata "plugin" (which parser plugin discovered this URL)
+
+### 2. Field Size Constraint
+- Current: `extractor = CharField(max_length=32)`
+- **Decision**: Keep max_length=32 when renaming to plugin
+- No size increase needed
+
+### 3. Migration Implementation
+- Use `migrations.RenameField('ArchiveResult', 'extractor', 'plugin')` for clean migration
+- Preserves data, indexes, and constraints automatically
+- Add hook_name field in same migration
+
+---
+
+## Rollout Notes
+
+**Breaking Changes**:
+1. CLI: `--extract`, `--extractors` → `--plugins` (no aliases)
+2. API: `extractor` field → `plugin` field (no backwards compat)
+3. Template tags: `extractor_*` → `plugin_*` (users must update custom templates)
+4. Python API: All function names with "extractor" → "plugin" (import changes needed)
+5. Form fields: `archive_methods` → `plugins`
+6. **via_extractor → plugin** (URL discovery metadata field)
+
+**Migration Required**: Yes - all instances must run migrations before upgrading
+
+**Estimated Impact**: ~150+ files will need updates across the entire codebase
+- 81 files: extractor terminology
+- 28 files: parser terminology
+- 10 files: archive_method legacy terminology
+- Plus templates, JavaScript, tests, etc.
+
+---
+
+## Next Steps
+
+1. **Continue with Phase 3**: Update hooks.py with all function renames and hook_name tracking
+2. **Then Phase 4**: Update JSONL import/export with backwards compatibility
+3. **Then Phases 5-12**: Systematically update all remaining files
+4. **Finally Phase 13**: Run full test suite and verify everything works
+
+**Note**: Migration can be tested immediately - the migration file is ready to run!