move todos

This commit is contained in:
Nick Sweeting
2025-12-28 04:44:38 -08:00
parent bd265c0083
commit d2e65cfd38
8 changed files with 0 additions and 956 deletions

View File

@@ -0,0 +1,431 @@
# Chrome Plugin Consolidation - COMPLETED ✓
## Core Principle: One ArchiveResult Per Plugin
**Critical Realization:** Each plugin must produce exactly ONE ArchiveResult output. This is fundamental to ArchiveBox's architecture - you cannot have multiple outputs from a single plugin.
### CRITICAL ARCHITECTURE CLARIFICATION
**DO NOT CONFUSE THESE CONCEPTS:**
1. **Plugin** = Directory name (e.g., `chrome`, `consolelog`, `screenshot`)
- Lives in `archivebox/plugins/<plugin_name>/`
- Can contain MULTIPLE hook files
- Produces ONE output directory: `users/{username}/snapshots/YYYYMMDD/{domain}/{snap_id}/{plugin_name}/`
- Creates ONE ArchiveResult record per snapshot
2. **Hook** = Individual script file (e.g., `on_Snapshot__20_chrome_tab.bg.js`)
- Lives inside a plugin directory
- One plugin can have MANY hooks
- All hooks in a plugin run sequentially when that plugin's ArchiveResult is processed
- All hooks write to the SAME output directory (the plugin directory)
3. **Extractor** = ArchiveResult.extractor field = PLUGIN NAME (not hook name)
- `ArchiveResult.extractor = 'chrome'` (plugin name)
- NOT `ArchiveResult.extractor = '20_chrome_tab.bg'` (hook name)
4. **Output Directory** = `users/{username}/snapshots/YYYYMMDD/{domain}/{snap_id}/{plugin_name}/`
- One output directory per plugin (0.9.x structure)
- ALL hooks in that plugin write to this same directory
- Example: `users/default/snapshots/20251227/example.com/019b-6397-6a5b/chrome/` contains outputs from ALL chrome hooks
- Legacy: `archive/{timestamp}/` with symlink for backwards compatibility
**Example 1: Chrome Plugin (Infrastructure - NO ArchiveResult)**
```
Plugin name: 'chrome'
ArchiveResult: NONE (infrastructure only)
Output directory: users/default/snapshots/20251227/example.com/019b-6397-6a5b/chrome/
Hooks:
- on_Snapshot__20_chrome_tab.bg.js # Launches Chrome, opens tab
- on_Snapshot__30_chrome_navigate.js # Navigates to URL
- on_Snapshot__45_chrome_tab_cleanup.py # Kills Chrome on cleanup
Writes (temporary infrastructure files, deleted on cleanup):
- chrome/cdp_url.txt # Other plugins read this to connect
- chrome/target_id.txt # Tab ID for CDP connection
- chrome/page_loaded.txt # Navigation completion marker
- chrome/navigation.json # Navigation state
- chrome/hook.pid # For cleanup
NO ArchiveResult JSON is produced - this is pure infrastructure.
On SIGTERM: Chrome exits, chrome/ directory is deleted.
```
**Example 2: Screenshot Plugin (Output Plugin - CREATES ArchiveResult)**
```
Plugin name: 'screenshot'
ArchiveResult.extractor: 'screenshot'
Output directory: users/default/snapshots/20251227/example.com/019b-6397-6a5b/screenshot/
Hooks:
- on_Snapshot__34_screenshot.js
Process:
1. Reads ../chrome/cdp_url.txt to get Chrome connection
2. Connects to Chrome CDP
3. Takes screenshot
4. Writes to: screenshot/screenshot.png
5. Emits ArchiveResult JSON to stdout
Creates ArchiveResult with status=succeeded, output_files={'screenshot.png': {}}
```
**Example 3: PDF Plugin (Output Plugin - CREATES ArchiveResult)**
```
Plugin name: 'pdf'
ArchiveResult.extractor: 'pdf'
Output directory: users/default/snapshots/20251227/example.com/019b-6397-6a5b/pdf/
Hooks:
- on_Snapshot__35_pdf.js
Process:
1. Reads ../chrome/cdp_url.txt to get Chrome connection
2. Connects to Chrome CDP
3. Generates PDF
4. Writes to: pdf/output.pdf
5. Emits ArchiveResult JSON to stdout
Creates ArchiveResult with status=succeeded, output_files={'output.pdf': {}}
```
**Lifecycle:**
```
1. Chrome hooks run → create chrome/ dir with infrastructure files
2. Screenshot/PDF/etc hooks run → read chrome/cdp_url.txt, write to their own dirs
3. Snapshot.cleanup() called → sends SIGTERM to background hooks
4. Chrome receives SIGTERM → exits, deletes chrome/ dir
5. Screenshot/PDF/etc dirs remain with their outputs
```
**DO NOT:**
- Create one ArchiveResult per hook
- Use hook names as extractor values
- Create separate output directories per hook
**DO:**
- Create one ArchiveResult per plugin
- Use plugin directory name as extractor value
- Run all hooks in a plugin when processing its ArchiveResult
- Write all hook outputs to the same plugin directory
This principle drove the entire consolidation strategy:
- **Chrome plugin** = Infrastructure only (NO ArchiveResult)
- **Output plugins** = Each produces ONE distinct ArchiveResult (kept separate)
## Final Structure
### 1. Chrome Plugin (Infrastructure - No Output)
**Location:** `archivebox/plugins/chrome/`
This plugin provides shared Chrome infrastructure for other plugins. It manages the browser lifecycle but **produces NO ArchiveResult** - only infrastructure files in a single `chrome/` output directory.
**Consolidates these former plugins:**
- `chrome_session/` → Merged
- `chrome_navigate/` → Merged
- `chrome_cleanup/` → Merged
- `chrome_extensions/` → Utilities merged
**Hook Files:**
```
chrome/
├── on_Crawl__00_chrome_install_config.py # Configure Chrome settings
├── on_Crawl__00_chrome_install.py # Install Chrome binary
├── on_Crawl__20_chrome_launch.bg.js # Launch Chrome (Crawl-level, bg)
├── on_Snapshot__20_chrome_tab.bg.js # Open tab (Snapshot-level, bg)
├── on_Snapshot__30_chrome_navigate.js # Navigate to URL (foreground)
├── on_Snapshot__45_chrome_tab_cleanup.py # Close tab, kill bg hooks
├── chrome_extension_utils.js # Extension utilities
├── config.json # Configuration
└── tests/test_chrome.py # Tests
```
**Output Directory (Infrastructure Only):**
```
chrome/
├── cdp_url.txt # WebSocket URL for CDP connection
├── pid.txt # Chrome process PID
├── target_id.txt # Current tab target ID
├── page_loaded.txt # Navigation completion marker
├── final_url.txt # Final URL after redirects
├── navigation.json # Navigation state (NEW)
└── hook.pid # Background hook PIDs (for cleanup)
```
**New: navigation.json**
Tracks navigation state with wait condition and timing:
```json
{
"waitUntil": "networkidle2",
"elapsed": 1523,
"url": "https://example.com",
"finalUrl": "https://example.com/",
"status": 200,
"timestamp": "2025-12-27T22:15:30.123Z"
}
```
Fields:
- `waitUntil` - Wait condition: `networkidle0`, `networkidle2`, `domcontentloaded`, or `load`
- `elapsed` - Navigation time in milliseconds
- `url` - Original requested URL
- `finalUrl` - Final URL after redirects (success only)
- `status` - HTTP status code (success only)
- `error` - Error message (failure only)
- `timestamp` - ISO 8601 completion timestamp
### 2. Output Plugins (Each = One ArchiveResult)
These remain **SEPARATE** plugins because each produces a distinct output/ArchiveResult. Each plugin references `../chrome` for infrastructure.
#### consolelog Plugin
```
archivebox/plugins/consolelog/
└── on_Snapshot__21_consolelog.bg.js
```
- **Output:** `console.jsonl` (browser console messages)
- **Type:** Background hook (CDP listener)
- **References:** `../chrome` for CDP URL
#### ssl Plugin
```
archivebox/plugins/ssl/
└── on_Snapshot__23_ssl.bg.js
```
- **Output:** `ssl.jsonl` (SSL/TLS certificate details)
- **Type:** Background hook (CDP listener)
- **References:** `../chrome` for CDP URL
#### responses Plugin
```
archivebox/plugins/responses/
└── on_Snapshot__24_responses.bg.js
```
- **Output:** `responses/` directory with `index.jsonl` (network responses)
- **Type:** Background hook (CDP listener)
- **References:** `../chrome` for CDP URL
#### redirects Plugin
```
archivebox/plugins/redirects/
└── on_Snapshot__31_redirects.bg.js
```
- **Output:** `redirects.jsonl` (redirect chain)
- **Type:** Background hook (CDP listener)
- **References:** `../chrome` for CDP URL
- **Changed:** Converted to background hook, now uses CDP `Network.requestWillBeSent` to capture redirects from initial request
#### staticfile Plugin
```
archivebox/plugins/staticfile/
└── on_Snapshot__31_staticfile.bg.js
```
- **Output:** Downloaded static file (PDF, image, video, etc.)
- **Type:** Background hook (CDP listener)
- **References:** `../chrome` for CDP URL
- **Changed:** Converted from Python to JavaScript, now uses CDP to detect Content-Type from initial response and download via CDP
## What Changed
### 1. Plugin Consolidation
- Merged `chrome_session`, `chrome_navigate`, `chrome_cleanup`, `chrome_extensions``chrome/`
- Chrome plugin now has **single output directory**: `chrome/`
- All Chrome infrastructure hooks reference `.` (same directory)
### 2. Background Hook Conversions
**redirects Plugin:**
- **Before:** Ran AFTER navigation, reconnected to Chrome to check for redirects
- **After:** Background hook that sets up CDP listeners BEFORE navigation to capture redirects from initial request
- **Method:** Uses CDP `Network.requestWillBeSent` event with `redirectResponse` parameter
**staticfile Plugin:**
- **Before:** Python script that ran AFTER navigation, checked response headers
- **After:** Background JavaScript hook that sets up CDP listeners BEFORE navigation
- **Method:** Uses CDP `page.on('response')` to capture Content-Type from initial request
- **Language:** Converted from Python to JavaScript/Node.js for consistency
### 3. Navigation State Tracking
- **Added:** `navigation.json` file in `chrome/` output directory
- **Contains:** `waitUntil` condition and `elapsed` milliseconds
- **Purpose:** Track navigation performance and wait conditions for analysis
### 4. Cleanup
- **Deleted:** `chrome_session/on_CrawlEnd__99_chrome_cleanup.py` (manual cleanup hook)
- **Reason:** Automatic cleanup via state machines is sufficient
- **Verified:** Cleanup mechanisms in `core/models.py` and `crawls/models.py` work correctly
## Hook Execution Order
```
═══ CRAWL LEVEL ═══
00. chrome_install_config.py Configure Chrome settings
00. chrome_install.py Install Chrome binary
20. chrome_launch.bg.js Launch Chrome browser (STAYS RUNNING)
═══ PER-SNAPSHOT LEVEL ═══
Phase 1: PRE-NAVIGATION (Background hooks setup)
20. chrome_tab.bg.js Open new tab (STAYS ALIVE)
21. consolelog.bg.js Setup console listener (STAYS ALIVE)
23. ssl.bg.js Setup SSL listener (STAYS ALIVE)
24. responses.bg.js Setup network response listener (STAYS ALIVE)
31. redirects.bg.js Setup redirect listener (STAYS ALIVE)
31. staticfile.bg.js Setup staticfile detector (STAYS ALIVE)
Phase 2: NAVIGATION (Foreground - synchronization point)
30. chrome_navigate.js Navigate to URL (BLOCKS until page loaded)
Writes navigation.json with waitUntil & elapsed
Writes page_loaded.txt marker
All background hooks can now finalize
Phase 3: POST-NAVIGATION (Background hooks finalize)
(All .bg hooks save their data and wait for cleanup signal)
Phase 4: OTHER EXTRACTORS (use loaded page)
34. screenshot.js
37. singlefile.js
... (other extractors that need loaded page)
Phase 5: CLEANUP
45. chrome_tab_cleanup.py Close tab
Kill background hooks (SIGTERM → SIGKILL)
Update ArchiveResults
```
## Background Hook Pattern
All `.bg.js` hooks follow this pattern:
1. **Setup:** Create CDP listeners BEFORE navigation
2. **Capture:** Collect data incrementally as events occur
3. **Write:** Save data to filesystem continuously
4. **Wait:** Keep process alive until SIGTERM
5. **Finalize:** On SIGTERM, emit final JSONL result to stdout
6. **Exit:** Clean exit with status code
**Key files written:**
- `hook.pid` - Process ID for cleanup mechanism
- Output files (e.g., `console.jsonl`, `ssl.jsonl`, etc.)
## Automatic Cleanup Mechanism
**Snapshot-level cleanup** (`core/models.py`):
```python
def cleanup(self):
"""Kill background hooks and close resources."""
# Scan OUTPUT_DIR for hook.pid files
# Send SIGTERM to processes
# Wait for graceful exit
# Send SIGKILL if process still alive
# Update ArchiveResults to FAILED if needed
```
**Crawl-level cleanup** (`crawls/models.py`):
```python
def cleanup(self):
"""Kill Crawl-level background hooks (Chrome browser)."""
# Similar pattern for Crawl-level resources
# Kills Chrome launch process
```
**State machine integration:**
- Both `SnapshotMachine` and `CrawlMachine` call `cleanup()` when entering `sealed` state
- Ensures all background processes are cleaned up properly
- No manual cleanup hooks needed
## Directory References
**Crawl output structure:**
- Crawls output to: `users/{user_id}/crawls/{YYYYMMDD}/{crawl_id}/`
- Example: `users/1/crawls/20251227/abc-def-123/`
- Crawl-level plugins create subdirectories: `users/1/crawls/20251227/abc-def-123/chrome/`
**Snapshot output structure:**
- Snapshots output to: `archive/{timestamp}/`
- Snapshot-level plugins create subdirectories: `archive/{timestamp}/chrome/`, `archive/{timestamp}/consolelog/`, etc.
**Within chrome plugin:**
- Hooks use `.` or `OUTPUT_DIR` to reference the `chrome/` directory they're running in
- Example: `fs.writeFileSync(path.join(OUTPUT_DIR, 'navigation.json'), ...)`
**From output plugins to chrome (same snapshot):**
- Hooks use `../chrome` to reference Chrome infrastructure in same snapshot
- Example: `const CHROME_SESSION_DIR = '../chrome';`
- Used to read: `cdp_url.txt`, `target_id.txt`, `page_loaded.txt`
**From snapshot hooks to crawl chrome:**
- Snapshot hooks receive `CRAWL_OUTPUT_DIR` environment variable (set by hooks.py)
- Use: `path.join(process.env.CRAWL_OUTPUT_DIR, 'chrome')` to find crawl-level Chrome
- This allows snapshots to reuse the crawl's shared Chrome browser
**Navigation synchronization:**
- All hooks wait for `../chrome/page_loaded.txt` before finalizing
- This file is written by `chrome_navigate.js` after navigation completes
## Design Principles
1. **One ArchiveResult Per Plugin**
- Each plugin produces exactly ONE output/ArchiveResult
- Infrastructure plugins (like chrome) produce NO ArchiveResult
2. **Chrome as Infrastructure**
- Provides shared CDP connection, PIDs, navigation state
- No ArchiveResult output of its own
- Single output directory for all infrastructure files
3. **Background Hooks for CDP**
- Hooks that need CDP listeners BEFORE navigation are background (`.bg.js`)
- They capture events from the initial request/response
- Stay alive through navigation and cleanup
4. **Foreground for Synchronization**
- `chrome_navigate.js` is foreground (not `.bg`)
- Provides synchronization point - blocks until page loaded
- All other hooks wait for its completion marker
5. **Automatic Cleanup**
- State machines handle background hook cleanup
- No manual cleanup hooks needed
- SIGTERM for graceful exit, SIGKILL as backup
6. **Clear Separation**
- Infrastructure vs outputs
- One output directory per plugin
- Predictable, maintainable architecture
## Benefits
**Architectural Clarity** - Clear separation between infrastructure and outputs
**Correct Output Model** - One ArchiveResult per plugin
**Better Performance** - CDP listeners capture data from initial request
**No Duplication** - Single Chrome infrastructure used by all
**Proper Lifecycle** - Background hooks cleaned up automatically
**Maintainable** - Easy to understand, debug, and extend
**Consistent** - All background hooks follow same pattern
**Observable** - Navigation state tracked for debugging
## Testing
Run tests:
```bash
sudo -u testuser bash -c 'source .venv/bin/activate && python -m pytest archivebox/plugins/chrome/tests/ -v'
```
## Migration Notes
**For developers:**
- Chrome infrastructure is now in `chrome/` output dir (not `chrome_session/`)
- Reference `../chrome/cdp_url.txt` from output plugins
- Navigation marker is `../chrome/page_loaded.txt`
- Navigation details in `../chrome/navigation.json`
**For users:**
- No user-facing changes
- Output structure remains the same
- All extractors continue to work

1240
old/TODO_fs_migrations.md Normal file

File diff suppressed because it is too large Load Diff

1976
old/TODO_hook_architecture.md Executable file

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,665 @@
# Hook & State Machine Cleanup - Unified Pattern
## Goal
Implement a **consistent pattern** across all models (Crawl, Snapshot, ArchiveResult, Dependency) for:
1. Running hooks
2. Processing JSONL records
3. Managing background hooks
4. State transitions
## Current State Analysis (ALL COMPLETE ✅)
### ✅ Crawl (archivebox/crawls/)
**Status**: COMPLETE
- ✅ Has state machine: `CrawlMachine`
-`Crawl.run()` - runs hooks, processes JSONL via `process_hook_records()`, creates snapshots
-`Crawl.cleanup()` - kills background hooks, runs on_CrawlEnd hooks
- ✅ Uses `OUTPUT_DIR/plugin_name/` for PWD
- ✅ State machine calls model methods:
- `queued -> started`: calls `crawl.run()`
- `started -> sealed`: calls `crawl.cleanup()`
### ✅ Snapshot (archivebox/core/)
**Status**: COMPLETE
- ✅ Has state machine: `SnapshotMachine`
-`Snapshot.run()` - creates pending ArchiveResults
-`Snapshot.cleanup()` - kills background ArchiveResult hooks, calls `update_from_output()`
-`Snapshot.has_running_background_hooks()` - checks PID files using `process_is_alive()`
-`Snapshot.from_jsonl()` - simplified, filtering moved to caller
- ✅ State machine calls model methods:
- `queued -> started`: calls `snapshot.run()`
- `started -> sealed`: calls `snapshot.cleanup()`
- `is_finished()`: uses `has_running_background_hooks()`
### ✅ ArchiveResult (archivebox/core/)
**Status**: COMPLETE - Major refactor completed
- ✅ Has state machine: `ArchiveResultMachine`
-`ArchiveResult.run()` - runs hook, calls `update_from_output()` for foreground hooks
-`ArchiveResult.update_from_output()` - unified method for foreground and background hooks
- ✅ Uses PWD `snapshot.OUTPUT_DIR/plugin_name`
- ✅ JSONL processing via `process_hook_records()` with URL/depth filtering
-**DELETED** special background hook methods:
-`check_background_completed()` - replaced by `process_is_alive()` helper
-`finalize_background_hook()` - replaced by `update_from_output()`
-`_populate_output_fields()` - merged into `update_from_output()`
- ✅ State machine transitions:
- `queued -> started`: calls `archiveresult.run()`
- `started -> succeeded/failed/skipped`: status set by `update_from_output()`
### ✅ Binary (archivebox/machine/) - NEW!
**Status**: COMPLETE - Replaced Dependency model entirely
- ✅ Has state machine: `BinaryMachine`
-`Binary.run()` - runs on_Binary__install_* hooks, processes JSONL
-`Binary.cleanup()` - kills background installation hooks (for consistency)
-`Binary.from_jsonl()` - handles both binaries.jsonl and hook output
- ✅ Uses PWD `data/machines/{machine_id}/binaries/{name}/{id}/plugin_name/`
- ✅ Configuration via static `plugins/*/binaries.jsonl` files
- ✅ State machine calls model methods:
- `queued -> started`: calls `binary.run()`
- `started -> succeeded/failed`: status set by hooks via JSONL
- ✅ Perfect symmetry with Crawl/Snapshot/ArchiveResult pattern
### ❌ Dependency Model - ELIMINATED
**Status**: Deleted entirely (replaced by Binary state machine)
- Static configuration now lives in `plugins/*/binaries.jsonl`
- Per-machine state tracked by Binary records
- No global singleton conflicts
- Hooks renamed from `on_Dependency__install_*` to `on_Binary__install_*`
## Unified Pattern (Target Architecture)
### Pattern for ALL models:
```python
# 1. State Machine orchestrates transitions
class ModelMachine(StateMachine):
@started.enter
def enter_started(self):
self.model.run() # Do the work
# Update status
def is_finished(self):
# Check if background hooks still running
if self.model.has_running_background_hooks():
return False
# Check if children finished
if self.model.has_pending_children():
return False
return True
@sealed.enter
def enter_sealed(self):
self.model.cleanup() # Clean up background hooks
# Update status
# 2. Model methods do the actual work
class Model:
def run(self):
"""Run hooks, process JSONL, create children."""
hooks = discover_hooks('ModelName')
for hook in hooks:
output_dir = self.OUTPUT_DIR / hook.parent.name
result = run_hook(hook, output_dir=output_dir, ...)
if result is None: # Background hook
continue
# Process JSONL records
records = result.get('records', [])
overrides = {'model': self, 'created_by_id': self.created_by_id}
process_hook_records(records, overrides=overrides)
# Create children (e.g., ArchiveResults, Snapshots)
self.create_children()
def cleanup(self):
"""Kill background hooks, run cleanup hooks."""
# Kill any background hooks
if self.OUTPUT_DIR.exists():
for pid_file in self.OUTPUT_DIR.glob('*/hook.pid'):
kill_process(pid_file)
# Run cleanup hooks (e.g., on_ModelEnd)
cleanup_hooks = discover_hooks('ModelEnd')
for hook in cleanup_hooks:
run_hook(hook, ...)
def has_running_background_hooks(self) -> bool:
"""Check if any background hooks still running."""
if not self.OUTPUT_DIR.exists():
return False
for pid_file in self.OUTPUT_DIR.glob('*/hook.pid'):
if process_is_alive(pid_file):
return True
return False
```
### PWD Standard:
```
model.OUTPUT_DIR/plugin_name/
```
- Crawl: `users/{user}/crawls/{date}/{crawl_id}/plugin_name/`
- Snapshot: `users/{user}/snapshots/{date}/{domain}/{snapshot_id}/plugin_name/`
- ArchiveResult: `users/{user}/snapshots/{date}/{domain}/{snapshot_id}/plugin_name/` (same as Snapshot)
- Dependency: `dependencies/{dependency_id}/plugin_name/` (set output_dir field directly)
## Implementation Plan
### Phase 1: Add unified helpers to hooks.py ✅ DONE
**File**: `archivebox/hooks.py`
**Status**: COMPLETE - Added three helper functions:
- `process_hook_records(records, overrides)` - lines 1258-1323
- `process_is_alive(pid_file)` - lines 1326-1344
- `kill_process(pid_file, sig)` - lines 1347-1362
```python
def process_hook_records(records: List[Dict], overrides: Dict = None) -> Dict[str, int]:
"""
Process JSONL records from hook output.
Dispatches to Model.from_jsonl() for each record type.
Args:
records: List of JSONL record dicts from result['records']
overrides: Dict with 'snapshot', 'crawl', 'dependency', 'created_by_id', etc.
Returns:
Dict with counts by record type
"""
stats = {}
for record in records:
record_type = record.get('type')
# Dispatch to appropriate model
if record_type == 'Snapshot':
from core.models import Snapshot
Snapshot.from_jsonl(record, overrides)
stats['Snapshot'] = stats.get('Snapshot', 0) + 1
elif record_type == 'Tag':
from core.models import Tag
Tag.from_jsonl(record, overrides)
stats['Tag'] = stats.get('Tag', 0) + 1
elif record_type == 'Binary':
from machine.models import Binary
Binary.from_jsonl(record, overrides)
stats['Binary'] = stats.get('Binary', 0) + 1
# ... etc
return stats
def process_is_alive(pid_file: Path) -> bool:
"""Check if process in PID file is still running."""
if not pid_file.exists():
return False
try:
pid = int(pid_file.read_text().strip())
os.kill(pid, 0) # Signal 0 = check if exists
return True
except (OSError, ValueError):
return False
def kill_process(pid_file: Path, signal=SIGTERM):
"""Kill process in PID file."""
if not pid_file.exists():
return
try:
pid = int(pid_file.read_text().strip())
os.kill(pid, signal)
except (OSError, ValueError):
pass
```
### Phase 2: Add Model.from_jsonl() static methods ✅ DONE
**Files**: `archivebox/core/models.py`, `archivebox/machine/models.py`, `archivebox/crawls/models.py`
**Status**: COMPLETE - Added from_jsonl() to:
-`Tag.from_jsonl()` - core/models.py lines 93-116
-`Snapshot.from_jsonl()` - core/models.py lines 1144-1189
-`Machine.from_jsonl()` - machine/models.py lines 66-89
-`Dependency.from_jsonl()` - machine/models.py lines 203-227
-`Binary.from_jsonl()` - machine/models.py lines 401-434
Example implementations added:
```python
class Snapshot:
@staticmethod
def from_jsonl(record: Dict, overrides: Dict = None):
"""Create/update Snapshot from JSONL record."""
from archivebox.misc.jsonl import get_or_create_snapshot
overrides = overrides or {}
# Apply overrides (crawl, parent_snapshot, depth limits)
crawl = overrides.get('crawl')
snapshot = overrides.get('snapshot') # parent
if crawl:
depth = record.get('depth', (snapshot.depth + 1 if snapshot else 1))
if depth > crawl.max_depth:
return None
record.setdefault('crawl_id', str(crawl.id))
record.setdefault('depth', depth)
if snapshot:
record.setdefault('parent_snapshot_id', str(snapshot.id))
created_by_id = overrides.get('created_by_id')
new_snapshot = get_or_create_snapshot(record, created_by_id=created_by_id)
new_snapshot.status = Snapshot.StatusChoices.QUEUED
new_snapshot.retry_at = timezone.now()
new_snapshot.save()
return new_snapshot
class Tag:
@staticmethod
def from_jsonl(record: Dict, overrides: Dict = None):
"""Create/update Tag from JSONL record."""
from archivebox.misc.jsonl import get_or_create_tag
tag = get_or_create_tag(record)
# Auto-attach to snapshot if in overrides
if overrides and 'snapshot' in overrides:
overrides['snapshot'].tags.add(tag)
return tag
class Binary:
@staticmethod
def from_jsonl(record: Dict, overrides: Dict = None):
"""Create/update Binary from JSONL record."""
# Implementation similar to existing create_model_record()
...
# Etc for other models
```
### Phase 3: Update ArchiveResult to use unified pattern ✅ DONE
**File**: `archivebox/core/models.py`
**Status**: COMPLETE
**Changes made**:
1.**Replaced inline JSONL processing** (lines 1912-1950):
- Pre-filter Snapshot records for depth/URL constraints in ArchiveResult.run()
- Use `self._url_passes_filters(url)` with parent snapshot's config for proper hierarchy
- Replaced inline Tag/Snapshot/other record creation with `process_hook_records()`
- Removed ~60 lines of duplicate code
2.**Simplified Snapshot.from_jsonl()** (lines 1144-1189):
- Removed depth checking (now done in caller)
- Just applies crawl metadata and creates snapshot
- Added docstring note: "Filtering should be done by caller BEFORE calling this method"
3.**Preserved ArchiveResult self-update logic**:
- Status/output fields still updated from ArchiveResult JSONL record (lines 1856-1910)
- Special title extractor logic preserved (line 1952+)
- Search indexing trigger preserved (line 1957+)
4.**Key insight**: Filtering happens in ArchiveResult.run() where we have parent snapshot context, NOT in from_jsonl() where we'd lose config hierarchy
**Note**: Did NOT delete special background hook methods (`check_background_completed`, `finalize_background_hook`) - that's Phase 6
### Phase 4: Add Snapshot.cleanup() method ✅ DONE
**File**: `archivebox/core/models.py`
**Status**: COMPLETE
**Changes made**:
1.**Added Snapshot.cleanup()** (lines 1144-1175):
- Kills background ArchiveResult hooks by scanning for `*/hook.pid` files
- Finalizes background ArchiveResults using `finalize_background_hook()` (temporary until Phase 6)
- Called by state machine when entering sealed state
2.**Added Snapshot.has_running_background_hooks()** (lines 1177-1195):
- Checks if any background hooks still running using `process_is_alive()`
- Used by state machine in `is_finished()` check
### Phase 5: Update SnapshotMachine to use cleanup() ✅ DONE
**File**: `archivebox/core/statemachines.py`
**Status**: COMPLETE
**Changes made**:
1.**Simplified is_finished()** (lines 58-72):
- Removed inline background hook checking and finalization (lines 67-76 deleted)
- Now uses `self.snapshot.has_running_background_hooks()` (line 68)
- Removed ~12 lines of duplicate logic
2.**Added cleanup() to sealed.enter** (lines 102-111):
- Calls `self.snapshot.cleanup()` to kill background hooks (line 105)
- Follows unified pattern: cleanup happens on seal, not in is_finished()
### Phase 6: Add ArchiveResult.update_from_output() and simplify run() ✅ DONE
**File**: `archivebox/core/models.py`
**Status**: COMPLETE - The BIG refactor (removed ~200 lines of duplication)
**Changes made**:
1.**Added `ArchiveResult.update_from_output()`** (lines 1908-2061):
- Unified method for both foreground and background hooks
- Reads stdout.log and parses JSONL records
- Updates status/output_str/output_json from ArchiveResult JSONL record
- Walks filesystem to populate output_files/output_size/output_mimetypes
- Filters Snapshot records for depth/URL constraints (same as run())
- Processes side-effect records via `process_hook_records()`
- Updates snapshot title if title extractor
- Triggers search indexing if succeeded
- Cleans up PID files and empty logs
- ~160 lines of comprehensive logic
2.**Simplified `ArchiveResult.run()`** (lines 1841-1906):
- Removed ~120 lines of duplicate filesystem reading logic
- Now just sets start_ts/pwd and calls `update_from_output()`
- Background hooks: return immediately after saving status=STARTED
- Foreground hooks: call `update_from_output()` to do all the work
- Removed ~10 lines of duplicate code
3.**Updated `Snapshot.cleanup()`** (line 1172):
- Changed from `ar.finalize_background_hook()` to `ar.update_from_output()`
- Uses the unified method instead of the old special-case method
4.**Deleted `_populate_output_fields()`** (was ~45 lines):
- Logic merged into `update_from_output()`
- Eliminates duplication of filesystem walking code
5.**Deleted `check_background_completed()`** (was ~20 lines):
- Replaced by `process_is_alive(pid_file)` from hooks.py
- Generic helper used by Snapshot.has_running_background_hooks()
6.**Deleted `finalize_background_hook()`** (was ~85 lines):
- Completely replaced by `update_from_output()`
- Was duplicate of foreground hook finalization logic
**Total lines removed**: ~280 lines of duplicate code
**Total lines added**: ~160 lines of unified code
**Net reduction**: ~120 lines (-43%)
### Phase 7-8: Dependency State Machine ❌ NOT NEEDED
**Status**: Intentionally skipped - Dependency doesn't need a state machine
**Why no state machine for Dependency?**
1. **Wrong Granularity**: Dependency is a GLOBAL singleton (one record per binary name)
- Multiple machines would race to update the same `status`/`retry_at` fields
- No clear semantics: "started" on which machine? "failed" on Machine A but "succeeded" on Machine B?
2. **Wrong Timing**: Installation should be SYNCHRONOUS, not queued
- When a worker needs wget, it should install wget NOW, not queue it for later
- No benefit to async state machine transitions
3. **State Lives Elsewhere**: Binary records are the actual state
- Each machine has its own Binary records (one per machine per binary)
- Binary.machine FK provides proper per-machine state tracking
**Correct Architecture:**
```
Dependency (global, no state machine):
├─ Configuration: bin_name, bin_providers, overrides
├─ run() method: synchronous installation attempt
└─ NO status, NO retry_at, NO state_machine_name
Binary (per-machine, has machine FK):
├─ State: is this binary installed on this specific machine?
├─ Created via JSONL output from on_Dependency hooks
└─ unique_together = (machine, name, abspath, version, sha256)
```
**What was implemented:**
-**Refactored `Dependency.run()`** (lines 249-324):
- Uses `discover_hooks()` and `process_hook_records()` for consistency
- Added comprehensive docstring explaining why no state machine
- Synchronous execution: returns Binary or None immediately
- Uses unified JSONL processing pattern
-**Kept Dependency simple**: Just configuration fields, no state fields
-**Multi-machine support**: Each machine independently runs Dependency.run() and creates its own Binary
## Summary of Changes
### Progress: 6/6 Core Phases Complete ✅ + 2 Phases Skipped (Intentionally)
**ALL core functionality is now complete!** The unified pattern is consistently implemented across Crawl, Snapshot, and ArchiveResult. Dependency intentionally kept simple (no state machine needed).
### Files Modified:
1.**DONE** `archivebox/hooks.py` - Add unified helpers:
-`process_hook_records(records, overrides)` - dispatcher (lines 1258-1323)
-`process_is_alive(pid_file)` - check if PID still running (lines 1326-1344)
-`kill_process(pid_file)` - kill process (lines 1347-1362)
2.**DONE** `archivebox/crawls/models.py` - Already updated:
-`Crawl.run()` - runs hooks, processes JSONL, creates snapshots
-`Crawl.cleanup()` - kills background hooks, runs on_CrawlEnd
3.**DONE** `archivebox/core/models.py`:
-`Tag.from_jsonl()` - lines 93-116
-`Snapshot.from_jsonl()` - lines 1197-1234 (simplified, removed filtering)
-`Snapshot.cleanup()` - lines 1144-1172 (kill background hooks, calls ar.update_from_output())
-`Snapshot.has_running_background_hooks()` - lines 1174-1193 (check PIDs)
-`ArchiveResult.run()` - simplified, uses `update_from_output()` (lines 1841-1906)
-`ArchiveResult.update_from_output()` - unified filesystem reading (lines 1908-2061)
-**DELETED** `ArchiveResult.check_background_completed()` - replaced by `process_is_alive()`
-**DELETED** `ArchiveResult.finalize_background_hook()` - replaced by `update_from_output()`
-**DELETED** `ArchiveResult._populate_output_fields()` - merged into `update_from_output()`
4.**DONE** `archivebox/core/statemachines.py`:
- ✅ Simplified `SnapshotMachine.is_finished()` - uses `has_running_background_hooks()` (line 68)
- ✅ Added cleanup call to `SnapshotMachine.sealed.enter` (line 105)
5.**DONE** `archivebox/machine/models.py`:
-`Machine.from_jsonl()` - lines 66-89
-`Dependency.from_jsonl()` - lines 203-227
-`Binary.from_jsonl()` - lines 401-434
- ✅ Refactored `Dependency.run()` to use unified pattern (lines 249-324)
- ✅ Added comprehensive docstring explaining why Dependency doesn't need state machine
- ✅ Kept Dependency simple: no state fields, synchronous execution only
### Code Metrics:
- **Lines removed**: ~280 lines of duplicate code
- **Lines added**: ~160 lines of unified code
- **Net reduction**: ~120 lines total (-43%)
- **Files created**: 0 (no new files needed)
### Key Benefits:
1. **Consistency**: All stateful models (Crawl, Snapshot, ArchiveResult) follow the same unified state machine pattern
2. **Simplicity**: Eliminated special-case background hook handling (~280 lines of duplicate code)
3. **Correctness**: Background hooks are properly cleaned up on seal transition
4. **Maintainability**: Unified `process_hook_records()` dispatcher for all JSONL processing
5. **Testability**: Consistent pattern makes testing easier
6. **Clear Separation**: Stateful work items (Crawl/Snapshot/ArchiveResult) vs stateless config (Dependency)
7. **Multi-Machine Support**: Dependency remains simple synchronous config, Binary tracks per-machine state
## Final Unified Pattern
All models now follow this consistent architecture:
### State Machine Structure
```python
class ModelMachine(StateMachine):
queued = State(initial=True)
started = State()
sealed/succeeded/failed = State(final=True)
@started.enter
def enter_started(self):
self.model.run() # Execute the work
@sealed.enter # or @succeeded.enter
def enter_sealed(self):
self.model.cleanup() # Clean up background hooks
```
### Model Methods
```python
class Model:
# State machine fields
status = CharField(default='queued')
retry_at = DateTimeField(default=timezone.now)
output_dir = CharField(default='', blank=True)
state_machine_name = 'app.statemachines.ModelMachine'
def run(self):
"""Run hooks, process JSONL, create children."""
hooks = discover_hooks('EventName')
for hook in hooks:
output_dir = self.OUTPUT_DIR / hook.parent.name
result = run_hook(hook, output_dir=output_dir, ...)
if result is None: # Background hook
continue
# Process JSONL records
overrides = {'model': self, 'created_by_id': self.created_by_id}
process_hook_records(result['records'], overrides=overrides)
def cleanup(self):
"""Kill background hooks, run cleanup hooks."""
for pid_file in self.OUTPUT_DIR.glob('*/hook.pid'):
kill_process(pid_file)
# Update children from filesystem
child.update_from_output()
def update_for_workers(self, **fields):
"""Update fields and bump modified_at."""
for field, value in fields.items():
setattr(self, field, value)
self.save(update_fields=[*fields.keys(), 'modified_at'])
@staticmethod
def from_jsonl(record: dict, overrides: dict = None):
"""Create/update model from JSONL record."""
# Implementation specific to model
# Called by process_hook_records()
```
### Hook Processing Flow
```
1. Model.run() discovers hooks
2. Hooks execute and output JSONL to stdout
3. JSONL records dispatched via process_hook_records()
4. Each record type handled by Model.from_jsonl()
5. Background hooks tracked via hook.pid files
6. Model.cleanup() kills background hooks on seal
7. Children updated via update_from_output()
```
### Multi-Machine Coordination
- **Work Items** (Crawl, Snapshot, ArchiveResult): No machine FK, any worker can claim
- **Resources** (Binary): Machine FK, one per machine per binary
- **Configuration** (Dependency): No machine FK, global singleton, synchronous execution
- **Execution Tracking** (ArchiveResult.iface): FK to NetworkInterface for observability
## Testing Checklist
- [ ] Test Crawl → Snapshot creation with hooks
- [ ] Test Snapshot → ArchiveResult creation
- [ ] Test ArchiveResult foreground hooks (JSONL processing)
- [ ] Test ArchiveResult background hooks (PID tracking, cleanup)
- [ ] Test Dependency.run() synchronous installation
- [ ] Test background hook cleanup on seal transition
- [ ] Test multi-machine Crawl execution
- [ ] Test Binary creation per machine (one per machine per binary)
- [ ] Verify Dependency.run() can be called concurrently from multiple machines safely
## FINAL ARCHITECTURE (Phases 1-8 Complete)
### ✅ Phases 1-6: Core Models Unified
All core models (Crawl, Snapshot, ArchiveResult) now follow the unified pattern:
- State machines orchestrate transitions
- `.run()` methods execute hooks and process JSONL
- `.cleanup()` methods kill background hooks
- `.update_for_workers()` methods update state for worker coordination
- Consistent use of `process_hook_records()` for JSONL dispatching
### ✅ Phases 7-8: Binary State Machine (Dependency Model Eliminated)
**Key Decision**: Eliminated `Dependency` model entirely and made `Binary` the state machine.
#### New Architecture
- **Static Configuration**: `plugins/{plugin}/dependencies.jsonl` files define binary requirements
```jsonl
{"type": "Binary", "name": "yt-dlp", "bin_providers": "pip,brew,apt,env"}
{"type": "Binary", "name": "node", "bin_providers": "apt,brew,env", "overrides": {"apt": {"packages": ["nodejs"]}}}
{"type": "Binary", "name": "ffmpeg", "bin_providers": "apt,brew,env"}
```
- **Dynamic State**: `Binary` model tracks per-machine installation state
- Fields: `machine`, `name`, `bin_providers`, `overrides`, `abspath`, `version`, `sha256`, `binprovider`
- State machine: `queued → started → succeeded/failed`
- Output dir: `data/machines/{machine_id}/binaries/{binary_name}/{binary_id}/`
#### Binary State Machine Flow
```python
class BinaryMachine(StateMachine):
queued → started → succeeded/failed
@started.enter
def enter_started(self):
self.binary.run() # Runs on_Binary__install_* hooks
class Binary(models.Model):
def run(self):
"""
Runs ALL on_Binary__install_* hooks.
Each hook checks bin_providers and decides if it can handle this binary.
First hook to succeed wins.
Outputs JSONL with abspath, version, sha256, binprovider.
"""
hooks = discover_hooks('Binary')
for hook in hooks:
result = run_hook(hook, output_dir=self.OUTPUT_DIR/plugin_name,
binary_id=self.id, machine_id=self.machine_id,
name=self.name, bin_providers=self.bin_providers,
overrides=json.dumps(self.overrides))
# Hook outputs: {"type": "Binary", "name": "wget", "abspath": "/usr/bin/wget", "version": "1.21", "binprovider": "apt"}
# Binary.from_jsonl() updates self with installation results
```
#### Hook Naming Convention
- **Before**: `on_Dependency__install_using_pip_provider.py`
- **After**: `on_Binary__install_using_pip_provider.py`
Each hook checks `--bin-providers` CLI argument:
```python
if 'pip' not in bin_providers.split(','):
sys.exit(0) # Skip this binary
```
#### Perfect Symmetry Achieved
All models now follow identical patterns:
```python
Crawl(queued) → CrawlMachine → Crawl.run() → sealed
Snapshot(queued) → SnapshotMachine → Snapshot.run() → sealed
ArchiveResult(queued) → ArchiveResultMachine → ArchiveResult.run() → succeeded/failed
Binary(queued) → BinaryMachine → Binary.run() → succeeded/failed
```
#### Benefits of Eliminating Dependency
1. **No global singleton conflicts**: Binary is per-machine, no race conditions
2. **Simpler data model**: One table instead of two (Dependency + InstalledBinary)
3. **Static configuration**: dependencies.jsonl in version control, not database
4. **Consistent state machine**: Binary follows same pattern as other models
5. **Cleaner hooks**: Hooks check bin_providers themselves instead of orchestrator parsing names
#### Multi-Machine Coordination
- **Work Items** (Crawl, Snapshot, ArchiveResult): No machine FK, any worker can claim
- **Resources** (Binary): Machine FK, one per machine per binary name
- **Configuration**: Static files in `plugins/*/dependencies.jsonl`
- **Execution Tracking**: ArchiveResult.iface FK to NetworkInterface for observability
### Testing Checklist (Updated)
- [x] Core models use unified hook pattern (Phases 1-6)
- [ ] Binary installation via state machine
- [ ] Multiple machines can install same binary independently
- [ ] Hook bin_providers filtering works correctly
- [ ] Binary.from_jsonl() handles both dependencies.jsonl and hook output
- [ ] Binary OUTPUT_DIR structure: data/machines/{machine_id}/binaries/{name}/{id}/

View File

@@ -0,0 +1,517 @@
# TODO: Rename Extractor to Plugin - Implementation Progress
**Status**: 🟡 In Progress (2/13 phases complete)
**Started**: 2025-12-28
**Estimated Files to Update**: ~150+ files
---
## Progress Overview
### ✅ Completed Phases (2/13)
- [x] **Phase 1**: Database Migration - Created migration 0033
- [x] **Phase 2**: Core Model Updates - Updated ArchiveResult, ArchiveResultManager, Snapshot models
### 🟡 In Progress (1/13)
- [ ] **Phase 3**: Hook Execution System (hooks.py - all function renames)
### ⏳ Pending Phases (10/13)
- [ ] **Phase 4**: JSONL Import/Export (misc/jsonl.py)
- [ ] **Phase 5**: CLI Commands (archivebox_extract, archivebox_add, archivebox_update)
- [ ] **Phase 6**: API Endpoints (v1_core.py, v1_cli.py)
- [ ] **Phase 7**: Admin Interface (admin_archiveresults.py, forms.py)
- [ ] **Phase 8**: Views and Templates (views.py, templatetags, progress_monitor.html)
- [ ] **Phase 9**: Worker System (workers/worker.py)
- [ ] **Phase 10**: State Machine (statemachines.py)
- [ ] **Phase 11**: Tests (test_migrations_helpers.py, test_recursive_crawl.py, etc.)
- [ ] **Phase 12**: Terminology Standardization (via_extractor→plugin, comments, docstrings)
- [ ] **Phase 13**: Run migrations and verify all tests pass
---
## What's Been Completed So Far
### Phase 1: Database Migration ✅
**File Created**: `archivebox/core/migrations/0033_rename_extractor_add_hook_name.py`
Changes:
- Used `migrations.RenameField()` to rename `extractor``plugin`
- Added `hook_name` field (CharField, max_length=255, indexed, default='')
- Preserves all existing data, indexes, and constraints
### Phase 2: Core Models ✅
**File Updated**: `archivebox/core/models.py`
#### ArchiveResultManager
- Updated `indexable()` method to use `plugin__in` and `plugin=method`
- Changed reference from `ARCHIVE_METHODS_INDEXING_PRECEDENCE` to `EXTRACTOR_INDEXING_PRECEDENCE`
#### ArchiveResult Model
**Field Changes**:
- Renamed field: `extractor``plugin`
- Added field: `hook_name` (stores full filename like `on_Snapshot__50_wget.py`)
- Updated comments to reference "plugin" instead of "extractor"
**Method Updates**:
- `get_extractor_choices()``get_plugin_choices()`
- `__str__()`: Now uses `self.plugin`
- `save()`: Logs `plugin` instead of `extractor`
- `get_absolute_url()`: Uses `self.plugin`
- `extractor_module` property → `plugin_module` property
- `output_exists()`: Checks `self.plugin` directory
- `embed_path()`: Uses `self.plugin` for paths
- `create_output_dir()`: Creates `self.plugin` directory
- `output_dir_name`: Returns `self.plugin`
- `run()`: All references to extractor → plugin (including extractor_dir → plugin_dir)
- `update_from_output()`: All references updated to plugin/plugin_dir
- `_update_snapshot_title()`: Parameter renamed to `plugin_dir`
- `trigger_search_indexing()`: Passes `plugin=self.plugin`
- `output_dir` property: Returns plugin directory
- `is_background_hook()`: Uses `plugin_dir`
#### Snapshot Model
**Method Updates**:
- `create_pending_archiveresults()`: Uses `get_enabled_plugins()`, filters by `plugin=plugin`
- `result_icons` (calc_icons): Maps by `r.plugin`, calls `get_plugin_name()` and `get_plugin_icon()`
- `_merge_archive_results_from_index()`: Maps by `(ar.plugin, ar.start_ts)`, supports both 'extractor' and 'plugin' keys for backwards compat
- `_create_archive_result_if_missing()`: Supports both 'extractor' and 'plugin' keys, creates with `plugin=plugin`
- `write_index_json()`: Writes `'plugin': ar.plugin` in archive_results
- `canonical_outputs()`: Updates `find_best_output_in_dir()` to use `plugin_name`, accesses `result.plugin`, creates keys like `{result.plugin}_path`
- `latest_outputs()`: Uses `get_plugins()`, filters by `plugin=plugin`
- `retry_failed_archiveresults()`: Updated docstring to reference "plugins" instead of "extractors"
**Total Lines Changed in models.py**: ~50+ locations
---
## Full Implementation Plan
# ArchiveResult Model Refactoring Plan: Rename Extractor to Plugin + Add Hook Name Field
## Overview
Refactor the ArchiveResult model and standardize terminology across the codebase:
1. Rename the `extractor` field to `plugin` in ArchiveResult model
2. Add a new `hook_name` field to store the specific hook filename that executed
3. Update all related code paths (CLI, API, admin, views, hooks, JSONL, etc.)
4. Standardize CLI flags from `--extract/--extractors` to `--plugins`
5. **Standardize terminology throughout codebase**:
- "parsers" → "parser plugins"
- "extractors" → "extractor plugins"
- "parser extractors" → "parser plugins"
- "archive methods" → "extractor plugins"
- Document apt/brew/npm/pip as "package manager plugins" in comments
## Current State Analysis
### ArchiveResult Model (archivebox/core/models.py:1679-1750)
```python
class ArchiveResult(ModelWithOutputDir, ...):
extractor = models.CharField(max_length=32, db_index=True) # e.g., "screenshot", "wget"
# New fields from migration 0029:
output_str, output_json, output_files, output_size, output_mimetypes
binary = ForeignKey('machine.Binary', ...)
# No hook_name field yet
```
### Hook Execution Flow
1. `ArchiveResult.run()` discovers hooks for the plugin (e.g., `wget/on_Snapshot__50_wget.py`)
2. `run_hook()` executes each hook script, captures output as HookResult
3. `update_from_output()` parses JSONL and updates ArchiveResult fields
4. Currently NO tracking of which specific hook file executed
### Field Usage Across Codebase
**extractor field** is used in ~100 locations:
- **Model**: ArchiveResult.extractor field definition, __str__, manager queries
- **CLI**: archivebox_extract.py (--plugin flag), archivebox_add.py, tests
- **API**: v1_core.py (extractor filter), v1_cli.py (extract/extractors args)
- **Admin**: admin_archiveresults.py (list filter, display)
- **Views**: core/views.py (archiveresult_objects dict by extractor)
- **Template Tags**: core_tags.py (extractor_icon, extractor_thumbnail, extractor_embed)
- **Hooks**: hooks.py (get_extractors, get_extractor_name, run_hook output parsing)
- **JSONL**: misc/jsonl.py (archiveresult_to_jsonl serializes extractor)
- **Worker**: workers/worker.py (ArchiveResultWorker filters by extractor)
- **Statemachine**: statemachines.py (logs extractor in state transitions)
---
## Implementation Plan
### Phase 1: Database Migration (archivebox/core/migrations/) ✅ COMPLETE
**Create migration 0033_rename_extractor_add_hook_name.py**:
1. Rename field: `extractor``plugin` (preserve index, constraints)
2. Add field: `hook_name` = CharField(max_length=255, blank=True, default='', db_index=True)
- **Stores full hook filename**: `on_Snapshot__50_wget.py`, `on_Crawl__10_chrome_session.js`, etc.
- Empty string for existing records (data migration sets all to '')
3. Update any indexes or constraints that reference extractor
**Decision**: Full filename chosen for explicitness and easy grep-ability
**Critical Files to Update**:
- ✅ ArchiveResult model field definitions
- ✅ Migration dependencies (latest: 0032)
---
### Phase 2: Core Model Updates (archivebox/core/models.py) ✅ COMPLETE
**ArchiveResult Model** (lines 1679-1820):
- ✅ Rename field: `extractor``plugin`
- ✅ Add field: `hook_name = models.CharField(...)`
- ✅ Update __str__: `f'...-> {self.plugin}'`
- ✅ Update absolute_url: Use plugin instead of extractor
- ✅ Update embed_path: Use plugin directory name
**ArchiveResultManager** (lines 1669-1677):
- ✅ Update indexable(): `filter(plugin__in=INDEXABLE_METHODS, ...)`
- ✅ Update precedence: `When(plugin=method, ...)`
**Snapshot Model** (lines 1000-1600):
- ✅ Update canonical_outputs: Access by plugin name
- ✅ Update create_pending_archiveresults: Use plugin parameter
- ✅ All queryset filters: `archiveresult_set.filter(plugin=...)`
---
### Phase 3: Hook Execution System (archivebox/hooks.py) 🟡 IN PROGRESS
**Function Renames**:
- [ ] `get_extractors()``get_plugins()` (lines 479-504)
- [ ] `get_parser_extractors()``get_parser_plugins()` (lines 507-514)
- [ ] `get_extractor_name()``get_plugin_name()` (lines 517-530)
- [ ] `is_parser_extractor()``is_parser_plugin()` (lines 533-536)
- [ ] `get_enabled_extractors()``get_enabled_plugins()` (lines 553-566)
- [ ] `get_extractor_template()``get_plugin_template()` (line 1048)
- [ ] `get_extractor_icon()``get_plugin_icon()` (line 1068)
- [ ] `get_all_extractor_icons()``get_all_plugin_icons()` (line 1092)
**Update HookResult TypedDict** (lines 63-73):
- [ ] Add field: `hook_name: str` to store hook filename
- [ ] Add field: `plugin: str` (if not already present)
**Update run_hook()** (lines 141-389):
- [ ] **Add hook_name parameter**: Pass hook filename to be stored in result
- [ ] Update HookResult to include hook_name field
- [ ] Update JSONL record output: Add `hook_name` key
**Update ArchiveResult.run()** (lines 1838-1914):
- [ ] When calling run_hook, pass the hook filename
- [ ] Store hook_name in ArchiveResult before/after execution
**Update ArchiveResult.update_from_output()** (lines 1916-2073):
- [ ] Parse hook_name from JSONL output
- [ ] Store in self.hook_name field
- [ ] If not present in JSONL, infer from directory/filename
**Constants to Rename**:
- [ ] `ARCHIVE_METHODS_INDEXING_PRECEDENCE``EXTRACTOR_INDEXING_PRECEDENCE`
**Comments/Docstrings**: Update all function docstrings to use "plugin" terminology
---
### Phase 4: JSONL Import/Export (archivebox/misc/jsonl.py)
**Update archiveresult_to_jsonl()** (lines 173-200):
- [ ] Change key: `'extractor': result.extractor``'plugin': result.plugin`
- [ ] Add key: `'hook_name': result.hook_name`
**Update JSONL parsing**:
- [ ] **Accept both 'extractor' (legacy) and 'plugin' (new) keys when importing**
- [ ] Always write 'plugin' key in new exports (never 'extractor')
- [ ] Parse and store hook_name if present (backwards compat: empty if missing)
**Decision**: Support both keys on import for smooth migration, always export new format
---
### Phase 5: CLI Commands (archivebox/cli/)
**archivebox_extract.py** (lines 1-230):
- [ ] Rename flag: `--plugin` stays (already correct!)
- [ ] Update internal references: extractor → plugin
- [ ] Update filter: `results.filter(plugin=plugin)`
- [ ] Update display: `result.plugin`
**archivebox_add.py**:
- [ ] Rename config key: `'EXTRACTORS': plugins``'PLUGINS': plugins` (if not already)
**archivebox_update.py**:
- [ ] Standardize to `--plugins` flag (currently may be --extractors or --extract)
**tests/test_oneshot.py**:
- [ ] Update flag: `--extract=...``--plugins=...`
---
### Phase 6: API Endpoints (archivebox/api/)
**v1_core.py** (ArchiveResult API):
- [ ] Update schema field: `extractor: str``plugin: str`
- [ ] Update schema field: Add `hook_name: str = ''`
- [ ] Update FilterSchema: `q=[..., 'plugin', ...]`
- [ ] Update extractor filter: `plugin: Optional[str] = Field(None, q='plugin__icontains')`
**v1_cli.py** (CLI API):
- [ ] Rename AddCommandSchema field: `extract: str``plugins: str`
- [ ] Rename UpdateCommandSchema field: `extractors: str``plugins: str`
- [ ] Update endpoint mapping: `args.plugins``plugins` parameter
---
### Phase 7: Admin Interface (archivebox/core/)
**admin_archiveresults.py**:
- [ ] Update all references: extractor → plugin
- [ ] Update list_filter: `'plugin'` instead of `'extractor'`
- [ ] Update ordering: `order_by('plugin')`
- [ ] Update get_plugin_icon: (rename from get_extractor_icon if exists)
**admin_snapshots.py**:
- [ ] Update any commented TODOs referencing extractor
**forms.py**:
- [ ] Rename function: `get_archive_methods()``get_plugin_choices()`
- [ ] Update form field: `archive_methods``plugins`
---
### Phase 8: Views and Templates (archivebox/core/)
**views.py**:
- [ ] Update dict building: `archiveresult_objects[result.plugin] = result`
- [ ] Update all extractor references to plugin
**templatetags/core_tags.py**:
- [ ] **Rename template tags (BREAKING CHANGE)**:
- `extractor_icon()``plugin_icon()`
- `extractor_thumbnail()``plugin_thumbnail()`
- `extractor_embed()``plugin_embed()`
- [ ] Update internal: `result.extractor``result.plugin`
**Update HTML templates** (if any directly reference extractor):
- [ ] Search for `{{ result.extractor }}` and similar
- [ ] Update to `{{ result.plugin }}`
- [ ] Update template tag calls
- [ ] **CRITICAL**: Update JavaScript in `templates/admin/progress_monitor.html`:
- Lines 491, 505: Change `extractor.extractor` and `a.extractor` to use `plugin` field
---
### Phase 9: Worker System (archivebox/workers/worker.py)
**ArchiveResultWorker**:
- [ ] Rename parameter: `extractor``plugin` (lines 348, 350)
- [ ] Update filter: `qs.filter(plugin=self.plugin)`
- [ ] Update subprocess passing: Use plugin parameter
---
### Phase 10: State Machine (archivebox/core/statemachines.py)
**ArchiveResultMachine**:
- [ ] Update logging: Use `self.archiveresult.plugin` instead of extractor
- [ ] Update any state metadata that includes extractor field
---
### Phase 11: Tests and Fixtures
**Update test files**:
- [ ] tests/test_migrations_*.py: Update expected field names in schema definitions
- [ ] tests/test_hooks.py: Update assertions for plugin/hook_name fields
- [ ] archivebox/tests/test_migrations_helpers.py: Update schema SQL (lines 161, 382, 468)
- [ ] tests/test_recursive_crawl.py: Update SQL query `WHERE extractor = '60_parse_html_urls'` (line 163)
- [ ] archivebox/cli/tests_piping.py: Update test function names and assertions
- [ ] Any fixtures that create ArchiveResults: Use plugin parameter
- [ ] Any mock objects that set `.extractor` attribute: Change to `.plugin`
---
### Phase 12: Terminology Standardization (NEW)
This phase standardizes terminology throughout the codebase to use consistent "plugin" nomenclature.
**via_extractor → plugin Rename (14 files)**:
- [ ] Rename metadata field `via_extractor` to just `plugin`
- [ ] Files affected:
- archivebox/hooks.py - Set plugin in run_hook() output
- archivebox/crawls/models.py - If via_extractor field exists
- archivebox/cli/archivebox_crawl.py - References to via_extractor
- All parser plugins that set via_extractor in output
- Test files with via_extractor assertions
- [ ] Update all JSONL output from parser plugins to use "plugin" key
**Logging Functions (archivebox/misc/logging_util.py)**:
- [ ] `log_archive_method_started()``log_extractor_started()` (line 326)
- [ ] `log_archive_method_finished()``log_extractor_finished()` (line 330)
**Form Functions (archivebox/core/forms.py)**:
- [ ] `get_archive_methods()``get_plugin_choices()` (line 15)
- [ ] Form field `archive_methods``plugins` (line 24, 29)
- [ ] Update form validation and view usage
**Comments and Docstrings (81 files with "extractor" references)**:
- [ ] Update comments to say "extractor plugin" instead of just "extractor"
- [ ] Update comments to say "parser plugin" instead of "parser extractor"
- [ ] All plugin files: Update docstrings to use "extractor plugin" terminology
**Package Manager Plugin Documentation**:
- [ ] Update comments in package manager hook files to say "package manager plugin":
- archivebox/plugins/apt/on_Binary__install_using_apt_provider.py
- archivebox/plugins/brew/on_Binary__install_using_brew_provider.py
- archivebox/plugins/npm/on_Binary__install_using_npm_provider.py
- archivebox/plugins/pip/on_Binary__install_using_pip_provider.py
- archivebox/plugins/env/on_Binary__install_using_env_provider.py
- archivebox/plugins/custom/on_Binary__install_using_custom_bash.py
**String Literals in Error Messages**:
- [ ] Search for error messages containing "extractor" and update to "plugin" or "extractor plugin"
- [ ] Search for error messages containing "parser" and update to "parser plugin" where appropriate
---
## Critical Files Summary
### Must Update (Core):
1.`archivebox/core/models.py` - ArchiveResult, ArchiveResultManager, Snapshot
2.`archivebox/core/migrations/0033_*.py` - New migration
3.`archivebox/hooks.py` - All hook execution and discovery functions
4.`archivebox/misc/jsonl.py` - Serialization/deserialization
### Must Update (CLI):
5.`archivebox/cli/archivebox_extract.py`
6.`archivebox/cli/archivebox_add.py`
7.`archivebox/cli/archivebox_update.py`
### Must Update (API):
8.`archivebox/api/v1_core.py`
9.`archivebox/api/v1_cli.py`
### Must Update (Admin/Views):
10.`archivebox/core/admin_archiveresults.py`
11.`archivebox/core/views.py`
12.`archivebox/core/templatetags/core_tags.py`
### Must Update (Workers/State):
13.`archivebox/workers/worker.py`
14.`archivebox/core/statemachines.py`
### Must Update (Tests):
15.`tests/test_oneshot.py`
16.`archivebox/tests/test_hooks.py`
17.`archivebox/tests/test_migrations_helpers.py` - Schema SQL definitions
18.`tests/test_recursive_crawl.py` - SQL queries with field names
19.`archivebox/cli/tests_piping.py` - Test function docstrings
### Must Update (Terminology - Phase 12):
20.`archivebox/misc/logging_util.py` - Rename logging functions
21.`archivebox/core/forms.py` - Rename form helper and field
22.`archivebox/templates/admin/progress_monitor.html` - JavaScript field refs
23. ⏳ All 81 plugin files - Update docstrings and comments
24. ⏳ 28 files with parser terminology - Update comments consistently
---
## Migration Strategy
### Data Migration for Existing Records:
```python
def forwards(apps, schema_editor):
ArchiveResult = apps.get_model('core', 'ArchiveResult')
# All existing records get empty hook_name
ArchiveResult.objects.all().update(hook_name='')
```
### Backwards Compatibility:
**BREAKING CHANGES** (per user requirements - no backwards compat):
- CLI flags: Hard cutover to `--plugins` (no aliases)
- API fields: `extractor` removed, `plugin` required
- Template tags: All renamed to `plugin_*`
**PARTIAL COMPAT** (for migration):
- JSONL: Write 'plugin', but **accept both 'extractor' and 'plugin' on import**
---
## Testing Checklist
- [ ] Migration 0033 runs successfully on test database
- [ ] All migrations tests pass (test_migrations_*.py)
- [ ] All hook tests pass (test_hooks.py)
- [ ] CLI commands work with --plugins flag
- [ ] API endpoints return plugin/hook_name fields correctly
- [ ] Admin interface displays plugin correctly
- [ ] Admin progress monitor JavaScript works (no console errors)
- [ ] JSONL export includes both plugin and hook_name
- [ ] JSONL import accepts both 'extractor' and 'plugin' keys
- [ ] Hook execution populates hook_name field
- [ ] Worker filtering by plugin works
- [ ] Template tags render with new names (plugin_icon, etc.)
- [ ] All renamed functions work correctly
- [ ] SQL queries in tests use correct field names
- [ ] Terminology is consistent across codebase
---
## Critical Issues to Address
### 1. via_extractor Field (DECISION: RENAME)
- Currently used in 14 files for tracking which parser plugin discovered a URL
- **Decision**: Rename `via_extractor``plugin` (not via_plugin, just "plugin")
- **Impact**: Crawler and parser plugin code - 14 files to update
- Files affected:
- archivebox/hooks.py
- archivebox/crawls/models.py
- archivebox/cli/archivebox_crawl.py
- All parser plugins (parse_html_urls, parse_rss_urls, parse_jsonl_urls, etc.)
- Tests: tests_piping.py, test_parse_rss_urls_comprehensive.py
- This creates consistent naming where "plugin" is used for both:
- ArchiveResult.plugin (which extractor plugin ran)
- URL discovery metadata "plugin" (which parser plugin discovered this URL)
### 2. Field Size Constraint
- Current: `extractor = CharField(max_length=32)`
- **Decision**: Keep max_length=32 when renaming to plugin
- No size increase needed
### 3. Migration Implementation
- Use `migrations.RenameField('ArchiveResult', 'extractor', 'plugin')` for clean migration
- Preserves data, indexes, and constraints automatically
- Add hook_name field in same migration
---
## Rollout Notes
**Breaking Changes**:
1. CLI: `--extract`, `--extractors``--plugins` (no aliases)
2. API: `extractor` field → `plugin` field (no backwards compat)
3. Template tags: `extractor_*``plugin_*` (users must update custom templates)
4. Python API: All function names with "extractor" → "plugin" (import changes needed)
5. Form fields: `archive_methods``plugins`
6. **via_extractor → plugin** (URL discovery metadata field)
**Migration Required**: Yes - all instances must run migrations before upgrading
**Estimated Impact**: ~150+ files will need updates across the entire codebase
- 81 files: extractor terminology
- 28 files: parser terminology
- 10 files: archive_method legacy terminology
- Plus templates, JavaScript, tests, etc.
---
## Next Steps
1. **Continue with Phase 3**: Update hooks.py with all function renames and hook_name tracking
2. **Then Phase 4**: Update JSONL import/export with backwards compatibility
3. **Then Phases 5-12**: Systematically update all remaining files
4. **Finally Phase 13**: Run full test suite and verify everything works
**Note**: Migration can be tested immediately - the migration file is ready to run!