mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-01-03 09:25:42 +10:00
move todos
This commit is contained in:
431
old/TODO_chrome_plugin_cleanup.md
Normal file
431
old/TODO_chrome_plugin_cleanup.md
Normal file
@@ -0,0 +1,431 @@
|
||||
# Chrome Plugin Consolidation - COMPLETED ✓
|
||||
|
||||
## Core Principle: One ArchiveResult Per Plugin
|
||||
|
||||
**Critical Realization:** Each plugin must produce exactly ONE ArchiveResult output. This is fundamental to ArchiveBox's architecture - you cannot have multiple outputs from a single plugin.
|
||||
|
||||
### CRITICAL ARCHITECTURE CLARIFICATION
|
||||
|
||||
**DO NOT CONFUSE THESE CONCEPTS:**
|
||||
|
||||
1. **Plugin** = Directory name (e.g., `chrome`, `consolelog`, `screenshot`)
|
||||
- Lives in `archivebox/plugins/<plugin_name>/`
|
||||
- Can contain MULTIPLE hook files
|
||||
- Produces ONE output directory: `users/{username}/snapshots/YYYYMMDD/{domain}/{snap_id}/{plugin_name}/`
|
||||
- Creates ONE ArchiveResult record per snapshot
|
||||
|
||||
2. **Hook** = Individual script file (e.g., `on_Snapshot__20_chrome_tab.bg.js`)
|
||||
- Lives inside a plugin directory
|
||||
- One plugin can have MANY hooks
|
||||
- All hooks in a plugin run sequentially when that plugin's ArchiveResult is processed
|
||||
- All hooks write to the SAME output directory (the plugin directory)
|
||||
|
||||
3. **Extractor** = ArchiveResult.extractor field = PLUGIN NAME (not hook name)
|
||||
- `ArchiveResult.extractor = 'chrome'` (plugin name)
|
||||
- NOT `ArchiveResult.extractor = '20_chrome_tab.bg'` (hook name)
|
||||
|
||||
4. **Output Directory** = `users/{username}/snapshots/YYYYMMDD/{domain}/{snap_id}/{plugin_name}/`
|
||||
- One output directory per plugin (0.9.x structure)
|
||||
- ALL hooks in that plugin write to this same directory
|
||||
- Example: `users/default/snapshots/20251227/example.com/019b-6397-6a5b/chrome/` contains outputs from ALL chrome hooks
|
||||
- Legacy: `archive/{timestamp}/` with symlink for backwards compatibility
|
||||
|
||||
**Example 1: Chrome Plugin (Infrastructure - NO ArchiveResult)**
|
||||
```
|
||||
Plugin name: 'chrome'
|
||||
ArchiveResult: NONE (infrastructure only)
|
||||
Output directory: users/default/snapshots/20251227/example.com/019b-6397-6a5b/chrome/
|
||||
|
||||
Hooks:
|
||||
- on_Snapshot__20_chrome_tab.bg.js # Launches Chrome, opens tab
|
||||
- on_Snapshot__30_chrome_navigate.js # Navigates to URL
|
||||
- on_Snapshot__45_chrome_tab_cleanup.py # Kills Chrome on cleanup
|
||||
|
||||
Writes (temporary infrastructure files, deleted on cleanup):
|
||||
- chrome/cdp_url.txt # Other plugins read this to connect
|
||||
- chrome/target_id.txt # Tab ID for CDP connection
|
||||
- chrome/page_loaded.txt # Navigation completion marker
|
||||
- chrome/navigation.json # Navigation state
|
||||
- chrome/hook.pid # For cleanup
|
||||
|
||||
NO ArchiveResult JSON is produced - this is pure infrastructure.
|
||||
On SIGTERM: Chrome exits, chrome/ directory is deleted.
|
||||
```
|
||||
|
||||
**Example 2: Screenshot Plugin (Output Plugin - CREATES ArchiveResult)**
|
||||
```
|
||||
Plugin name: 'screenshot'
|
||||
ArchiveResult.extractor: 'screenshot'
|
||||
Output directory: users/default/snapshots/20251227/example.com/019b-6397-6a5b/screenshot/
|
||||
|
||||
Hooks:
|
||||
- on_Snapshot__34_screenshot.js
|
||||
|
||||
Process:
|
||||
1. Reads ../chrome/cdp_url.txt to get Chrome connection
|
||||
2. Connects to Chrome CDP
|
||||
3. Takes screenshot
|
||||
4. Writes to: screenshot/screenshot.png
|
||||
5. Emits ArchiveResult JSON to stdout
|
||||
|
||||
Creates ArchiveResult with status=succeeded, output_files={'screenshot.png': {}}
|
||||
```
|
||||
|
||||
**Example 3: PDF Plugin (Output Plugin - CREATES ArchiveResult)**
|
||||
```
|
||||
Plugin name: 'pdf'
|
||||
ArchiveResult.extractor: 'pdf'
|
||||
Output directory: users/default/snapshots/20251227/example.com/019b-6397-6a5b/pdf/
|
||||
|
||||
Hooks:
|
||||
- on_Snapshot__35_pdf.js
|
||||
|
||||
Process:
|
||||
1. Reads ../chrome/cdp_url.txt to get Chrome connection
|
||||
2. Connects to Chrome CDP
|
||||
3. Generates PDF
|
||||
4. Writes to: pdf/output.pdf
|
||||
5. Emits ArchiveResult JSON to stdout
|
||||
|
||||
Creates ArchiveResult with status=succeeded, output_files={'output.pdf': {}}
|
||||
```
|
||||
|
||||
**Lifecycle:**
|
||||
```
|
||||
1. Chrome hooks run → create chrome/ dir with infrastructure files
|
||||
2. Screenshot/PDF/etc hooks run → read chrome/cdp_url.txt, write to their own dirs
|
||||
3. Snapshot.cleanup() called → sends SIGTERM to background hooks
|
||||
4. Chrome receives SIGTERM → exits, deletes chrome/ dir
|
||||
5. Screenshot/PDF/etc dirs remain with their outputs
|
||||
```
|
||||
|
||||
**DO NOT:**
|
||||
- Create one ArchiveResult per hook
|
||||
- Use hook names as extractor values
|
||||
- Create separate output directories per hook
|
||||
|
||||
**DO:**
|
||||
- Create one ArchiveResult per plugin
|
||||
- Use plugin directory name as extractor value
|
||||
- Run all hooks in a plugin when processing its ArchiveResult
|
||||
- Write all hook outputs to the same plugin directory
|
||||
|
||||
This principle drove the entire consolidation strategy:
|
||||
- **Chrome plugin** = Infrastructure only (NO ArchiveResult)
|
||||
- **Output plugins** = Each produces ONE distinct ArchiveResult (kept separate)
|
||||
|
||||
## Final Structure
|
||||
|
||||
### 1. Chrome Plugin (Infrastructure - No Output)
|
||||
|
||||
**Location:** `archivebox/plugins/chrome/`
|
||||
|
||||
This plugin provides shared Chrome infrastructure for other plugins. It manages the browser lifecycle but **produces NO ArchiveResult** - only infrastructure files in a single `chrome/` output directory.
|
||||
|
||||
**Consolidates these former plugins:**
|
||||
- `chrome_session/` → Merged
|
||||
- `chrome_navigate/` → Merged
|
||||
- `chrome_cleanup/` → Merged
|
||||
- `chrome_extensions/` → Utilities merged
|
||||
|
||||
**Hook Files:**
|
||||
```
|
||||
chrome/
|
||||
├── on_Crawl__00_chrome_install_config.py # Configure Chrome settings
|
||||
├── on_Crawl__00_chrome_install.py # Install Chrome binary
|
||||
├── on_Crawl__20_chrome_launch.bg.js # Launch Chrome (Crawl-level, bg)
|
||||
├── on_Snapshot__20_chrome_tab.bg.js # Open tab (Snapshot-level, bg)
|
||||
├── on_Snapshot__30_chrome_navigate.js # Navigate to URL (foreground)
|
||||
├── on_Snapshot__45_chrome_tab_cleanup.py # Close tab, kill bg hooks
|
||||
├── chrome_extension_utils.js # Extension utilities
|
||||
├── config.json # Configuration
|
||||
└── tests/test_chrome.py # Tests
|
||||
```
|
||||
|
||||
**Output Directory (Infrastructure Only):**
|
||||
```
|
||||
chrome/
|
||||
├── cdp_url.txt # WebSocket URL for CDP connection
|
||||
├── pid.txt # Chrome process PID
|
||||
├── target_id.txt # Current tab target ID
|
||||
├── page_loaded.txt # Navigation completion marker
|
||||
├── final_url.txt # Final URL after redirects
|
||||
├── navigation.json # Navigation state (NEW)
|
||||
└── hook.pid # Background hook PIDs (for cleanup)
|
||||
```
|
||||
|
||||
**New: navigation.json**
|
||||
|
||||
Tracks navigation state with wait condition and timing:
|
||||
```json
|
||||
{
|
||||
"waitUntil": "networkidle2",
|
||||
"elapsed": 1523,
|
||||
"url": "https://example.com",
|
||||
"finalUrl": "https://example.com/",
|
||||
"status": 200,
|
||||
"timestamp": "2025-12-27T22:15:30.123Z"
|
||||
}
|
||||
```
|
||||
|
||||
Fields:
|
||||
- `waitUntil` - Wait condition: `networkidle0`, `networkidle2`, `domcontentloaded`, or `load`
|
||||
- `elapsed` - Navigation time in milliseconds
|
||||
- `url` - Original requested URL
|
||||
- `finalUrl` - Final URL after redirects (success only)
|
||||
- `status` - HTTP status code (success only)
|
||||
- `error` - Error message (failure only)
|
||||
- `timestamp` - ISO 8601 completion timestamp
|
||||
|
||||
### 2. Output Plugins (Each = One ArchiveResult)
|
||||
|
||||
These remain **SEPARATE** plugins because each produces a distinct output/ArchiveResult. Each plugin references `../chrome` for infrastructure.
|
||||
|
||||
#### consolelog Plugin
|
||||
```
|
||||
archivebox/plugins/consolelog/
|
||||
└── on_Snapshot__21_consolelog.bg.js
|
||||
```
|
||||
- **Output:** `console.jsonl` (browser console messages)
|
||||
- **Type:** Background hook (CDP listener)
|
||||
- **References:** `../chrome` for CDP URL
|
||||
|
||||
#### ssl Plugin
|
||||
```
|
||||
archivebox/plugins/ssl/
|
||||
└── on_Snapshot__23_ssl.bg.js
|
||||
```
|
||||
- **Output:** `ssl.jsonl` (SSL/TLS certificate details)
|
||||
- **Type:** Background hook (CDP listener)
|
||||
- **References:** `../chrome` for CDP URL
|
||||
|
||||
#### responses Plugin
|
||||
```
|
||||
archivebox/plugins/responses/
|
||||
└── on_Snapshot__24_responses.bg.js
|
||||
```
|
||||
- **Output:** `responses/` directory with `index.jsonl` (network responses)
|
||||
- **Type:** Background hook (CDP listener)
|
||||
- **References:** `../chrome` for CDP URL
|
||||
|
||||
#### redirects Plugin
|
||||
```
|
||||
archivebox/plugins/redirects/
|
||||
└── on_Snapshot__31_redirects.bg.js
|
||||
```
|
||||
- **Output:** `redirects.jsonl` (redirect chain)
|
||||
- **Type:** Background hook (CDP listener)
|
||||
- **References:** `../chrome` for CDP URL
|
||||
- **Changed:** Converted to background hook, now uses CDP `Network.requestWillBeSent` to capture redirects from initial request
|
||||
|
||||
#### staticfile Plugin
|
||||
```
|
||||
archivebox/plugins/staticfile/
|
||||
└── on_Snapshot__31_staticfile.bg.js
|
||||
```
|
||||
- **Output:** Downloaded static file (PDF, image, video, etc.)
|
||||
- **Type:** Background hook (CDP listener)
|
||||
- **References:** `../chrome` for CDP URL
|
||||
- **Changed:** Converted from Python to JavaScript, now uses CDP to detect Content-Type from initial response and download via CDP
|
||||
|
||||
## What Changed
|
||||
|
||||
### 1. Plugin Consolidation
|
||||
- Merged `chrome_session`, `chrome_navigate`, `chrome_cleanup`, `chrome_extensions` → `chrome/`
|
||||
- Chrome plugin now has **single output directory**: `chrome/`
|
||||
- All Chrome infrastructure hooks reference `.` (same directory)
|
||||
|
||||
### 2. Background Hook Conversions
|
||||
|
||||
**redirects Plugin:**
|
||||
- **Before:** Ran AFTER navigation, reconnected to Chrome to check for redirects
|
||||
- **After:** Background hook that sets up CDP listeners BEFORE navigation to capture redirects from initial request
|
||||
- **Method:** Uses CDP `Network.requestWillBeSent` event with `redirectResponse` parameter
|
||||
|
||||
**staticfile Plugin:**
|
||||
- **Before:** Python script that ran AFTER navigation, checked response headers
|
||||
- **After:** Background JavaScript hook that sets up CDP listeners BEFORE navigation
|
||||
- **Method:** Uses CDP `page.on('response')` to capture Content-Type from initial request
|
||||
- **Language:** Converted from Python to JavaScript/Node.js for consistency
|
||||
|
||||
### 3. Navigation State Tracking
|
||||
- **Added:** `navigation.json` file in `chrome/` output directory
|
||||
- **Contains:** `waitUntil` condition and `elapsed` milliseconds
|
||||
- **Purpose:** Track navigation performance and wait conditions for analysis
|
||||
|
||||
### 4. Cleanup
|
||||
- **Deleted:** `chrome_session/on_CrawlEnd__99_chrome_cleanup.py` (manual cleanup hook)
|
||||
- **Reason:** Automatic cleanup via state machines is sufficient
|
||||
- **Verified:** Cleanup mechanisms in `core/models.py` and `crawls/models.py` work correctly
|
||||
|
||||
## Hook Execution Order
|
||||
|
||||
```
|
||||
═══ CRAWL LEVEL ═══
|
||||
00. chrome_install_config.py Configure Chrome settings
|
||||
00. chrome_install.py Install Chrome binary
|
||||
20. chrome_launch.bg.js Launch Chrome browser (STAYS RUNNING)
|
||||
|
||||
═══ PER-SNAPSHOT LEVEL ═══
|
||||
|
||||
Phase 1: PRE-NAVIGATION (Background hooks setup)
|
||||
20. chrome_tab.bg.js Open new tab (STAYS ALIVE)
|
||||
21. consolelog.bg.js Setup console listener (STAYS ALIVE)
|
||||
23. ssl.bg.js Setup SSL listener (STAYS ALIVE)
|
||||
24. responses.bg.js Setup network response listener (STAYS ALIVE)
|
||||
31. redirects.bg.js Setup redirect listener (STAYS ALIVE)
|
||||
31. staticfile.bg.js Setup staticfile detector (STAYS ALIVE)
|
||||
|
||||
Phase 2: NAVIGATION (Foreground - synchronization point)
|
||||
30. chrome_navigate.js Navigate to URL (BLOCKS until page loaded)
|
||||
↓
|
||||
Writes navigation.json with waitUntil & elapsed
|
||||
Writes page_loaded.txt marker
|
||||
↓
|
||||
All background hooks can now finalize
|
||||
|
||||
Phase 3: POST-NAVIGATION (Background hooks finalize)
|
||||
(All .bg hooks save their data and wait for cleanup signal)
|
||||
|
||||
Phase 4: OTHER EXTRACTORS (use loaded page)
|
||||
34. screenshot.js
|
||||
37. singlefile.js
|
||||
... (other extractors that need loaded page)
|
||||
|
||||
Phase 5: CLEANUP
|
||||
45. chrome_tab_cleanup.py Close tab
|
||||
Kill background hooks (SIGTERM → SIGKILL)
|
||||
Update ArchiveResults
|
||||
```
|
||||
|
||||
## Background Hook Pattern
|
||||
|
||||
All `.bg.js` hooks follow this pattern:
|
||||
|
||||
1. **Setup:** Create CDP listeners BEFORE navigation
|
||||
2. **Capture:** Collect data incrementally as events occur
|
||||
3. **Write:** Save data to filesystem continuously
|
||||
4. **Wait:** Keep process alive until SIGTERM
|
||||
5. **Finalize:** On SIGTERM, emit final JSONL result to stdout
|
||||
6. **Exit:** Clean exit with status code
|
||||
|
||||
**Key files written:**
|
||||
- `hook.pid` - Process ID for cleanup mechanism
|
||||
- Output files (e.g., `console.jsonl`, `ssl.jsonl`, etc.)
|
||||
|
||||
## Automatic Cleanup Mechanism
|
||||
|
||||
**Snapshot-level cleanup** (`core/models.py`):
|
||||
```python
|
||||
def cleanup(self):
|
||||
"""Kill background hooks and close resources."""
|
||||
# Scan OUTPUT_DIR for hook.pid files
|
||||
# Send SIGTERM to processes
|
||||
# Wait for graceful exit
|
||||
# Send SIGKILL if process still alive
|
||||
# Update ArchiveResults to FAILED if needed
|
||||
```
|
||||
|
||||
**Crawl-level cleanup** (`crawls/models.py`):
|
||||
```python
|
||||
def cleanup(self):
|
||||
"""Kill Crawl-level background hooks (Chrome browser)."""
|
||||
# Similar pattern for Crawl-level resources
|
||||
# Kills Chrome launch process
|
||||
```
|
||||
|
||||
**State machine integration:**
|
||||
- Both `SnapshotMachine` and `CrawlMachine` call `cleanup()` when entering `sealed` state
|
||||
- Ensures all background processes are cleaned up properly
|
||||
- No manual cleanup hooks needed
|
||||
|
||||
## Directory References
|
||||
|
||||
**Crawl output structure:**
|
||||
- Crawls output to: `users/{user_id}/crawls/{YYYYMMDD}/{crawl_id}/`
|
||||
- Example: `users/1/crawls/20251227/abc-def-123/`
|
||||
- Crawl-level plugins create subdirectories: `users/1/crawls/20251227/abc-def-123/chrome/`
|
||||
|
||||
**Snapshot output structure:**
|
||||
- Snapshots output to: `archive/{timestamp}/`
|
||||
- Snapshot-level plugins create subdirectories: `archive/{timestamp}/chrome/`, `archive/{timestamp}/consolelog/`, etc.
|
||||
|
||||
**Within chrome plugin:**
|
||||
- Hooks use `.` or `OUTPUT_DIR` to reference the `chrome/` directory they're running in
|
||||
- Example: `fs.writeFileSync(path.join(OUTPUT_DIR, 'navigation.json'), ...)`
|
||||
|
||||
**From output plugins to chrome (same snapshot):**
|
||||
- Hooks use `../chrome` to reference Chrome infrastructure in same snapshot
|
||||
- Example: `const CHROME_SESSION_DIR = '../chrome';`
|
||||
- Used to read: `cdp_url.txt`, `target_id.txt`, `page_loaded.txt`
|
||||
|
||||
**From snapshot hooks to crawl chrome:**
|
||||
- Snapshot hooks receive `CRAWL_OUTPUT_DIR` environment variable (set by hooks.py)
|
||||
- Use: `path.join(process.env.CRAWL_OUTPUT_DIR, 'chrome')` to find crawl-level Chrome
|
||||
- This allows snapshots to reuse the crawl's shared Chrome browser
|
||||
|
||||
**Navigation synchronization:**
|
||||
- All hooks wait for `../chrome/page_loaded.txt` before finalizing
|
||||
- This file is written by `chrome_navigate.js` after navigation completes
|
||||
|
||||
## Design Principles
|
||||
|
||||
1. **One ArchiveResult Per Plugin**
|
||||
- Each plugin produces exactly ONE output/ArchiveResult
|
||||
- Infrastructure plugins (like chrome) produce NO ArchiveResult
|
||||
|
||||
2. **Chrome as Infrastructure**
|
||||
- Provides shared CDP connection, PIDs, navigation state
|
||||
- No ArchiveResult output of its own
|
||||
- Single output directory for all infrastructure files
|
||||
|
||||
3. **Background Hooks for CDP**
|
||||
- Hooks that need CDP listeners BEFORE navigation are background (`.bg.js`)
|
||||
- They capture events from the initial request/response
|
||||
- Stay alive through navigation and cleanup
|
||||
|
||||
4. **Foreground for Synchronization**
|
||||
- `chrome_navigate.js` is foreground (not `.bg`)
|
||||
- Provides synchronization point - blocks until page loaded
|
||||
- All other hooks wait for its completion marker
|
||||
|
||||
5. **Automatic Cleanup**
|
||||
- State machines handle background hook cleanup
|
||||
- No manual cleanup hooks needed
|
||||
- SIGTERM for graceful exit, SIGKILL as backup
|
||||
|
||||
6. **Clear Separation**
|
||||
- Infrastructure vs outputs
|
||||
- One output directory per plugin
|
||||
- Predictable, maintainable architecture
|
||||
|
||||
## Benefits
|
||||
|
||||
✓ **Architectural Clarity** - Clear separation between infrastructure and outputs
|
||||
✓ **Correct Output Model** - One ArchiveResult per plugin
|
||||
✓ **Better Performance** - CDP listeners capture data from initial request
|
||||
✓ **No Duplication** - Single Chrome infrastructure used by all
|
||||
✓ **Proper Lifecycle** - Background hooks cleaned up automatically
|
||||
✓ **Maintainable** - Easy to understand, debug, and extend
|
||||
✓ **Consistent** - All background hooks follow same pattern
|
||||
✓ **Observable** - Navigation state tracked for debugging
|
||||
|
||||
## Testing
|
||||
|
||||
Run tests:
|
||||
```bash
|
||||
sudo -u testuser bash -c 'source .venv/bin/activate && python -m pytest archivebox/plugins/chrome/tests/ -v'
|
||||
```
|
||||
|
||||
## Migration Notes
|
||||
|
||||
**For developers:**
|
||||
- Chrome infrastructure is now in `chrome/` output dir (not `chrome_session/`)
|
||||
- Reference `../chrome/cdp_url.txt` from output plugins
|
||||
- Navigation marker is `../chrome/page_loaded.txt`
|
||||
- Navigation details in `../chrome/navigation.json`
|
||||
|
||||
**For users:**
|
||||
- No user-facing changes
|
||||
- Output structure remains the same
|
||||
- All extractors continue to work
|
||||
1240
old/TODO_fs_migrations.md
Normal file
1240
old/TODO_fs_migrations.md
Normal file
File diff suppressed because it is too large
Load Diff
1976
old/TODO_hook_architecture.md
Executable file
1976
old/TODO_hook_architecture.md
Executable file
File diff suppressed because it is too large
Load Diff
665
old/TODO_hook_statemachine_cleanup.md
Normal file
665
old/TODO_hook_statemachine_cleanup.md
Normal file
@@ -0,0 +1,665 @@
|
||||
# Hook & State Machine Cleanup - Unified Pattern
|
||||
|
||||
## Goal
|
||||
Implement a **consistent pattern** across all models (Crawl, Snapshot, ArchiveResult, Dependency) for:
|
||||
1. Running hooks
|
||||
2. Processing JSONL records
|
||||
3. Managing background hooks
|
||||
4. State transitions
|
||||
|
||||
## Current State Analysis (ALL COMPLETE ✅)
|
||||
|
||||
### ✅ Crawl (archivebox/crawls/)
|
||||
**Status**: COMPLETE
|
||||
- ✅ Has state machine: `CrawlMachine`
|
||||
- ✅ `Crawl.run()` - runs hooks, processes JSONL via `process_hook_records()`, creates snapshots
|
||||
- ✅ `Crawl.cleanup()` - kills background hooks, runs on_CrawlEnd hooks
|
||||
- ✅ Uses `OUTPUT_DIR/plugin_name/` for PWD
|
||||
- ✅ State machine calls model methods:
|
||||
- `queued -> started`: calls `crawl.run()`
|
||||
- `started -> sealed`: calls `crawl.cleanup()`
|
||||
|
||||
### ✅ Snapshot (archivebox/core/)
|
||||
**Status**: COMPLETE
|
||||
- ✅ Has state machine: `SnapshotMachine`
|
||||
- ✅ `Snapshot.run()` - creates pending ArchiveResults
|
||||
- ✅ `Snapshot.cleanup()` - kills background ArchiveResult hooks, calls `update_from_output()`
|
||||
- ✅ `Snapshot.has_running_background_hooks()` - checks PID files using `process_is_alive()`
|
||||
- ✅ `Snapshot.from_jsonl()` - simplified, filtering moved to caller
|
||||
- ✅ State machine calls model methods:
|
||||
- `queued -> started`: calls `snapshot.run()`
|
||||
- `started -> sealed`: calls `snapshot.cleanup()`
|
||||
- `is_finished()`: uses `has_running_background_hooks()`
|
||||
|
||||
### ✅ ArchiveResult (archivebox/core/)
|
||||
**Status**: COMPLETE - Major refactor completed
|
||||
- ✅ Has state machine: `ArchiveResultMachine`
|
||||
- ✅ `ArchiveResult.run()` - runs hook, calls `update_from_output()` for foreground hooks
|
||||
- ✅ `ArchiveResult.update_from_output()` - unified method for foreground and background hooks
|
||||
- ✅ Uses PWD `snapshot.OUTPUT_DIR/plugin_name`
|
||||
- ✅ JSONL processing via `process_hook_records()` with URL/depth filtering
|
||||
- ✅ **DELETED** special background hook methods:
|
||||
- ❌ `check_background_completed()` - replaced by `process_is_alive()` helper
|
||||
- ❌ `finalize_background_hook()` - replaced by `update_from_output()`
|
||||
- ❌ `_populate_output_fields()` - merged into `update_from_output()`
|
||||
- ✅ State machine transitions:
|
||||
- `queued -> started`: calls `archiveresult.run()`
|
||||
- `started -> succeeded/failed/skipped`: status set by `update_from_output()`
|
||||
|
||||
### ✅ Binary (archivebox/machine/) - NEW!
|
||||
**Status**: COMPLETE - Replaced Dependency model entirely
|
||||
- ✅ Has state machine: `BinaryMachine`
|
||||
- ✅ `Binary.run()` - runs on_Binary__install_* hooks, processes JSONL
|
||||
- ✅ `Binary.cleanup()` - kills background installation hooks (for consistency)
|
||||
- ✅ `Binary.from_jsonl()` - handles both binaries.jsonl and hook output
|
||||
- ✅ Uses PWD `data/machines/{machine_id}/binaries/{name}/{id}/plugin_name/`
|
||||
- ✅ Configuration via static `plugins/*/binaries.jsonl` files
|
||||
- ✅ State machine calls model methods:
|
||||
- `queued -> started`: calls `binary.run()`
|
||||
- `started -> succeeded/failed`: status set by hooks via JSONL
|
||||
- ✅ Perfect symmetry with Crawl/Snapshot/ArchiveResult pattern
|
||||
|
||||
### ❌ Dependency Model - ELIMINATED
|
||||
**Status**: Deleted entirely (replaced by Binary state machine)
|
||||
- Static configuration now lives in `plugins/*/binaries.jsonl`
|
||||
- Per-machine state tracked by Binary records
|
||||
- No global singleton conflicts
|
||||
- Hooks renamed from `on_Dependency__install_*` to `on_Binary__install_*`
|
||||
|
||||
## Unified Pattern (Target Architecture)
|
||||
|
||||
### Pattern for ALL models:
|
||||
|
||||
```python
|
||||
# 1. State Machine orchestrates transitions
|
||||
class ModelMachine(StateMachine):
|
||||
@started.enter
|
||||
def enter_started(self):
|
||||
self.model.run() # Do the work
|
||||
# Update status
|
||||
|
||||
def is_finished(self):
|
||||
# Check if background hooks still running
|
||||
if self.model.has_running_background_hooks():
|
||||
return False
|
||||
# Check if children finished
|
||||
if self.model.has_pending_children():
|
||||
return False
|
||||
return True
|
||||
|
||||
@sealed.enter
|
||||
def enter_sealed(self):
|
||||
self.model.cleanup() # Clean up background hooks
|
||||
# Update status
|
||||
|
||||
# 2. Model methods do the actual work
|
||||
class Model:
|
||||
def run(self):
|
||||
"""Run hooks, process JSONL, create children."""
|
||||
hooks = discover_hooks('ModelName')
|
||||
for hook in hooks:
|
||||
output_dir = self.OUTPUT_DIR / hook.parent.name
|
||||
result = run_hook(hook, output_dir=output_dir, ...)
|
||||
|
||||
if result is None: # Background hook
|
||||
continue
|
||||
|
||||
# Process JSONL records
|
||||
records = result.get('records', [])
|
||||
overrides = {'model': self, 'created_by_id': self.created_by_id}
|
||||
process_hook_records(records, overrides=overrides)
|
||||
|
||||
# Create children (e.g., ArchiveResults, Snapshots)
|
||||
self.create_children()
|
||||
|
||||
def cleanup(self):
|
||||
"""Kill background hooks, run cleanup hooks."""
|
||||
# Kill any background hooks
|
||||
if self.OUTPUT_DIR.exists():
|
||||
for pid_file in self.OUTPUT_DIR.glob('*/hook.pid'):
|
||||
kill_process(pid_file)
|
||||
|
||||
# Run cleanup hooks (e.g., on_ModelEnd)
|
||||
cleanup_hooks = discover_hooks('ModelEnd')
|
||||
for hook in cleanup_hooks:
|
||||
run_hook(hook, ...)
|
||||
|
||||
def has_running_background_hooks(self) -> bool:
|
||||
"""Check if any background hooks still running."""
|
||||
if not self.OUTPUT_DIR.exists():
|
||||
return False
|
||||
for pid_file in self.OUTPUT_DIR.glob('*/hook.pid'):
|
||||
if process_is_alive(pid_file):
|
||||
return True
|
||||
return False
|
||||
```
|
||||
|
||||
### PWD Standard:
|
||||
```
|
||||
model.OUTPUT_DIR/plugin_name/
|
||||
```
|
||||
- Crawl: `users/{user}/crawls/{date}/{crawl_id}/plugin_name/`
|
||||
- Snapshot: `users/{user}/snapshots/{date}/{domain}/{snapshot_id}/plugin_name/`
|
||||
- ArchiveResult: `users/{user}/snapshots/{date}/{domain}/{snapshot_id}/plugin_name/` (same as Snapshot)
|
||||
- Dependency: `dependencies/{dependency_id}/plugin_name/` (set output_dir field directly)
|
||||
|
||||
## Implementation Plan
|
||||
|
||||
### Phase 1: Add unified helpers to hooks.py ✅ DONE
|
||||
|
||||
**File**: `archivebox/hooks.py`
|
||||
|
||||
**Status**: COMPLETE - Added three helper functions:
|
||||
- `process_hook_records(records, overrides)` - lines 1258-1323
|
||||
- `process_is_alive(pid_file)` - lines 1326-1344
|
||||
- `kill_process(pid_file, sig)` - lines 1347-1362
|
||||
|
||||
```python
|
||||
def process_hook_records(records: List[Dict], overrides: Dict = None) -> Dict[str, int]:
|
||||
"""
|
||||
Process JSONL records from hook output.
|
||||
Dispatches to Model.from_jsonl() for each record type.
|
||||
|
||||
Args:
|
||||
records: List of JSONL record dicts from result['records']
|
||||
overrides: Dict with 'snapshot', 'crawl', 'dependency', 'created_by_id', etc.
|
||||
|
||||
Returns:
|
||||
Dict with counts by record type
|
||||
"""
|
||||
stats = {}
|
||||
for record in records:
|
||||
record_type = record.get('type')
|
||||
|
||||
# Dispatch to appropriate model
|
||||
if record_type == 'Snapshot':
|
||||
from core.models import Snapshot
|
||||
Snapshot.from_jsonl(record, overrides)
|
||||
stats['Snapshot'] = stats.get('Snapshot', 0) + 1
|
||||
elif record_type == 'Tag':
|
||||
from core.models import Tag
|
||||
Tag.from_jsonl(record, overrides)
|
||||
stats['Tag'] = stats.get('Tag', 0) + 1
|
||||
elif record_type == 'Binary':
|
||||
from machine.models import Binary
|
||||
Binary.from_jsonl(record, overrides)
|
||||
stats['Binary'] = stats.get('Binary', 0) + 1
|
||||
# ... etc
|
||||
return stats
|
||||
|
||||
def process_is_alive(pid_file: Path) -> bool:
|
||||
"""Check if process in PID file is still running."""
|
||||
if not pid_file.exists():
|
||||
return False
|
||||
try:
|
||||
pid = int(pid_file.read_text().strip())
|
||||
os.kill(pid, 0) # Signal 0 = check if exists
|
||||
return True
|
||||
except (OSError, ValueError):
|
||||
return False
|
||||
|
||||
def kill_process(pid_file: Path, signal=SIGTERM):
|
||||
"""Kill process in PID file."""
|
||||
if not pid_file.exists():
|
||||
return
|
||||
try:
|
||||
pid = int(pid_file.read_text().strip())
|
||||
os.kill(pid, signal)
|
||||
except (OSError, ValueError):
|
||||
pass
|
||||
```
|
||||
|
||||
### Phase 2: Add Model.from_jsonl() static methods ✅ DONE
|
||||
|
||||
**Files**: `archivebox/core/models.py`, `archivebox/machine/models.py`, `archivebox/crawls/models.py`
|
||||
|
||||
**Status**: COMPLETE - Added from_jsonl() to:
|
||||
- ✅ `Tag.from_jsonl()` - core/models.py lines 93-116
|
||||
- ✅ `Snapshot.from_jsonl()` - core/models.py lines 1144-1189
|
||||
- ✅ `Machine.from_jsonl()` - machine/models.py lines 66-89
|
||||
- ✅ `Dependency.from_jsonl()` - machine/models.py lines 203-227
|
||||
- ✅ `Binary.from_jsonl()` - machine/models.py lines 401-434
|
||||
|
||||
Example implementations added:
|
||||
|
||||
```python
|
||||
class Snapshot:
|
||||
@staticmethod
|
||||
def from_jsonl(record: Dict, overrides: Dict = None):
|
||||
"""Create/update Snapshot from JSONL record."""
|
||||
from archivebox.misc.jsonl import get_or_create_snapshot
|
||||
overrides = overrides or {}
|
||||
|
||||
# Apply overrides (crawl, parent_snapshot, depth limits)
|
||||
crawl = overrides.get('crawl')
|
||||
snapshot = overrides.get('snapshot') # parent
|
||||
|
||||
if crawl:
|
||||
depth = record.get('depth', (snapshot.depth + 1 if snapshot else 1))
|
||||
if depth > crawl.max_depth:
|
||||
return None
|
||||
record.setdefault('crawl_id', str(crawl.id))
|
||||
record.setdefault('depth', depth)
|
||||
if snapshot:
|
||||
record.setdefault('parent_snapshot_id', str(snapshot.id))
|
||||
|
||||
created_by_id = overrides.get('created_by_id')
|
||||
new_snapshot = get_or_create_snapshot(record, created_by_id=created_by_id)
|
||||
new_snapshot.status = Snapshot.StatusChoices.QUEUED
|
||||
new_snapshot.retry_at = timezone.now()
|
||||
new_snapshot.save()
|
||||
return new_snapshot
|
||||
|
||||
class Tag:
|
||||
@staticmethod
|
||||
def from_jsonl(record: Dict, overrides: Dict = None):
|
||||
"""Create/update Tag from JSONL record."""
|
||||
from archivebox.misc.jsonl import get_or_create_tag
|
||||
tag = get_or_create_tag(record)
|
||||
# Auto-attach to snapshot if in overrides
|
||||
if overrides and 'snapshot' in overrides:
|
||||
overrides['snapshot'].tags.add(tag)
|
||||
return tag
|
||||
|
||||
class Binary:
|
||||
@staticmethod
|
||||
def from_jsonl(record: Dict, overrides: Dict = None):
|
||||
"""Create/update Binary from JSONL record."""
|
||||
# Implementation similar to existing create_model_record()
|
||||
...
|
||||
|
||||
# Etc for other models
|
||||
```
|
||||
|
||||
### Phase 3: Update ArchiveResult to use unified pattern ✅ DONE
|
||||
|
||||
**File**: `archivebox/core/models.py`
|
||||
|
||||
**Status**: COMPLETE
|
||||
|
||||
**Changes made**:
|
||||
|
||||
1. ✅ **Replaced inline JSONL processing** (lines 1912-1950):
|
||||
- Pre-filter Snapshot records for depth/URL constraints in ArchiveResult.run()
|
||||
- Use `self._url_passes_filters(url)` with parent snapshot's config for proper hierarchy
|
||||
- Replaced inline Tag/Snapshot/other record creation with `process_hook_records()`
|
||||
- Removed ~60 lines of duplicate code
|
||||
|
||||
2. ✅ **Simplified Snapshot.from_jsonl()** (lines 1144-1189):
|
||||
- Removed depth checking (now done in caller)
|
||||
- Just applies crawl metadata and creates snapshot
|
||||
- Added docstring note: "Filtering should be done by caller BEFORE calling this method"
|
||||
|
||||
3. ✅ **Preserved ArchiveResult self-update logic**:
|
||||
- Status/output fields still updated from ArchiveResult JSONL record (lines 1856-1910)
|
||||
- Special title extractor logic preserved (line 1952+)
|
||||
- Search indexing trigger preserved (line 1957+)
|
||||
|
||||
4. ✅ **Key insight**: Filtering happens in ArchiveResult.run() where we have parent snapshot context, NOT in from_jsonl() where we'd lose config hierarchy
|
||||
|
||||
**Note**: Did NOT delete special background hook methods (`check_background_completed`, `finalize_background_hook`) - that's Phase 6
|
||||
|
||||
### Phase 4: Add Snapshot.cleanup() method ✅ DONE
|
||||
|
||||
**File**: `archivebox/core/models.py`
|
||||
|
||||
**Status**: COMPLETE
|
||||
|
||||
**Changes made**:
|
||||
|
||||
1. ✅ **Added Snapshot.cleanup()** (lines 1144-1175):
|
||||
- Kills background ArchiveResult hooks by scanning for `*/hook.pid` files
|
||||
- Finalizes background ArchiveResults using `finalize_background_hook()` (temporary until Phase 6)
|
||||
- Called by state machine when entering sealed state
|
||||
|
||||
2. ✅ **Added Snapshot.has_running_background_hooks()** (lines 1177-1195):
|
||||
- Checks if any background hooks still running using `process_is_alive()`
|
||||
- Used by state machine in `is_finished()` check
|
||||
|
||||
### Phase 5: Update SnapshotMachine to use cleanup() ✅ DONE
|
||||
|
||||
**File**: `archivebox/core/statemachines.py`
|
||||
|
||||
**Status**: COMPLETE
|
||||
|
||||
**Changes made**:
|
||||
|
||||
1. ✅ **Simplified is_finished()** (lines 58-72):
|
||||
- Removed inline background hook checking and finalization (lines 67-76 deleted)
|
||||
- Now uses `self.snapshot.has_running_background_hooks()` (line 68)
|
||||
- Removed ~12 lines of duplicate logic
|
||||
|
||||
2. ✅ **Added cleanup() to sealed.enter** (lines 102-111):
|
||||
- Calls `self.snapshot.cleanup()` to kill background hooks (line 105)
|
||||
- Follows unified pattern: cleanup happens on seal, not in is_finished()
|
||||
|
||||
### Phase 6: Add ArchiveResult.update_from_output() and simplify run() ✅ DONE
|
||||
|
||||
**File**: `archivebox/core/models.py`
|
||||
|
||||
**Status**: COMPLETE - The BIG refactor (removed ~200 lines of duplication)
|
||||
|
||||
**Changes made**:
|
||||
|
||||
1. ✅ **Added `ArchiveResult.update_from_output()`** (lines 1908-2061):
|
||||
- Unified method for both foreground and background hooks
|
||||
- Reads stdout.log and parses JSONL records
|
||||
- Updates status/output_str/output_json from ArchiveResult JSONL record
|
||||
- Walks filesystem to populate output_files/output_size/output_mimetypes
|
||||
- Filters Snapshot records for depth/URL constraints (same as run())
|
||||
- Processes side-effect records via `process_hook_records()`
|
||||
- Updates snapshot title if title extractor
|
||||
- Triggers search indexing if succeeded
|
||||
- Cleans up PID files and empty logs
|
||||
- ~160 lines of comprehensive logic
|
||||
|
||||
2. ✅ **Simplified `ArchiveResult.run()`** (lines 1841-1906):
|
||||
- Removed ~120 lines of duplicate filesystem reading logic
|
||||
- Now just sets start_ts/pwd and calls `update_from_output()`
|
||||
- Background hooks: return immediately after saving status=STARTED
|
||||
- Foreground hooks: call `update_from_output()` to do all the work
|
||||
- Removed ~10 lines of duplicate code
|
||||
|
||||
3. ✅ **Updated `Snapshot.cleanup()`** (line 1172):
|
||||
- Changed from `ar.finalize_background_hook()` to `ar.update_from_output()`
|
||||
- Uses the unified method instead of the old special-case method
|
||||
|
||||
4. ✅ **Deleted `_populate_output_fields()`** (was ~45 lines):
|
||||
- Logic merged into `update_from_output()`
|
||||
- Eliminates duplication of filesystem walking code
|
||||
|
||||
5. ✅ **Deleted `check_background_completed()`** (was ~20 lines):
|
||||
- Replaced by `process_is_alive(pid_file)` from hooks.py
|
||||
- Generic helper used by Snapshot.has_running_background_hooks()
|
||||
|
||||
6. ✅ **Deleted `finalize_background_hook()`** (was ~85 lines):
|
||||
- Completely replaced by `update_from_output()`
|
||||
- Was duplicate of foreground hook finalization logic
|
||||
|
||||
**Total lines removed**: ~280 lines of duplicate code
|
||||
**Total lines added**: ~160 lines of unified code
|
||||
**Net reduction**: ~120 lines (-43%)
|
||||
|
||||
### Phase 7-8: Dependency State Machine ❌ NOT NEEDED
|
||||
|
||||
**Status**: Intentionally skipped - Dependency doesn't need a state machine
|
||||
|
||||
**Why no state machine for Dependency?**
|
||||
|
||||
1. **Wrong Granularity**: Dependency is a GLOBAL singleton (one record per binary name)
|
||||
- Multiple machines would race to update the same `status`/`retry_at` fields
|
||||
- No clear semantics: "started" on which machine? "failed" on Machine A but "succeeded" on Machine B?
|
||||
|
||||
2. **Wrong Timing**: Installation should be SYNCHRONOUS, not queued
|
||||
- When a worker needs wget, it should install wget NOW, not queue it for later
|
||||
- No benefit to async state machine transitions
|
||||
|
||||
3. **State Lives Elsewhere**: Binary records are the actual state
|
||||
- Each machine has its own Binary records (one per machine per binary)
|
||||
- Binary.machine FK provides proper per-machine state tracking
|
||||
|
||||
**Correct Architecture:**
|
||||
```
|
||||
Dependency (global, no state machine):
|
||||
├─ Configuration: bin_name, bin_providers, overrides
|
||||
├─ run() method: synchronous installation attempt
|
||||
└─ NO status, NO retry_at, NO state_machine_name
|
||||
|
||||
Binary (per-machine, has machine FK):
|
||||
├─ State: is this binary installed on this specific machine?
|
||||
├─ Created via JSONL output from on_Dependency hooks
|
||||
└─ unique_together = (machine, name, abspath, version, sha256)
|
||||
```
|
||||
|
||||
**What was implemented:**
|
||||
- ✅ **Refactored `Dependency.run()`** (lines 249-324):
|
||||
- Uses `discover_hooks()` and `process_hook_records()` for consistency
|
||||
- Added comprehensive docstring explaining why no state machine
|
||||
- Synchronous execution: returns Binary or None immediately
|
||||
- Uses unified JSONL processing pattern
|
||||
- ✅ **Kept Dependency simple**: Just configuration fields, no state fields
|
||||
- ✅ **Multi-machine support**: Each machine independently runs Dependency.run() and creates its own Binary
|
||||
|
||||
## Summary of Changes
|
||||
|
||||
### Progress: 6/6 Core Phases Complete ✅ + 2 Phases Skipped (Intentionally)
|
||||
|
||||
**ALL core functionality is now complete!** The unified pattern is consistently implemented across Crawl, Snapshot, and ArchiveResult. Dependency intentionally kept simple (no state machine needed).
|
||||
|
||||
### Files Modified:
|
||||
|
||||
1. ✅ **DONE** `archivebox/hooks.py` - Add unified helpers:
|
||||
- ✅ `process_hook_records(records, overrides)` - dispatcher (lines 1258-1323)
|
||||
- ✅ `process_is_alive(pid_file)` - check if PID still running (lines 1326-1344)
|
||||
- ✅ `kill_process(pid_file)` - kill process (lines 1347-1362)
|
||||
|
||||
2. ✅ **DONE** `archivebox/crawls/models.py` - Already updated:
|
||||
- ✅ `Crawl.run()` - runs hooks, processes JSONL, creates snapshots
|
||||
- ✅ `Crawl.cleanup()` - kills background hooks, runs on_CrawlEnd
|
||||
|
||||
3. ✅ **DONE** `archivebox/core/models.py`:
|
||||
- ✅ `Tag.from_jsonl()` - lines 93-116
|
||||
- ✅ `Snapshot.from_jsonl()` - lines 1197-1234 (simplified, removed filtering)
|
||||
- ✅ `Snapshot.cleanup()` - lines 1144-1172 (kill background hooks, calls ar.update_from_output())
|
||||
- ✅ `Snapshot.has_running_background_hooks()` - lines 1174-1193 (check PIDs)
|
||||
- ✅ `ArchiveResult.run()` - simplified, uses `update_from_output()` (lines 1841-1906)
|
||||
- ✅ `ArchiveResult.update_from_output()` - unified filesystem reading (lines 1908-2061)
|
||||
- ✅ **DELETED** `ArchiveResult.check_background_completed()` - replaced by `process_is_alive()`
|
||||
- ✅ **DELETED** `ArchiveResult.finalize_background_hook()` - replaced by `update_from_output()`
|
||||
- ✅ **DELETED** `ArchiveResult._populate_output_fields()` - merged into `update_from_output()`
|
||||
|
||||
4. ✅ **DONE** `archivebox/core/statemachines.py`:
|
||||
- ✅ Simplified `SnapshotMachine.is_finished()` - uses `has_running_background_hooks()` (line 68)
|
||||
- ✅ Added cleanup call to `SnapshotMachine.sealed.enter` (line 105)
|
||||
|
||||
5. ✅ **DONE** `archivebox/machine/models.py`:
|
||||
- ✅ `Machine.from_jsonl()` - lines 66-89
|
||||
- ✅ `Dependency.from_jsonl()` - lines 203-227
|
||||
- ✅ `Binary.from_jsonl()` - lines 401-434
|
||||
- ✅ Refactored `Dependency.run()` to use unified pattern (lines 249-324)
|
||||
- ✅ Added comprehensive docstring explaining why Dependency doesn't need state machine
|
||||
- ✅ Kept Dependency simple: no state fields, synchronous execution only
|
||||
|
||||
### Code Metrics:
|
||||
- **Lines removed**: ~280 lines of duplicate code
|
||||
- **Lines added**: ~160 lines of unified code
|
||||
- **Net reduction**: ~120 lines total (-43%)
|
||||
- **Files created**: 0 (no new files needed)
|
||||
|
||||
### Key Benefits:
|
||||
|
||||
1. **Consistency**: All stateful models (Crawl, Snapshot, ArchiveResult) follow the same unified state machine pattern
|
||||
2. **Simplicity**: Eliminated special-case background hook handling (~280 lines of duplicate code)
|
||||
3. **Correctness**: Background hooks are properly cleaned up on seal transition
|
||||
4. **Maintainability**: Unified `process_hook_records()` dispatcher for all JSONL processing
|
||||
5. **Testability**: Consistent pattern makes testing easier
|
||||
6. **Clear Separation**: Stateful work items (Crawl/Snapshot/ArchiveResult) vs stateless config (Dependency)
|
||||
7. **Multi-Machine Support**: Dependency remains simple synchronous config, Binary tracks per-machine state
|
||||
|
||||
## Final Unified Pattern
|
||||
|
||||
All models now follow this consistent architecture:
|
||||
|
||||
### State Machine Structure
|
||||
```python
|
||||
class ModelMachine(StateMachine):
|
||||
queued = State(initial=True)
|
||||
started = State()
|
||||
sealed/succeeded/failed = State(final=True)
|
||||
|
||||
@started.enter
|
||||
def enter_started(self):
|
||||
self.model.run() # Execute the work
|
||||
|
||||
@sealed.enter # or @succeeded.enter
|
||||
def enter_sealed(self):
|
||||
self.model.cleanup() # Clean up background hooks
|
||||
```
|
||||
|
||||
### Model Methods
|
||||
```python
|
||||
class Model:
|
||||
# State machine fields
|
||||
status = CharField(default='queued')
|
||||
retry_at = DateTimeField(default=timezone.now)
|
||||
output_dir = CharField(default='', blank=True)
|
||||
state_machine_name = 'app.statemachines.ModelMachine'
|
||||
|
||||
def run(self):
|
||||
"""Run hooks, process JSONL, create children."""
|
||||
hooks = discover_hooks('EventName')
|
||||
for hook in hooks:
|
||||
output_dir = self.OUTPUT_DIR / hook.parent.name
|
||||
result = run_hook(hook, output_dir=output_dir, ...)
|
||||
|
||||
if result is None: # Background hook
|
||||
continue
|
||||
|
||||
# Process JSONL records
|
||||
overrides = {'model': self, 'created_by_id': self.created_by_id}
|
||||
process_hook_records(result['records'], overrides=overrides)
|
||||
|
||||
def cleanup(self):
|
||||
"""Kill background hooks, run cleanup hooks."""
|
||||
for pid_file in self.OUTPUT_DIR.glob('*/hook.pid'):
|
||||
kill_process(pid_file)
|
||||
# Update children from filesystem
|
||||
child.update_from_output()
|
||||
|
||||
def update_for_workers(self, **fields):
|
||||
"""Update fields and bump modified_at."""
|
||||
for field, value in fields.items():
|
||||
setattr(self, field, value)
|
||||
self.save(update_fields=[*fields.keys(), 'modified_at'])
|
||||
|
||||
@staticmethod
|
||||
def from_jsonl(record: dict, overrides: dict = None):
|
||||
"""Create/update model from JSONL record."""
|
||||
# Implementation specific to model
|
||||
# Called by process_hook_records()
|
||||
```
|
||||
|
||||
### Hook Processing Flow
|
||||
```
|
||||
1. Model.run() discovers hooks
|
||||
2. Hooks execute and output JSONL to stdout
|
||||
3. JSONL records dispatched via process_hook_records()
|
||||
4. Each record type handled by Model.from_jsonl()
|
||||
5. Background hooks tracked via hook.pid files
|
||||
6. Model.cleanup() kills background hooks on seal
|
||||
7. Children updated via update_from_output()
|
||||
```
|
||||
|
||||
### Multi-Machine Coordination
|
||||
- **Work Items** (Crawl, Snapshot, ArchiveResult): No machine FK, any worker can claim
|
||||
- **Resources** (Binary): Machine FK, one per machine per binary
|
||||
- **Configuration** (Dependency): No machine FK, global singleton, synchronous execution
|
||||
- **Execution Tracking** (ArchiveResult.iface): FK to NetworkInterface for observability
|
||||
|
||||
## Testing Checklist
|
||||
|
||||
- [ ] Test Crawl → Snapshot creation with hooks
|
||||
- [ ] Test Snapshot → ArchiveResult creation
|
||||
- [ ] Test ArchiveResult foreground hooks (JSONL processing)
|
||||
- [ ] Test ArchiveResult background hooks (PID tracking, cleanup)
|
||||
- [ ] Test Dependency.run() synchronous installation
|
||||
- [ ] Test background hook cleanup on seal transition
|
||||
- [ ] Test multi-machine Crawl execution
|
||||
- [ ] Test Binary creation per machine (one per machine per binary)
|
||||
- [ ] Verify Dependency.run() can be called concurrently from multiple machines safely
|
||||
|
||||
## FINAL ARCHITECTURE (Phases 1-8 Complete)
|
||||
|
||||
### ✅ Phases 1-6: Core Models Unified
|
||||
All core models (Crawl, Snapshot, ArchiveResult) now follow the unified pattern:
|
||||
- State machines orchestrate transitions
|
||||
- `.run()` methods execute hooks and process JSONL
|
||||
- `.cleanup()` methods kill background hooks
|
||||
- `.update_for_workers()` methods update state for worker coordination
|
||||
- Consistent use of `process_hook_records()` for JSONL dispatching
|
||||
|
||||
### ✅ Phases 7-8: Binary State Machine (Dependency Model Eliminated)
|
||||
|
||||
**Key Decision**: Eliminated `Dependency` model entirely and made `Binary` the state machine.
|
||||
|
||||
#### New Architecture
|
||||
- **Static Configuration**: `plugins/{plugin}/dependencies.jsonl` files define binary requirements
|
||||
```jsonl
|
||||
{"type": "Binary", "name": "yt-dlp", "bin_providers": "pip,brew,apt,env"}
|
||||
{"type": "Binary", "name": "node", "bin_providers": "apt,brew,env", "overrides": {"apt": {"packages": ["nodejs"]}}}
|
||||
{"type": "Binary", "name": "ffmpeg", "bin_providers": "apt,brew,env"}
|
||||
```
|
||||
|
||||
- **Dynamic State**: `Binary` model tracks per-machine installation state
|
||||
- Fields: `machine`, `name`, `bin_providers`, `overrides`, `abspath`, `version`, `sha256`, `binprovider`
|
||||
- State machine: `queued → started → succeeded/failed`
|
||||
- Output dir: `data/machines/{machine_id}/binaries/{binary_name}/{binary_id}/`
|
||||
|
||||
#### Binary State Machine Flow
|
||||
```python
|
||||
class BinaryMachine(StateMachine):
|
||||
queued → started → succeeded/failed
|
||||
|
||||
@started.enter
|
||||
def enter_started(self):
|
||||
self.binary.run() # Runs on_Binary__install_* hooks
|
||||
|
||||
class Binary(models.Model):
|
||||
def run(self):
|
||||
"""
|
||||
Runs ALL on_Binary__install_* hooks.
|
||||
Each hook checks bin_providers and decides if it can handle this binary.
|
||||
First hook to succeed wins.
|
||||
Outputs JSONL with abspath, version, sha256, binprovider.
|
||||
"""
|
||||
hooks = discover_hooks('Binary')
|
||||
for hook in hooks:
|
||||
result = run_hook(hook, output_dir=self.OUTPUT_DIR/plugin_name,
|
||||
binary_id=self.id, machine_id=self.machine_id,
|
||||
name=self.name, bin_providers=self.bin_providers,
|
||||
overrides=json.dumps(self.overrides))
|
||||
|
||||
# Hook outputs: {"type": "Binary", "name": "wget", "abspath": "/usr/bin/wget", "version": "1.21", "binprovider": "apt"}
|
||||
# Binary.from_jsonl() updates self with installation results
|
||||
```
|
||||
|
||||
#### Hook Naming Convention
|
||||
- **Before**: `on_Dependency__install_using_pip_provider.py`
|
||||
- **After**: `on_Binary__install_using_pip_provider.py`
|
||||
|
||||
Each hook checks `--bin-providers` CLI argument:
|
||||
```python
|
||||
if 'pip' not in bin_providers.split(','):
|
||||
sys.exit(0) # Skip this binary
|
||||
```
|
||||
|
||||
#### Perfect Symmetry Achieved
|
||||
All models now follow identical patterns:
|
||||
```python
|
||||
Crawl(queued) → CrawlMachine → Crawl.run() → sealed
|
||||
Snapshot(queued) → SnapshotMachine → Snapshot.run() → sealed
|
||||
ArchiveResult(queued) → ArchiveResultMachine → ArchiveResult.run() → succeeded/failed
|
||||
Binary(queued) → BinaryMachine → Binary.run() → succeeded/failed
|
||||
```
|
||||
|
||||
#### Benefits of Eliminating Dependency
|
||||
1. **No global singleton conflicts**: Binary is per-machine, no race conditions
|
||||
2. **Simpler data model**: One table instead of two (Dependency + InstalledBinary)
|
||||
3. **Static configuration**: dependencies.jsonl in version control, not database
|
||||
4. **Consistent state machine**: Binary follows same pattern as other models
|
||||
5. **Cleaner hooks**: Hooks check bin_providers themselves instead of orchestrator parsing names
|
||||
|
||||
#### Multi-Machine Coordination
|
||||
- **Work Items** (Crawl, Snapshot, ArchiveResult): No machine FK, any worker can claim
|
||||
- **Resources** (Binary): Machine FK, one per machine per binary name
|
||||
- **Configuration**: Static files in `plugins/*/dependencies.jsonl`
|
||||
- **Execution Tracking**: ArchiveResult.iface FK to NetworkInterface for observability
|
||||
|
||||
### Testing Checklist (Updated)
|
||||
- [x] Core models use unified hook pattern (Phases 1-6)
|
||||
- [ ] Binary installation via state machine
|
||||
- [ ] Multiple machines can install same binary independently
|
||||
- [ ] Hook bin_providers filtering works correctly
|
||||
- [ ] Binary.from_jsonl() handles both dependencies.jsonl and hook output
|
||||
- [ ] Binary OUTPUT_DIR structure: data/machines/{machine_id}/binaries/{name}/{id}/
|
||||
|
||||
517
old/TODO_rename_extractor_to_plugin.md
Normal file
517
old/TODO_rename_extractor_to_plugin.md
Normal file
@@ -0,0 +1,517 @@
|
||||
# TODO: Rename Extractor to Plugin - Implementation Progress
|
||||
|
||||
**Status**: 🟡 In Progress (2/13 phases complete)
|
||||
**Started**: 2025-12-28
|
||||
**Estimated Files to Update**: ~150+ files
|
||||
|
||||
---
|
||||
|
||||
## Progress Overview
|
||||
|
||||
### ✅ Completed Phases (2/13)
|
||||
|
||||
- [x] **Phase 1**: Database Migration - Created migration 0033
|
||||
- [x] **Phase 2**: Core Model Updates - Updated ArchiveResult, ArchiveResultManager, Snapshot models
|
||||
|
||||
### 🟡 In Progress (1/13)
|
||||
|
||||
- [ ] **Phase 3**: Hook Execution System (hooks.py - all function renames)
|
||||
|
||||
### ⏳ Pending Phases (10/13)
|
||||
|
||||
- [ ] **Phase 4**: JSONL Import/Export (misc/jsonl.py)
|
||||
- [ ] **Phase 5**: CLI Commands (archivebox_extract, archivebox_add, archivebox_update)
|
||||
- [ ] **Phase 6**: API Endpoints (v1_core.py, v1_cli.py)
|
||||
- [ ] **Phase 7**: Admin Interface (admin_archiveresults.py, forms.py)
|
||||
- [ ] **Phase 8**: Views and Templates (views.py, templatetags, progress_monitor.html)
|
||||
- [ ] **Phase 9**: Worker System (workers/worker.py)
|
||||
- [ ] **Phase 10**: State Machine (statemachines.py)
|
||||
- [ ] **Phase 11**: Tests (test_migrations_helpers.py, test_recursive_crawl.py, etc.)
|
||||
- [ ] **Phase 12**: Terminology Standardization (via_extractor→plugin, comments, docstrings)
|
||||
- [ ] **Phase 13**: Run migrations and verify all tests pass
|
||||
|
||||
---
|
||||
|
||||
## What's Been Completed So Far
|
||||
|
||||
### Phase 1: Database Migration ✅
|
||||
|
||||
**File Created**: `archivebox/core/migrations/0033_rename_extractor_add_hook_name.py`
|
||||
|
||||
Changes:
|
||||
- Used `migrations.RenameField()` to rename `extractor` → `plugin`
|
||||
- Added `hook_name` field (CharField, max_length=255, indexed, default='')
|
||||
- Preserves all existing data, indexes, and constraints
|
||||
|
||||
### Phase 2: Core Models ✅
|
||||
|
||||
**File Updated**: `archivebox/core/models.py`
|
||||
|
||||
#### ArchiveResultManager
|
||||
- Updated `indexable()` method to use `plugin__in` and `plugin=method`
|
||||
- Changed reference from `ARCHIVE_METHODS_INDEXING_PRECEDENCE` to `EXTRACTOR_INDEXING_PRECEDENCE`
|
||||
|
||||
#### ArchiveResult Model
|
||||
**Field Changes**:
|
||||
- Renamed field: `extractor` → `plugin`
|
||||
- Added field: `hook_name` (stores full filename like `on_Snapshot__50_wget.py`)
|
||||
- Updated comments to reference "plugin" instead of "extractor"
|
||||
|
||||
**Method Updates**:
|
||||
- `get_extractor_choices()` → `get_plugin_choices()`
|
||||
- `__str__()`: Now uses `self.plugin`
|
||||
- `save()`: Logs `plugin` instead of `extractor`
|
||||
- `get_absolute_url()`: Uses `self.plugin`
|
||||
- `extractor_module` property → `plugin_module` property
|
||||
- `output_exists()`: Checks `self.plugin` directory
|
||||
- `embed_path()`: Uses `self.plugin` for paths
|
||||
- `create_output_dir()`: Creates `self.plugin` directory
|
||||
- `output_dir_name`: Returns `self.plugin`
|
||||
- `run()`: All references to extractor → plugin (including extractor_dir → plugin_dir)
|
||||
- `update_from_output()`: All references updated to plugin/plugin_dir
|
||||
- `_update_snapshot_title()`: Parameter renamed to `plugin_dir`
|
||||
- `trigger_search_indexing()`: Passes `plugin=self.plugin`
|
||||
- `output_dir` property: Returns plugin directory
|
||||
- `is_background_hook()`: Uses `plugin_dir`
|
||||
|
||||
#### Snapshot Model
|
||||
**Method Updates**:
|
||||
- `create_pending_archiveresults()`: Uses `get_enabled_plugins()`, filters by `plugin=plugin`
|
||||
- `result_icons` (calc_icons): Maps by `r.plugin`, calls `get_plugin_name()` and `get_plugin_icon()`
|
||||
- `_merge_archive_results_from_index()`: Maps by `(ar.plugin, ar.start_ts)`, supports both 'extractor' and 'plugin' keys for backwards compat
|
||||
- `_create_archive_result_if_missing()`: Supports both 'extractor' and 'plugin' keys, creates with `plugin=plugin`
|
||||
- `write_index_json()`: Writes `'plugin': ar.plugin` in archive_results
|
||||
- `canonical_outputs()`: Updates `find_best_output_in_dir()` to use `plugin_name`, accesses `result.plugin`, creates keys like `{result.plugin}_path`
|
||||
- `latest_outputs()`: Uses `get_plugins()`, filters by `plugin=plugin`
|
||||
- `retry_failed_archiveresults()`: Updated docstring to reference "plugins" instead of "extractors"
|
||||
|
||||
**Total Lines Changed in models.py**: ~50+ locations
|
||||
|
||||
---
|
||||
|
||||
## Full Implementation Plan
|
||||
|
||||
# ArchiveResult Model Refactoring Plan: Rename Extractor to Plugin + Add Hook Name Field
|
||||
|
||||
## Overview
|
||||
Refactor the ArchiveResult model and standardize terminology across the codebase:
|
||||
1. Rename the `extractor` field to `plugin` in ArchiveResult model
|
||||
2. Add a new `hook_name` field to store the specific hook filename that executed
|
||||
3. Update all related code paths (CLI, API, admin, views, hooks, JSONL, etc.)
|
||||
4. Standardize CLI flags from `--extract/--extractors` to `--plugins`
|
||||
5. **Standardize terminology throughout codebase**:
|
||||
- "parsers" → "parser plugins"
|
||||
- "extractors" → "extractor plugins"
|
||||
- "parser extractors" → "parser plugins"
|
||||
- "archive methods" → "extractor plugins"
|
||||
- Document apt/brew/npm/pip as "package manager plugins" in comments
|
||||
|
||||
## Current State Analysis
|
||||
|
||||
### ArchiveResult Model (archivebox/core/models.py:1679-1750)
|
||||
```python
|
||||
class ArchiveResult(ModelWithOutputDir, ...):
|
||||
extractor = models.CharField(max_length=32, db_index=True) # e.g., "screenshot", "wget"
|
||||
# New fields from migration 0029:
|
||||
output_str, output_json, output_files, output_size, output_mimetypes
|
||||
binary = ForeignKey('machine.Binary', ...)
|
||||
# No hook_name field yet
|
||||
```
|
||||
|
||||
### Hook Execution Flow
|
||||
1. `ArchiveResult.run()` discovers hooks for the plugin (e.g., `wget/on_Snapshot__50_wget.py`)
|
||||
2. `run_hook()` executes each hook script, captures output as HookResult
|
||||
3. `update_from_output()` parses JSONL and updates ArchiveResult fields
|
||||
4. Currently NO tracking of which specific hook file executed
|
||||
|
||||
### Field Usage Across Codebase
|
||||
**extractor field** is used in ~100 locations:
|
||||
- **Model**: ArchiveResult.extractor field definition, __str__, manager queries
|
||||
- **CLI**: archivebox_extract.py (--plugin flag), archivebox_add.py, tests
|
||||
- **API**: v1_core.py (extractor filter), v1_cli.py (extract/extractors args)
|
||||
- **Admin**: admin_archiveresults.py (list filter, display)
|
||||
- **Views**: core/views.py (archiveresult_objects dict by extractor)
|
||||
- **Template Tags**: core_tags.py (extractor_icon, extractor_thumbnail, extractor_embed)
|
||||
- **Hooks**: hooks.py (get_extractors, get_extractor_name, run_hook output parsing)
|
||||
- **JSONL**: misc/jsonl.py (archiveresult_to_jsonl serializes extractor)
|
||||
- **Worker**: workers/worker.py (ArchiveResultWorker filters by extractor)
|
||||
- **Statemachine**: statemachines.py (logs extractor in state transitions)
|
||||
|
||||
---
|
||||
|
||||
## Implementation Plan
|
||||
|
||||
### Phase 1: Database Migration (archivebox/core/migrations/) ✅ COMPLETE
|
||||
|
||||
**Create migration 0033_rename_extractor_add_hook_name.py**:
|
||||
1. Rename field: `extractor` → `plugin` (preserve index, constraints)
|
||||
2. Add field: `hook_name` = CharField(max_length=255, blank=True, default='', db_index=True)
|
||||
- **Stores full hook filename**: `on_Snapshot__50_wget.py`, `on_Crawl__10_chrome_session.js`, etc.
|
||||
- Empty string for existing records (data migration sets all to '')
|
||||
3. Update any indexes or constraints that reference extractor
|
||||
|
||||
**Decision**: Full filename chosen for explicitness and easy grep-ability
|
||||
|
||||
**Critical Files to Update**:
|
||||
- ✅ ArchiveResult model field definitions
|
||||
- ✅ Migration dependencies (latest: 0032)
|
||||
|
||||
---
|
||||
|
||||
### Phase 2: Core Model Updates (archivebox/core/models.py) ✅ COMPLETE
|
||||
|
||||
**ArchiveResult Model** (lines 1679-1820):
|
||||
- ✅ Rename field: `extractor` → `plugin`
|
||||
- ✅ Add field: `hook_name = models.CharField(...)`
|
||||
- ✅ Update __str__: `f'...-> {self.plugin}'`
|
||||
- ✅ Update absolute_url: Use plugin instead of extractor
|
||||
- ✅ Update embed_path: Use plugin directory name
|
||||
|
||||
**ArchiveResultManager** (lines 1669-1677):
|
||||
- ✅ Update indexable(): `filter(plugin__in=INDEXABLE_METHODS, ...)`
|
||||
- ✅ Update precedence: `When(plugin=method, ...)`
|
||||
|
||||
**Snapshot Model** (lines 1000-1600):
|
||||
- ✅ Update canonical_outputs: Access by plugin name
|
||||
- ✅ Update create_pending_archiveresults: Use plugin parameter
|
||||
- ✅ All queryset filters: `archiveresult_set.filter(plugin=...)`
|
||||
|
||||
---
|
||||
|
||||
### Phase 3: Hook Execution System (archivebox/hooks.py) 🟡 IN PROGRESS
|
||||
|
||||
**Function Renames**:
|
||||
- [ ] `get_extractors()` → `get_plugins()` (lines 479-504)
|
||||
- [ ] `get_parser_extractors()` → `get_parser_plugins()` (lines 507-514)
|
||||
- [ ] `get_extractor_name()` → `get_plugin_name()` (lines 517-530)
|
||||
- [ ] `is_parser_extractor()` → `is_parser_plugin()` (lines 533-536)
|
||||
- [ ] `get_enabled_extractors()` → `get_enabled_plugins()` (lines 553-566)
|
||||
- [ ] `get_extractor_template()` → `get_plugin_template()` (line 1048)
|
||||
- [ ] `get_extractor_icon()` → `get_plugin_icon()` (line 1068)
|
||||
- [ ] `get_all_extractor_icons()` → `get_all_plugin_icons()` (line 1092)
|
||||
|
||||
**Update HookResult TypedDict** (lines 63-73):
|
||||
- [ ] Add field: `hook_name: str` to store hook filename
|
||||
- [ ] Add field: `plugin: str` (if not already present)
|
||||
|
||||
**Update run_hook()** (lines 141-389):
|
||||
- [ ] **Add hook_name parameter**: Pass hook filename to be stored in result
|
||||
- [ ] Update HookResult to include hook_name field
|
||||
- [ ] Update JSONL record output: Add `hook_name` key
|
||||
|
||||
**Update ArchiveResult.run()** (lines 1838-1914):
|
||||
- [ ] When calling run_hook, pass the hook filename
|
||||
- [ ] Store hook_name in ArchiveResult before/after execution
|
||||
|
||||
**Update ArchiveResult.update_from_output()** (lines 1916-2073):
|
||||
- [ ] Parse hook_name from JSONL output
|
||||
- [ ] Store in self.hook_name field
|
||||
- [ ] If not present in JSONL, infer from directory/filename
|
||||
|
||||
**Constants to Rename**:
|
||||
- [ ] `ARCHIVE_METHODS_INDEXING_PRECEDENCE` → `EXTRACTOR_INDEXING_PRECEDENCE`
|
||||
|
||||
**Comments/Docstrings**: Update all function docstrings to use "plugin" terminology
|
||||
|
||||
---
|
||||
|
||||
### Phase 4: JSONL Import/Export (archivebox/misc/jsonl.py)
|
||||
|
||||
**Update archiveresult_to_jsonl()** (lines 173-200):
|
||||
- [ ] Change key: `'extractor': result.extractor` → `'plugin': result.plugin`
|
||||
- [ ] Add key: `'hook_name': result.hook_name`
|
||||
|
||||
**Update JSONL parsing**:
|
||||
- [ ] **Accept both 'extractor' (legacy) and 'plugin' (new) keys when importing**
|
||||
- [ ] Always write 'plugin' key in new exports (never 'extractor')
|
||||
- [ ] Parse and store hook_name if present (backwards compat: empty if missing)
|
||||
|
||||
**Decision**: Support both keys on import for smooth migration, always export new format
|
||||
|
||||
---
|
||||
|
||||
### Phase 5: CLI Commands (archivebox/cli/)
|
||||
|
||||
**archivebox_extract.py** (lines 1-230):
|
||||
- [ ] Rename flag: `--plugin` stays (already correct!)
|
||||
- [ ] Update internal references: extractor → plugin
|
||||
- [ ] Update filter: `results.filter(plugin=plugin)`
|
||||
- [ ] Update display: `result.plugin`
|
||||
|
||||
**archivebox_add.py**:
|
||||
- [ ] Rename config key: `'EXTRACTORS': plugins` → `'PLUGINS': plugins` (if not already)
|
||||
|
||||
**archivebox_update.py**:
|
||||
- [ ] Standardize to `--plugins` flag (currently may be --extractors or --extract)
|
||||
|
||||
**tests/test_oneshot.py**:
|
||||
- [ ] Update flag: `--extract=...` → `--plugins=...`
|
||||
|
||||
---
|
||||
|
||||
### Phase 6: API Endpoints (archivebox/api/)
|
||||
|
||||
**v1_core.py** (ArchiveResult API):
|
||||
- [ ] Update schema field: `extractor: str` → `plugin: str`
|
||||
- [ ] Update schema field: Add `hook_name: str = ''`
|
||||
- [ ] Update FilterSchema: `q=[..., 'plugin', ...]`
|
||||
- [ ] Update extractor filter: `plugin: Optional[str] = Field(None, q='plugin__icontains')`
|
||||
|
||||
**v1_cli.py** (CLI API):
|
||||
- [ ] Rename AddCommandSchema field: `extract: str` → `plugins: str`
|
||||
- [ ] Rename UpdateCommandSchema field: `extractors: str` → `plugins: str`
|
||||
- [ ] Update endpoint mapping: `args.plugins` → `plugins` parameter
|
||||
|
||||
---
|
||||
|
||||
### Phase 7: Admin Interface (archivebox/core/)
|
||||
|
||||
**admin_archiveresults.py**:
|
||||
- [ ] Update all references: extractor → plugin
|
||||
- [ ] Update list_filter: `'plugin'` instead of `'extractor'`
|
||||
- [ ] Update ordering: `order_by('plugin')`
|
||||
- [ ] Update get_plugin_icon: (rename from get_extractor_icon if exists)
|
||||
|
||||
**admin_snapshots.py**:
|
||||
- [ ] Update any commented TODOs referencing extractor
|
||||
|
||||
**forms.py**:
|
||||
- [ ] Rename function: `get_archive_methods()` → `get_plugin_choices()`
|
||||
- [ ] Update form field: `archive_methods` → `plugins`
|
||||
|
||||
---
|
||||
|
||||
### Phase 8: Views and Templates (archivebox/core/)
|
||||
|
||||
**views.py**:
|
||||
- [ ] Update dict building: `archiveresult_objects[result.plugin] = result`
|
||||
- [ ] Update all extractor references to plugin
|
||||
|
||||
**templatetags/core_tags.py**:
|
||||
- [ ] **Rename template tags (BREAKING CHANGE)**:
|
||||
- `extractor_icon()` → `plugin_icon()`
|
||||
- `extractor_thumbnail()` → `plugin_thumbnail()`
|
||||
- `extractor_embed()` → `plugin_embed()`
|
||||
- [ ] Update internal: `result.extractor` → `result.plugin`
|
||||
|
||||
**Update HTML templates** (if any directly reference extractor):
|
||||
- [ ] Search for `{{ result.extractor }}` and similar
|
||||
- [ ] Update to `{{ result.plugin }}`
|
||||
- [ ] Update template tag calls
|
||||
- [ ] **CRITICAL**: Update JavaScript in `templates/admin/progress_monitor.html`:
|
||||
- Lines 491, 505: Change `extractor.extractor` and `a.extractor` to use `plugin` field
|
||||
|
||||
---
|
||||
|
||||
### Phase 9: Worker System (archivebox/workers/worker.py)
|
||||
|
||||
**ArchiveResultWorker**:
|
||||
- [ ] Rename parameter: `extractor` → `plugin` (lines 348, 350)
|
||||
- [ ] Update filter: `qs.filter(plugin=self.plugin)`
|
||||
- [ ] Update subprocess passing: Use plugin parameter
|
||||
|
||||
---
|
||||
|
||||
### Phase 10: State Machine (archivebox/core/statemachines.py)
|
||||
|
||||
**ArchiveResultMachine**:
|
||||
- [ ] Update logging: Use `self.archiveresult.plugin` instead of extractor
|
||||
- [ ] Update any state metadata that includes extractor field
|
||||
|
||||
---
|
||||
|
||||
### Phase 11: Tests and Fixtures
|
||||
|
||||
**Update test files**:
|
||||
- [ ] tests/test_migrations_*.py: Update expected field names in schema definitions
|
||||
- [ ] tests/test_hooks.py: Update assertions for plugin/hook_name fields
|
||||
- [ ] archivebox/tests/test_migrations_helpers.py: Update schema SQL (lines 161, 382, 468)
|
||||
- [ ] tests/test_recursive_crawl.py: Update SQL query `WHERE extractor = '60_parse_html_urls'` (line 163)
|
||||
- [ ] archivebox/cli/tests_piping.py: Update test function names and assertions
|
||||
- [ ] Any fixtures that create ArchiveResults: Use plugin parameter
|
||||
- [ ] Any mock objects that set `.extractor` attribute: Change to `.plugin`
|
||||
|
||||
---
|
||||
|
||||
### Phase 12: Terminology Standardization (NEW)
|
||||
|
||||
This phase standardizes terminology throughout the codebase to use consistent "plugin" nomenclature.
|
||||
|
||||
**via_extractor → plugin Rename (14 files)**:
|
||||
- [ ] Rename metadata field `via_extractor` to just `plugin`
|
||||
- [ ] Files affected:
|
||||
- archivebox/hooks.py - Set plugin in run_hook() output
|
||||
- archivebox/crawls/models.py - If via_extractor field exists
|
||||
- archivebox/cli/archivebox_crawl.py - References to via_extractor
|
||||
- All parser plugins that set via_extractor in output
|
||||
- Test files with via_extractor assertions
|
||||
- [ ] Update all JSONL output from parser plugins to use "plugin" key
|
||||
|
||||
**Logging Functions (archivebox/misc/logging_util.py)**:
|
||||
- [ ] `log_archive_method_started()` → `log_extractor_started()` (line 326)
|
||||
- [ ] `log_archive_method_finished()` → `log_extractor_finished()` (line 330)
|
||||
|
||||
**Form Functions (archivebox/core/forms.py)**:
|
||||
- [ ] `get_archive_methods()` → `get_plugin_choices()` (line 15)
|
||||
- [ ] Form field `archive_methods` → `plugins` (line 24, 29)
|
||||
- [ ] Update form validation and view usage
|
||||
|
||||
**Comments and Docstrings (81 files with "extractor" references)**:
|
||||
- [ ] Update comments to say "extractor plugin" instead of just "extractor"
|
||||
- [ ] Update comments to say "parser plugin" instead of "parser extractor"
|
||||
- [ ] All plugin files: Update docstrings to use "extractor plugin" terminology
|
||||
|
||||
**Package Manager Plugin Documentation**:
|
||||
- [ ] Update comments in package manager hook files to say "package manager plugin":
|
||||
- archivebox/plugins/apt/on_Binary__install_using_apt_provider.py
|
||||
- archivebox/plugins/brew/on_Binary__install_using_brew_provider.py
|
||||
- archivebox/plugins/npm/on_Binary__install_using_npm_provider.py
|
||||
- archivebox/plugins/pip/on_Binary__install_using_pip_provider.py
|
||||
- archivebox/plugins/env/on_Binary__install_using_env_provider.py
|
||||
- archivebox/plugins/custom/on_Binary__install_using_custom_bash.py
|
||||
|
||||
**String Literals in Error Messages**:
|
||||
- [ ] Search for error messages containing "extractor" and update to "plugin" or "extractor plugin"
|
||||
- [ ] Search for error messages containing "parser" and update to "parser plugin" where appropriate
|
||||
|
||||
---
|
||||
|
||||
## Critical Files Summary
|
||||
|
||||
### Must Update (Core):
|
||||
1. ✅ `archivebox/core/models.py` - ArchiveResult, ArchiveResultManager, Snapshot
|
||||
2. ✅ `archivebox/core/migrations/0033_*.py` - New migration
|
||||
3. ⏳ `archivebox/hooks.py` - All hook execution and discovery functions
|
||||
4. ⏳ `archivebox/misc/jsonl.py` - Serialization/deserialization
|
||||
|
||||
### Must Update (CLI):
|
||||
5. ⏳ `archivebox/cli/archivebox_extract.py`
|
||||
6. ⏳ `archivebox/cli/archivebox_add.py`
|
||||
7. ⏳ `archivebox/cli/archivebox_update.py`
|
||||
|
||||
### Must Update (API):
|
||||
8. ⏳ `archivebox/api/v1_core.py`
|
||||
9. ⏳ `archivebox/api/v1_cli.py`
|
||||
|
||||
### Must Update (Admin/Views):
|
||||
10. ⏳ `archivebox/core/admin_archiveresults.py`
|
||||
11. ⏳ `archivebox/core/views.py`
|
||||
12. ⏳ `archivebox/core/templatetags/core_tags.py`
|
||||
|
||||
### Must Update (Workers/State):
|
||||
13. ⏳ `archivebox/workers/worker.py`
|
||||
14. ⏳ `archivebox/core/statemachines.py`
|
||||
|
||||
### Must Update (Tests):
|
||||
15. ⏳ `tests/test_oneshot.py`
|
||||
16. ⏳ `archivebox/tests/test_hooks.py`
|
||||
17. ⏳ `archivebox/tests/test_migrations_helpers.py` - Schema SQL definitions
|
||||
18. ⏳ `tests/test_recursive_crawl.py` - SQL queries with field names
|
||||
19. ⏳ `archivebox/cli/tests_piping.py` - Test function docstrings
|
||||
|
||||
### Must Update (Terminology - Phase 12):
|
||||
20. ⏳ `archivebox/misc/logging_util.py` - Rename logging functions
|
||||
21. ⏳ `archivebox/core/forms.py` - Rename form helper and field
|
||||
22. ⏳ `archivebox/templates/admin/progress_monitor.html` - JavaScript field refs
|
||||
23. ⏳ All 81 plugin files - Update docstrings and comments
|
||||
24. ⏳ 28 files with parser terminology - Update comments consistently
|
||||
|
||||
---
|
||||
|
||||
## Migration Strategy
|
||||
|
||||
### Data Migration for Existing Records:
|
||||
```python
|
||||
def forwards(apps, schema_editor):
|
||||
ArchiveResult = apps.get_model('core', 'ArchiveResult')
|
||||
# All existing records get empty hook_name
|
||||
ArchiveResult.objects.all().update(hook_name='')
|
||||
```
|
||||
|
||||
### Backwards Compatibility:
|
||||
**BREAKING CHANGES** (per user requirements - no backwards compat):
|
||||
- CLI flags: Hard cutover to `--plugins` (no aliases)
|
||||
- API fields: `extractor` removed, `plugin` required
|
||||
- Template tags: All renamed to `plugin_*`
|
||||
|
||||
**PARTIAL COMPAT** (for migration):
|
||||
- JSONL: Write 'plugin', but **accept both 'extractor' and 'plugin' on import**
|
||||
|
||||
---
|
||||
|
||||
## Testing Checklist
|
||||
|
||||
- [ ] Migration 0033 runs successfully on test database
|
||||
- [ ] All migrations tests pass (test_migrations_*.py)
|
||||
- [ ] All hook tests pass (test_hooks.py)
|
||||
- [ ] CLI commands work with --plugins flag
|
||||
- [ ] API endpoints return plugin/hook_name fields correctly
|
||||
- [ ] Admin interface displays plugin correctly
|
||||
- [ ] Admin progress monitor JavaScript works (no console errors)
|
||||
- [ ] JSONL export includes both plugin and hook_name
|
||||
- [ ] JSONL import accepts both 'extractor' and 'plugin' keys
|
||||
- [ ] Hook execution populates hook_name field
|
||||
- [ ] Worker filtering by plugin works
|
||||
- [ ] Template tags render with new names (plugin_icon, etc.)
|
||||
- [ ] All renamed functions work correctly
|
||||
- [ ] SQL queries in tests use correct field names
|
||||
- [ ] Terminology is consistent across codebase
|
||||
|
||||
---
|
||||
|
||||
## Critical Issues to Address
|
||||
|
||||
### 1. via_extractor Field (DECISION: RENAME)
|
||||
- Currently used in 14 files for tracking which parser plugin discovered a URL
|
||||
- **Decision**: Rename `via_extractor` → `plugin` (not via_plugin, just "plugin")
|
||||
- **Impact**: Crawler and parser plugin code - 14 files to update
|
||||
- Files affected:
|
||||
- archivebox/hooks.py
|
||||
- archivebox/crawls/models.py
|
||||
- archivebox/cli/archivebox_crawl.py
|
||||
- All parser plugins (parse_html_urls, parse_rss_urls, parse_jsonl_urls, etc.)
|
||||
- Tests: tests_piping.py, test_parse_rss_urls_comprehensive.py
|
||||
- This creates consistent naming where "plugin" is used for both:
|
||||
- ArchiveResult.plugin (which extractor plugin ran)
|
||||
- URL discovery metadata "plugin" (which parser plugin discovered this URL)
|
||||
|
||||
### 2. Field Size Constraint
|
||||
- Current: `extractor = CharField(max_length=32)`
|
||||
- **Decision**: Keep max_length=32 when renaming to plugin
|
||||
- No size increase needed
|
||||
|
||||
### 3. Migration Implementation
|
||||
- Use `migrations.RenameField('ArchiveResult', 'extractor', 'plugin')` for clean migration
|
||||
- Preserves data, indexes, and constraints automatically
|
||||
- Add hook_name field in same migration
|
||||
|
||||
---
|
||||
|
||||
## Rollout Notes
|
||||
|
||||
**Breaking Changes**:
|
||||
1. CLI: `--extract`, `--extractors` → `--plugins` (no aliases)
|
||||
2. API: `extractor` field → `plugin` field (no backwards compat)
|
||||
3. Template tags: `extractor_*` → `plugin_*` (users must update custom templates)
|
||||
4. Python API: All function names with "extractor" → "plugin" (import changes needed)
|
||||
5. Form fields: `archive_methods` → `plugins`
|
||||
6. **via_extractor → plugin** (URL discovery metadata field)
|
||||
|
||||
**Migration Required**: Yes - all instances must run migrations before upgrading
|
||||
|
||||
**Estimated Impact**: ~150+ files will need updates across the entire codebase
|
||||
- 81 files: extractor terminology
|
||||
- 28 files: parser terminology
|
||||
- 10 files: archive_method legacy terminology
|
||||
- Plus templates, JavaScript, tests, etc.
|
||||
|
||||
---
|
||||
|
||||
## Next Steps
|
||||
|
||||
1. **Continue with Phase 3**: Update hooks.py with all function renames and hook_name tracking
|
||||
2. **Then Phase 4**: Update JSONL import/export with backwards compatibility
|
||||
3. **Then Phases 5-12**: Systematically update all remaining files
|
||||
4. **Finally Phase 13**: Run full test suite and verify everything works
|
||||
|
||||
**Note**: Migration can be tested immediately - the migration file is ready to run!
|
||||
Reference in New Issue
Block a user