diff --git a/TODO_hook_concurrency.md b/TODO_hook_concurrency.md new file mode 100644 index 00000000..82190e7f --- /dev/null +++ b/TODO_hook_concurrency.md @@ -0,0 +1,486 @@ +# ArchiveBox Hook Script Concurrency & Execution Plan + +## Overview + +Snapshot.run() should enforce that snapshot hooks are run in **10 discrete, sequential "steps"**: `0*`, `1*`, `2*`, `3*`, `4*`, `5*`, `6*`, `7*`, `8*`, `9*`. + +For every discovered hook script, ArchiveBox should create an ArchiveResult in `queued` state, then manage running them using `retry_at` and inline logic to enforce this ordering. + +## Hook Numbering Convention + +Hooks scripts are numbered `00` to `99` to control: +- **First digit (0-9)**: Which step they are part of +- **Second digit (0-9)**: Order within that step + +Hook scripts are launched **strictly sequentially** based on their filename alphabetical order, and run in sets of several per step before moving on to the next step. + +**Naming Format:** +``` +on_{ModelName}__{run_order}_{human_readable_description}[.bg].{ext} +``` + +**Examples:** +``` +on_Snapshot__00_this_would_run_first.sh +on_Snapshot__05_start_ytdlp_download.bg.sh +on_Snapshot__10_chrome_tab_opened.js +on_Snapshot__50_screenshot.js +on_Snapshot__53_media.bg.py +``` + +## Background (.bg) vs Foreground Scripts + +### Foreground Scripts (no .bg suffix) +- Run sequentially within their step +- Block step progression until they exit +- Should exit naturally when work is complete +- Get killed with SIGTERM if they exceed their `PLUGINNAME_TIMEOUT` + +### Background Scripts (.bg suffix) +- Spawned and allowed to continue running +- Do NOT block step progression +- Run until **their own `PLUGINNAME_TIMEOUT` is reached** (not until step 99) +- Get polite SIGTERM when timeout expires, then SIGKILL 60s later if not exited +- Must implement their own concurrency control using filesystem (semaphore files, locks, etc.) +- Should exit naturally when work is complete (best case) + +**Important:** If a .bg script starts at step 05 with `MEDIA_TIMEOUT=3600s`, it gets the full 3600s regardless of when step 99 completes. It runs on its own timeline. + +## Execution Step Guidelines + +These are **naming conventions and guidelines**, not enforced checkpoints. They provide semantic organization for plugin ordering: + +### Step 0: Pre-Setup +``` +00-09: Initial setup, validation, feature detection +``` + +### Step 1: Chrome Launch & Tab Creation +``` +10-19: Browser/tab lifecycle setup +- Chrome browser launch +- Tab creation and CDP connection +``` + +### Step 2: Navigation & Settlement +``` +20-29: Page loading and settling +- Navigate to URL +- Wait for page load +- Initial response capture (responses, ssl, consolelog as .bg listeners) +``` + +### Step 3: Page Adjustment +``` +30-39: DOM manipulation before archiving +- Hide popups/banners +- Solve captchas +- Expand comments/details sections +- Inject custom CSS/JS +- Accessibility modifications +``` + +### Step 4: Ready for Archiving +``` +40-49: Final pre-archiving checks +- Verify page is fully adjusted +- Wait for any pending modifications +``` + +### Step 5: DOM Extraction (Sequential, Non-BG) +``` +50-59: Extractors that need exclusive DOM access +- singlefile (MUST NOT be .bg) +- screenshot (MUST NOT be .bg) +- pdf (MUST NOT be .bg) +- dom (MUST NOT be .bg) +- title +- headers +- readability +- mercury + +These MUST run sequentially as they temporarily modify the DOM +during extraction, then revert it. Running in parallel would corrupt results. +``` + +### Step 6: Post-DOM Extraction +``` +60-69: Extractors that don't need DOM or run on downloaded files +- wget +- git +- media (.bg - can run for hours) +- gallerydl (.bg) +- forumdl (.bg) +- papersdl (.bg) +``` + +### Step 7: Chrome Cleanup +``` +70-79: Browser/tab teardown +- Close tabs +- Cleanup Chrome resources +``` + +### Step 8: Post-Processing +``` +80-89: Reprocess outputs from earlier extractors +- OCR of images +- Audio/video transcription +- URL parsing from downloaded content (rss, html, json, txt, csv, md) +- LLM analysis/summarization of outputs +``` + +### Step 9: Indexing & Finalization +``` +90-99: Save to indexes and finalize +- Index text content to Sonic/SQLite FTS +- Create symlinks +- Generate merkle trees +- Final status updates +``` + +## Hook Script Interface + +### Input: CLI Arguments (NOT stdin) +Hooks receive configuration as CLI flags (CSV or JSON-encoded): + +```bash +--url="https://example.com" +--snapshot-id="1234-5678-uuid" +--config='{"some_key": "some_value"}' +--plugins=git,media,favicon,title +--timeout=50 +--enable-something +``` + +### Input: Environment Variables +All configuration comes from env vars, defined in `plugin_dir/config.json` JSONSchema: + +```bash +WGET_BINARY=/usr/bin/wget +WGET_TIMEOUT=60 +WGET_USER_AGENT="Mozilla/5.0..." +WGET_EXTRA_ARGS="--no-check-certificate" +SAVE_WGET=True +``` + +**Required:** Every plugin must support `PLUGINNAME_TIMEOUT` for self-termination. + +### Output: Filesystem (CWD) +Hooks read/write files to: +- `$CWD`: Their own output subdirectory (e.g., `archive/snapshots/{id}/wget/`) +- `$CWD/..`: Parent directory (to read outputs from other hooks) + +This allows hooks to: +- Access files created by other hooks +- Keep their outputs separate by default +- Use semaphore files for coordination (if needed) + +### Output: JSONL to stdout +Hooks emit one JSONL line per database record they want to create or update: + +```jsonl +{"type": "Tag", "name": "sci-fi"} +{"type": "ArchiveResult", "id": "1234-uuid", "status": "succeeded", "output_str": "wget/index.html"} +{"type": "Snapshot", "id": "5678-uuid", "title": "Example Page"} +``` + +See `archivebox/misc/jsonl.py` and model `from_json()` / `from_jsonl()` methods for full list of supported types and fields. + +### Output: stderr for Human Logs +Hooks should emit human-readable output or debug info to **stderr**. There are no guarantees this will be persisted long-term. Use stdout JSONL or filesystem for outputs that matter. + +### Cleanup: Delete Cruft +If hooks emit no meaningful long-term outputs, they should delete any temporary files themselves to avoid wasting space. However, the ArchiveResult DB row should be kept so we know: +- It doesn't need to be retried +- It isn't missing +- What happened (status, error message) + +### Signal Handling: SIGINT/SIGTERM +Hooks are expected to listen for polite `SIGINT`/`SIGTERM` and finish hastily, then exit cleanly. Beyond that, they may be `SIGKILL'd` at ArchiveBox's discretion. + +**If hooks double-fork or spawn long-running processes:** They must output a `.pid` file in their directory so zombies can be swept safely. + +## Hook Failure Modes & Retry Logic + +Hooks can fail in several ways. ArchiveBox handles each differently: + +### 1. Soft Failure (Record & Don't Retry) +**Exit:** `0` (success) +**JSONL:** `{"type": "ArchiveResult", "status": "failed", "output_str": "404 Not Found"}` + +This means: "I ran successfully, but the resource wasn't available." Don't retry this. + +**Use cases:** +- 404 errors +- Content not available +- Feature not applicable to this URL + +### 2. Hard Failure / Temporary Error (Retry Later) +**Exit:** Non-zero (1, 2, etc.) +**JSONL:** None (or incomplete) + +This means: "Something went wrong, I couldn't complete." Treat this ArchiveResult as "missing" and set `retry_at` for later. + +**Use cases:** +- 500 server errors +- Network timeouts +- Binary not found / crashed +- Transient errors + +**Behavior:** +- ArchiveBox sets `retry_at` on the ArchiveResult +- Hook will be retried during next `archivebox update` + +### 3. Partial Success (Update & Continue) +**Exit:** Non-zero +**JSONL:** Partial records emitted before crash + +**Behavior:** +- Update ArchiveResult with whatever was emitted +- Mark remaining work as "missing" with `retry_at` + +### 4. Success (Record & Continue) +**Exit:** `0` +**JSONL:** `{"type": "ArchiveResult", "status": "succeeded", "output_str": "output/file.html"}` + +This is the happy path. + +### Error Handling Rules + +- **DO NOT skip hooks** based on failures +- **Continue to next hook** regardless of foreground or background failures +- **Update ArchiveResults** with whatever information is available +- **Set retry_at** for "missing" or temporarily-failed hooks +- **Let background scripts continue** even if foreground scripts fail + +## File Structure + +``` +archivebox/plugins/{plugin_name}/ +├── config.json # JSONSchema: env var config options +├── binaries.jsonl # Runtime dependencies: apt|brew|pip|npm|env +├── on_Snapshot__XX_name.py # Hook script (foreground) +├── on_Snapshot__XX_name.bg.py # Hook script (background) +└── tests/ + └── test_name.py +``` + +## Implementation Checklist + +### Phase 1: Renumber Existing Hooks ✅ +- [ ] Renumber DOM extractors to 50-59 range +- [ ] Ensure pdf/screenshot are NOT .bg (need sequential access) +- [ ] Ensure media (ytdlp) IS .bg (can run for hours) +- [ ] Add step comments to each plugin for clarity + +### Phase 2: Timeout Consistency ✅ +- [x] All plugins support `PLUGINNAME_TIMEOUT` env var +- [x] All plugins fall back to generic `TIMEOUT` env var +- [x] Background scripts handle SIGTERM gracefully (or exit naturally) + +### Phase 3: Refactor Snapshot.run() +- [ ] Parse hook filenames to extract step number (first digit) +- [ ] Group hooks by step (0-9) +- [ ] Run each step sequentially +- [ ] Within each step: + - [ ] Launch foreground hooks sequentially + - [ ] Launch .bg hooks and track PIDs + - [ ] Wait for foreground hooks to complete before next step +- [ ] Track .bg script timeouts independently +- [ ] Send SIGTERM to .bg scripts when their timeout expires +- [ ] Send SIGKILL 60s after SIGTERM if not exited + +### Phase 4: ArchiveResult Management +- [ ] Create one ArchiveResult per hook (not per plugin) +- [ ] Set initial state to `queued` +- [ ] Update state based on JSONL output and exit code +- [ ] Set `retry_at` for hooks that exit non-zero with no JSONL +- [ ] Don't retry hooks that emit `{"status": "failed"}` + +### Phase 5: JSONL Streaming +- [ ] Parse stdout JSONL line-by-line during hook execution +- [ ] Create/update DB rows as JSONL is emitted (streaming mode) +- [ ] Handle partial JSONL on hook crash + +### Phase 6: Zombie Process Management +- [ ] Read `.pid` files from hook output directories +- [ ] Sweep zombies on cleanup +- [ ] Handle double-forked processes correctly + +## Migration Path + +### Backward Compatibility +During migration, support both old and new numbering: +1. Run hooks numbered 00-99 in step order +2. Run unnumbered hooks last (step 9) for compatibility +3. Log warnings for unnumbered hooks +4. Eventually require all hooks to be numbered + +### Renumbering Map + +**Current → New:** +``` +git/on_Snapshot__12_git.py → git/on_Snapshot__62_git.py +media/on_Snapshot__51_media.py → media/on_Snapshot__63_media.bg.py +gallerydl/on_Snapshot__52_gallerydl.py → gallerydl/on_Snapshot__64_gallerydl.bg.py +forumdl/on_Snapshot__53_forumdl.py → forumdl/on_Snapshot__65_forumdl.bg.py +papersdl/on_Snapshot__54_papersdl.py → papersdl/on_Snapshot__66_papersdl.bg.py + +readability/on_Snapshot__52_readability.py → readability/on_Snapshot__55_readability.py +mercury/on_Snapshot__53_mercury.py → mercury/on_Snapshot__56_mercury.py + +singlefile/on_Snapshot__37_singlefile.py → singlefile/on_Snapshot__50_singlefile.py +screenshot/on_Snapshot__34_screenshot.js → screenshot/on_Snapshot__51_screenshot.js +pdf/on_Snapshot__35_pdf.js → pdf/on_Snapshot__52_pdf.js +dom/on_Snapshot__36_dom.js → dom/on_Snapshot__53_dom.js +title/on_Snapshot__32_title.js → title/on_Snapshot__54_title.js +headers/on_Snapshot__33_headers.js → headers/on_Snapshot__55_headers.js + +wget/on_Snapshot__50_wget.py → wget/on_Snapshot__61_wget.py +``` + +## Testing Strategy + +### Unit Tests +- Test hook ordering (00-99) +- Test step grouping (first digit) +- Test .bg vs foreground execution +- Test timeout enforcement +- Test JSONL parsing +- Test failure modes & retry_at logic + +### Integration Tests +- Test full Snapshot.run() with mixed hooks +- Test .bg scripts running beyond step 99 +- Test zombie process cleanup +- Test graceful SIGTERM handling +- Test concurrent .bg script coordination + +### Performance Tests +- Measure overhead of per-hook ArchiveResults +- Test with 50+ concurrent .bg scripts +- Test filesystem contention with many hooks + +## Open Questions + +### Q: Should we provide semaphore utilities? +**A:** No. Keep plugins decoupled. Let them use simple filesystem coordination if needed. + +### Q: What happens if ArchiveResult table gets huge? +**A:** We can delete old successful ArchiveResults periodically, or archive them to cold storage. The important data is in the filesystem outputs. + +### Q: Should naturally-exiting .bg scripts still be .bg? +**A:** Yes. The .bg suffix means "don't block step progression," not "run until step 99." Natural exit is the best case. + +## Examples + +### Foreground Hook (Sequential DOM Access) +```python +#!/usr/bin/env python3 +# archivebox/plugins/screenshot/on_Snapshot__51_screenshot.js + +# Runs at step 5, blocks step progression until complete +# Gets killed if it exceeds SCREENSHOT_TIMEOUT + +timeout = get_env_int('SCREENSHOT_TIMEOUT') or get_env_int('TIMEOUT', 60) + +try: + result = subprocess.run(cmd, capture_output=True, timeout=timeout) + if result.returncode == 0: + print(json.dumps({ + "type": "ArchiveResult", + "status": "succeeded", + "output_str": "screenshot.png" + })) + sys.exit(0) + else: + # Temporary failure - will be retried + sys.exit(1) +except subprocess.TimeoutExpired: + # Timeout - will be retried + sys.exit(1) +``` + +### Background Hook (Long-Running Download) +```python +#!/usr/bin/env python3 +# archivebox/plugins/media/on_Snapshot__63_media.bg.py + +# Runs at step 6, doesn't block step progression +# Gets full MEDIA_TIMEOUT (e.g., 3600s) regardless of when step 99 completes + +timeout = get_env_int('YTDLP_TIMEOUT') or get_env_int('MEDIA_TIMEOUT') or get_env_int('TIMEOUT', 3600) + +try: + result = subprocess.run(['yt-dlp', url], capture_output=True, timeout=timeout) + if result.returncode == 0: + print(json.dumps({ + "type": "ArchiveResult", + "status": "succeeded", + "output_str": "media/" + })) + sys.exit(0) + else: + # Hard failure - don't retry + print(json.dumps({ + "type": "ArchiveResult", + "status": "failed", + "output_str": "Video unavailable" + })) + sys.exit(0) # Exit 0 to record the failure +except subprocess.TimeoutExpired: + # Timeout - will be retried + sys.exit(1) +``` + +### Background Hook with Natural Exit +```javascript +#!/usr/bin/env node +// archivebox/plugins/ssl/on_Snapshot__23_ssl.bg.js + +// Sets up listener, captures SSL info, then exits naturally +// No SIGTERM handler needed - already exits when done + +async function main() { + const page = await connectToChrome(); + + // Set up listener + page.on('response', async (response) => { + const securityDetails = response.securityDetails(); + if (securityDetails) { + fs.writeFileSync('ssl.json', JSON.stringify(securityDetails)); + } + }); + + // Wait for navigation (done by other hook) + await waitForNavigation(); + + // Emit result + console.log(JSON.stringify({ + type: 'ArchiveResult', + status: 'succeeded', + output_str: 'ssl.json' + })); + + process.exit(0); // Natural exit - no await indefinitely +} + +main().catch(e => { + console.error(`ERROR: ${e.message}`); + process.exit(1); // Will be retried +}); +``` + +## Summary + +This plan provides: +- ✅ Clear execution ordering (10 steps, 00-99 numbering) +- ✅ Async support (.bg suffix) +- ✅ Independent timeout control per plugin +- ✅ Flexible failure handling & retry logic +- ✅ Streaming JSONL output for DB updates +- ✅ Simple filesystem-based coordination +- ✅ Backward compatibility during migration + +The main implementation work is refactoring `Snapshot.run()` to enforce step ordering and manage .bg script lifecycles. Plugin renumbering is straightforward mechanical work. diff --git a/archivebox/api/v1_cli.py b/archivebox/api/v1_cli.py index 9282acce..3359ca54 100644 --- a/archivebox/api/v1_cli.py +++ b/archivebox/api/v1_cli.py @@ -54,7 +54,7 @@ class AddCommandSchema(Schema): tag: str = "" depth: int = 0 parser: str = "auto" - extract: str = "" + plugins: str = "" update: bool = not ARCHIVING_CONFIG.ONLY_NEW # Default to the opposite of ARCHIVING_CONFIG.ONLY_NEW overwrite: bool = False index_only: bool = False @@ -69,7 +69,7 @@ class UpdateCommandSchema(Schema): status: Optional[StatusChoices] = StatusChoices.unarchived filter_type: Optional[str] = FilterTypeChoices.substring filter_patterns: Optional[List[str]] = ['https://example.com'] - extractors: Optional[str] = "" + plugins: Optional[str] = "" class ScheduleCommandSchema(Schema): import_path: Optional[str] = None @@ -115,7 +115,7 @@ def cli_add(request, args: AddCommandSchema): update=args.update, index_only=args.index_only, overwrite=args.overwrite, - plugins=args.extract, # extract in API maps to plugins param + plugins=args.plugins, parser=args.parser, bg=True, # Always run in background for API calls ) @@ -143,7 +143,7 @@ def cli_update(request, args: UpdateCommandSchema): status=args.status, filter_type=args.filter_type, filter_patterns=args.filter_patterns, - extractors=args.extractors, + plugins=args.plugins, ) return { "success": True, diff --git a/archivebox/api/v1_core.py b/archivebox/api/v1_core.py index 7f4f4f37..3d83d710 100644 --- a/archivebox/api/v1_core.py +++ b/archivebox/api/v1_core.py @@ -65,7 +65,8 @@ class MinimalArchiveResultSchema(Schema): created_by_username: str status: str retry_at: datetime | None - extractor: str + plugin: str + hook_name: str cmd_version: str | None cmd: list[str] | None pwd: str | None @@ -113,13 +114,14 @@ class ArchiveResultSchema(MinimalArchiveResultSchema): class ArchiveResultFilterSchema(FilterSchema): id: Optional[str] = Field(None, q=['id__startswith', 'snapshot__id__startswith', 'snapshot__timestamp__startswith']) - search: Optional[str] = Field(None, q=['snapshot__url__icontains', 'snapshot__title__icontains', 'snapshot__tags__name__icontains', 'extractor', 'output_str__icontains', 'id__startswith', 'snapshot__id__startswith', 'snapshot__timestamp__startswith']) + search: Optional[str] = Field(None, q=['snapshot__url__icontains', 'snapshot__title__icontains', 'snapshot__tags__name__icontains', 'plugin', 'output_str__icontains', 'id__startswith', 'snapshot__id__startswith', 'snapshot__timestamp__startswith']) snapshot_id: Optional[str] = Field(None, q=['snapshot__id__startswith', 'snapshot__timestamp__startswith']) snapshot_url: Optional[str] = Field(None, q='snapshot__url__icontains') snapshot_tag: Optional[str] = Field(None, q='snapshot__tags__name__icontains') status: Optional[str] = Field(None, q='status') output_str: Optional[str] = Field(None, q='output_str__icontains') - extractor: Optional[str] = Field(None, q='extractor__icontains') + plugin: Optional[str] = Field(None, q='plugin__icontains') + hook_name: Optional[str] = Field(None, q='hook_name__icontains') cmd: Optional[str] = Field(None, q='cmd__0__icontains') pwd: Optional[str] = Field(None, q='pwd__icontains') cmd_version: Optional[str] = Field(None, q='cmd_version') diff --git a/archivebox/cli/archivebox_add.py b/archivebox/cli/archivebox_add.py index 4a848d13..f868787d 100644 --- a/archivebox/cli/archivebox_add.py +++ b/archivebox/cli/archivebox_add.py @@ -86,7 +86,7 @@ def add(urls: str | list[str], 'ONLY_NEW': not update, 'INDEX_ONLY': index_only, 'OVERWRITE': overwrite, - 'EXTRACTORS': plugins, + 'PLUGINS': plugins, 'DEFAULT_PERSONA': persona or 'Default', 'PARSER': parser, } diff --git a/archivebox/cli/archivebox_extract.py b/archivebox/cli/archivebox_extract.py index 7ebdc385..45eeb331 100644 --- a/archivebox/cli/archivebox_extract.py +++ b/archivebox/cli/archivebox_extract.py @@ -40,7 +40,7 @@ def process_archiveresult_by_id(archiveresult_id: str) -> int: """ Run extraction for a single ArchiveResult by ID (used by workers). - Triggers the ArchiveResult's state machine tick() to run the extractor. + Triggers the ArchiveResult's state machine tick() to run the extractor plugin. """ from rich import print as rprint from core.models import ArchiveResult @@ -51,7 +51,7 @@ def process_archiveresult_by_id(archiveresult_id: str) -> int: rprint(f'[red]ArchiveResult {archiveresult_id} not found[/red]', file=sys.stderr) return 1 - rprint(f'[blue]Extracting {archiveresult.extractor} for {archiveresult.snapshot.url}[/blue]', file=sys.stderr) + rprint(f'[blue]Extracting {archiveresult.plugin} for {archiveresult.snapshot.url}[/blue]', file=sys.stderr) try: # Trigger state machine tick - this runs the actual extraction @@ -151,7 +151,7 @@ def run_plugins( # Only create for specific plugin result, created = ArchiveResult.objects.get_or_create( snapshot=snapshot, - extractor=plugin, + plugin=plugin, defaults={ 'status': ArchiveResult.StatusChoices.QUEUED, 'retry_at': timezone.now(), @@ -193,7 +193,7 @@ def run_plugins( snapshot = Snapshot.objects.get(id=snapshot_id) results = snapshot.archiveresult_set.all() if plugin: - results = results.filter(extractor=plugin) + results = results.filter(plugin=plugin) for result in results: if is_tty: @@ -202,7 +202,7 @@ def run_plugins( 'failed': 'red', 'skipped': 'yellow', }.get(result.status, 'dim') - rprint(f' [{status_color}]{result.status}[/{status_color}] {result.extractor} → {result.output_str or ""}', file=sys.stderr) + rprint(f' [{status_color}]{result.status}[/{status_color}] {result.plugin} → {result.output_str or ""}', file=sys.stderr) else: write_record(archiveresult_to_jsonl(result)) except Snapshot.DoesNotExist: diff --git a/archivebox/core/admin_archiveresults.py b/archivebox/core/admin_archiveresults.py index 749170ab..1acaf27a 100644 --- a/archivebox/core/admin_archiveresults.py +++ b/archivebox/core/admin_archiveresults.py @@ -13,16 +13,16 @@ from archivebox.config import DATA_DIR from archivebox.config.common import SERVER_CONFIG from archivebox.misc.paginators import AccelleratedPaginator from archivebox.base_models.admin import BaseModelAdmin -from archivebox.hooks import get_extractor_icon +from archivebox.hooks import get_plugin_icon from core.models import ArchiveResult, Snapshot def render_archiveresults_list(archiveresults_qs, limit=50): - """Render a nice inline list view of archive results with status, extractor, output, and actions.""" + """Render a nice inline list view of archive results with status, plugin, output, and actions.""" - results = list(archiveresults_qs.order_by('extractor').select_related('snapshot')[:limit]) + results = list(archiveresults_qs.order_by('plugin').select_related('snapshot')[:limit]) if not results: return mark_safe('
No Archive Results yet...
') @@ -40,8 +40,8 @@ def render_archiveresults_list(archiveresults_qs, limit=50): status = result.status or 'queued' color, bg = status_colors.get(status, ('#6b7280', '#f3f4f6')) - # Get extractor icon - icon = get_extractor_icon(result.extractor) + # Get plugin icon + icon = get_plugin_icon(result.plugin) # Format timestamp end_time = result.end_ts.strftime('%Y-%m-%d %H:%M:%S') if result.end_ts else '-' @@ -79,7 +79,7 @@ def render_archiveresults_list(archiveresults_qs, limit=50): font-size: 11px; font-weight: 600; text-transform: uppercase; color: {color}; background: {bg};">{status} - + {icon} @@ -88,7 +88,7 @@ def render_archiveresults_list(archiveresults_qs, limit=50): title="View output fullscreen" onmouseover="this.style.color='#2563eb'; this.style.textDecoration='underline';" onmouseout="this.style.color='#334155'; this.style.textDecoration='none';"> - {result.extractor} + {result.plugin} @@ -162,7 +162,7 @@ def render_archiveresults_list(archiveresults_qs, limit=50): ID Status - Extractor + Plugin Output Completed Version @@ -185,9 +185,9 @@ class ArchiveResultInline(admin.TabularInline): parent_model = Snapshot # fk_name = 'snapshot' extra = 0 - sort_fields = ('end_ts', 'extractor', 'output_str', 'status', 'cmd_version') + sort_fields = ('end_ts', 'plugin', 'output_str', 'status', 'cmd_version') readonly_fields = ('id', 'result_id', 'completed', 'command', 'version') - fields = ('start_ts', 'end_ts', *readonly_fields, 'extractor', 'cmd', 'cmd_version', 'pwd', 'created_by', 'status', 'retry_at', 'output_str') + fields = ('start_ts', 'end_ts', *readonly_fields, 'plugin', 'cmd', 'cmd_version', 'pwd', 'created_by', 'status', 'retry_at', 'output_str') # exclude = ('id',) ordering = ('end_ts',) show_change_link = True @@ -253,9 +253,9 @@ class ArchiveResultInline(admin.TabularInline): class ArchiveResultAdmin(BaseModelAdmin): list_display = ('id', 'created_by', 'created_at', 'snapshot_info', 'tags_str', 'status', 'extractor_with_icon', 'cmd_str', 'output_str') - sort_fields = ('id', 'created_by', 'created_at', 'extractor', 'status') + sort_fields = ('id', 'created_by', 'created_at', 'plugin', 'status') readonly_fields = ('cmd_str', 'snapshot_info', 'tags_str', 'created_at', 'modified_at', 'output_summary', 'extractor_with_icon', 'iface') - search_fields = ('id', 'snapshot__url', 'extractor', 'output_str', 'cmd_version', 'cmd', 'snapshot__timestamp') + search_fields = ('id', 'snapshot__url', 'plugin', 'output_str', 'cmd_version', 'cmd', 'snapshot__timestamp') autocomplete_fields = ['snapshot'] fieldsets = ( @@ -263,8 +263,8 @@ class ArchiveResultAdmin(BaseModelAdmin): 'fields': ('snapshot', 'snapshot_info', 'tags_str'), 'classes': ('card', 'wide'), }), - ('Extractor', { - 'fields': ('extractor', 'extractor_with_icon', 'status', 'retry_at', 'iface'), + ('Plugin', { + 'fields': ('plugin', 'plugin_with_icon', 'status', 'retry_at', 'iface'), 'classes': ('card',), }), ('Timing', { @@ -285,7 +285,7 @@ class ArchiveResultAdmin(BaseModelAdmin): }), ) - list_filter = ('status', 'extractor', 'start_ts', 'cmd_version') + list_filter = ('status', 'plugin', 'start_ts', 'cmd_version') ordering = ['-start_ts'] list_per_page = SERVER_CONFIG.SNAPSHOTS_PER_PAGE @@ -321,14 +321,14 @@ class ArchiveResultAdmin(BaseModelAdmin): def tags_str(self, result): return result.snapshot.tags_str() - @admin.display(description='Extractor', ordering='extractor') - def extractor_with_icon(self, result): - icon = get_extractor_icon(result.extractor) + @admin.display(description='Plugin', ordering='plugin') + def plugin_with_icon(self, result): + icon = get_plugin_icon(result.plugin) return format_html( '{} {}', - result.extractor, + result.plugin, icon, - result.extractor, + result.plugin, ) def cmd_str(self, result): diff --git a/archivebox/core/forms.py b/archivebox/core/forms.py index a4390d96..4aa2fb9e 100644 --- a/archivebox/core/forms.py +++ b/archivebox/core/forms.py @@ -10,19 +10,19 @@ DEPTH_CHOICES = ( ('1', 'depth = 1 (archive these URLs and all URLs one hop away)'), ) -from archivebox.hooks import get_extractors +from archivebox.hooks import get_plugins -def get_archive_methods(): - """Get available archive methods from discovered hooks.""" - return [(name, name) for name in get_extractors()] +def get_plugin_choices(): + """Get available extractor plugins from discovered hooks.""" + return [(name, name) for name in get_plugins()] class AddLinkForm(forms.Form): url = forms.RegexField(label="URLs (one per line)", regex=URL_REGEX, min_length='6', strip=True, widget=forms.Textarea, required=True) tag = forms.CharField(label="Tags (comma separated tag1,tag2,tag3)", strip=True, required=False) depth = forms.ChoiceField(label="Archive depth", choices=DEPTH_CHOICES, initial='0', widget=forms.RadioSelect(attrs={"class": "depth-selection"})) - archive_methods = forms.MultipleChoiceField( - label="Archive methods (select at least 1, otherwise all will be used by default)", + plugins = forms.MultipleChoiceField( + label="Plugins (select at least 1, otherwise all will be used by default)", required=False, widget=forms.SelectMultiple, choices=[], # populated dynamically in __init__ @@ -30,7 +30,7 @@ class AddLinkForm(forms.Form): def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) - self.fields['archive_methods'].choices = get_archive_methods() + self.fields['plugins'].choices = get_plugin_choices() # TODO: hook these up to the view and put them # in a collapsible UI section labeled "Advanced" # diff --git a/archivebox/core/models.py b/archivebox/core/models.py index 928abf80..673c85a9 100755 --- a/archivebox/core/models.py +++ b/archivebox/core/models.py @@ -867,6 +867,7 @@ class Snapshot(ModelWithOutputDir, ModelWithConfig, ModelWithNotes, ModelWithHea ArchiveResult.objects.create( snapshot=self, plugin=plugin, + hook_name=result_data.get('hook_name', ''), status=result_data.get('status', 'failed'), output_str=result_data.get('output', ''), cmd=result_data.get('cmd', []), @@ -1162,7 +1163,7 @@ class Snapshot(ModelWithOutputDir, ModelWithConfig, ModelWithNotes, ModelWithHea continue pid_file = plugin_dir / 'hook.pid' if pid_file.exists(): - kill_process(pid_file) + kill_process(pid_file, validate=True) # Use validation # Update the ArchiveResult from filesystem plugin_name = plugin_dir.name diff --git a/archivebox/core/statemachines.py b/archivebox/core/statemachines.py index 81c453aa..89dda0c8 100644 --- a/archivebox/core/statemachines.py +++ b/archivebox/core/statemachines.py @@ -192,7 +192,7 @@ class ArchiveResultMachine(StateMachine, strict_states=True): def is_backoff(self) -> bool: """Check if we should backoff and retry later.""" - # Backoff if status is still started (extractor didn't complete) and output_str is empty + # Backoff if status is still started (plugin didn't complete) and output_str is empty return ( self.archiveresult.status == ArchiveResult.StatusChoices.STARTED and not self.archiveresult.output_str @@ -222,19 +222,19 @@ class ArchiveResultMachine(StateMachine, strict_states=True): # Suppressed: state transition logs # Lock the object and mark start time self.archiveresult.update_for_workers( - retry_at=timezone.now() + timedelta(seconds=120), # 2 min timeout for extractor + retry_at=timezone.now() + timedelta(seconds=120), # 2 min timeout for plugin status=ArchiveResult.StatusChoices.STARTED, start_ts=timezone.now(), iface=NetworkInterface.current(), ) - # Run the extractor - this updates status, output, timestamps, etc. + # Run the plugin - this updates status, output, timestamps, etc. self.archiveresult.run() # Save the updated result self.archiveresult.save() - # Suppressed: extractor result logs (already logged by worker) + # Suppressed: plugin result logs (already logged by worker) @backoff.enter def enter_backoff(self): diff --git a/archivebox/core/templatetags/core_tags.py b/archivebox/core/templatetags/core_tags.py index 33a620c0..685665a4 100644 --- a/archivebox/core/templatetags/core_tags.py +++ b/archivebox/core/templatetags/core_tags.py @@ -5,7 +5,7 @@ from django.utils.safestring import mark_safe from typing import Union from archivebox.hooks import ( - get_extractor_icon, get_extractor_template, get_extractor_name, + get_plugin_icon, get_plugin_template, get_plugin_name, ) @@ -51,30 +51,30 @@ def url_replace(context, **kwargs): @register.simple_tag -def extractor_icon(extractor: str) -> str: +def plugin_icon(plugin: str) -> str: """ - Render the icon for an extractor. + Render the icon for a plugin. - Usage: {% extractor_icon "screenshot" %} + Usage: {% plugin_icon "screenshot" %} """ - return mark_safe(get_extractor_icon(extractor)) + return mark_safe(get_plugin_icon(plugin)) @register.simple_tag(takes_context=True) -def extractor_thumbnail(context, result) -> str: +def plugin_thumbnail(context, result) -> str: """ Render the thumbnail template for an archive result. - Usage: {% extractor_thumbnail result %} + Usage: {% plugin_thumbnail result %} Context variables passed to template: - result: ArchiveResult object - snapshot: Parent Snapshot object - output_path: Path to output relative to snapshot dir (from embed_path()) - - extractor: Extractor base name + - plugin: Plugin base name """ - extractor = get_extractor_name(result.extractor) - template_str = get_extractor_template(extractor, 'thumbnail') + plugin = get_plugin_name(result.plugin) + template_str = get_plugin_template(plugin, 'thumbnail') if not template_str: return '' @@ -89,7 +89,7 @@ def extractor_thumbnail(context, result) -> str: 'result': result, 'snapshot': result.snapshot, 'output_path': output_path, - 'extractor': extractor, + 'plugin': plugin, }) return mark_safe(tpl.render(ctx)) except Exception: @@ -97,14 +97,14 @@ def extractor_thumbnail(context, result) -> str: @register.simple_tag(takes_context=True) -def extractor_embed(context, result) -> str: +def plugin_embed(context, result) -> str: """ Render the embed iframe template for an archive result. - Usage: {% extractor_embed result %} + Usage: {% plugin_embed result %} """ - extractor = get_extractor_name(result.extractor) - template_str = get_extractor_template(extractor, 'embed') + plugin = get_plugin_name(result.plugin) + template_str = get_plugin_template(plugin, 'embed') if not template_str: return '' @@ -117,7 +117,7 @@ def extractor_embed(context, result) -> str: 'result': result, 'snapshot': result.snapshot, 'output_path': output_path, - 'extractor': extractor, + 'plugin': plugin, }) return mark_safe(tpl.render(ctx)) except Exception: @@ -125,14 +125,14 @@ def extractor_embed(context, result) -> str: @register.simple_tag(takes_context=True) -def extractor_fullscreen(context, result) -> str: +def plugin_fullscreen(context, result) -> str: """ Render the fullscreen template for an archive result. - Usage: {% extractor_fullscreen result %} + Usage: {% plugin_fullscreen result %} """ - extractor = get_extractor_name(result.extractor) - template_str = get_extractor_template(extractor, 'fullscreen') + plugin = get_plugin_name(result.plugin) + template_str = get_plugin_template(plugin, 'fullscreen') if not template_str: return '' @@ -145,7 +145,7 @@ def extractor_fullscreen(context, result) -> str: 'result': result, 'snapshot': result.snapshot, 'output_path': output_path, - 'extractor': extractor, + 'plugin': plugin, }) return mark_safe(tpl.render(ctx)) except Exception: @@ -153,10 +153,10 @@ def extractor_fullscreen(context, result) -> str: @register.filter -def extractor_name(value: str) -> str: +def plugin_name(value: str) -> str: """ - Get the base name of an extractor (strips numeric prefix). + Get the base name of a plugin (strips numeric prefix). - Usage: {{ result.extractor|extractor_name }} + Usage: {{ result.plugin|plugin_name }} """ - return get_extractor_name(value) + return get_plugin_name(value) diff --git a/archivebox/core/views.py b/archivebox/core/views.py index 1ffb20b8..df17924a 100644 --- a/archivebox/core/views.py +++ b/archivebox/core/views.py @@ -56,9 +56,9 @@ class SnapshotView(View): def render_live_index(request, snapshot): TITLE_LOADING_MSG = 'Not yet archived...' - # Dict of extractor -> ArchiveResult object + # Dict of plugin -> ArchiveResult object archiveresult_objects = {} - # Dict of extractor -> result info dict (for template compatibility) + # Dict of plugin -> result info dict (for template compatibility) archiveresults = {} results = snapshot.archiveresult_set.all() @@ -75,16 +75,16 @@ class SnapshotView(View): continue # Store the full ArchiveResult object for template tags - archiveresult_objects[result.extractor] = result + archiveresult_objects[result.plugin] = result result_info = { - 'name': result.extractor, + 'name': result.plugin, 'path': embed_path, 'ts': ts_to_date_str(result.end_ts), 'size': abs_path.stat().st_size or '?', 'result': result, # Include the full object for template tags } - archiveresults[result.extractor] = result_info + archiveresults[result.plugin] = result_info # Use canonical_outputs for intelligent discovery # This method now scans ArchiveResults and uses smart heuristics @@ -119,8 +119,8 @@ class SnapshotView(View): # Get available extractors from hooks (sorted by numeric prefix for ordering) # Convert to base names for display ordering - all_extractors = [get_extractor_name(e) for e in get_extractors()] - preferred_types = tuple(all_extractors) + all_plugins = [get_extractor_name(e) for e in get_extractors()] + preferred_types = tuple(all_plugins) all_types = preferred_types + tuple(result_type for result_type in archiveresults.keys() if result_type not in preferred_types) best_result = {'path': 'None', 'result': None} @@ -463,7 +463,6 @@ class AddView(UserPassesTestMixin, FormView): urls_content = sources_file.read_text() crawl = Crawl.objects.create( urls=urls_content, - extractor=parser, max_depth=depth, tags_str=tag, label=f'{self.request.user.username}@{HOSTNAME}{self.request.path} {timestamp}', @@ -598,12 +597,12 @@ def live_progress_view(request): ArchiveResult.StatusChoices.SUCCEEDED: 2, ArchiveResult.StatusChoices.FAILED: 3, } - return (status_order.get(ar.status, 4), ar.extractor) + return (status_order.get(ar.status, 4), ar.plugin) - all_extractors = [ + all_plugins = [ { 'id': str(ar.id), - 'extractor': ar.extractor, + 'plugin': ar.plugin, 'status': ar.status, } for ar in sorted(snapshot_results, key=extractor_sort_key) @@ -619,7 +618,7 @@ def live_progress_view(request): 'completed_extractors': completed_extractors, 'failed_extractors': failed_extractors, 'pending_extractors': pending_extractors, - 'all_extractors': all_extractors, + 'all_plugins': all_plugins, }) # Check if crawl can start (for debugging stuck crawls) diff --git a/archivebox/crawls/models.py b/archivebox/crawls/models.py index 3ce21d99..b143f13f 100755 --- a/archivebox/crawls/models.py +++ b/archivebox/crawls/models.py @@ -353,10 +353,18 @@ class Crawl(ModelWithOutputDir, ModelWithConfig, ModelWithHealthStats, ModelWith import signal from pathlib import Path from archivebox.hooks import run_hook, discover_hooks + from archivebox.misc.process_utils import validate_pid_file # Kill any background processes by scanning for all .pid files if self.OUTPUT_DIR.exists(): for pid_file in self.OUTPUT_DIR.glob('**/*.pid'): + # Validate PID before killing to avoid killing unrelated processes + cmd_file = pid_file.parent / 'cmd.sh' + if not validate_pid_file(pid_file, cmd_file): + # PID reused by different process or process dead + pid_file.unlink(missing_ok=True) + continue + try: pid = int(pid_file.read_text().strip()) try: diff --git a/archivebox/hooks.py b/archivebox/hooks.py index e308dc51..aff3ea22 100644 --- a/archivebox/hooks.py +++ b/archivebox/hooks.py @@ -292,8 +292,13 @@ def run_hook( stdout_file = output_dir / 'stdout.log' stderr_file = output_dir / 'stderr.log' pid_file = output_dir / 'hook.pid' + cmd_file = output_dir / 'cmd.sh' try: + # Write command script for validation + from archivebox.misc.process_utils import write_cmd_file + write_cmd_file(cmd_file, cmd) + # Open log files for writing with open(stdout_file, 'w') as out, open(stderr_file, 'w') as err: process = subprocess.Popen( @@ -304,8 +309,10 @@ def run_hook( env=env, ) - # Write PID for all hooks (useful for debugging/cleanup) - pid_file.write_text(str(process.pid)) + # Write PID with mtime set to process start time for validation + from archivebox.misc.process_utils import write_pid_file_with_mtime + process_start_time = time.time() + write_pid_file_with_mtime(pid_file, process.pid, process_start_time) if is_background: # Background hook - return None immediately, don't wait @@ -1327,21 +1334,29 @@ def process_is_alive(pid_file: Path) -> bool: return False -def kill_process(pid_file: Path, sig: int = signal.SIGTERM): +def kill_process(pid_file: Path, sig: int = signal.SIGTERM, validate: bool = True): """ - Kill process in PID file. + Kill process in PID file with optional validation. Args: pid_file: Path to hook.pid file sig: Signal to send (default SIGTERM) + validate: If True, validate process identity before killing (default: True) """ - if not pid_file.exists(): - return - - try: - pid = int(pid_file.read_text().strip()) - os.kill(pid, sig) - except (OSError, ValueError): - pass + from archivebox.misc.process_utils import safe_kill_process + + if validate: + # Use safe kill with validation + cmd_file = pid_file.parent / 'cmd.sh' + safe_kill_process(pid_file, cmd_file, signal_num=sig, validate=True) + else: + # Legacy behavior - kill without validation + if not pid_file.exists(): + return + try: + pid = int(pid_file.read_text().strip()) + os.kill(pid, sig) + except (OSError, ValueError): + pass diff --git a/archivebox/misc/jsonl.py b/archivebox/misc/jsonl.py index 50cbd3e5..3e9f6e97 100644 --- a/archivebox/misc/jsonl.py +++ b/archivebox/misc/jsonl.py @@ -178,7 +178,8 @@ def archiveresult_to_jsonl(result) -> Dict[str, Any]: 'type': TYPE_ARCHIVERESULT, 'id': str(result.id), 'snapshot_id': str(result.snapshot_id), - 'extractor': result.extractor, + 'plugin': result.plugin, + 'hook_name': result.hook_name, 'status': result.status, 'output_str': result.output_str, 'start_ts': result.start_ts.isoformat() if result.start_ts else None, diff --git a/archivebox/plugins/chrome/on_Crawl__20_chrome_launch.bg.js b/archivebox/plugins/chrome/on_Crawl__20_chrome_launch.bg.js index 7ee41eda..3ae9a039 100644 --- a/archivebox/plugins/chrome/on_Crawl__20_chrome_launch.bg.js +++ b/archivebox/plugins/chrome/on_Crawl__20_chrome_launch.bg.js @@ -29,6 +29,98 @@ const http = require('http'); const EXTRACTOR_NAME = 'chrome_launch'; const OUTPUT_DIR = 'chrome'; +// Helper: Write PID file with mtime set to process start time +function writePidWithMtime(filePath, pid, startTimeSeconds) { + fs.writeFileSync(filePath, String(pid)); + // Set both atime and mtime to process start time for validation + const startTimeMs = startTimeSeconds * 1000; + fs.utimesSync(filePath, new Date(startTimeMs), new Date(startTimeMs)); +} + +// Helper: Write command script for validation +function writeCmdScript(filePath, binary, args) { + // Shell escape arguments containing spaces or special characters + const escapedArgs = args.map(arg => { + if (arg.includes(' ') || arg.includes('"') || arg.includes('$')) { + return `"${arg.replace(/"/g, '\\"')}"`; + } + return arg; + }); + const script = `#!/bin/bash\n${binary} ${escapedArgs.join(' ')}\n`; + fs.writeFileSync(filePath, script); + fs.chmodSync(filePath, 0o755); +} + +// Helper: Get process start time (cross-platform) +function getProcessStartTime(pid) { + try { + const { execSync } = require('child_process'); + if (process.platform === 'darwin') { + // macOS: ps -p PID -o lstart= gives start time + const output = execSync(`ps -p ${pid} -o lstart=`, { encoding: 'utf8', timeout: 1000 }); + return Date.parse(output.trim()) / 1000; // Convert to epoch seconds + } else { + // Linux: read /proc/PID/stat field 22 (starttime in clock ticks) + const stat = fs.readFileSync(`/proc/${pid}/stat`, 'utf8'); + const match = stat.match(/\) \w+ (\d+)/); + if (match) { + const startTicks = parseInt(match[1], 10); + // Convert clock ticks to seconds (assuming 100 ticks/sec) + const uptimeSeconds = parseFloat(fs.readFileSync('/proc/uptime', 'utf8').split(' ')[0]); + const bootTime = Date.now() / 1000 - uptimeSeconds; + return bootTime + (startTicks / 100); + } + } + } catch (e) { + // Can't get start time + return null; + } + return null; +} + +// Helper: Validate PID using mtime and command +function validatePid(pid, pidFile, cmdFile) { + try { + // Check process exists + try { + process.kill(pid, 0); // Signal 0 = check existence + } catch (e) { + return false; // Process doesn't exist + } + + // Check mtime matches process start time (within 5 sec tolerance) + const fileStat = fs.statSync(pidFile); + const fileMtime = fileStat.mtimeMs / 1000; // Convert to seconds + const procStartTime = getProcessStartTime(pid); + + if (procStartTime === null) { + // Can't validate - fall back to basic existence check + return true; + } + + if (Math.abs(fileMtime - procStartTime) > 5) { + // PID was reused by different process + return false; + } + + // Validate command if available + if (fs.existsSync(cmdFile)) { + const cmd = fs.readFileSync(cmdFile, 'utf8'); + // Check for Chrome/Chromium and debug port + if (!cmd.includes('chrome') && !cmd.includes('chromium')) { + return false; + } + if (!cmd.includes('--remote-debugging-port')) { + return false; + } + } + + return true; + } catch (e) { + return false; + } +} + // Global state for cleanup let chromePid = null; @@ -240,17 +332,17 @@ function killZombieChrome() { const pid = parseInt(fs.readFileSync(pidFile, 'utf8').trim(), 10); if (isNaN(pid) || pid <= 0) continue; - // Check if process exists - try { - process.kill(pid, 0); - } catch (e) { - // Process dead, remove stale PID file + // Validate PID before killing + const cmdFile = path.join(chromeDir, 'cmd.sh'); + if (!validatePid(pid, pidFile, cmdFile)) { + // PID reused or validation failed + console.error(`[!] PID ${pid} failed validation (reused or wrong process) - cleaning up`); try { fs.unlinkSync(pidFile); } catch (e) {} continue; } - // Process alive but crawl is stale - zombie! - console.error(`[!] Found zombie (PID ${pid}) from stale crawl ${crawl.name}`); + // Process alive, validated, and crawl is stale - zombie! + console.error(`[!] Found validated zombie (PID ${pid}) from stale crawl ${crawl.name}`); try { // Kill process group first @@ -386,15 +478,20 @@ async function launchChrome(binary) { chromeProcess.unref(); // Don't keep Node.js process running chromePid = chromeProcess.pid; + const chromeStartTime = Date.now() / 1000; // Unix epoch seconds console.error(`[*] Launched Chrome (PID: ${chromePid}), waiting for debug port...`); - // Write Chrome PID for backup cleanup (named .pid so Crawl.cleanup() finds it) - fs.writeFileSync(path.join(OUTPUT_DIR, 'chrome.pid'), String(chromePid)); + // Write Chrome PID with mtime set to start time for validation + writePidWithMtime(path.join(OUTPUT_DIR, 'chrome.pid'), chromePid, chromeStartTime); + + // Write command script for validation + writeCmdScript(path.join(OUTPUT_DIR, 'cmd.sh'), binary, chromeArgs); + fs.writeFileSync(path.join(OUTPUT_DIR, 'port.txt'), String(debugPort)); - // Write hook's own PID so Crawl.cleanup() can kill this hook process - // (which will trigger our SIGTERM handler to kill Chrome) - fs.writeFileSync(path.join(OUTPUT_DIR, 'hook.pid'), String(process.pid)); + // Write hook's own PID with mtime for validation + const hookStartTime = Date.now() / 1000; + writePidWithMtime(path.join(OUTPUT_DIR, 'hook.pid'), process.pid, hookStartTime); try { // Wait for Chrome to be ready diff --git a/archivebox/templates/admin/progress_monitor.html b/archivebox/templates/admin/progress_monitor.html index 10286104..a2be9eda 100644 --- a/archivebox/templates/admin/progress_monitor.html +++ b/archivebox/templates/admin/progress_monitor.html @@ -488,7 +488,7 @@ ${icon} - ${extractor.extractor} + ${extractor.plugin} `; @@ -499,10 +499,10 @@ const adminUrl = `/admin/core/snapshot/${snapshot.id}/change/`; let extractorHtml = ''; - if (snapshot.all_extractors && snapshot.all_extractors.length > 0) { - // Sort extractors alphabetically by name to prevent reordering on updates - const sortedExtractors = [...snapshot.all_extractors].sort((a, b) => - a.extractor.localeCompare(b.extractor) + if (snapshot.all_plugins && snapshot.all_plugins.length > 0) { + // Sort plugins alphabetically by name to prevent reordering on updates + const sortedExtractors = [...snapshot.all_plugins].sort((a, b) => + a.plugin.localeCompare(b.plugin) ); extractorHtml = `
diff --git a/archivebox/tests/test_migrations_helpers.py b/archivebox/tests/test_migrations_helpers.py index eddaa4e8..b634583b 100644 --- a/archivebox/tests/test_migrations_helpers.py +++ b/archivebox/tests/test_migrations_helpers.py @@ -158,7 +158,7 @@ CREATE TABLE IF NOT EXISTS core_snapshot_tags ( CREATE TABLE IF NOT EXISTS core_archiveresult ( id INTEGER PRIMARY KEY AUTOINCREMENT, snapshot_id CHAR(32) NOT NULL REFERENCES core_snapshot(id), - extractor VARCHAR(32) NOT NULL, + plugin VARCHAR(32) NOT NULL, cmd TEXT, pwd VARCHAR(256), cmd_version VARCHAR(128), @@ -379,7 +379,7 @@ CREATE TABLE IF NOT EXISTS crawls_seed ( created_by_id INTEGER NOT NULL REFERENCES auth_user(id), modified_at DATETIME, uri VARCHAR(2048) NOT NULL, - extractor VARCHAR(32) NOT NULL DEFAULT 'auto', + plugin VARCHAR(32) NOT NULL DEFAULT 'auto', tags_str VARCHAR(255) NOT NULL DEFAULT '', label VARCHAR(255) NOT NULL DEFAULT '', config TEXT DEFAULT '{}', @@ -465,7 +465,7 @@ CREATE TABLE IF NOT EXISTS core_archiveresult ( created_at DATETIME NOT NULL, modified_at DATETIME, snapshot_id CHAR(36) NOT NULL REFERENCES core_snapshot(id), - extractor VARCHAR(32) NOT NULL, + plugin VARCHAR(32) NOT NULL, pwd VARCHAR(256), cmd TEXT, cmd_version VARCHAR(128), diff --git a/archivebox/workers/worker.py b/archivebox/workers/worker.py index b97eb435..ca67cccc 100644 --- a/archivebox/workers/worker.py +++ b/archivebox/workers/worker.py @@ -227,10 +227,10 @@ class Worker: urls = obj.get_urls_list() url = urls[0] if urls else None - extractor = None - if hasattr(obj, 'extractor'): + plugin = None + if hasattr(obj, 'plugin'): # ArchiveResultWorker, Crawl - extractor = obj.extractor + plugin = obj.plugin log_worker_event( worker_type=worker_type_name, @@ -239,7 +239,7 @@ class Worker: pid=self.pid, worker_id=str(self.worker_id), url=url, - extractor=extractor, + plugin=plugin, metadata=start_metadata if start_metadata else None, ) @@ -262,7 +262,7 @@ class Worker: pid=self.pid, worker_id=str(self.worker_id), url=url, - extractor=extractor, + plugin=plugin, metadata=complete_metadata, ) else: @@ -345,9 +345,9 @@ class ArchiveResultWorker(Worker): name: ClassVar[str] = 'archiveresult' MAX_TICK_TIME: ClassVar[int] = 120 - def __init__(self, extractor: str | None = None, **kwargs: Any): + def __init__(self, plugin: str | None = None, **kwargs: Any): super().__init__(**kwargs) - self.extractor = extractor + self.plugin = plugin def get_model(self): from core.models import ArchiveResult @@ -359,16 +359,16 @@ class ArchiveResultWorker(Worker): qs = super().get_queue() - if self.extractor: - qs = qs.filter(extractor=self.extractor) + if self.plugin: + qs = qs.filter(plugin=self.plugin) # Note: Removed blocking logic since plugins have separate output directories - # and don't interfere with each other. Each plugin (extractor) runs independently. + # and don't interfere with each other. Each plugin runs independently. return qs def process_item(self, obj) -> bool: - """Process an ArchiveResult by running its extractor.""" + """Process an ArchiveResult by running its plugin.""" try: obj.sm.tick() return True @@ -378,8 +378,8 @@ class ArchiveResultWorker(Worker): return False @classmethod - def start(cls, worker_id: int | None = None, daemon: bool = False, extractor: str | None = None, **kwargs: Any) -> int: - """Fork a new worker as subprocess with optional extractor filter.""" + def start(cls, worker_id: int | None = None, daemon: bool = False, plugin: str | None = None, **kwargs: Any) -> int: + """Fork a new worker as subprocess with optional plugin filter.""" if worker_id is None: worker_id = get_next_worker_id(cls.name) @@ -387,7 +387,7 @@ class ArchiveResultWorker(Worker): proc = Process( target=_run_worker, args=(cls.name, worker_id, daemon), - kwargs={'extractor': extractor, **kwargs}, + kwargs={'plugin': plugin, **kwargs}, name=f'{cls.name}_worker_{worker_id}', ) proc.start() diff --git a/tests/test_recursive_crawl.py b/tests/test_recursive_crawl.py index ef5e223f..9ed52e16 100644 --- a/tests/test_recursive_crawl.py +++ b/tests/test_recursive_crawl.py @@ -74,17 +74,17 @@ def test_background_hooks_dont_block_parser_extractors(tmp_path, process): # Check that background hooks are running # Background hooks: consolelog, ssl, responses, redirects, staticfile bg_hooks = c.execute( - "SELECT extractor, status FROM core_archiveresult WHERE extractor IN ('consolelog', 'ssl', 'responses', 'redirects', 'staticfile') ORDER BY extractor" + "SELECT plugin, status FROM core_archiveresult WHERE plugin IN ('consolelog', 'ssl', 'responses', 'redirects', 'staticfile') ORDER BY plugin" ).fetchall() # Check that parser extractors have run (not stuck in queued) parser_extractors = c.execute( - "SELECT extractor, status FROM core_archiveresult WHERE extractor LIKE 'parse_%_urls' ORDER BY extractor" + "SELECT plugin, status FROM core_archiveresult WHERE plugin LIKE 'parse_%_urls' ORDER BY plugin" ).fetchall() # Check all extractors to see what's happening all_extractors = c.execute( - "SELECT extractor, status FROM core_archiveresult ORDER BY extractor" + "SELECT plugin, status FROM core_archiveresult ORDER BY plugin" ).fetchall() conn.close() @@ -160,7 +160,7 @@ def test_parser_extractors_emit_snapshot_jsonl(tmp_path, process): # Check that parse_html_urls ran parse_html = c.execute( - "SELECT id, status, output_str FROM core_archiveresult WHERE extractor = '60_parse_html_urls'" + "SELECT id, status, output_str FROM core_archiveresult WHERE plugin = '60_parse_html_urls'" ).fetchone() conn.close() @@ -171,7 +171,7 @@ def test_parser_extractors_emit_snapshot_jsonl(tmp_path, process): # Parser should have run assert status in ['started', 'succeeded', 'failed'], \ - f"parse_html_urls should have run, got status: {status}" + f"60_parse_html_urls should have run, got status: {status}" # If it succeeded and found links, output should contain JSON if status == 'succeeded' and output: @@ -185,39 +185,37 @@ def test_recursive_crawl_creates_child_snapshots(tmp_path, process): """Test that recursive crawling creates child snapshots with proper depth and parent_snapshot_id.""" os.chdir(tmp_path) - # Disable most extractors to speed up test, but keep wget for HTML content + # Create a test HTML file with links + test_html = tmp_path / 'test.html' + test_html.write_text(''' + + +

Test Page

+ About + Blog + Contact + + + ''') + + # Minimal env for fast testing env = os.environ.copy() env.update({ - "USE_WGET": "true", # Need wget to fetch HTML for parsers - "USE_SINGLEFILE": "false", - "USE_READABILITY": "false", - "USE_MERCURY": "false", - "SAVE_HTMLTOTEXT": "false", - "SAVE_PDF": "false", - "SAVE_SCREENSHOT": "false", - "SAVE_DOM": "false", - "SAVE_HEADERS": "false", - "USE_GIT": "false", - "SAVE_MEDIA": "false", - "SAVE_ARCHIVE_DOT_ORG": "false", - "SAVE_TITLE": "false", - "SAVE_FAVICON": "false", - "USE_CHROME": "false", "URL_ALLOWLIST": r"monadical\.com/.*", # Only crawl same domain }) # Start a crawl with depth=1 (just one hop to test recursive crawling) + # Use file:// URL so it's instant, no network fetch needed proc = subprocess.Popen( - ['archivebox', 'add', '--depth=1', 'https://monadical.com'], + ['archivebox', 'add', '--depth=1', f'file://{test_html}'], stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True, env=env, ) - # Give orchestrator time to process - parser extractors should emit child snapshots within 60s - # Even if root snapshot is still processing, child snapshots can start in parallel - time.sleep(60) + # Give orchestrator time to process - file:// is fast, should complete in 20s + time.sleep(20) # Kill the process proc.kill() @@ -231,8 +229,7 @@ def test_recursive_crawl_creates_child_snapshots(tmp_path, process): # Check root snapshot (depth=0) root_snapshot = c.execute( - "SELECT id, url, depth, parent_snapshot_id FROM core_snapshot WHERE url = ? AND depth = 0", - ('https://monadical.com',) + "SELECT id, url, depth, parent_snapshot_id FROM core_snapshot WHERE depth = 0 ORDER BY created_at LIMIT 1" ).fetchone() # Check if any child snapshots were created (depth=1) @@ -247,13 +244,13 @@ def test_recursive_crawl_creates_child_snapshots(tmp_path, process): # Check parser extractor status parser_status = c.execute( - "SELECT extractor, status FROM core_archiveresult WHERE snapshot_id = ? AND extractor LIKE 'parse_%_urls'", + "SELECT plugin, status FROM core_archiveresult WHERE snapshot_id = ? AND plugin LIKE 'parse_%_urls'", (root_snapshot[0] if root_snapshot else '',) ).fetchall() # Check for started extractors that might be blocking started_extractors = c.execute( - "SELECT extractor, status FROM core_archiveresult WHERE snapshot_id = ? AND status = 'started'", + "SELECT plugin, status FROM core_archiveresult WHERE snapshot_id = ? AND status = 'started'", (root_snapshot[0] if root_snapshot else '',) ).fetchall() @@ -417,12 +414,12 @@ def test_archiveresult_worker_queue_filters_by_foreground_extractors(tmp_path, p # Get background hooks that are started bg_started = c.execute( - "SELECT extractor FROM core_archiveresult WHERE extractor IN ('consolelog', 'ssl', 'responses', 'redirects', 'staticfile') AND status = 'started'" + "SELECT plugin FROM core_archiveresult WHERE plugin IN ('consolelog', 'ssl', 'responses', 'redirects', 'staticfile') AND status = 'started'" ).fetchall() # Get parser extractors that should be queued or better parser_status = c.execute( - "SELECT extractor, status FROM core_archiveresult WHERE extractor LIKE 'parse_%_urls'" + "SELECT plugin, status FROM core_archiveresult WHERE plugin LIKE 'parse_%_urls'" ).fetchall() conn.close()