diff --git a/TODO_hook_concurrency.md b/TODO_hook_concurrency.md
new file mode 100644
index 00000000..82190e7f
--- /dev/null
+++ b/TODO_hook_concurrency.md
@@ -0,0 +1,486 @@
+# ArchiveBox Hook Script Concurrency & Execution Plan
+
+## Overview
+
+Snapshot.run() should enforce that snapshot hooks are run in **10 discrete, sequential "steps"**: `0*`, `1*`, `2*`, `3*`, `4*`, `5*`, `6*`, `7*`, `8*`, `9*`.
+
+For every discovered hook script, ArchiveBox should create an ArchiveResult in `queued` state, then manage running them using `retry_at` and inline logic to enforce this ordering.
+
+## Hook Numbering Convention
+
+Hooks scripts are numbered `00` to `99` to control:
+- **First digit (0-9)**: Which step they are part of
+- **Second digit (0-9)**: Order within that step
+
+Hook scripts are launched **strictly sequentially** based on their filename alphabetical order, and run in sets of several per step before moving on to the next step.
+
+**Naming Format:**
+```
+on_{ModelName}__{run_order}_{human_readable_description}[.bg].{ext}
+```
+
+**Examples:**
+```
+on_Snapshot__00_this_would_run_first.sh
+on_Snapshot__05_start_ytdlp_download.bg.sh
+on_Snapshot__10_chrome_tab_opened.js
+on_Snapshot__50_screenshot.js
+on_Snapshot__53_media.bg.py
+```
+
+## Background (.bg) vs Foreground Scripts
+
+### Foreground Scripts (no .bg suffix)
+- Run sequentially within their step
+- Block step progression until they exit
+- Should exit naturally when work is complete
+- Get killed with SIGTERM if they exceed their `PLUGINNAME_TIMEOUT`
+
+### Background Scripts (.bg suffix)
+- Spawned and allowed to continue running
+- Do NOT block step progression
+- Run until **their own `PLUGINNAME_TIMEOUT` is reached** (not until step 99)
+- Get polite SIGTERM when timeout expires, then SIGKILL 60s later if not exited
+- Must implement their own concurrency control using filesystem (semaphore files, locks, etc.)
+- Should exit naturally when work is complete (best case)
+
+**Important:** If a .bg script starts at step 05 with `MEDIA_TIMEOUT=3600s`, it gets the full 3600s regardless of when step 99 completes. It runs on its own timeline.
+
+## Execution Step Guidelines
+
+These are **naming conventions and guidelines**, not enforced checkpoints. They provide semantic organization for plugin ordering:
+
+### Step 0: Pre-Setup
+```
+00-09: Initial setup, validation, feature detection
+```
+
+### Step 1: Chrome Launch & Tab Creation
+```
+10-19: Browser/tab lifecycle setup
+- Chrome browser launch
+- Tab creation and CDP connection
+```
+
+### Step 2: Navigation & Settlement
+```
+20-29: Page loading and settling
+- Navigate to URL
+- Wait for page load
+- Initial response capture (responses, ssl, consolelog as .bg listeners)
+```
+
+### Step 3: Page Adjustment
+```
+30-39: DOM manipulation before archiving
+- Hide popups/banners
+- Solve captchas
+- Expand comments/details sections
+- Inject custom CSS/JS
+- Accessibility modifications
+```
+
+### Step 4: Ready for Archiving
+```
+40-49: Final pre-archiving checks
+- Verify page is fully adjusted
+- Wait for any pending modifications
+```
+
+### Step 5: DOM Extraction (Sequential, Non-BG)
+```
+50-59: Extractors that need exclusive DOM access
+- singlefile (MUST NOT be .bg)
+- screenshot (MUST NOT be .bg)
+- pdf (MUST NOT be .bg)
+- dom (MUST NOT be .bg)
+- title
+- headers
+- readability
+- mercury
+
+These MUST run sequentially as they temporarily modify the DOM
+during extraction, then revert it. Running in parallel would corrupt results.
+```
+
+### Step 6: Post-DOM Extraction
+```
+60-69: Extractors that don't need DOM or run on downloaded files
+- wget
+- git
+- media (.bg - can run for hours)
+- gallerydl (.bg)
+- forumdl (.bg)
+- papersdl (.bg)
+```
+
+### Step 7: Chrome Cleanup
+```
+70-79: Browser/tab teardown
+- Close tabs
+- Cleanup Chrome resources
+```
+
+### Step 8: Post-Processing
+```
+80-89: Reprocess outputs from earlier extractors
+- OCR of images
+- Audio/video transcription
+- URL parsing from downloaded content (rss, html, json, txt, csv, md)
+- LLM analysis/summarization of outputs
+```
+
+### Step 9: Indexing & Finalization
+```
+90-99: Save to indexes and finalize
+- Index text content to Sonic/SQLite FTS
+- Create symlinks
+- Generate merkle trees
+- Final status updates
+```
+
+## Hook Script Interface
+
+### Input: CLI Arguments (NOT stdin)
+Hooks receive configuration as CLI flags (CSV or JSON-encoded):
+
+```bash
+--url="https://example.com"
+--snapshot-id="1234-5678-uuid"
+--config='{"some_key": "some_value"}'
+--plugins=git,media,favicon,title
+--timeout=50
+--enable-something
+```
+
+### Input: Environment Variables
+All configuration comes from env vars, defined in `plugin_dir/config.json` JSONSchema:
+
+```bash
+WGET_BINARY=/usr/bin/wget
+WGET_TIMEOUT=60
+WGET_USER_AGENT="Mozilla/5.0..."
+WGET_EXTRA_ARGS="--no-check-certificate"
+SAVE_WGET=True
+```
+
+**Required:** Every plugin must support `PLUGINNAME_TIMEOUT` for self-termination.
+
+### Output: Filesystem (CWD)
+Hooks read/write files to:
+- `$CWD`: Their own output subdirectory (e.g., `archive/snapshots/{id}/wget/`)
+- `$CWD/..`: Parent directory (to read outputs from other hooks)
+
+This allows hooks to:
+- Access files created by other hooks
+- Keep their outputs separate by default
+- Use semaphore files for coordination (if needed)
+
+### Output: JSONL to stdout
+Hooks emit one JSONL line per database record they want to create or update:
+
+```jsonl
+{"type": "Tag", "name": "sci-fi"}
+{"type": "ArchiveResult", "id": "1234-uuid", "status": "succeeded", "output_str": "wget/index.html"}
+{"type": "Snapshot", "id": "5678-uuid", "title": "Example Page"}
+```
+
+See `archivebox/misc/jsonl.py` and model `from_json()` / `from_jsonl()` methods for full list of supported types and fields.
+
+### Output: stderr for Human Logs
+Hooks should emit human-readable output or debug info to **stderr**. There are no guarantees this will be persisted long-term. Use stdout JSONL or filesystem for outputs that matter.
+
+### Cleanup: Delete Cruft
+If hooks emit no meaningful long-term outputs, they should delete any temporary files themselves to avoid wasting space. However, the ArchiveResult DB row should be kept so we know:
+- It doesn't need to be retried
+- It isn't missing
+- What happened (status, error message)
+
+### Signal Handling: SIGINT/SIGTERM
+Hooks are expected to listen for polite `SIGINT`/`SIGTERM` and finish hastily, then exit cleanly. Beyond that, they may be `SIGKILL'd` at ArchiveBox's discretion.
+
+**If hooks double-fork or spawn long-running processes:** They must output a `.pid` file in their directory so zombies can be swept safely.
+
+## Hook Failure Modes & Retry Logic
+
+Hooks can fail in several ways. ArchiveBox handles each differently:
+
+### 1. Soft Failure (Record & Don't Retry)
+**Exit:** `0` (success)
+**JSONL:** `{"type": "ArchiveResult", "status": "failed", "output_str": "404 Not Found"}`
+
+This means: "I ran successfully, but the resource wasn't available." Don't retry this.
+
+**Use cases:**
+- 404 errors
+- Content not available
+- Feature not applicable to this URL
+
+### 2. Hard Failure / Temporary Error (Retry Later)
+**Exit:** Non-zero (1, 2, etc.)
+**JSONL:** None (or incomplete)
+
+This means: "Something went wrong, I couldn't complete." Treat this ArchiveResult as "missing" and set `retry_at` for later.
+
+**Use cases:**
+- 500 server errors
+- Network timeouts
+- Binary not found / crashed
+- Transient errors
+
+**Behavior:**
+- ArchiveBox sets `retry_at` on the ArchiveResult
+- Hook will be retried during next `archivebox update`
+
+### 3. Partial Success (Update & Continue)
+**Exit:** Non-zero
+**JSONL:** Partial records emitted before crash
+
+**Behavior:**
+- Update ArchiveResult with whatever was emitted
+- Mark remaining work as "missing" with `retry_at`
+
+### 4. Success (Record & Continue)
+**Exit:** `0`
+**JSONL:** `{"type": "ArchiveResult", "status": "succeeded", "output_str": "output/file.html"}`
+
+This is the happy path.
+
+### Error Handling Rules
+
+- **DO NOT skip hooks** based on failures
+- **Continue to next hook** regardless of foreground or background failures
+- **Update ArchiveResults** with whatever information is available
+- **Set retry_at** for "missing" or temporarily-failed hooks
+- **Let background scripts continue** even if foreground scripts fail
+
+## File Structure
+
+```
+archivebox/plugins/{plugin_name}/
+├── config.json # JSONSchema: env var config options
+├── binaries.jsonl # Runtime dependencies: apt|brew|pip|npm|env
+├── on_Snapshot__XX_name.py # Hook script (foreground)
+├── on_Snapshot__XX_name.bg.py # Hook script (background)
+└── tests/
+ └── test_name.py
+```
+
+## Implementation Checklist
+
+### Phase 1: Renumber Existing Hooks ✅
+- [ ] Renumber DOM extractors to 50-59 range
+- [ ] Ensure pdf/screenshot are NOT .bg (need sequential access)
+- [ ] Ensure media (ytdlp) IS .bg (can run for hours)
+- [ ] Add step comments to each plugin for clarity
+
+### Phase 2: Timeout Consistency ✅
+- [x] All plugins support `PLUGINNAME_TIMEOUT` env var
+- [x] All plugins fall back to generic `TIMEOUT` env var
+- [x] Background scripts handle SIGTERM gracefully (or exit naturally)
+
+### Phase 3: Refactor Snapshot.run()
+- [ ] Parse hook filenames to extract step number (first digit)
+- [ ] Group hooks by step (0-9)
+- [ ] Run each step sequentially
+- [ ] Within each step:
+ - [ ] Launch foreground hooks sequentially
+ - [ ] Launch .bg hooks and track PIDs
+ - [ ] Wait for foreground hooks to complete before next step
+- [ ] Track .bg script timeouts independently
+- [ ] Send SIGTERM to .bg scripts when their timeout expires
+- [ ] Send SIGKILL 60s after SIGTERM if not exited
+
+### Phase 4: ArchiveResult Management
+- [ ] Create one ArchiveResult per hook (not per plugin)
+- [ ] Set initial state to `queued`
+- [ ] Update state based on JSONL output and exit code
+- [ ] Set `retry_at` for hooks that exit non-zero with no JSONL
+- [ ] Don't retry hooks that emit `{"status": "failed"}`
+
+### Phase 5: JSONL Streaming
+- [ ] Parse stdout JSONL line-by-line during hook execution
+- [ ] Create/update DB rows as JSONL is emitted (streaming mode)
+- [ ] Handle partial JSONL on hook crash
+
+### Phase 6: Zombie Process Management
+- [ ] Read `.pid` files from hook output directories
+- [ ] Sweep zombies on cleanup
+- [ ] Handle double-forked processes correctly
+
+## Migration Path
+
+### Backward Compatibility
+During migration, support both old and new numbering:
+1. Run hooks numbered 00-99 in step order
+2. Run unnumbered hooks last (step 9) for compatibility
+3. Log warnings for unnumbered hooks
+4. Eventually require all hooks to be numbered
+
+### Renumbering Map
+
+**Current → New:**
+```
+git/on_Snapshot__12_git.py → git/on_Snapshot__62_git.py
+media/on_Snapshot__51_media.py → media/on_Snapshot__63_media.bg.py
+gallerydl/on_Snapshot__52_gallerydl.py → gallerydl/on_Snapshot__64_gallerydl.bg.py
+forumdl/on_Snapshot__53_forumdl.py → forumdl/on_Snapshot__65_forumdl.bg.py
+papersdl/on_Snapshot__54_papersdl.py → papersdl/on_Snapshot__66_papersdl.bg.py
+
+readability/on_Snapshot__52_readability.py → readability/on_Snapshot__55_readability.py
+mercury/on_Snapshot__53_mercury.py → mercury/on_Snapshot__56_mercury.py
+
+singlefile/on_Snapshot__37_singlefile.py → singlefile/on_Snapshot__50_singlefile.py
+screenshot/on_Snapshot__34_screenshot.js → screenshot/on_Snapshot__51_screenshot.js
+pdf/on_Snapshot__35_pdf.js → pdf/on_Snapshot__52_pdf.js
+dom/on_Snapshot__36_dom.js → dom/on_Snapshot__53_dom.js
+title/on_Snapshot__32_title.js → title/on_Snapshot__54_title.js
+headers/on_Snapshot__33_headers.js → headers/on_Snapshot__55_headers.js
+
+wget/on_Snapshot__50_wget.py → wget/on_Snapshot__61_wget.py
+```
+
+## Testing Strategy
+
+### Unit Tests
+- Test hook ordering (00-99)
+- Test step grouping (first digit)
+- Test .bg vs foreground execution
+- Test timeout enforcement
+- Test JSONL parsing
+- Test failure modes & retry_at logic
+
+### Integration Tests
+- Test full Snapshot.run() with mixed hooks
+- Test .bg scripts running beyond step 99
+- Test zombie process cleanup
+- Test graceful SIGTERM handling
+- Test concurrent .bg script coordination
+
+### Performance Tests
+- Measure overhead of per-hook ArchiveResults
+- Test with 50+ concurrent .bg scripts
+- Test filesystem contention with many hooks
+
+## Open Questions
+
+### Q: Should we provide semaphore utilities?
+**A:** No. Keep plugins decoupled. Let them use simple filesystem coordination if needed.
+
+### Q: What happens if ArchiveResult table gets huge?
+**A:** We can delete old successful ArchiveResults periodically, or archive them to cold storage. The important data is in the filesystem outputs.
+
+### Q: Should naturally-exiting .bg scripts still be .bg?
+**A:** Yes. The .bg suffix means "don't block step progression," not "run until step 99." Natural exit is the best case.
+
+## Examples
+
+### Foreground Hook (Sequential DOM Access)
+```python
+#!/usr/bin/env python3
+# archivebox/plugins/screenshot/on_Snapshot__51_screenshot.js
+
+# Runs at step 5, blocks step progression until complete
+# Gets killed if it exceeds SCREENSHOT_TIMEOUT
+
+timeout = get_env_int('SCREENSHOT_TIMEOUT') or get_env_int('TIMEOUT', 60)
+
+try:
+ result = subprocess.run(cmd, capture_output=True, timeout=timeout)
+ if result.returncode == 0:
+ print(json.dumps({
+ "type": "ArchiveResult",
+ "status": "succeeded",
+ "output_str": "screenshot.png"
+ }))
+ sys.exit(0)
+ else:
+ # Temporary failure - will be retried
+ sys.exit(1)
+except subprocess.TimeoutExpired:
+ # Timeout - will be retried
+ sys.exit(1)
+```
+
+### Background Hook (Long-Running Download)
+```python
+#!/usr/bin/env python3
+# archivebox/plugins/media/on_Snapshot__63_media.bg.py
+
+# Runs at step 6, doesn't block step progression
+# Gets full MEDIA_TIMEOUT (e.g., 3600s) regardless of when step 99 completes
+
+timeout = get_env_int('YTDLP_TIMEOUT') or get_env_int('MEDIA_TIMEOUT') or get_env_int('TIMEOUT', 3600)
+
+try:
+ result = subprocess.run(['yt-dlp', url], capture_output=True, timeout=timeout)
+ if result.returncode == 0:
+ print(json.dumps({
+ "type": "ArchiveResult",
+ "status": "succeeded",
+ "output_str": "media/"
+ }))
+ sys.exit(0)
+ else:
+ # Hard failure - don't retry
+ print(json.dumps({
+ "type": "ArchiveResult",
+ "status": "failed",
+ "output_str": "Video unavailable"
+ }))
+ sys.exit(0) # Exit 0 to record the failure
+except subprocess.TimeoutExpired:
+ # Timeout - will be retried
+ sys.exit(1)
+```
+
+### Background Hook with Natural Exit
+```javascript
+#!/usr/bin/env node
+// archivebox/plugins/ssl/on_Snapshot__23_ssl.bg.js
+
+// Sets up listener, captures SSL info, then exits naturally
+// No SIGTERM handler needed - already exits when done
+
+async function main() {
+ const page = await connectToChrome();
+
+ // Set up listener
+ page.on('response', async (response) => {
+ const securityDetails = response.securityDetails();
+ if (securityDetails) {
+ fs.writeFileSync('ssl.json', JSON.stringify(securityDetails));
+ }
+ });
+
+ // Wait for navigation (done by other hook)
+ await waitForNavigation();
+
+ // Emit result
+ console.log(JSON.stringify({
+ type: 'ArchiveResult',
+ status: 'succeeded',
+ output_str: 'ssl.json'
+ }));
+
+ process.exit(0); // Natural exit - no await indefinitely
+}
+
+main().catch(e => {
+ console.error(`ERROR: ${e.message}`);
+ process.exit(1); // Will be retried
+});
+```
+
+## Summary
+
+This plan provides:
+- ✅ Clear execution ordering (10 steps, 00-99 numbering)
+- ✅ Async support (.bg suffix)
+- ✅ Independent timeout control per plugin
+- ✅ Flexible failure handling & retry logic
+- ✅ Streaming JSONL output for DB updates
+- ✅ Simple filesystem-based coordination
+- ✅ Backward compatibility during migration
+
+The main implementation work is refactoring `Snapshot.run()` to enforce step ordering and manage .bg script lifecycles. Plugin renumbering is straightforward mechanical work.
diff --git a/archivebox/api/v1_cli.py b/archivebox/api/v1_cli.py
index 9282acce..3359ca54 100644
--- a/archivebox/api/v1_cli.py
+++ b/archivebox/api/v1_cli.py
@@ -54,7 +54,7 @@ class AddCommandSchema(Schema):
tag: str = ""
depth: int = 0
parser: str = "auto"
- extract: str = ""
+ plugins: str = ""
update: bool = not ARCHIVING_CONFIG.ONLY_NEW # Default to the opposite of ARCHIVING_CONFIG.ONLY_NEW
overwrite: bool = False
index_only: bool = False
@@ -69,7 +69,7 @@ class UpdateCommandSchema(Schema):
status: Optional[StatusChoices] = StatusChoices.unarchived
filter_type: Optional[str] = FilterTypeChoices.substring
filter_patterns: Optional[List[str]] = ['https://example.com']
- extractors: Optional[str] = ""
+ plugins: Optional[str] = ""
class ScheduleCommandSchema(Schema):
import_path: Optional[str] = None
@@ -115,7 +115,7 @@ def cli_add(request, args: AddCommandSchema):
update=args.update,
index_only=args.index_only,
overwrite=args.overwrite,
- plugins=args.extract, # extract in API maps to plugins param
+ plugins=args.plugins,
parser=args.parser,
bg=True, # Always run in background for API calls
)
@@ -143,7 +143,7 @@ def cli_update(request, args: UpdateCommandSchema):
status=args.status,
filter_type=args.filter_type,
filter_patterns=args.filter_patterns,
- extractors=args.extractors,
+ plugins=args.plugins,
)
return {
"success": True,
diff --git a/archivebox/api/v1_core.py b/archivebox/api/v1_core.py
index 7f4f4f37..3d83d710 100644
--- a/archivebox/api/v1_core.py
+++ b/archivebox/api/v1_core.py
@@ -65,7 +65,8 @@ class MinimalArchiveResultSchema(Schema):
created_by_username: str
status: str
retry_at: datetime | None
- extractor: str
+ plugin: str
+ hook_name: str
cmd_version: str | None
cmd: list[str] | None
pwd: str | None
@@ -113,13 +114,14 @@ class ArchiveResultSchema(MinimalArchiveResultSchema):
class ArchiveResultFilterSchema(FilterSchema):
id: Optional[str] = Field(None, q=['id__startswith', 'snapshot__id__startswith', 'snapshot__timestamp__startswith'])
- search: Optional[str] = Field(None, q=['snapshot__url__icontains', 'snapshot__title__icontains', 'snapshot__tags__name__icontains', 'extractor', 'output_str__icontains', 'id__startswith', 'snapshot__id__startswith', 'snapshot__timestamp__startswith'])
+ search: Optional[str] = Field(None, q=['snapshot__url__icontains', 'snapshot__title__icontains', 'snapshot__tags__name__icontains', 'plugin', 'output_str__icontains', 'id__startswith', 'snapshot__id__startswith', 'snapshot__timestamp__startswith'])
snapshot_id: Optional[str] = Field(None, q=['snapshot__id__startswith', 'snapshot__timestamp__startswith'])
snapshot_url: Optional[str] = Field(None, q='snapshot__url__icontains')
snapshot_tag: Optional[str] = Field(None, q='snapshot__tags__name__icontains')
status: Optional[str] = Field(None, q='status')
output_str: Optional[str] = Field(None, q='output_str__icontains')
- extractor: Optional[str] = Field(None, q='extractor__icontains')
+ plugin: Optional[str] = Field(None, q='plugin__icontains')
+ hook_name: Optional[str] = Field(None, q='hook_name__icontains')
cmd: Optional[str] = Field(None, q='cmd__0__icontains')
pwd: Optional[str] = Field(None, q='pwd__icontains')
cmd_version: Optional[str] = Field(None, q='cmd_version')
diff --git a/archivebox/cli/archivebox_add.py b/archivebox/cli/archivebox_add.py
index 4a848d13..f868787d 100644
--- a/archivebox/cli/archivebox_add.py
+++ b/archivebox/cli/archivebox_add.py
@@ -86,7 +86,7 @@ def add(urls: str | list[str],
'ONLY_NEW': not update,
'INDEX_ONLY': index_only,
'OVERWRITE': overwrite,
- 'EXTRACTORS': plugins,
+ 'PLUGINS': plugins,
'DEFAULT_PERSONA': persona or 'Default',
'PARSER': parser,
}
diff --git a/archivebox/cli/archivebox_extract.py b/archivebox/cli/archivebox_extract.py
index 7ebdc385..45eeb331 100644
--- a/archivebox/cli/archivebox_extract.py
+++ b/archivebox/cli/archivebox_extract.py
@@ -40,7 +40,7 @@ def process_archiveresult_by_id(archiveresult_id: str) -> int:
"""
Run extraction for a single ArchiveResult by ID (used by workers).
- Triggers the ArchiveResult's state machine tick() to run the extractor.
+ Triggers the ArchiveResult's state machine tick() to run the extractor plugin.
"""
from rich import print as rprint
from core.models import ArchiveResult
@@ -51,7 +51,7 @@ def process_archiveresult_by_id(archiveresult_id: str) -> int:
rprint(f'[red]ArchiveResult {archiveresult_id} not found[/red]', file=sys.stderr)
return 1
- rprint(f'[blue]Extracting {archiveresult.extractor} for {archiveresult.snapshot.url}[/blue]', file=sys.stderr)
+ rprint(f'[blue]Extracting {archiveresult.plugin} for {archiveresult.snapshot.url}[/blue]', file=sys.stderr)
try:
# Trigger state machine tick - this runs the actual extraction
@@ -151,7 +151,7 @@ def run_plugins(
# Only create for specific plugin
result, created = ArchiveResult.objects.get_or_create(
snapshot=snapshot,
- extractor=plugin,
+ plugin=plugin,
defaults={
'status': ArchiveResult.StatusChoices.QUEUED,
'retry_at': timezone.now(),
@@ -193,7 +193,7 @@ def run_plugins(
snapshot = Snapshot.objects.get(id=snapshot_id)
results = snapshot.archiveresult_set.all()
if plugin:
- results = results.filter(extractor=plugin)
+ results = results.filter(plugin=plugin)
for result in results:
if is_tty:
@@ -202,7 +202,7 @@ def run_plugins(
'failed': 'red',
'skipped': 'yellow',
}.get(result.status, 'dim')
- rprint(f' [{status_color}]{result.status}[/{status_color}] {result.extractor} → {result.output_str or ""}', file=sys.stderr)
+ rprint(f' [{status_color}]{result.status}[/{status_color}] {result.plugin} → {result.output_str or ""}', file=sys.stderr)
else:
write_record(archiveresult_to_jsonl(result))
except Snapshot.DoesNotExist:
diff --git a/archivebox/core/admin_archiveresults.py b/archivebox/core/admin_archiveresults.py
index 749170ab..1acaf27a 100644
--- a/archivebox/core/admin_archiveresults.py
+++ b/archivebox/core/admin_archiveresults.py
@@ -13,16 +13,16 @@ from archivebox.config import DATA_DIR
from archivebox.config.common import SERVER_CONFIG
from archivebox.misc.paginators import AccelleratedPaginator
from archivebox.base_models.admin import BaseModelAdmin
-from archivebox.hooks import get_extractor_icon
+from archivebox.hooks import get_plugin_icon
from core.models import ArchiveResult, Snapshot
def render_archiveresults_list(archiveresults_qs, limit=50):
- """Render a nice inline list view of archive results with status, extractor, output, and actions."""
+ """Render a nice inline list view of archive results with status, plugin, output, and actions."""
- results = list(archiveresults_qs.order_by('extractor').select_related('snapshot')[:limit])
+ results = list(archiveresults_qs.order_by('plugin').select_related('snapshot')[:limit])
if not results:
return mark_safe('
No Archive Results yet...
')
@@ -40,8 +40,8 @@ def render_archiveresults_list(archiveresults_qs, limit=50):
status = result.status or 'queued'
color, bg = status_colors.get(status, ('#6b7280', '#f3f4f6'))
- # Get extractor icon
- icon = get_extractor_icon(result.extractor)
+ # Get plugin icon
+ icon = get_plugin_icon(result.plugin)
# Format timestamp
end_time = result.end_ts.strftime('%Y-%m-%d %H:%M:%S') if result.end_ts else '-'
@@ -79,7 +79,7 @@ def render_archiveresults_list(archiveresults_qs, limit=50):
font-size: 11px; font-weight: 600; text-transform: uppercase;
color: {color}; background: {bg};">{status}
-
+ |
{icon}
|
@@ -88,7 +88,7 @@ def render_archiveresults_list(archiveresults_qs, limit=50):
title="View output fullscreen"
onmouseover="this.style.color='#2563eb'; this.style.textDecoration='underline';"
onmouseout="this.style.color='#334155'; this.style.textDecoration='none';">
- {result.extractor}
+ {result.plugin}
|
@@ -162,7 +162,7 @@ def render_archiveresults_list(archiveresults_qs, limit=50):
| ID |
Status |
|
- Extractor |
+ Plugin |
Output |
Completed |
Version |
@@ -185,9 +185,9 @@ class ArchiveResultInline(admin.TabularInline):
parent_model = Snapshot
# fk_name = 'snapshot'
extra = 0
- sort_fields = ('end_ts', 'extractor', 'output_str', 'status', 'cmd_version')
+ sort_fields = ('end_ts', 'plugin', 'output_str', 'status', 'cmd_version')
readonly_fields = ('id', 'result_id', 'completed', 'command', 'version')
- fields = ('start_ts', 'end_ts', *readonly_fields, 'extractor', 'cmd', 'cmd_version', 'pwd', 'created_by', 'status', 'retry_at', 'output_str')
+ fields = ('start_ts', 'end_ts', *readonly_fields, 'plugin', 'cmd', 'cmd_version', 'pwd', 'created_by', 'status', 'retry_at', 'output_str')
# exclude = ('id',)
ordering = ('end_ts',)
show_change_link = True
@@ -253,9 +253,9 @@ class ArchiveResultInline(admin.TabularInline):
class ArchiveResultAdmin(BaseModelAdmin):
list_display = ('id', 'created_by', 'created_at', 'snapshot_info', 'tags_str', 'status', 'extractor_with_icon', 'cmd_str', 'output_str')
- sort_fields = ('id', 'created_by', 'created_at', 'extractor', 'status')
+ sort_fields = ('id', 'created_by', 'created_at', 'plugin', 'status')
readonly_fields = ('cmd_str', 'snapshot_info', 'tags_str', 'created_at', 'modified_at', 'output_summary', 'extractor_with_icon', 'iface')
- search_fields = ('id', 'snapshot__url', 'extractor', 'output_str', 'cmd_version', 'cmd', 'snapshot__timestamp')
+ search_fields = ('id', 'snapshot__url', 'plugin', 'output_str', 'cmd_version', 'cmd', 'snapshot__timestamp')
autocomplete_fields = ['snapshot']
fieldsets = (
@@ -263,8 +263,8 @@ class ArchiveResultAdmin(BaseModelAdmin):
'fields': ('snapshot', 'snapshot_info', 'tags_str'),
'classes': ('card', 'wide'),
}),
- ('Extractor', {
- 'fields': ('extractor', 'extractor_with_icon', 'status', 'retry_at', 'iface'),
+ ('Plugin', {
+ 'fields': ('plugin', 'plugin_with_icon', 'status', 'retry_at', 'iface'),
'classes': ('card',),
}),
('Timing', {
@@ -285,7 +285,7 @@ class ArchiveResultAdmin(BaseModelAdmin):
}),
)
- list_filter = ('status', 'extractor', 'start_ts', 'cmd_version')
+ list_filter = ('status', 'plugin', 'start_ts', 'cmd_version')
ordering = ['-start_ts']
list_per_page = SERVER_CONFIG.SNAPSHOTS_PER_PAGE
@@ -321,14 +321,14 @@ class ArchiveResultAdmin(BaseModelAdmin):
def tags_str(self, result):
return result.snapshot.tags_str()
- @admin.display(description='Extractor', ordering='extractor')
- def extractor_with_icon(self, result):
- icon = get_extractor_icon(result.extractor)
+ @admin.display(description='Plugin', ordering='plugin')
+ def plugin_with_icon(self, result):
+ icon = get_plugin_icon(result.plugin)
return format_html(
'{} {}',
- result.extractor,
+ result.plugin,
icon,
- result.extractor,
+ result.plugin,
)
def cmd_str(self, result):
diff --git a/archivebox/core/forms.py b/archivebox/core/forms.py
index a4390d96..4aa2fb9e 100644
--- a/archivebox/core/forms.py
+++ b/archivebox/core/forms.py
@@ -10,19 +10,19 @@ DEPTH_CHOICES = (
('1', 'depth = 1 (archive these URLs and all URLs one hop away)'),
)
-from archivebox.hooks import get_extractors
+from archivebox.hooks import get_plugins
-def get_archive_methods():
- """Get available archive methods from discovered hooks."""
- return [(name, name) for name in get_extractors()]
+def get_plugin_choices():
+ """Get available extractor plugins from discovered hooks."""
+ return [(name, name) for name in get_plugins()]
class AddLinkForm(forms.Form):
url = forms.RegexField(label="URLs (one per line)", regex=URL_REGEX, min_length='6', strip=True, widget=forms.Textarea, required=True)
tag = forms.CharField(label="Tags (comma separated tag1,tag2,tag3)", strip=True, required=False)
depth = forms.ChoiceField(label="Archive depth", choices=DEPTH_CHOICES, initial='0', widget=forms.RadioSelect(attrs={"class": "depth-selection"}))
- archive_methods = forms.MultipleChoiceField(
- label="Archive methods (select at least 1, otherwise all will be used by default)",
+ plugins = forms.MultipleChoiceField(
+ label="Plugins (select at least 1, otherwise all will be used by default)",
required=False,
widget=forms.SelectMultiple,
choices=[], # populated dynamically in __init__
@@ -30,7 +30,7 @@ class AddLinkForm(forms.Form):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
- self.fields['archive_methods'].choices = get_archive_methods()
+ self.fields['plugins'].choices = get_plugin_choices()
# TODO: hook these up to the view and put them
# in a collapsible UI section labeled "Advanced"
#
diff --git a/archivebox/core/models.py b/archivebox/core/models.py
index 928abf80..673c85a9 100755
--- a/archivebox/core/models.py
+++ b/archivebox/core/models.py
@@ -867,6 +867,7 @@ class Snapshot(ModelWithOutputDir, ModelWithConfig, ModelWithNotes, ModelWithHea
ArchiveResult.objects.create(
snapshot=self,
plugin=plugin,
+ hook_name=result_data.get('hook_name', ''),
status=result_data.get('status', 'failed'),
output_str=result_data.get('output', ''),
cmd=result_data.get('cmd', []),
@@ -1162,7 +1163,7 @@ class Snapshot(ModelWithOutputDir, ModelWithConfig, ModelWithNotes, ModelWithHea
continue
pid_file = plugin_dir / 'hook.pid'
if pid_file.exists():
- kill_process(pid_file)
+ kill_process(pid_file, validate=True) # Use validation
# Update the ArchiveResult from filesystem
plugin_name = plugin_dir.name
diff --git a/archivebox/core/statemachines.py b/archivebox/core/statemachines.py
index 81c453aa..89dda0c8 100644
--- a/archivebox/core/statemachines.py
+++ b/archivebox/core/statemachines.py
@@ -192,7 +192,7 @@ class ArchiveResultMachine(StateMachine, strict_states=True):
def is_backoff(self) -> bool:
"""Check if we should backoff and retry later."""
- # Backoff if status is still started (extractor didn't complete) and output_str is empty
+ # Backoff if status is still started (plugin didn't complete) and output_str is empty
return (
self.archiveresult.status == ArchiveResult.StatusChoices.STARTED and
not self.archiveresult.output_str
@@ -222,19 +222,19 @@ class ArchiveResultMachine(StateMachine, strict_states=True):
# Suppressed: state transition logs
# Lock the object and mark start time
self.archiveresult.update_for_workers(
- retry_at=timezone.now() + timedelta(seconds=120), # 2 min timeout for extractor
+ retry_at=timezone.now() + timedelta(seconds=120), # 2 min timeout for plugin
status=ArchiveResult.StatusChoices.STARTED,
start_ts=timezone.now(),
iface=NetworkInterface.current(),
)
- # Run the extractor - this updates status, output, timestamps, etc.
+ # Run the plugin - this updates status, output, timestamps, etc.
self.archiveresult.run()
# Save the updated result
self.archiveresult.save()
- # Suppressed: extractor result logs (already logged by worker)
+ # Suppressed: plugin result logs (already logged by worker)
@backoff.enter
def enter_backoff(self):
diff --git a/archivebox/core/templatetags/core_tags.py b/archivebox/core/templatetags/core_tags.py
index 33a620c0..685665a4 100644
--- a/archivebox/core/templatetags/core_tags.py
+++ b/archivebox/core/templatetags/core_tags.py
@@ -5,7 +5,7 @@ from django.utils.safestring import mark_safe
from typing import Union
from archivebox.hooks import (
- get_extractor_icon, get_extractor_template, get_extractor_name,
+ get_plugin_icon, get_plugin_template, get_plugin_name,
)
@@ -51,30 +51,30 @@ def url_replace(context, **kwargs):
@register.simple_tag
-def extractor_icon(extractor: str) -> str:
+def plugin_icon(plugin: str) -> str:
"""
- Render the icon for an extractor.
+ Render the icon for a plugin.
- Usage: {% extractor_icon "screenshot" %}
+ Usage: {% plugin_icon "screenshot" %}
"""
- return mark_safe(get_extractor_icon(extractor))
+ return mark_safe(get_plugin_icon(plugin))
@register.simple_tag(takes_context=True)
-def extractor_thumbnail(context, result) -> str:
+def plugin_thumbnail(context, result) -> str:
"""
Render the thumbnail template for an archive result.
- Usage: {% extractor_thumbnail result %}
+ Usage: {% plugin_thumbnail result %}
Context variables passed to template:
- result: ArchiveResult object
- snapshot: Parent Snapshot object
- output_path: Path to output relative to snapshot dir (from embed_path())
- - extractor: Extractor base name
+ - plugin: Plugin base name
"""
- extractor = get_extractor_name(result.extractor)
- template_str = get_extractor_template(extractor, 'thumbnail')
+ plugin = get_plugin_name(result.plugin)
+ template_str = get_plugin_template(plugin, 'thumbnail')
if not template_str:
return ''
@@ -89,7 +89,7 @@ def extractor_thumbnail(context, result) -> str:
'result': result,
'snapshot': result.snapshot,
'output_path': output_path,
- 'extractor': extractor,
+ 'plugin': plugin,
})
return mark_safe(tpl.render(ctx))
except Exception:
@@ -97,14 +97,14 @@ def extractor_thumbnail(context, result) -> str:
@register.simple_tag(takes_context=True)
-def extractor_embed(context, result) -> str:
+def plugin_embed(context, result) -> str:
"""
Render the embed iframe template for an archive result.
- Usage: {% extractor_embed result %}
+ Usage: {% plugin_embed result %}
"""
- extractor = get_extractor_name(result.extractor)
- template_str = get_extractor_template(extractor, 'embed')
+ plugin = get_plugin_name(result.plugin)
+ template_str = get_plugin_template(plugin, 'embed')
if not template_str:
return ''
@@ -117,7 +117,7 @@ def extractor_embed(context, result) -> str:
'result': result,
'snapshot': result.snapshot,
'output_path': output_path,
- 'extractor': extractor,
+ 'plugin': plugin,
})
return mark_safe(tpl.render(ctx))
except Exception:
@@ -125,14 +125,14 @@ def extractor_embed(context, result) -> str:
@register.simple_tag(takes_context=True)
-def extractor_fullscreen(context, result) -> str:
+def plugin_fullscreen(context, result) -> str:
"""
Render the fullscreen template for an archive result.
- Usage: {% extractor_fullscreen result %}
+ Usage: {% plugin_fullscreen result %}
"""
- extractor = get_extractor_name(result.extractor)
- template_str = get_extractor_template(extractor, 'fullscreen')
+ plugin = get_plugin_name(result.plugin)
+ template_str = get_plugin_template(plugin, 'fullscreen')
if not template_str:
return ''
@@ -145,7 +145,7 @@ def extractor_fullscreen(context, result) -> str:
'result': result,
'snapshot': result.snapshot,
'output_path': output_path,
- 'extractor': extractor,
+ 'plugin': plugin,
})
return mark_safe(tpl.render(ctx))
except Exception:
@@ -153,10 +153,10 @@ def extractor_fullscreen(context, result) -> str:
@register.filter
-def extractor_name(value: str) -> str:
+def plugin_name(value: str) -> str:
"""
- Get the base name of an extractor (strips numeric prefix).
+ Get the base name of a plugin (strips numeric prefix).
- Usage: {{ result.extractor|extractor_name }}
+ Usage: {{ result.plugin|plugin_name }}
"""
- return get_extractor_name(value)
+ return get_plugin_name(value)
diff --git a/archivebox/core/views.py b/archivebox/core/views.py
index 1ffb20b8..df17924a 100644
--- a/archivebox/core/views.py
+++ b/archivebox/core/views.py
@@ -56,9 +56,9 @@ class SnapshotView(View):
def render_live_index(request, snapshot):
TITLE_LOADING_MSG = 'Not yet archived...'
- # Dict of extractor -> ArchiveResult object
+ # Dict of plugin -> ArchiveResult object
archiveresult_objects = {}
- # Dict of extractor -> result info dict (for template compatibility)
+ # Dict of plugin -> result info dict (for template compatibility)
archiveresults = {}
results = snapshot.archiveresult_set.all()
@@ -75,16 +75,16 @@ class SnapshotView(View):
continue
# Store the full ArchiveResult object for template tags
- archiveresult_objects[result.extractor] = result
+ archiveresult_objects[result.plugin] = result
result_info = {
- 'name': result.extractor,
+ 'name': result.plugin,
'path': embed_path,
'ts': ts_to_date_str(result.end_ts),
'size': abs_path.stat().st_size or '?',
'result': result, # Include the full object for template tags
}
- archiveresults[result.extractor] = result_info
+ archiveresults[result.plugin] = result_info
# Use canonical_outputs for intelligent discovery
# This method now scans ArchiveResults and uses smart heuristics
@@ -119,8 +119,8 @@ class SnapshotView(View):
# Get available extractors from hooks (sorted by numeric prefix for ordering)
# Convert to base names for display ordering
- all_extractors = [get_extractor_name(e) for e in get_extractors()]
- preferred_types = tuple(all_extractors)
+ all_plugins = [get_extractor_name(e) for e in get_extractors()]
+ preferred_types = tuple(all_plugins)
all_types = preferred_types + tuple(result_type for result_type in archiveresults.keys() if result_type not in preferred_types)
best_result = {'path': 'None', 'result': None}
@@ -463,7 +463,6 @@ class AddView(UserPassesTestMixin, FormView):
urls_content = sources_file.read_text()
crawl = Crawl.objects.create(
urls=urls_content,
- extractor=parser,
max_depth=depth,
tags_str=tag,
label=f'{self.request.user.username}@{HOSTNAME}{self.request.path} {timestamp}',
@@ -598,12 +597,12 @@ def live_progress_view(request):
ArchiveResult.StatusChoices.SUCCEEDED: 2,
ArchiveResult.StatusChoices.FAILED: 3,
}
- return (status_order.get(ar.status, 4), ar.extractor)
+ return (status_order.get(ar.status, 4), ar.plugin)
- all_extractors = [
+ all_plugins = [
{
'id': str(ar.id),
- 'extractor': ar.extractor,
+ 'plugin': ar.plugin,
'status': ar.status,
}
for ar in sorted(snapshot_results, key=extractor_sort_key)
@@ -619,7 +618,7 @@ def live_progress_view(request):
'completed_extractors': completed_extractors,
'failed_extractors': failed_extractors,
'pending_extractors': pending_extractors,
- 'all_extractors': all_extractors,
+ 'all_plugins': all_plugins,
})
# Check if crawl can start (for debugging stuck crawls)
diff --git a/archivebox/crawls/models.py b/archivebox/crawls/models.py
index 3ce21d99..b143f13f 100755
--- a/archivebox/crawls/models.py
+++ b/archivebox/crawls/models.py
@@ -353,10 +353,18 @@ class Crawl(ModelWithOutputDir, ModelWithConfig, ModelWithHealthStats, ModelWith
import signal
from pathlib import Path
from archivebox.hooks import run_hook, discover_hooks
+ from archivebox.misc.process_utils import validate_pid_file
# Kill any background processes by scanning for all .pid files
if self.OUTPUT_DIR.exists():
for pid_file in self.OUTPUT_DIR.glob('**/*.pid'):
+ # Validate PID before killing to avoid killing unrelated processes
+ cmd_file = pid_file.parent / 'cmd.sh'
+ if not validate_pid_file(pid_file, cmd_file):
+ # PID reused by different process or process dead
+ pid_file.unlink(missing_ok=True)
+ continue
+
try:
pid = int(pid_file.read_text().strip())
try:
diff --git a/archivebox/hooks.py b/archivebox/hooks.py
index e308dc51..aff3ea22 100644
--- a/archivebox/hooks.py
+++ b/archivebox/hooks.py
@@ -292,8 +292,13 @@ def run_hook(
stdout_file = output_dir / 'stdout.log'
stderr_file = output_dir / 'stderr.log'
pid_file = output_dir / 'hook.pid'
+ cmd_file = output_dir / 'cmd.sh'
try:
+ # Write command script for validation
+ from archivebox.misc.process_utils import write_cmd_file
+ write_cmd_file(cmd_file, cmd)
+
# Open log files for writing
with open(stdout_file, 'w') as out, open(stderr_file, 'w') as err:
process = subprocess.Popen(
@@ -304,8 +309,10 @@ def run_hook(
env=env,
)
- # Write PID for all hooks (useful for debugging/cleanup)
- pid_file.write_text(str(process.pid))
+ # Write PID with mtime set to process start time for validation
+ from archivebox.misc.process_utils import write_pid_file_with_mtime
+ process_start_time = time.time()
+ write_pid_file_with_mtime(pid_file, process.pid, process_start_time)
if is_background:
# Background hook - return None immediately, don't wait
@@ -1327,21 +1334,29 @@ def process_is_alive(pid_file: Path) -> bool:
return False
-def kill_process(pid_file: Path, sig: int = signal.SIGTERM):
+def kill_process(pid_file: Path, sig: int = signal.SIGTERM, validate: bool = True):
"""
- Kill process in PID file.
+ Kill process in PID file with optional validation.
Args:
pid_file: Path to hook.pid file
sig: Signal to send (default SIGTERM)
+ validate: If True, validate process identity before killing (default: True)
"""
- if not pid_file.exists():
- return
-
- try:
- pid = int(pid_file.read_text().strip())
- os.kill(pid, sig)
- except (OSError, ValueError):
- pass
+ from archivebox.misc.process_utils import safe_kill_process
+
+ if validate:
+ # Use safe kill with validation
+ cmd_file = pid_file.parent / 'cmd.sh'
+ safe_kill_process(pid_file, cmd_file, signal_num=sig, validate=True)
+ else:
+ # Legacy behavior - kill without validation
+ if not pid_file.exists():
+ return
+ try:
+ pid = int(pid_file.read_text().strip())
+ os.kill(pid, sig)
+ except (OSError, ValueError):
+ pass
diff --git a/archivebox/misc/jsonl.py b/archivebox/misc/jsonl.py
index 50cbd3e5..3e9f6e97 100644
--- a/archivebox/misc/jsonl.py
+++ b/archivebox/misc/jsonl.py
@@ -178,7 +178,8 @@ def archiveresult_to_jsonl(result) -> Dict[str, Any]:
'type': TYPE_ARCHIVERESULT,
'id': str(result.id),
'snapshot_id': str(result.snapshot_id),
- 'extractor': result.extractor,
+ 'plugin': result.plugin,
+ 'hook_name': result.hook_name,
'status': result.status,
'output_str': result.output_str,
'start_ts': result.start_ts.isoformat() if result.start_ts else None,
diff --git a/archivebox/plugins/chrome/on_Crawl__20_chrome_launch.bg.js b/archivebox/plugins/chrome/on_Crawl__20_chrome_launch.bg.js
index 7ee41eda..3ae9a039 100644
--- a/archivebox/plugins/chrome/on_Crawl__20_chrome_launch.bg.js
+++ b/archivebox/plugins/chrome/on_Crawl__20_chrome_launch.bg.js
@@ -29,6 +29,98 @@ const http = require('http');
const EXTRACTOR_NAME = 'chrome_launch';
const OUTPUT_DIR = 'chrome';
+// Helper: Write PID file with mtime set to process start time
+function writePidWithMtime(filePath, pid, startTimeSeconds) {
+ fs.writeFileSync(filePath, String(pid));
+ // Set both atime and mtime to process start time for validation
+ const startTimeMs = startTimeSeconds * 1000;
+ fs.utimesSync(filePath, new Date(startTimeMs), new Date(startTimeMs));
+}
+
+// Helper: Write command script for validation
+function writeCmdScript(filePath, binary, args) {
+ // Shell escape arguments containing spaces or special characters
+ const escapedArgs = args.map(arg => {
+ if (arg.includes(' ') || arg.includes('"') || arg.includes('$')) {
+ return `"${arg.replace(/"/g, '\\"')}"`;
+ }
+ return arg;
+ });
+ const script = `#!/bin/bash\n${binary} ${escapedArgs.join(' ')}\n`;
+ fs.writeFileSync(filePath, script);
+ fs.chmodSync(filePath, 0o755);
+}
+
+// Helper: Get process start time (cross-platform)
+function getProcessStartTime(pid) {
+ try {
+ const { execSync } = require('child_process');
+ if (process.platform === 'darwin') {
+ // macOS: ps -p PID -o lstart= gives start time
+ const output = execSync(`ps -p ${pid} -o lstart=`, { encoding: 'utf8', timeout: 1000 });
+ return Date.parse(output.trim()) / 1000; // Convert to epoch seconds
+ } else {
+ // Linux: read /proc/PID/stat field 22 (starttime in clock ticks)
+ const stat = fs.readFileSync(`/proc/${pid}/stat`, 'utf8');
+ const match = stat.match(/\) \w+ (\d+)/);
+ if (match) {
+ const startTicks = parseInt(match[1], 10);
+ // Convert clock ticks to seconds (assuming 100 ticks/sec)
+ const uptimeSeconds = parseFloat(fs.readFileSync('/proc/uptime', 'utf8').split(' ')[0]);
+ const bootTime = Date.now() / 1000 - uptimeSeconds;
+ return bootTime + (startTicks / 100);
+ }
+ }
+ } catch (e) {
+ // Can't get start time
+ return null;
+ }
+ return null;
+}
+
+// Helper: Validate PID using mtime and command
+function validatePid(pid, pidFile, cmdFile) {
+ try {
+ // Check process exists
+ try {
+ process.kill(pid, 0); // Signal 0 = check existence
+ } catch (e) {
+ return false; // Process doesn't exist
+ }
+
+ // Check mtime matches process start time (within 5 sec tolerance)
+ const fileStat = fs.statSync(pidFile);
+ const fileMtime = fileStat.mtimeMs / 1000; // Convert to seconds
+ const procStartTime = getProcessStartTime(pid);
+
+ if (procStartTime === null) {
+ // Can't validate - fall back to basic existence check
+ return true;
+ }
+
+ if (Math.abs(fileMtime - procStartTime) > 5) {
+ // PID was reused by different process
+ return false;
+ }
+
+ // Validate command if available
+ if (fs.existsSync(cmdFile)) {
+ const cmd = fs.readFileSync(cmdFile, 'utf8');
+ // Check for Chrome/Chromium and debug port
+ if (!cmd.includes('chrome') && !cmd.includes('chromium')) {
+ return false;
+ }
+ if (!cmd.includes('--remote-debugging-port')) {
+ return false;
+ }
+ }
+
+ return true;
+ } catch (e) {
+ return false;
+ }
+}
+
// Global state for cleanup
let chromePid = null;
@@ -240,17 +332,17 @@ function killZombieChrome() {
const pid = parseInt(fs.readFileSync(pidFile, 'utf8').trim(), 10);
if (isNaN(pid) || pid <= 0) continue;
- // Check if process exists
- try {
- process.kill(pid, 0);
- } catch (e) {
- // Process dead, remove stale PID file
+ // Validate PID before killing
+ const cmdFile = path.join(chromeDir, 'cmd.sh');
+ if (!validatePid(pid, pidFile, cmdFile)) {
+ // PID reused or validation failed
+ console.error(`[!] PID ${pid} failed validation (reused or wrong process) - cleaning up`);
try { fs.unlinkSync(pidFile); } catch (e) {}
continue;
}
- // Process alive but crawl is stale - zombie!
- console.error(`[!] Found zombie (PID ${pid}) from stale crawl ${crawl.name}`);
+ // Process alive, validated, and crawl is stale - zombie!
+ console.error(`[!] Found validated zombie (PID ${pid}) from stale crawl ${crawl.name}`);
try {
// Kill process group first
@@ -386,15 +478,20 @@ async function launchChrome(binary) {
chromeProcess.unref(); // Don't keep Node.js process running
chromePid = chromeProcess.pid;
+ const chromeStartTime = Date.now() / 1000; // Unix epoch seconds
console.error(`[*] Launched Chrome (PID: ${chromePid}), waiting for debug port...`);
- // Write Chrome PID for backup cleanup (named .pid so Crawl.cleanup() finds it)
- fs.writeFileSync(path.join(OUTPUT_DIR, 'chrome.pid'), String(chromePid));
+ // Write Chrome PID with mtime set to start time for validation
+ writePidWithMtime(path.join(OUTPUT_DIR, 'chrome.pid'), chromePid, chromeStartTime);
+
+ // Write command script for validation
+ writeCmdScript(path.join(OUTPUT_DIR, 'cmd.sh'), binary, chromeArgs);
+
fs.writeFileSync(path.join(OUTPUT_DIR, 'port.txt'), String(debugPort));
- // Write hook's own PID so Crawl.cleanup() can kill this hook process
- // (which will trigger our SIGTERM handler to kill Chrome)
- fs.writeFileSync(path.join(OUTPUT_DIR, 'hook.pid'), String(process.pid));
+ // Write hook's own PID with mtime for validation
+ const hookStartTime = Date.now() / 1000;
+ writePidWithMtime(path.join(OUTPUT_DIR, 'hook.pid'), process.pid, hookStartTime);
try {
// Wait for Chrome to be ready
diff --git a/archivebox/templates/admin/progress_monitor.html b/archivebox/templates/admin/progress_monitor.html
index 10286104..a2be9eda 100644
--- a/archivebox/templates/admin/progress_monitor.html
+++ b/archivebox/templates/admin/progress_monitor.html
@@ -488,7 +488,7 @@
${icon}
- ${extractor.extractor}
+ ${extractor.plugin}
`;
@@ -499,10 +499,10 @@
const adminUrl = `/admin/core/snapshot/${snapshot.id}/change/`;
let extractorHtml = '';
- if (snapshot.all_extractors && snapshot.all_extractors.length > 0) {
- // Sort extractors alphabetically by name to prevent reordering on updates
- const sortedExtractors = [...snapshot.all_extractors].sort((a, b) =>
- a.extractor.localeCompare(b.extractor)
+ if (snapshot.all_plugins && snapshot.all_plugins.length > 0) {
+ // Sort plugins alphabetically by name to prevent reordering on updates
+ const sortedExtractors = [...snapshot.all_plugins].sort((a, b) =>
+ a.plugin.localeCompare(b.plugin)
);
extractorHtml = `