rename extractor to plugin everywhere

This commit is contained in:
Nick Sweeting
2025-12-28 04:43:15 -08:00
parent 50e527ec65
commit bd265c0083
19 changed files with 766 additions and 160 deletions

486
TODO_hook_concurrency.md Normal file
View File

@@ -0,0 +1,486 @@
# ArchiveBox Hook Script Concurrency & Execution Plan
## Overview
Snapshot.run() should enforce that snapshot hooks are run in **10 discrete, sequential "steps"**: `0*`, `1*`, `2*`, `3*`, `4*`, `5*`, `6*`, `7*`, `8*`, `9*`.
For every discovered hook script, ArchiveBox should create an ArchiveResult in `queued` state, then manage running them using `retry_at` and inline logic to enforce this ordering.
## Hook Numbering Convention
Hooks scripts are numbered `00` to `99` to control:
- **First digit (0-9)**: Which step they are part of
- **Second digit (0-9)**: Order within that step
Hook scripts are launched **strictly sequentially** based on their filename alphabetical order, and run in sets of several per step before moving on to the next step.
**Naming Format:**
```
on_{ModelName}__{run_order}_{human_readable_description}[.bg].{ext}
```
**Examples:**
```
on_Snapshot__00_this_would_run_first.sh
on_Snapshot__05_start_ytdlp_download.bg.sh
on_Snapshot__10_chrome_tab_opened.js
on_Snapshot__50_screenshot.js
on_Snapshot__53_media.bg.py
```
## Background (.bg) vs Foreground Scripts
### Foreground Scripts (no .bg suffix)
- Run sequentially within their step
- Block step progression until they exit
- Should exit naturally when work is complete
- Get killed with SIGTERM if they exceed their `PLUGINNAME_TIMEOUT`
### Background Scripts (.bg suffix)
- Spawned and allowed to continue running
- Do NOT block step progression
- Run until **their own `PLUGINNAME_TIMEOUT` is reached** (not until step 99)
- Get polite SIGTERM when timeout expires, then SIGKILL 60s later if not exited
- Must implement their own concurrency control using filesystem (semaphore files, locks, etc.)
- Should exit naturally when work is complete (best case)
**Important:** If a .bg script starts at step 05 with `MEDIA_TIMEOUT=3600s`, it gets the full 3600s regardless of when step 99 completes. It runs on its own timeline.
## Execution Step Guidelines
These are **naming conventions and guidelines**, not enforced checkpoints. They provide semantic organization for plugin ordering:
### Step 0: Pre-Setup
```
00-09: Initial setup, validation, feature detection
```
### Step 1: Chrome Launch & Tab Creation
```
10-19: Browser/tab lifecycle setup
- Chrome browser launch
- Tab creation and CDP connection
```
### Step 2: Navigation & Settlement
```
20-29: Page loading and settling
- Navigate to URL
- Wait for page load
- Initial response capture (responses, ssl, consolelog as .bg listeners)
```
### Step 3: Page Adjustment
```
30-39: DOM manipulation before archiving
- Hide popups/banners
- Solve captchas
- Expand comments/details sections
- Inject custom CSS/JS
- Accessibility modifications
```
### Step 4: Ready for Archiving
```
40-49: Final pre-archiving checks
- Verify page is fully adjusted
- Wait for any pending modifications
```
### Step 5: DOM Extraction (Sequential, Non-BG)
```
50-59: Extractors that need exclusive DOM access
- singlefile (MUST NOT be .bg)
- screenshot (MUST NOT be .bg)
- pdf (MUST NOT be .bg)
- dom (MUST NOT be .bg)
- title
- headers
- readability
- mercury
These MUST run sequentially as they temporarily modify the DOM
during extraction, then revert it. Running in parallel would corrupt results.
```
### Step 6: Post-DOM Extraction
```
60-69: Extractors that don't need DOM or run on downloaded files
- wget
- git
- media (.bg - can run for hours)
- gallerydl (.bg)
- forumdl (.bg)
- papersdl (.bg)
```
### Step 7: Chrome Cleanup
```
70-79: Browser/tab teardown
- Close tabs
- Cleanup Chrome resources
```
### Step 8: Post-Processing
```
80-89: Reprocess outputs from earlier extractors
- OCR of images
- Audio/video transcription
- URL parsing from downloaded content (rss, html, json, txt, csv, md)
- LLM analysis/summarization of outputs
```
### Step 9: Indexing & Finalization
```
90-99: Save to indexes and finalize
- Index text content to Sonic/SQLite FTS
- Create symlinks
- Generate merkle trees
- Final status updates
```
## Hook Script Interface
### Input: CLI Arguments (NOT stdin)
Hooks receive configuration as CLI flags (CSV or JSON-encoded):
```bash
--url="https://example.com"
--snapshot-id="1234-5678-uuid"
--config='{"some_key": "some_value"}'
--plugins=git,media,favicon,title
--timeout=50
--enable-something
```
### Input: Environment Variables
All configuration comes from env vars, defined in `plugin_dir/config.json` JSONSchema:
```bash
WGET_BINARY=/usr/bin/wget
WGET_TIMEOUT=60
WGET_USER_AGENT="Mozilla/5.0..."
WGET_EXTRA_ARGS="--no-check-certificate"
SAVE_WGET=True
```
**Required:** Every plugin must support `PLUGINNAME_TIMEOUT` for self-termination.
### Output: Filesystem (CWD)
Hooks read/write files to:
- `$CWD`: Their own output subdirectory (e.g., `archive/snapshots/{id}/wget/`)
- `$CWD/..`: Parent directory (to read outputs from other hooks)
This allows hooks to:
- Access files created by other hooks
- Keep their outputs separate by default
- Use semaphore files for coordination (if needed)
### Output: JSONL to stdout
Hooks emit one JSONL line per database record they want to create or update:
```jsonl
{"type": "Tag", "name": "sci-fi"}
{"type": "ArchiveResult", "id": "1234-uuid", "status": "succeeded", "output_str": "wget/index.html"}
{"type": "Snapshot", "id": "5678-uuid", "title": "Example Page"}
```
See `archivebox/misc/jsonl.py` and model `from_json()` / `from_jsonl()` methods for full list of supported types and fields.
### Output: stderr for Human Logs
Hooks should emit human-readable output or debug info to **stderr**. There are no guarantees this will be persisted long-term. Use stdout JSONL or filesystem for outputs that matter.
### Cleanup: Delete Cruft
If hooks emit no meaningful long-term outputs, they should delete any temporary files themselves to avoid wasting space. However, the ArchiveResult DB row should be kept so we know:
- It doesn't need to be retried
- It isn't missing
- What happened (status, error message)
### Signal Handling: SIGINT/SIGTERM
Hooks are expected to listen for polite `SIGINT`/`SIGTERM` and finish hastily, then exit cleanly. Beyond that, they may be `SIGKILL'd` at ArchiveBox's discretion.
**If hooks double-fork or spawn long-running processes:** They must output a `.pid` file in their directory so zombies can be swept safely.
## Hook Failure Modes & Retry Logic
Hooks can fail in several ways. ArchiveBox handles each differently:
### 1. Soft Failure (Record & Don't Retry)
**Exit:** `0` (success)
**JSONL:** `{"type": "ArchiveResult", "status": "failed", "output_str": "404 Not Found"}`
This means: "I ran successfully, but the resource wasn't available." Don't retry this.
**Use cases:**
- 404 errors
- Content not available
- Feature not applicable to this URL
### 2. Hard Failure / Temporary Error (Retry Later)
**Exit:** Non-zero (1, 2, etc.)
**JSONL:** None (or incomplete)
This means: "Something went wrong, I couldn't complete." Treat this ArchiveResult as "missing" and set `retry_at` for later.
**Use cases:**
- 500 server errors
- Network timeouts
- Binary not found / crashed
- Transient errors
**Behavior:**
- ArchiveBox sets `retry_at` on the ArchiveResult
- Hook will be retried during next `archivebox update`
### 3. Partial Success (Update & Continue)
**Exit:** Non-zero
**JSONL:** Partial records emitted before crash
**Behavior:**
- Update ArchiveResult with whatever was emitted
- Mark remaining work as "missing" with `retry_at`
### 4. Success (Record & Continue)
**Exit:** `0`
**JSONL:** `{"type": "ArchiveResult", "status": "succeeded", "output_str": "output/file.html"}`
This is the happy path.
### Error Handling Rules
- **DO NOT skip hooks** based on failures
- **Continue to next hook** regardless of foreground or background failures
- **Update ArchiveResults** with whatever information is available
- **Set retry_at** for "missing" or temporarily-failed hooks
- **Let background scripts continue** even if foreground scripts fail
## File Structure
```
archivebox/plugins/{plugin_name}/
├── config.json # JSONSchema: env var config options
├── binaries.jsonl # Runtime dependencies: apt|brew|pip|npm|env
├── on_Snapshot__XX_name.py # Hook script (foreground)
├── on_Snapshot__XX_name.bg.py # Hook script (background)
└── tests/
└── test_name.py
```
## Implementation Checklist
### Phase 1: Renumber Existing Hooks ✅
- [ ] Renumber DOM extractors to 50-59 range
- [ ] Ensure pdf/screenshot are NOT .bg (need sequential access)
- [ ] Ensure media (ytdlp) IS .bg (can run for hours)
- [ ] Add step comments to each plugin for clarity
### Phase 2: Timeout Consistency ✅
- [x] All plugins support `PLUGINNAME_TIMEOUT` env var
- [x] All plugins fall back to generic `TIMEOUT` env var
- [x] Background scripts handle SIGTERM gracefully (or exit naturally)
### Phase 3: Refactor Snapshot.run()
- [ ] Parse hook filenames to extract step number (first digit)
- [ ] Group hooks by step (0-9)
- [ ] Run each step sequentially
- [ ] Within each step:
- [ ] Launch foreground hooks sequentially
- [ ] Launch .bg hooks and track PIDs
- [ ] Wait for foreground hooks to complete before next step
- [ ] Track .bg script timeouts independently
- [ ] Send SIGTERM to .bg scripts when their timeout expires
- [ ] Send SIGKILL 60s after SIGTERM if not exited
### Phase 4: ArchiveResult Management
- [ ] Create one ArchiveResult per hook (not per plugin)
- [ ] Set initial state to `queued`
- [ ] Update state based on JSONL output and exit code
- [ ] Set `retry_at` for hooks that exit non-zero with no JSONL
- [ ] Don't retry hooks that emit `{"status": "failed"}`
### Phase 5: JSONL Streaming
- [ ] Parse stdout JSONL line-by-line during hook execution
- [ ] Create/update DB rows as JSONL is emitted (streaming mode)
- [ ] Handle partial JSONL on hook crash
### Phase 6: Zombie Process Management
- [ ] Read `.pid` files from hook output directories
- [ ] Sweep zombies on cleanup
- [ ] Handle double-forked processes correctly
## Migration Path
### Backward Compatibility
During migration, support both old and new numbering:
1. Run hooks numbered 00-99 in step order
2. Run unnumbered hooks last (step 9) for compatibility
3. Log warnings for unnumbered hooks
4. Eventually require all hooks to be numbered
### Renumbering Map
**Current → New:**
```
git/on_Snapshot__12_git.py → git/on_Snapshot__62_git.py
media/on_Snapshot__51_media.py → media/on_Snapshot__63_media.bg.py
gallerydl/on_Snapshot__52_gallerydl.py → gallerydl/on_Snapshot__64_gallerydl.bg.py
forumdl/on_Snapshot__53_forumdl.py → forumdl/on_Snapshot__65_forumdl.bg.py
papersdl/on_Snapshot__54_papersdl.py → papersdl/on_Snapshot__66_papersdl.bg.py
readability/on_Snapshot__52_readability.py → readability/on_Snapshot__55_readability.py
mercury/on_Snapshot__53_mercury.py → mercury/on_Snapshot__56_mercury.py
singlefile/on_Snapshot__37_singlefile.py → singlefile/on_Snapshot__50_singlefile.py
screenshot/on_Snapshot__34_screenshot.js → screenshot/on_Snapshot__51_screenshot.js
pdf/on_Snapshot__35_pdf.js → pdf/on_Snapshot__52_pdf.js
dom/on_Snapshot__36_dom.js → dom/on_Snapshot__53_dom.js
title/on_Snapshot__32_title.js → title/on_Snapshot__54_title.js
headers/on_Snapshot__33_headers.js → headers/on_Snapshot__55_headers.js
wget/on_Snapshot__50_wget.py → wget/on_Snapshot__61_wget.py
```
## Testing Strategy
### Unit Tests
- Test hook ordering (00-99)
- Test step grouping (first digit)
- Test .bg vs foreground execution
- Test timeout enforcement
- Test JSONL parsing
- Test failure modes & retry_at logic
### Integration Tests
- Test full Snapshot.run() with mixed hooks
- Test .bg scripts running beyond step 99
- Test zombie process cleanup
- Test graceful SIGTERM handling
- Test concurrent .bg script coordination
### Performance Tests
- Measure overhead of per-hook ArchiveResults
- Test with 50+ concurrent .bg scripts
- Test filesystem contention with many hooks
## Open Questions
### Q: Should we provide semaphore utilities?
**A:** No. Keep plugins decoupled. Let them use simple filesystem coordination if needed.
### Q: What happens if ArchiveResult table gets huge?
**A:** We can delete old successful ArchiveResults periodically, or archive them to cold storage. The important data is in the filesystem outputs.
### Q: Should naturally-exiting .bg scripts still be .bg?
**A:** Yes. The .bg suffix means "don't block step progression," not "run until step 99." Natural exit is the best case.
## Examples
### Foreground Hook (Sequential DOM Access)
```python
#!/usr/bin/env python3
# archivebox/plugins/screenshot/on_Snapshot__51_screenshot.js
# Runs at step 5, blocks step progression until complete
# Gets killed if it exceeds SCREENSHOT_TIMEOUT
timeout = get_env_int('SCREENSHOT_TIMEOUT') or get_env_int('TIMEOUT', 60)
try:
result = subprocess.run(cmd, capture_output=True, timeout=timeout)
if result.returncode == 0:
print(json.dumps({
"type": "ArchiveResult",
"status": "succeeded",
"output_str": "screenshot.png"
}))
sys.exit(0)
else:
# Temporary failure - will be retried
sys.exit(1)
except subprocess.TimeoutExpired:
# Timeout - will be retried
sys.exit(1)
```
### Background Hook (Long-Running Download)
```python
#!/usr/bin/env python3
# archivebox/plugins/media/on_Snapshot__63_media.bg.py
# Runs at step 6, doesn't block step progression
# Gets full MEDIA_TIMEOUT (e.g., 3600s) regardless of when step 99 completes
timeout = get_env_int('YTDLP_TIMEOUT') or get_env_int('MEDIA_TIMEOUT') or get_env_int('TIMEOUT', 3600)
try:
result = subprocess.run(['yt-dlp', url], capture_output=True, timeout=timeout)
if result.returncode == 0:
print(json.dumps({
"type": "ArchiveResult",
"status": "succeeded",
"output_str": "media/"
}))
sys.exit(0)
else:
# Hard failure - don't retry
print(json.dumps({
"type": "ArchiveResult",
"status": "failed",
"output_str": "Video unavailable"
}))
sys.exit(0) # Exit 0 to record the failure
except subprocess.TimeoutExpired:
# Timeout - will be retried
sys.exit(1)
```
### Background Hook with Natural Exit
```javascript
#!/usr/bin/env node
// archivebox/plugins/ssl/on_Snapshot__23_ssl.bg.js
// Sets up listener, captures SSL info, then exits naturally
// No SIGTERM handler needed - already exits when done
async function main() {
const page = await connectToChrome();
// Set up listener
page.on('response', async (response) => {
const securityDetails = response.securityDetails();
if (securityDetails) {
fs.writeFileSync('ssl.json', JSON.stringify(securityDetails));
}
});
// Wait for navigation (done by other hook)
await waitForNavigation();
// Emit result
console.log(JSON.stringify({
type: 'ArchiveResult',
status: 'succeeded',
output_str: 'ssl.json'
}));
process.exit(0); // Natural exit - no await indefinitely
}
main().catch(e => {
console.error(`ERROR: ${e.message}`);
process.exit(1); // Will be retried
});
```
## Summary
This plan provides:
- ✅ Clear execution ordering (10 steps, 00-99 numbering)
- ✅ Async support (.bg suffix)
- ✅ Independent timeout control per plugin
- ✅ Flexible failure handling & retry logic
- ✅ Streaming JSONL output for DB updates
- ✅ Simple filesystem-based coordination
- ✅ Backward compatibility during migration
The main implementation work is refactoring `Snapshot.run()` to enforce step ordering and manage .bg script lifecycles. Plugin renumbering is straightforward mechanical work.

View File

@@ -54,7 +54,7 @@ class AddCommandSchema(Schema):
tag: str = ""
depth: int = 0
parser: str = "auto"
extract: str = ""
plugins: str = ""
update: bool = not ARCHIVING_CONFIG.ONLY_NEW # Default to the opposite of ARCHIVING_CONFIG.ONLY_NEW
overwrite: bool = False
index_only: bool = False
@@ -69,7 +69,7 @@ class UpdateCommandSchema(Schema):
status: Optional[StatusChoices] = StatusChoices.unarchived
filter_type: Optional[str] = FilterTypeChoices.substring
filter_patterns: Optional[List[str]] = ['https://example.com']
extractors: Optional[str] = ""
plugins: Optional[str] = ""
class ScheduleCommandSchema(Schema):
import_path: Optional[str] = None
@@ -115,7 +115,7 @@ def cli_add(request, args: AddCommandSchema):
update=args.update,
index_only=args.index_only,
overwrite=args.overwrite,
plugins=args.extract, # extract in API maps to plugins param
plugins=args.plugins,
parser=args.parser,
bg=True, # Always run in background for API calls
)
@@ -143,7 +143,7 @@ def cli_update(request, args: UpdateCommandSchema):
status=args.status,
filter_type=args.filter_type,
filter_patterns=args.filter_patterns,
extractors=args.extractors,
plugins=args.plugins,
)
return {
"success": True,

View File

@@ -65,7 +65,8 @@ class MinimalArchiveResultSchema(Schema):
created_by_username: str
status: str
retry_at: datetime | None
extractor: str
plugin: str
hook_name: str
cmd_version: str | None
cmd: list[str] | None
pwd: str | None
@@ -113,13 +114,14 @@ class ArchiveResultSchema(MinimalArchiveResultSchema):
class ArchiveResultFilterSchema(FilterSchema):
id: Optional[str] = Field(None, q=['id__startswith', 'snapshot__id__startswith', 'snapshot__timestamp__startswith'])
search: Optional[str] = Field(None, q=['snapshot__url__icontains', 'snapshot__title__icontains', 'snapshot__tags__name__icontains', 'extractor', 'output_str__icontains', 'id__startswith', 'snapshot__id__startswith', 'snapshot__timestamp__startswith'])
search: Optional[str] = Field(None, q=['snapshot__url__icontains', 'snapshot__title__icontains', 'snapshot__tags__name__icontains', 'plugin', 'output_str__icontains', 'id__startswith', 'snapshot__id__startswith', 'snapshot__timestamp__startswith'])
snapshot_id: Optional[str] = Field(None, q=['snapshot__id__startswith', 'snapshot__timestamp__startswith'])
snapshot_url: Optional[str] = Field(None, q='snapshot__url__icontains')
snapshot_tag: Optional[str] = Field(None, q='snapshot__tags__name__icontains')
status: Optional[str] = Field(None, q='status')
output_str: Optional[str] = Field(None, q='output_str__icontains')
extractor: Optional[str] = Field(None, q='extractor__icontains')
plugin: Optional[str] = Field(None, q='plugin__icontains')
hook_name: Optional[str] = Field(None, q='hook_name__icontains')
cmd: Optional[str] = Field(None, q='cmd__0__icontains')
pwd: Optional[str] = Field(None, q='pwd__icontains')
cmd_version: Optional[str] = Field(None, q='cmd_version')

View File

@@ -86,7 +86,7 @@ def add(urls: str | list[str],
'ONLY_NEW': not update,
'INDEX_ONLY': index_only,
'OVERWRITE': overwrite,
'EXTRACTORS': plugins,
'PLUGINS': plugins,
'DEFAULT_PERSONA': persona or 'Default',
'PARSER': parser,
}

View File

@@ -40,7 +40,7 @@ def process_archiveresult_by_id(archiveresult_id: str) -> int:
"""
Run extraction for a single ArchiveResult by ID (used by workers).
Triggers the ArchiveResult's state machine tick() to run the extractor.
Triggers the ArchiveResult's state machine tick() to run the extractor plugin.
"""
from rich import print as rprint
from core.models import ArchiveResult
@@ -51,7 +51,7 @@ def process_archiveresult_by_id(archiveresult_id: str) -> int:
rprint(f'[red]ArchiveResult {archiveresult_id} not found[/red]', file=sys.stderr)
return 1
rprint(f'[blue]Extracting {archiveresult.extractor} for {archiveresult.snapshot.url}[/blue]', file=sys.stderr)
rprint(f'[blue]Extracting {archiveresult.plugin} for {archiveresult.snapshot.url}[/blue]', file=sys.stderr)
try:
# Trigger state machine tick - this runs the actual extraction
@@ -151,7 +151,7 @@ def run_plugins(
# Only create for specific plugin
result, created = ArchiveResult.objects.get_or_create(
snapshot=snapshot,
extractor=plugin,
plugin=plugin,
defaults={
'status': ArchiveResult.StatusChoices.QUEUED,
'retry_at': timezone.now(),
@@ -193,7 +193,7 @@ def run_plugins(
snapshot = Snapshot.objects.get(id=snapshot_id)
results = snapshot.archiveresult_set.all()
if plugin:
results = results.filter(extractor=plugin)
results = results.filter(plugin=plugin)
for result in results:
if is_tty:
@@ -202,7 +202,7 @@ def run_plugins(
'failed': 'red',
'skipped': 'yellow',
}.get(result.status, 'dim')
rprint(f' [{status_color}]{result.status}[/{status_color}] {result.extractor}{result.output_str or ""}', file=sys.stderr)
rprint(f' [{status_color}]{result.status}[/{status_color}] {result.plugin}{result.output_str or ""}', file=sys.stderr)
else:
write_record(archiveresult_to_jsonl(result))
except Snapshot.DoesNotExist:

View File

@@ -13,16 +13,16 @@ from archivebox.config import DATA_DIR
from archivebox.config.common import SERVER_CONFIG
from archivebox.misc.paginators import AccelleratedPaginator
from archivebox.base_models.admin import BaseModelAdmin
from archivebox.hooks import get_extractor_icon
from archivebox.hooks import get_plugin_icon
from core.models import ArchiveResult, Snapshot
def render_archiveresults_list(archiveresults_qs, limit=50):
"""Render a nice inline list view of archive results with status, extractor, output, and actions."""
"""Render a nice inline list view of archive results with status, plugin, output, and actions."""
results = list(archiveresults_qs.order_by('extractor').select_related('snapshot')[:limit])
results = list(archiveresults_qs.order_by('plugin').select_related('snapshot')[:limit])
if not results:
return mark_safe('<div style="color: #64748b; font-style: italic; padding: 16px 0;">No Archive Results yet...</div>')
@@ -40,8 +40,8 @@ def render_archiveresults_list(archiveresults_qs, limit=50):
status = result.status or 'queued'
color, bg = status_colors.get(status, ('#6b7280', '#f3f4f6'))
# Get extractor icon
icon = get_extractor_icon(result.extractor)
# Get plugin icon
icon = get_plugin_icon(result.plugin)
# Format timestamp
end_time = result.end_ts.strftime('%Y-%m-%d %H:%M:%S') if result.end_ts else '-'
@@ -79,7 +79,7 @@ def render_archiveresults_list(archiveresults_qs, limit=50):
font-size: 11px; font-weight: 600; text-transform: uppercase;
color: {color}; background: {bg};">{status}</span>
</td>
<td style="padding: 10px 12px; white-space: nowrap; font-size: 20px;" title="{result.extractor}">
<td style="padding: 10px 12px; white-space: nowrap; font-size: 20px;" title="{result.plugin}">
{icon}
</td>
<td style="padding: 10px 12px; font-weight: 500; color: #334155;">
@@ -88,7 +88,7 @@ def render_archiveresults_list(archiveresults_qs, limit=50):
title="View output fullscreen"
onmouseover="this.style.color='#2563eb'; this.style.textDecoration='underline';"
onmouseout="this.style.color='#334155'; this.style.textDecoration='none';">
{result.extractor}
{result.plugin}
</a>
</td>
<td style="padding: 10px 12px; max-width: 280px;">
@@ -162,7 +162,7 @@ def render_archiveresults_list(archiveresults_qs, limit=50):
<th style="padding: 10px 12px; text-align: left; font-weight: 600; color: #475569; font-size: 12px; text-transform: uppercase; letter-spacing: 0.05em;">ID</th>
<th style="padding: 10px 12px; text-align: left; font-weight: 600; color: #475569; font-size: 12px; text-transform: uppercase; letter-spacing: 0.05em;">Status</th>
<th style="padding: 10px 12px; text-align: left; font-weight: 600; color: #475569; font-size: 12px; width: 32px;"></th>
<th style="padding: 10px 12px; text-align: left; font-weight: 600; color: #475569; font-size: 12px; text-transform: uppercase; letter-spacing: 0.05em;">Extractor</th>
<th style="padding: 10px 12px; text-align: left; font-weight: 600; color: #475569; font-size: 12px; text-transform: uppercase; letter-spacing: 0.05em;">Plugin</th>
<th style="padding: 10px 12px; text-align: left; font-weight: 600; color: #475569; font-size: 12px; text-transform: uppercase; letter-spacing: 0.05em;">Output</th>
<th style="padding: 10px 12px; text-align: left; font-weight: 600; color: #475569; font-size: 12px; text-transform: uppercase; letter-spacing: 0.05em;">Completed</th>
<th style="padding: 10px 12px; text-align: left; font-weight: 600; color: #475569; font-size: 12px; text-transform: uppercase; letter-spacing: 0.05em;">Version</th>
@@ -185,9 +185,9 @@ class ArchiveResultInline(admin.TabularInline):
parent_model = Snapshot
# fk_name = 'snapshot'
extra = 0
sort_fields = ('end_ts', 'extractor', 'output_str', 'status', 'cmd_version')
sort_fields = ('end_ts', 'plugin', 'output_str', 'status', 'cmd_version')
readonly_fields = ('id', 'result_id', 'completed', 'command', 'version')
fields = ('start_ts', 'end_ts', *readonly_fields, 'extractor', 'cmd', 'cmd_version', 'pwd', 'created_by', 'status', 'retry_at', 'output_str')
fields = ('start_ts', 'end_ts', *readonly_fields, 'plugin', 'cmd', 'cmd_version', 'pwd', 'created_by', 'status', 'retry_at', 'output_str')
# exclude = ('id',)
ordering = ('end_ts',)
show_change_link = True
@@ -253,9 +253,9 @@ class ArchiveResultInline(admin.TabularInline):
class ArchiveResultAdmin(BaseModelAdmin):
list_display = ('id', 'created_by', 'created_at', 'snapshot_info', 'tags_str', 'status', 'extractor_with_icon', 'cmd_str', 'output_str')
sort_fields = ('id', 'created_by', 'created_at', 'extractor', 'status')
sort_fields = ('id', 'created_by', 'created_at', 'plugin', 'status')
readonly_fields = ('cmd_str', 'snapshot_info', 'tags_str', 'created_at', 'modified_at', 'output_summary', 'extractor_with_icon', 'iface')
search_fields = ('id', 'snapshot__url', 'extractor', 'output_str', 'cmd_version', 'cmd', 'snapshot__timestamp')
search_fields = ('id', 'snapshot__url', 'plugin', 'output_str', 'cmd_version', 'cmd', 'snapshot__timestamp')
autocomplete_fields = ['snapshot']
fieldsets = (
@@ -263,8 +263,8 @@ class ArchiveResultAdmin(BaseModelAdmin):
'fields': ('snapshot', 'snapshot_info', 'tags_str'),
'classes': ('card', 'wide'),
}),
('Extractor', {
'fields': ('extractor', 'extractor_with_icon', 'status', 'retry_at', 'iface'),
('Plugin', {
'fields': ('plugin', 'plugin_with_icon', 'status', 'retry_at', 'iface'),
'classes': ('card',),
}),
('Timing', {
@@ -285,7 +285,7 @@ class ArchiveResultAdmin(BaseModelAdmin):
}),
)
list_filter = ('status', 'extractor', 'start_ts', 'cmd_version')
list_filter = ('status', 'plugin', 'start_ts', 'cmd_version')
ordering = ['-start_ts']
list_per_page = SERVER_CONFIG.SNAPSHOTS_PER_PAGE
@@ -321,14 +321,14 @@ class ArchiveResultAdmin(BaseModelAdmin):
def tags_str(self, result):
return result.snapshot.tags_str()
@admin.display(description='Extractor', ordering='extractor')
def extractor_with_icon(self, result):
icon = get_extractor_icon(result.extractor)
@admin.display(description='Plugin', ordering='plugin')
def plugin_with_icon(self, result):
icon = get_plugin_icon(result.plugin)
return format_html(
'<span title="{}">{}</span> {}',
result.extractor,
result.plugin,
icon,
result.extractor,
result.plugin,
)
def cmd_str(self, result):

View File

@@ -10,19 +10,19 @@ DEPTH_CHOICES = (
('1', 'depth = 1 (archive these URLs and all URLs one hop away)'),
)
from archivebox.hooks import get_extractors
from archivebox.hooks import get_plugins
def get_archive_methods():
"""Get available archive methods from discovered hooks."""
return [(name, name) for name in get_extractors()]
def get_plugin_choices():
"""Get available extractor plugins from discovered hooks."""
return [(name, name) for name in get_plugins()]
class AddLinkForm(forms.Form):
url = forms.RegexField(label="URLs (one per line)", regex=URL_REGEX, min_length='6', strip=True, widget=forms.Textarea, required=True)
tag = forms.CharField(label="Tags (comma separated tag1,tag2,tag3)", strip=True, required=False)
depth = forms.ChoiceField(label="Archive depth", choices=DEPTH_CHOICES, initial='0', widget=forms.RadioSelect(attrs={"class": "depth-selection"}))
archive_methods = forms.MultipleChoiceField(
label="Archive methods (select at least 1, otherwise all will be used by default)",
plugins = forms.MultipleChoiceField(
label="Plugins (select at least 1, otherwise all will be used by default)",
required=False,
widget=forms.SelectMultiple,
choices=[], # populated dynamically in __init__
@@ -30,7 +30,7 @@ class AddLinkForm(forms.Form):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.fields['archive_methods'].choices = get_archive_methods()
self.fields['plugins'].choices = get_plugin_choices()
# TODO: hook these up to the view and put them
# in a collapsible UI section labeled "Advanced"
#

View File

@@ -867,6 +867,7 @@ class Snapshot(ModelWithOutputDir, ModelWithConfig, ModelWithNotes, ModelWithHea
ArchiveResult.objects.create(
snapshot=self,
plugin=plugin,
hook_name=result_data.get('hook_name', ''),
status=result_data.get('status', 'failed'),
output_str=result_data.get('output', ''),
cmd=result_data.get('cmd', []),
@@ -1162,7 +1163,7 @@ class Snapshot(ModelWithOutputDir, ModelWithConfig, ModelWithNotes, ModelWithHea
continue
pid_file = plugin_dir / 'hook.pid'
if pid_file.exists():
kill_process(pid_file)
kill_process(pid_file, validate=True) # Use validation
# Update the ArchiveResult from filesystem
plugin_name = plugin_dir.name

View File

@@ -192,7 +192,7 @@ class ArchiveResultMachine(StateMachine, strict_states=True):
def is_backoff(self) -> bool:
"""Check if we should backoff and retry later."""
# Backoff if status is still started (extractor didn't complete) and output_str is empty
# Backoff if status is still started (plugin didn't complete) and output_str is empty
return (
self.archiveresult.status == ArchiveResult.StatusChoices.STARTED and
not self.archiveresult.output_str
@@ -222,19 +222,19 @@ class ArchiveResultMachine(StateMachine, strict_states=True):
# Suppressed: state transition logs
# Lock the object and mark start time
self.archiveresult.update_for_workers(
retry_at=timezone.now() + timedelta(seconds=120), # 2 min timeout for extractor
retry_at=timezone.now() + timedelta(seconds=120), # 2 min timeout for plugin
status=ArchiveResult.StatusChoices.STARTED,
start_ts=timezone.now(),
iface=NetworkInterface.current(),
)
# Run the extractor - this updates status, output, timestamps, etc.
# Run the plugin - this updates status, output, timestamps, etc.
self.archiveresult.run()
# Save the updated result
self.archiveresult.save()
# Suppressed: extractor result logs (already logged by worker)
# Suppressed: plugin result logs (already logged by worker)
@backoff.enter
def enter_backoff(self):

View File

@@ -5,7 +5,7 @@ from django.utils.safestring import mark_safe
from typing import Union
from archivebox.hooks import (
get_extractor_icon, get_extractor_template, get_extractor_name,
get_plugin_icon, get_plugin_template, get_plugin_name,
)
@@ -51,30 +51,30 @@ def url_replace(context, **kwargs):
@register.simple_tag
def extractor_icon(extractor: str) -> str:
def plugin_icon(plugin: str) -> str:
"""
Render the icon for an extractor.
Render the icon for a plugin.
Usage: {% extractor_icon "screenshot" %}
Usage: {% plugin_icon "screenshot" %}
"""
return mark_safe(get_extractor_icon(extractor))
return mark_safe(get_plugin_icon(plugin))
@register.simple_tag(takes_context=True)
def extractor_thumbnail(context, result) -> str:
def plugin_thumbnail(context, result) -> str:
"""
Render the thumbnail template for an archive result.
Usage: {% extractor_thumbnail result %}
Usage: {% plugin_thumbnail result %}
Context variables passed to template:
- result: ArchiveResult object
- snapshot: Parent Snapshot object
- output_path: Path to output relative to snapshot dir (from embed_path())
- extractor: Extractor base name
- plugin: Plugin base name
"""
extractor = get_extractor_name(result.extractor)
template_str = get_extractor_template(extractor, 'thumbnail')
plugin = get_plugin_name(result.plugin)
template_str = get_plugin_template(plugin, 'thumbnail')
if not template_str:
return ''
@@ -89,7 +89,7 @@ def extractor_thumbnail(context, result) -> str:
'result': result,
'snapshot': result.snapshot,
'output_path': output_path,
'extractor': extractor,
'plugin': plugin,
})
return mark_safe(tpl.render(ctx))
except Exception:
@@ -97,14 +97,14 @@ def extractor_thumbnail(context, result) -> str:
@register.simple_tag(takes_context=True)
def extractor_embed(context, result) -> str:
def plugin_embed(context, result) -> str:
"""
Render the embed iframe template for an archive result.
Usage: {% extractor_embed result %}
Usage: {% plugin_embed result %}
"""
extractor = get_extractor_name(result.extractor)
template_str = get_extractor_template(extractor, 'embed')
plugin = get_plugin_name(result.plugin)
template_str = get_plugin_template(plugin, 'embed')
if not template_str:
return ''
@@ -117,7 +117,7 @@ def extractor_embed(context, result) -> str:
'result': result,
'snapshot': result.snapshot,
'output_path': output_path,
'extractor': extractor,
'plugin': plugin,
})
return mark_safe(tpl.render(ctx))
except Exception:
@@ -125,14 +125,14 @@ def extractor_embed(context, result) -> str:
@register.simple_tag(takes_context=True)
def extractor_fullscreen(context, result) -> str:
def plugin_fullscreen(context, result) -> str:
"""
Render the fullscreen template for an archive result.
Usage: {% extractor_fullscreen result %}
Usage: {% plugin_fullscreen result %}
"""
extractor = get_extractor_name(result.extractor)
template_str = get_extractor_template(extractor, 'fullscreen')
plugin = get_plugin_name(result.plugin)
template_str = get_plugin_template(plugin, 'fullscreen')
if not template_str:
return ''
@@ -145,7 +145,7 @@ def extractor_fullscreen(context, result) -> str:
'result': result,
'snapshot': result.snapshot,
'output_path': output_path,
'extractor': extractor,
'plugin': plugin,
})
return mark_safe(tpl.render(ctx))
except Exception:
@@ -153,10 +153,10 @@ def extractor_fullscreen(context, result) -> str:
@register.filter
def extractor_name(value: str) -> str:
def plugin_name(value: str) -> str:
"""
Get the base name of an extractor (strips numeric prefix).
Get the base name of a plugin (strips numeric prefix).
Usage: {{ result.extractor|extractor_name }}
Usage: {{ result.plugin|plugin_name }}
"""
return get_extractor_name(value)
return get_plugin_name(value)

View File

@@ -56,9 +56,9 @@ class SnapshotView(View):
def render_live_index(request, snapshot):
TITLE_LOADING_MSG = 'Not yet archived...'
# Dict of extractor -> ArchiveResult object
# Dict of plugin -> ArchiveResult object
archiveresult_objects = {}
# Dict of extractor -> result info dict (for template compatibility)
# Dict of plugin -> result info dict (for template compatibility)
archiveresults = {}
results = snapshot.archiveresult_set.all()
@@ -75,16 +75,16 @@ class SnapshotView(View):
continue
# Store the full ArchiveResult object for template tags
archiveresult_objects[result.extractor] = result
archiveresult_objects[result.plugin] = result
result_info = {
'name': result.extractor,
'name': result.plugin,
'path': embed_path,
'ts': ts_to_date_str(result.end_ts),
'size': abs_path.stat().st_size or '?',
'result': result, # Include the full object for template tags
}
archiveresults[result.extractor] = result_info
archiveresults[result.plugin] = result_info
# Use canonical_outputs for intelligent discovery
# This method now scans ArchiveResults and uses smart heuristics
@@ -119,8 +119,8 @@ class SnapshotView(View):
# Get available extractors from hooks (sorted by numeric prefix for ordering)
# Convert to base names for display ordering
all_extractors = [get_extractor_name(e) for e in get_extractors()]
preferred_types = tuple(all_extractors)
all_plugins = [get_extractor_name(e) for e in get_extractors()]
preferred_types = tuple(all_plugins)
all_types = preferred_types + tuple(result_type for result_type in archiveresults.keys() if result_type not in preferred_types)
best_result = {'path': 'None', 'result': None}
@@ -463,7 +463,6 @@ class AddView(UserPassesTestMixin, FormView):
urls_content = sources_file.read_text()
crawl = Crawl.objects.create(
urls=urls_content,
extractor=parser,
max_depth=depth,
tags_str=tag,
label=f'{self.request.user.username}@{HOSTNAME}{self.request.path} {timestamp}',
@@ -598,12 +597,12 @@ def live_progress_view(request):
ArchiveResult.StatusChoices.SUCCEEDED: 2,
ArchiveResult.StatusChoices.FAILED: 3,
}
return (status_order.get(ar.status, 4), ar.extractor)
return (status_order.get(ar.status, 4), ar.plugin)
all_extractors = [
all_plugins = [
{
'id': str(ar.id),
'extractor': ar.extractor,
'plugin': ar.plugin,
'status': ar.status,
}
for ar in sorted(snapshot_results, key=extractor_sort_key)
@@ -619,7 +618,7 @@ def live_progress_view(request):
'completed_extractors': completed_extractors,
'failed_extractors': failed_extractors,
'pending_extractors': pending_extractors,
'all_extractors': all_extractors,
'all_plugins': all_plugins,
})
# Check if crawl can start (for debugging stuck crawls)

View File

@@ -353,10 +353,18 @@ class Crawl(ModelWithOutputDir, ModelWithConfig, ModelWithHealthStats, ModelWith
import signal
from pathlib import Path
from archivebox.hooks import run_hook, discover_hooks
from archivebox.misc.process_utils import validate_pid_file
# Kill any background processes by scanning for all .pid files
if self.OUTPUT_DIR.exists():
for pid_file in self.OUTPUT_DIR.glob('**/*.pid'):
# Validate PID before killing to avoid killing unrelated processes
cmd_file = pid_file.parent / 'cmd.sh'
if not validate_pid_file(pid_file, cmd_file):
# PID reused by different process or process dead
pid_file.unlink(missing_ok=True)
continue
try:
pid = int(pid_file.read_text().strip())
try:

View File

@@ -292,8 +292,13 @@ def run_hook(
stdout_file = output_dir / 'stdout.log'
stderr_file = output_dir / 'stderr.log'
pid_file = output_dir / 'hook.pid'
cmd_file = output_dir / 'cmd.sh'
try:
# Write command script for validation
from archivebox.misc.process_utils import write_cmd_file
write_cmd_file(cmd_file, cmd)
# Open log files for writing
with open(stdout_file, 'w') as out, open(stderr_file, 'w') as err:
process = subprocess.Popen(
@@ -304,8 +309,10 @@ def run_hook(
env=env,
)
# Write PID for all hooks (useful for debugging/cleanup)
pid_file.write_text(str(process.pid))
# Write PID with mtime set to process start time for validation
from archivebox.misc.process_utils import write_pid_file_with_mtime
process_start_time = time.time()
write_pid_file_with_mtime(pid_file, process.pid, process_start_time)
if is_background:
# Background hook - return None immediately, don't wait
@@ -1327,21 +1334,29 @@ def process_is_alive(pid_file: Path) -> bool:
return False
def kill_process(pid_file: Path, sig: int = signal.SIGTERM):
def kill_process(pid_file: Path, sig: int = signal.SIGTERM, validate: bool = True):
"""
Kill process in PID file.
Kill process in PID file with optional validation.
Args:
pid_file: Path to hook.pid file
sig: Signal to send (default SIGTERM)
validate: If True, validate process identity before killing (default: True)
"""
if not pid_file.exists():
return
try:
pid = int(pid_file.read_text().strip())
os.kill(pid, sig)
except (OSError, ValueError):
pass
from archivebox.misc.process_utils import safe_kill_process
if validate:
# Use safe kill with validation
cmd_file = pid_file.parent / 'cmd.sh'
safe_kill_process(pid_file, cmd_file, signal_num=sig, validate=True)
else:
# Legacy behavior - kill without validation
if not pid_file.exists():
return
try:
pid = int(pid_file.read_text().strip())
os.kill(pid, sig)
except (OSError, ValueError):
pass

View File

@@ -178,7 +178,8 @@ def archiveresult_to_jsonl(result) -> Dict[str, Any]:
'type': TYPE_ARCHIVERESULT,
'id': str(result.id),
'snapshot_id': str(result.snapshot_id),
'extractor': result.extractor,
'plugin': result.plugin,
'hook_name': result.hook_name,
'status': result.status,
'output_str': result.output_str,
'start_ts': result.start_ts.isoformat() if result.start_ts else None,

View File

@@ -29,6 +29,98 @@ const http = require('http');
const EXTRACTOR_NAME = 'chrome_launch';
const OUTPUT_DIR = 'chrome';
// Helper: Write PID file with mtime set to process start time
function writePidWithMtime(filePath, pid, startTimeSeconds) {
fs.writeFileSync(filePath, String(pid));
// Set both atime and mtime to process start time for validation
const startTimeMs = startTimeSeconds * 1000;
fs.utimesSync(filePath, new Date(startTimeMs), new Date(startTimeMs));
}
// Helper: Write command script for validation
function writeCmdScript(filePath, binary, args) {
// Shell escape arguments containing spaces or special characters
const escapedArgs = args.map(arg => {
if (arg.includes(' ') || arg.includes('"') || arg.includes('$')) {
return `"${arg.replace(/"/g, '\\"')}"`;
}
return arg;
});
const script = `#!/bin/bash\n${binary} ${escapedArgs.join(' ')}\n`;
fs.writeFileSync(filePath, script);
fs.chmodSync(filePath, 0o755);
}
// Helper: Get process start time (cross-platform)
function getProcessStartTime(pid) {
try {
const { execSync } = require('child_process');
if (process.platform === 'darwin') {
// macOS: ps -p PID -o lstart= gives start time
const output = execSync(`ps -p ${pid} -o lstart=`, { encoding: 'utf8', timeout: 1000 });
return Date.parse(output.trim()) / 1000; // Convert to epoch seconds
} else {
// Linux: read /proc/PID/stat field 22 (starttime in clock ticks)
const stat = fs.readFileSync(`/proc/${pid}/stat`, 'utf8');
const match = stat.match(/\) \w+ (\d+)/);
if (match) {
const startTicks = parseInt(match[1], 10);
// Convert clock ticks to seconds (assuming 100 ticks/sec)
const uptimeSeconds = parseFloat(fs.readFileSync('/proc/uptime', 'utf8').split(' ')[0]);
const bootTime = Date.now() / 1000 - uptimeSeconds;
return bootTime + (startTicks / 100);
}
}
} catch (e) {
// Can't get start time
return null;
}
return null;
}
// Helper: Validate PID using mtime and command
function validatePid(pid, pidFile, cmdFile) {
try {
// Check process exists
try {
process.kill(pid, 0); // Signal 0 = check existence
} catch (e) {
return false; // Process doesn't exist
}
// Check mtime matches process start time (within 5 sec tolerance)
const fileStat = fs.statSync(pidFile);
const fileMtime = fileStat.mtimeMs / 1000; // Convert to seconds
const procStartTime = getProcessStartTime(pid);
if (procStartTime === null) {
// Can't validate - fall back to basic existence check
return true;
}
if (Math.abs(fileMtime - procStartTime) > 5) {
// PID was reused by different process
return false;
}
// Validate command if available
if (fs.existsSync(cmdFile)) {
const cmd = fs.readFileSync(cmdFile, 'utf8');
// Check for Chrome/Chromium and debug port
if (!cmd.includes('chrome') && !cmd.includes('chromium')) {
return false;
}
if (!cmd.includes('--remote-debugging-port')) {
return false;
}
}
return true;
} catch (e) {
return false;
}
}
// Global state for cleanup
let chromePid = null;
@@ -240,17 +332,17 @@ function killZombieChrome() {
const pid = parseInt(fs.readFileSync(pidFile, 'utf8').trim(), 10);
if (isNaN(pid) || pid <= 0) continue;
// Check if process exists
try {
process.kill(pid, 0);
} catch (e) {
// Process dead, remove stale PID file
// Validate PID before killing
const cmdFile = path.join(chromeDir, 'cmd.sh');
if (!validatePid(pid, pidFile, cmdFile)) {
// PID reused or validation failed
console.error(`[!] PID ${pid} failed validation (reused or wrong process) - cleaning up`);
try { fs.unlinkSync(pidFile); } catch (e) {}
continue;
}
// Process alive but crawl is stale - zombie!
console.error(`[!] Found zombie (PID ${pid}) from stale crawl ${crawl.name}`);
// Process alive, validated, and crawl is stale - zombie!
console.error(`[!] Found validated zombie (PID ${pid}) from stale crawl ${crawl.name}`);
try {
// Kill process group first
@@ -386,15 +478,20 @@ async function launchChrome(binary) {
chromeProcess.unref(); // Don't keep Node.js process running
chromePid = chromeProcess.pid;
const chromeStartTime = Date.now() / 1000; // Unix epoch seconds
console.error(`[*] Launched Chrome (PID: ${chromePid}), waiting for debug port...`);
// Write Chrome PID for backup cleanup (named .pid so Crawl.cleanup() finds it)
fs.writeFileSync(path.join(OUTPUT_DIR, 'chrome.pid'), String(chromePid));
// Write Chrome PID with mtime set to start time for validation
writePidWithMtime(path.join(OUTPUT_DIR, 'chrome.pid'), chromePid, chromeStartTime);
// Write command script for validation
writeCmdScript(path.join(OUTPUT_DIR, 'cmd.sh'), binary, chromeArgs);
fs.writeFileSync(path.join(OUTPUT_DIR, 'port.txt'), String(debugPort));
// Write hook's own PID so Crawl.cleanup() can kill this hook process
// (which will trigger our SIGTERM handler to kill Chrome)
fs.writeFileSync(path.join(OUTPUT_DIR, 'hook.pid'), String(process.pid));
// Write hook's own PID with mtime for validation
const hookStartTime = Date.now() / 1000;
writePidWithMtime(path.join(OUTPUT_DIR, 'hook.pid'), process.pid, hookStartTime);
try {
// Wait for Chrome to be ready

View File

@@ -488,7 +488,7 @@
<span class="progress-fill"></span>
<span class="badge-content">
<span class="badge-icon">${icon}</span>
<span>${extractor.extractor}</span>
<span>${extractor.plugin}</span>
</span>
</span>
`;
@@ -499,10 +499,10 @@
const adminUrl = `/admin/core/snapshot/${snapshot.id}/change/`;
let extractorHtml = '';
if (snapshot.all_extractors && snapshot.all_extractors.length > 0) {
// Sort extractors alphabetically by name to prevent reordering on updates
const sortedExtractors = [...snapshot.all_extractors].sort((a, b) =>
a.extractor.localeCompare(b.extractor)
if (snapshot.all_plugins && snapshot.all_plugins.length > 0) {
// Sort plugins alphabetically by name to prevent reordering on updates
const sortedExtractors = [...snapshot.all_plugins].sort((a, b) =>
a.plugin.localeCompare(b.plugin)
);
extractorHtml = `
<div class="extractor-list">

View File

@@ -158,7 +158,7 @@ CREATE TABLE IF NOT EXISTS core_snapshot_tags (
CREATE TABLE IF NOT EXISTS core_archiveresult (
id INTEGER PRIMARY KEY AUTOINCREMENT,
snapshot_id CHAR(32) NOT NULL REFERENCES core_snapshot(id),
extractor VARCHAR(32) NOT NULL,
plugin VARCHAR(32) NOT NULL,
cmd TEXT,
pwd VARCHAR(256),
cmd_version VARCHAR(128),
@@ -379,7 +379,7 @@ CREATE TABLE IF NOT EXISTS crawls_seed (
created_by_id INTEGER NOT NULL REFERENCES auth_user(id),
modified_at DATETIME,
uri VARCHAR(2048) NOT NULL,
extractor VARCHAR(32) NOT NULL DEFAULT 'auto',
plugin VARCHAR(32) NOT NULL DEFAULT 'auto',
tags_str VARCHAR(255) NOT NULL DEFAULT '',
label VARCHAR(255) NOT NULL DEFAULT '',
config TEXT DEFAULT '{}',
@@ -465,7 +465,7 @@ CREATE TABLE IF NOT EXISTS core_archiveresult (
created_at DATETIME NOT NULL,
modified_at DATETIME,
snapshot_id CHAR(36) NOT NULL REFERENCES core_snapshot(id),
extractor VARCHAR(32) NOT NULL,
plugin VARCHAR(32) NOT NULL,
pwd VARCHAR(256),
cmd TEXT,
cmd_version VARCHAR(128),

View File

@@ -227,10 +227,10 @@ class Worker:
urls = obj.get_urls_list()
url = urls[0] if urls else None
extractor = None
if hasattr(obj, 'extractor'):
plugin = None
if hasattr(obj, 'plugin'):
# ArchiveResultWorker, Crawl
extractor = obj.extractor
plugin = obj.plugin
log_worker_event(
worker_type=worker_type_name,
@@ -239,7 +239,7 @@ class Worker:
pid=self.pid,
worker_id=str(self.worker_id),
url=url,
extractor=extractor,
plugin=plugin,
metadata=start_metadata if start_metadata else None,
)
@@ -262,7 +262,7 @@ class Worker:
pid=self.pid,
worker_id=str(self.worker_id),
url=url,
extractor=extractor,
plugin=plugin,
metadata=complete_metadata,
)
else:
@@ -345,9 +345,9 @@ class ArchiveResultWorker(Worker):
name: ClassVar[str] = 'archiveresult'
MAX_TICK_TIME: ClassVar[int] = 120
def __init__(self, extractor: str | None = None, **kwargs: Any):
def __init__(self, plugin: str | None = None, **kwargs: Any):
super().__init__(**kwargs)
self.extractor = extractor
self.plugin = plugin
def get_model(self):
from core.models import ArchiveResult
@@ -359,16 +359,16 @@ class ArchiveResultWorker(Worker):
qs = super().get_queue()
if self.extractor:
qs = qs.filter(extractor=self.extractor)
if self.plugin:
qs = qs.filter(plugin=self.plugin)
# Note: Removed blocking logic since plugins have separate output directories
# and don't interfere with each other. Each plugin (extractor) runs independently.
# and don't interfere with each other. Each plugin runs independently.
return qs
def process_item(self, obj) -> bool:
"""Process an ArchiveResult by running its extractor."""
"""Process an ArchiveResult by running its plugin."""
try:
obj.sm.tick()
return True
@@ -378,8 +378,8 @@ class ArchiveResultWorker(Worker):
return False
@classmethod
def start(cls, worker_id: int | None = None, daemon: bool = False, extractor: str | None = None, **kwargs: Any) -> int:
"""Fork a new worker as subprocess with optional extractor filter."""
def start(cls, worker_id: int | None = None, daemon: bool = False, plugin: str | None = None, **kwargs: Any) -> int:
"""Fork a new worker as subprocess with optional plugin filter."""
if worker_id is None:
worker_id = get_next_worker_id(cls.name)
@@ -387,7 +387,7 @@ class ArchiveResultWorker(Worker):
proc = Process(
target=_run_worker,
args=(cls.name, worker_id, daemon),
kwargs={'extractor': extractor, **kwargs},
kwargs={'plugin': plugin, **kwargs},
name=f'{cls.name}_worker_{worker_id}',
)
proc.start()

View File

@@ -74,17 +74,17 @@ def test_background_hooks_dont_block_parser_extractors(tmp_path, process):
# Check that background hooks are running
# Background hooks: consolelog, ssl, responses, redirects, staticfile
bg_hooks = c.execute(
"SELECT extractor, status FROM core_archiveresult WHERE extractor IN ('consolelog', 'ssl', 'responses', 'redirects', 'staticfile') ORDER BY extractor"
"SELECT plugin, status FROM core_archiveresult WHERE plugin IN ('consolelog', 'ssl', 'responses', 'redirects', 'staticfile') ORDER BY plugin"
).fetchall()
# Check that parser extractors have run (not stuck in queued)
parser_extractors = c.execute(
"SELECT extractor, status FROM core_archiveresult WHERE extractor LIKE 'parse_%_urls' ORDER BY extractor"
"SELECT plugin, status FROM core_archiveresult WHERE plugin LIKE 'parse_%_urls' ORDER BY plugin"
).fetchall()
# Check all extractors to see what's happening
all_extractors = c.execute(
"SELECT extractor, status FROM core_archiveresult ORDER BY extractor"
"SELECT plugin, status FROM core_archiveresult ORDER BY plugin"
).fetchall()
conn.close()
@@ -160,7 +160,7 @@ def test_parser_extractors_emit_snapshot_jsonl(tmp_path, process):
# Check that parse_html_urls ran
parse_html = c.execute(
"SELECT id, status, output_str FROM core_archiveresult WHERE extractor = '60_parse_html_urls'"
"SELECT id, status, output_str FROM core_archiveresult WHERE plugin = '60_parse_html_urls'"
).fetchone()
conn.close()
@@ -171,7 +171,7 @@ def test_parser_extractors_emit_snapshot_jsonl(tmp_path, process):
# Parser should have run
assert status in ['started', 'succeeded', 'failed'], \
f"parse_html_urls should have run, got status: {status}"
f"60_parse_html_urls should have run, got status: {status}"
# If it succeeded and found links, output should contain JSON
if status == 'succeeded' and output:
@@ -185,39 +185,37 @@ def test_recursive_crawl_creates_child_snapshots(tmp_path, process):
"""Test that recursive crawling creates child snapshots with proper depth and parent_snapshot_id."""
os.chdir(tmp_path)
# Disable most extractors to speed up test, but keep wget for HTML content
# Create a test HTML file with links
test_html = tmp_path / 'test.html'
test_html.write_text('''
<html>
<body>
<h1>Test Page</h1>
<a href="https://monadical.com/about">About</a>
<a href="https://monadical.com/blog">Blog</a>
<a href="https://monadical.com/contact">Contact</a>
</body>
</html>
''')
# Minimal env for fast testing
env = os.environ.copy()
env.update({
"USE_WGET": "true", # Need wget to fetch HTML for parsers
"USE_SINGLEFILE": "false",
"USE_READABILITY": "false",
"USE_MERCURY": "false",
"SAVE_HTMLTOTEXT": "false",
"SAVE_PDF": "false",
"SAVE_SCREENSHOT": "false",
"SAVE_DOM": "false",
"SAVE_HEADERS": "false",
"USE_GIT": "false",
"SAVE_MEDIA": "false",
"SAVE_ARCHIVE_DOT_ORG": "false",
"SAVE_TITLE": "false",
"SAVE_FAVICON": "false",
"USE_CHROME": "false",
"URL_ALLOWLIST": r"monadical\.com/.*", # Only crawl same domain
})
# Start a crawl with depth=1 (just one hop to test recursive crawling)
# Use file:// URL so it's instant, no network fetch needed
proc = subprocess.Popen(
['archivebox', 'add', '--depth=1', 'https://monadical.com'],
['archivebox', 'add', '--depth=1', f'file://{test_html}'],
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
text=True,
env=env,
)
# Give orchestrator time to process - parser extractors should emit child snapshots within 60s
# Even if root snapshot is still processing, child snapshots can start in parallel
time.sleep(60)
# Give orchestrator time to process - file:// is fast, should complete in 20s
time.sleep(20)
# Kill the process
proc.kill()
@@ -231,8 +229,7 @@ def test_recursive_crawl_creates_child_snapshots(tmp_path, process):
# Check root snapshot (depth=0)
root_snapshot = c.execute(
"SELECT id, url, depth, parent_snapshot_id FROM core_snapshot WHERE url = ? AND depth = 0",
('https://monadical.com',)
"SELECT id, url, depth, parent_snapshot_id FROM core_snapshot WHERE depth = 0 ORDER BY created_at LIMIT 1"
).fetchone()
# Check if any child snapshots were created (depth=1)
@@ -247,13 +244,13 @@ def test_recursive_crawl_creates_child_snapshots(tmp_path, process):
# Check parser extractor status
parser_status = c.execute(
"SELECT extractor, status FROM core_archiveresult WHERE snapshot_id = ? AND extractor LIKE 'parse_%_urls'",
"SELECT plugin, status FROM core_archiveresult WHERE snapshot_id = ? AND plugin LIKE 'parse_%_urls'",
(root_snapshot[0] if root_snapshot else '',)
).fetchall()
# Check for started extractors that might be blocking
started_extractors = c.execute(
"SELECT extractor, status FROM core_archiveresult WHERE snapshot_id = ? AND status = 'started'",
"SELECT plugin, status FROM core_archiveresult WHERE snapshot_id = ? AND status = 'started'",
(root_snapshot[0] if root_snapshot else '',)
).fetchall()
@@ -417,12 +414,12 @@ def test_archiveresult_worker_queue_filters_by_foreground_extractors(tmp_path, p
# Get background hooks that are started
bg_started = c.execute(
"SELECT extractor FROM core_archiveresult WHERE extractor IN ('consolelog', 'ssl', 'responses', 'redirects', 'staticfile') AND status = 'started'"
"SELECT plugin FROM core_archiveresult WHERE plugin IN ('consolelog', 'ssl', 'responses', 'redirects', 'staticfile') AND status = 'started'"
).fetchall()
# Get parser extractors that should be queued or better
parser_status = c.execute(
"SELECT extractor, status FROM core_archiveresult WHERE extractor LIKE 'parse_%_urls'"
"SELECT plugin, status FROM core_archiveresult WHERE plugin LIKE 'parse_%_urls'"
).fetchall()
conn.close()