mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-03 22:37:53 +10:00
rename extractor to plugin everywhere
This commit is contained in:
486
TODO_hook_concurrency.md
Normal file
486
TODO_hook_concurrency.md
Normal file
@@ -0,0 +1,486 @@
|
||||
# ArchiveBox Hook Script Concurrency & Execution Plan
|
||||
|
||||
## Overview
|
||||
|
||||
Snapshot.run() should enforce that snapshot hooks are run in **10 discrete, sequential "steps"**: `0*`, `1*`, `2*`, `3*`, `4*`, `5*`, `6*`, `7*`, `8*`, `9*`.
|
||||
|
||||
For every discovered hook script, ArchiveBox should create an ArchiveResult in `queued` state, then manage running them using `retry_at` and inline logic to enforce this ordering.
|
||||
|
||||
## Hook Numbering Convention
|
||||
|
||||
Hooks scripts are numbered `00` to `99` to control:
|
||||
- **First digit (0-9)**: Which step they are part of
|
||||
- **Second digit (0-9)**: Order within that step
|
||||
|
||||
Hook scripts are launched **strictly sequentially** based on their filename alphabetical order, and run in sets of several per step before moving on to the next step.
|
||||
|
||||
**Naming Format:**
|
||||
```
|
||||
on_{ModelName}__{run_order}_{human_readable_description}[.bg].{ext}
|
||||
```
|
||||
|
||||
**Examples:**
|
||||
```
|
||||
on_Snapshot__00_this_would_run_first.sh
|
||||
on_Snapshot__05_start_ytdlp_download.bg.sh
|
||||
on_Snapshot__10_chrome_tab_opened.js
|
||||
on_Snapshot__50_screenshot.js
|
||||
on_Snapshot__53_media.bg.py
|
||||
```
|
||||
|
||||
## Background (.bg) vs Foreground Scripts
|
||||
|
||||
### Foreground Scripts (no .bg suffix)
|
||||
- Run sequentially within their step
|
||||
- Block step progression until they exit
|
||||
- Should exit naturally when work is complete
|
||||
- Get killed with SIGTERM if they exceed their `PLUGINNAME_TIMEOUT`
|
||||
|
||||
### Background Scripts (.bg suffix)
|
||||
- Spawned and allowed to continue running
|
||||
- Do NOT block step progression
|
||||
- Run until **their own `PLUGINNAME_TIMEOUT` is reached** (not until step 99)
|
||||
- Get polite SIGTERM when timeout expires, then SIGKILL 60s later if not exited
|
||||
- Must implement their own concurrency control using filesystem (semaphore files, locks, etc.)
|
||||
- Should exit naturally when work is complete (best case)
|
||||
|
||||
**Important:** If a .bg script starts at step 05 with `MEDIA_TIMEOUT=3600s`, it gets the full 3600s regardless of when step 99 completes. It runs on its own timeline.
|
||||
|
||||
## Execution Step Guidelines
|
||||
|
||||
These are **naming conventions and guidelines**, not enforced checkpoints. They provide semantic organization for plugin ordering:
|
||||
|
||||
### Step 0: Pre-Setup
|
||||
```
|
||||
00-09: Initial setup, validation, feature detection
|
||||
```
|
||||
|
||||
### Step 1: Chrome Launch & Tab Creation
|
||||
```
|
||||
10-19: Browser/tab lifecycle setup
|
||||
- Chrome browser launch
|
||||
- Tab creation and CDP connection
|
||||
```
|
||||
|
||||
### Step 2: Navigation & Settlement
|
||||
```
|
||||
20-29: Page loading and settling
|
||||
- Navigate to URL
|
||||
- Wait for page load
|
||||
- Initial response capture (responses, ssl, consolelog as .bg listeners)
|
||||
```
|
||||
|
||||
### Step 3: Page Adjustment
|
||||
```
|
||||
30-39: DOM manipulation before archiving
|
||||
- Hide popups/banners
|
||||
- Solve captchas
|
||||
- Expand comments/details sections
|
||||
- Inject custom CSS/JS
|
||||
- Accessibility modifications
|
||||
```
|
||||
|
||||
### Step 4: Ready for Archiving
|
||||
```
|
||||
40-49: Final pre-archiving checks
|
||||
- Verify page is fully adjusted
|
||||
- Wait for any pending modifications
|
||||
```
|
||||
|
||||
### Step 5: DOM Extraction (Sequential, Non-BG)
|
||||
```
|
||||
50-59: Extractors that need exclusive DOM access
|
||||
- singlefile (MUST NOT be .bg)
|
||||
- screenshot (MUST NOT be .bg)
|
||||
- pdf (MUST NOT be .bg)
|
||||
- dom (MUST NOT be .bg)
|
||||
- title
|
||||
- headers
|
||||
- readability
|
||||
- mercury
|
||||
|
||||
These MUST run sequentially as they temporarily modify the DOM
|
||||
during extraction, then revert it. Running in parallel would corrupt results.
|
||||
```
|
||||
|
||||
### Step 6: Post-DOM Extraction
|
||||
```
|
||||
60-69: Extractors that don't need DOM or run on downloaded files
|
||||
- wget
|
||||
- git
|
||||
- media (.bg - can run for hours)
|
||||
- gallerydl (.bg)
|
||||
- forumdl (.bg)
|
||||
- papersdl (.bg)
|
||||
```
|
||||
|
||||
### Step 7: Chrome Cleanup
|
||||
```
|
||||
70-79: Browser/tab teardown
|
||||
- Close tabs
|
||||
- Cleanup Chrome resources
|
||||
```
|
||||
|
||||
### Step 8: Post-Processing
|
||||
```
|
||||
80-89: Reprocess outputs from earlier extractors
|
||||
- OCR of images
|
||||
- Audio/video transcription
|
||||
- URL parsing from downloaded content (rss, html, json, txt, csv, md)
|
||||
- LLM analysis/summarization of outputs
|
||||
```
|
||||
|
||||
### Step 9: Indexing & Finalization
|
||||
```
|
||||
90-99: Save to indexes and finalize
|
||||
- Index text content to Sonic/SQLite FTS
|
||||
- Create symlinks
|
||||
- Generate merkle trees
|
||||
- Final status updates
|
||||
```
|
||||
|
||||
## Hook Script Interface
|
||||
|
||||
### Input: CLI Arguments (NOT stdin)
|
||||
Hooks receive configuration as CLI flags (CSV or JSON-encoded):
|
||||
|
||||
```bash
|
||||
--url="https://example.com"
|
||||
--snapshot-id="1234-5678-uuid"
|
||||
--config='{"some_key": "some_value"}'
|
||||
--plugins=git,media,favicon,title
|
||||
--timeout=50
|
||||
--enable-something
|
||||
```
|
||||
|
||||
### Input: Environment Variables
|
||||
All configuration comes from env vars, defined in `plugin_dir/config.json` JSONSchema:
|
||||
|
||||
```bash
|
||||
WGET_BINARY=/usr/bin/wget
|
||||
WGET_TIMEOUT=60
|
||||
WGET_USER_AGENT="Mozilla/5.0..."
|
||||
WGET_EXTRA_ARGS="--no-check-certificate"
|
||||
SAVE_WGET=True
|
||||
```
|
||||
|
||||
**Required:** Every plugin must support `PLUGINNAME_TIMEOUT` for self-termination.
|
||||
|
||||
### Output: Filesystem (CWD)
|
||||
Hooks read/write files to:
|
||||
- `$CWD`: Their own output subdirectory (e.g., `archive/snapshots/{id}/wget/`)
|
||||
- `$CWD/..`: Parent directory (to read outputs from other hooks)
|
||||
|
||||
This allows hooks to:
|
||||
- Access files created by other hooks
|
||||
- Keep their outputs separate by default
|
||||
- Use semaphore files for coordination (if needed)
|
||||
|
||||
### Output: JSONL to stdout
|
||||
Hooks emit one JSONL line per database record they want to create or update:
|
||||
|
||||
```jsonl
|
||||
{"type": "Tag", "name": "sci-fi"}
|
||||
{"type": "ArchiveResult", "id": "1234-uuid", "status": "succeeded", "output_str": "wget/index.html"}
|
||||
{"type": "Snapshot", "id": "5678-uuid", "title": "Example Page"}
|
||||
```
|
||||
|
||||
See `archivebox/misc/jsonl.py` and model `from_json()` / `from_jsonl()` methods for full list of supported types and fields.
|
||||
|
||||
### Output: stderr for Human Logs
|
||||
Hooks should emit human-readable output or debug info to **stderr**. There are no guarantees this will be persisted long-term. Use stdout JSONL or filesystem for outputs that matter.
|
||||
|
||||
### Cleanup: Delete Cruft
|
||||
If hooks emit no meaningful long-term outputs, they should delete any temporary files themselves to avoid wasting space. However, the ArchiveResult DB row should be kept so we know:
|
||||
- It doesn't need to be retried
|
||||
- It isn't missing
|
||||
- What happened (status, error message)
|
||||
|
||||
### Signal Handling: SIGINT/SIGTERM
|
||||
Hooks are expected to listen for polite `SIGINT`/`SIGTERM` and finish hastily, then exit cleanly. Beyond that, they may be `SIGKILL'd` at ArchiveBox's discretion.
|
||||
|
||||
**If hooks double-fork or spawn long-running processes:** They must output a `.pid` file in their directory so zombies can be swept safely.
|
||||
|
||||
## Hook Failure Modes & Retry Logic
|
||||
|
||||
Hooks can fail in several ways. ArchiveBox handles each differently:
|
||||
|
||||
### 1. Soft Failure (Record & Don't Retry)
|
||||
**Exit:** `0` (success)
|
||||
**JSONL:** `{"type": "ArchiveResult", "status": "failed", "output_str": "404 Not Found"}`
|
||||
|
||||
This means: "I ran successfully, but the resource wasn't available." Don't retry this.
|
||||
|
||||
**Use cases:**
|
||||
- 404 errors
|
||||
- Content not available
|
||||
- Feature not applicable to this URL
|
||||
|
||||
### 2. Hard Failure / Temporary Error (Retry Later)
|
||||
**Exit:** Non-zero (1, 2, etc.)
|
||||
**JSONL:** None (or incomplete)
|
||||
|
||||
This means: "Something went wrong, I couldn't complete." Treat this ArchiveResult as "missing" and set `retry_at` for later.
|
||||
|
||||
**Use cases:**
|
||||
- 500 server errors
|
||||
- Network timeouts
|
||||
- Binary not found / crashed
|
||||
- Transient errors
|
||||
|
||||
**Behavior:**
|
||||
- ArchiveBox sets `retry_at` on the ArchiveResult
|
||||
- Hook will be retried during next `archivebox update`
|
||||
|
||||
### 3. Partial Success (Update & Continue)
|
||||
**Exit:** Non-zero
|
||||
**JSONL:** Partial records emitted before crash
|
||||
|
||||
**Behavior:**
|
||||
- Update ArchiveResult with whatever was emitted
|
||||
- Mark remaining work as "missing" with `retry_at`
|
||||
|
||||
### 4. Success (Record & Continue)
|
||||
**Exit:** `0`
|
||||
**JSONL:** `{"type": "ArchiveResult", "status": "succeeded", "output_str": "output/file.html"}`
|
||||
|
||||
This is the happy path.
|
||||
|
||||
### Error Handling Rules
|
||||
|
||||
- **DO NOT skip hooks** based on failures
|
||||
- **Continue to next hook** regardless of foreground or background failures
|
||||
- **Update ArchiveResults** with whatever information is available
|
||||
- **Set retry_at** for "missing" or temporarily-failed hooks
|
||||
- **Let background scripts continue** even if foreground scripts fail
|
||||
|
||||
## File Structure
|
||||
|
||||
```
|
||||
archivebox/plugins/{plugin_name}/
|
||||
├── config.json # JSONSchema: env var config options
|
||||
├── binaries.jsonl # Runtime dependencies: apt|brew|pip|npm|env
|
||||
├── on_Snapshot__XX_name.py # Hook script (foreground)
|
||||
├── on_Snapshot__XX_name.bg.py # Hook script (background)
|
||||
└── tests/
|
||||
└── test_name.py
|
||||
```
|
||||
|
||||
## Implementation Checklist
|
||||
|
||||
### Phase 1: Renumber Existing Hooks ✅
|
||||
- [ ] Renumber DOM extractors to 50-59 range
|
||||
- [ ] Ensure pdf/screenshot are NOT .bg (need sequential access)
|
||||
- [ ] Ensure media (ytdlp) IS .bg (can run for hours)
|
||||
- [ ] Add step comments to each plugin for clarity
|
||||
|
||||
### Phase 2: Timeout Consistency ✅
|
||||
- [x] All plugins support `PLUGINNAME_TIMEOUT` env var
|
||||
- [x] All plugins fall back to generic `TIMEOUT` env var
|
||||
- [x] Background scripts handle SIGTERM gracefully (or exit naturally)
|
||||
|
||||
### Phase 3: Refactor Snapshot.run()
|
||||
- [ ] Parse hook filenames to extract step number (first digit)
|
||||
- [ ] Group hooks by step (0-9)
|
||||
- [ ] Run each step sequentially
|
||||
- [ ] Within each step:
|
||||
- [ ] Launch foreground hooks sequentially
|
||||
- [ ] Launch .bg hooks and track PIDs
|
||||
- [ ] Wait for foreground hooks to complete before next step
|
||||
- [ ] Track .bg script timeouts independently
|
||||
- [ ] Send SIGTERM to .bg scripts when their timeout expires
|
||||
- [ ] Send SIGKILL 60s after SIGTERM if not exited
|
||||
|
||||
### Phase 4: ArchiveResult Management
|
||||
- [ ] Create one ArchiveResult per hook (not per plugin)
|
||||
- [ ] Set initial state to `queued`
|
||||
- [ ] Update state based on JSONL output and exit code
|
||||
- [ ] Set `retry_at` for hooks that exit non-zero with no JSONL
|
||||
- [ ] Don't retry hooks that emit `{"status": "failed"}`
|
||||
|
||||
### Phase 5: JSONL Streaming
|
||||
- [ ] Parse stdout JSONL line-by-line during hook execution
|
||||
- [ ] Create/update DB rows as JSONL is emitted (streaming mode)
|
||||
- [ ] Handle partial JSONL on hook crash
|
||||
|
||||
### Phase 6: Zombie Process Management
|
||||
- [ ] Read `.pid` files from hook output directories
|
||||
- [ ] Sweep zombies on cleanup
|
||||
- [ ] Handle double-forked processes correctly
|
||||
|
||||
## Migration Path
|
||||
|
||||
### Backward Compatibility
|
||||
During migration, support both old and new numbering:
|
||||
1. Run hooks numbered 00-99 in step order
|
||||
2. Run unnumbered hooks last (step 9) for compatibility
|
||||
3. Log warnings for unnumbered hooks
|
||||
4. Eventually require all hooks to be numbered
|
||||
|
||||
### Renumbering Map
|
||||
|
||||
**Current → New:**
|
||||
```
|
||||
git/on_Snapshot__12_git.py → git/on_Snapshot__62_git.py
|
||||
media/on_Snapshot__51_media.py → media/on_Snapshot__63_media.bg.py
|
||||
gallerydl/on_Snapshot__52_gallerydl.py → gallerydl/on_Snapshot__64_gallerydl.bg.py
|
||||
forumdl/on_Snapshot__53_forumdl.py → forumdl/on_Snapshot__65_forumdl.bg.py
|
||||
papersdl/on_Snapshot__54_papersdl.py → papersdl/on_Snapshot__66_papersdl.bg.py
|
||||
|
||||
readability/on_Snapshot__52_readability.py → readability/on_Snapshot__55_readability.py
|
||||
mercury/on_Snapshot__53_mercury.py → mercury/on_Snapshot__56_mercury.py
|
||||
|
||||
singlefile/on_Snapshot__37_singlefile.py → singlefile/on_Snapshot__50_singlefile.py
|
||||
screenshot/on_Snapshot__34_screenshot.js → screenshot/on_Snapshot__51_screenshot.js
|
||||
pdf/on_Snapshot__35_pdf.js → pdf/on_Snapshot__52_pdf.js
|
||||
dom/on_Snapshot__36_dom.js → dom/on_Snapshot__53_dom.js
|
||||
title/on_Snapshot__32_title.js → title/on_Snapshot__54_title.js
|
||||
headers/on_Snapshot__33_headers.js → headers/on_Snapshot__55_headers.js
|
||||
|
||||
wget/on_Snapshot__50_wget.py → wget/on_Snapshot__61_wget.py
|
||||
```
|
||||
|
||||
## Testing Strategy
|
||||
|
||||
### Unit Tests
|
||||
- Test hook ordering (00-99)
|
||||
- Test step grouping (first digit)
|
||||
- Test .bg vs foreground execution
|
||||
- Test timeout enforcement
|
||||
- Test JSONL parsing
|
||||
- Test failure modes & retry_at logic
|
||||
|
||||
### Integration Tests
|
||||
- Test full Snapshot.run() with mixed hooks
|
||||
- Test .bg scripts running beyond step 99
|
||||
- Test zombie process cleanup
|
||||
- Test graceful SIGTERM handling
|
||||
- Test concurrent .bg script coordination
|
||||
|
||||
### Performance Tests
|
||||
- Measure overhead of per-hook ArchiveResults
|
||||
- Test with 50+ concurrent .bg scripts
|
||||
- Test filesystem contention with many hooks
|
||||
|
||||
## Open Questions
|
||||
|
||||
### Q: Should we provide semaphore utilities?
|
||||
**A:** No. Keep plugins decoupled. Let them use simple filesystem coordination if needed.
|
||||
|
||||
### Q: What happens if ArchiveResult table gets huge?
|
||||
**A:** We can delete old successful ArchiveResults periodically, or archive them to cold storage. The important data is in the filesystem outputs.
|
||||
|
||||
### Q: Should naturally-exiting .bg scripts still be .bg?
|
||||
**A:** Yes. The .bg suffix means "don't block step progression," not "run until step 99." Natural exit is the best case.
|
||||
|
||||
## Examples
|
||||
|
||||
### Foreground Hook (Sequential DOM Access)
|
||||
```python
|
||||
#!/usr/bin/env python3
|
||||
# archivebox/plugins/screenshot/on_Snapshot__51_screenshot.js
|
||||
|
||||
# Runs at step 5, blocks step progression until complete
|
||||
# Gets killed if it exceeds SCREENSHOT_TIMEOUT
|
||||
|
||||
timeout = get_env_int('SCREENSHOT_TIMEOUT') or get_env_int('TIMEOUT', 60)
|
||||
|
||||
try:
|
||||
result = subprocess.run(cmd, capture_output=True, timeout=timeout)
|
||||
if result.returncode == 0:
|
||||
print(json.dumps({
|
||||
"type": "ArchiveResult",
|
||||
"status": "succeeded",
|
||||
"output_str": "screenshot.png"
|
||||
}))
|
||||
sys.exit(0)
|
||||
else:
|
||||
# Temporary failure - will be retried
|
||||
sys.exit(1)
|
||||
except subprocess.TimeoutExpired:
|
||||
# Timeout - will be retried
|
||||
sys.exit(1)
|
||||
```
|
||||
|
||||
### Background Hook (Long-Running Download)
|
||||
```python
|
||||
#!/usr/bin/env python3
|
||||
# archivebox/plugins/media/on_Snapshot__63_media.bg.py
|
||||
|
||||
# Runs at step 6, doesn't block step progression
|
||||
# Gets full MEDIA_TIMEOUT (e.g., 3600s) regardless of when step 99 completes
|
||||
|
||||
timeout = get_env_int('YTDLP_TIMEOUT') or get_env_int('MEDIA_TIMEOUT') or get_env_int('TIMEOUT', 3600)
|
||||
|
||||
try:
|
||||
result = subprocess.run(['yt-dlp', url], capture_output=True, timeout=timeout)
|
||||
if result.returncode == 0:
|
||||
print(json.dumps({
|
||||
"type": "ArchiveResult",
|
||||
"status": "succeeded",
|
||||
"output_str": "media/"
|
||||
}))
|
||||
sys.exit(0)
|
||||
else:
|
||||
# Hard failure - don't retry
|
||||
print(json.dumps({
|
||||
"type": "ArchiveResult",
|
||||
"status": "failed",
|
||||
"output_str": "Video unavailable"
|
||||
}))
|
||||
sys.exit(0) # Exit 0 to record the failure
|
||||
except subprocess.TimeoutExpired:
|
||||
# Timeout - will be retried
|
||||
sys.exit(1)
|
||||
```
|
||||
|
||||
### Background Hook with Natural Exit
|
||||
```javascript
|
||||
#!/usr/bin/env node
|
||||
// archivebox/plugins/ssl/on_Snapshot__23_ssl.bg.js
|
||||
|
||||
// Sets up listener, captures SSL info, then exits naturally
|
||||
// No SIGTERM handler needed - already exits when done
|
||||
|
||||
async function main() {
|
||||
const page = await connectToChrome();
|
||||
|
||||
// Set up listener
|
||||
page.on('response', async (response) => {
|
||||
const securityDetails = response.securityDetails();
|
||||
if (securityDetails) {
|
||||
fs.writeFileSync('ssl.json', JSON.stringify(securityDetails));
|
||||
}
|
||||
});
|
||||
|
||||
// Wait for navigation (done by other hook)
|
||||
await waitForNavigation();
|
||||
|
||||
// Emit result
|
||||
console.log(JSON.stringify({
|
||||
type: 'ArchiveResult',
|
||||
status: 'succeeded',
|
||||
output_str: 'ssl.json'
|
||||
}));
|
||||
|
||||
process.exit(0); // Natural exit - no await indefinitely
|
||||
}
|
||||
|
||||
main().catch(e => {
|
||||
console.error(`ERROR: ${e.message}`);
|
||||
process.exit(1); // Will be retried
|
||||
});
|
||||
```
|
||||
|
||||
## Summary
|
||||
|
||||
This plan provides:
|
||||
- ✅ Clear execution ordering (10 steps, 00-99 numbering)
|
||||
- ✅ Async support (.bg suffix)
|
||||
- ✅ Independent timeout control per plugin
|
||||
- ✅ Flexible failure handling & retry logic
|
||||
- ✅ Streaming JSONL output for DB updates
|
||||
- ✅ Simple filesystem-based coordination
|
||||
- ✅ Backward compatibility during migration
|
||||
|
||||
The main implementation work is refactoring `Snapshot.run()` to enforce step ordering and manage .bg script lifecycles. Plugin renumbering is straightforward mechanical work.
|
||||
@@ -54,7 +54,7 @@ class AddCommandSchema(Schema):
|
||||
tag: str = ""
|
||||
depth: int = 0
|
||||
parser: str = "auto"
|
||||
extract: str = ""
|
||||
plugins: str = ""
|
||||
update: bool = not ARCHIVING_CONFIG.ONLY_NEW # Default to the opposite of ARCHIVING_CONFIG.ONLY_NEW
|
||||
overwrite: bool = False
|
||||
index_only: bool = False
|
||||
@@ -69,7 +69,7 @@ class UpdateCommandSchema(Schema):
|
||||
status: Optional[StatusChoices] = StatusChoices.unarchived
|
||||
filter_type: Optional[str] = FilterTypeChoices.substring
|
||||
filter_patterns: Optional[List[str]] = ['https://example.com']
|
||||
extractors: Optional[str] = ""
|
||||
plugins: Optional[str] = ""
|
||||
|
||||
class ScheduleCommandSchema(Schema):
|
||||
import_path: Optional[str] = None
|
||||
@@ -115,7 +115,7 @@ def cli_add(request, args: AddCommandSchema):
|
||||
update=args.update,
|
||||
index_only=args.index_only,
|
||||
overwrite=args.overwrite,
|
||||
plugins=args.extract, # extract in API maps to plugins param
|
||||
plugins=args.plugins,
|
||||
parser=args.parser,
|
||||
bg=True, # Always run in background for API calls
|
||||
)
|
||||
@@ -143,7 +143,7 @@ def cli_update(request, args: UpdateCommandSchema):
|
||||
status=args.status,
|
||||
filter_type=args.filter_type,
|
||||
filter_patterns=args.filter_patterns,
|
||||
extractors=args.extractors,
|
||||
plugins=args.plugins,
|
||||
)
|
||||
return {
|
||||
"success": True,
|
||||
|
||||
@@ -65,7 +65,8 @@ class MinimalArchiveResultSchema(Schema):
|
||||
created_by_username: str
|
||||
status: str
|
||||
retry_at: datetime | None
|
||||
extractor: str
|
||||
plugin: str
|
||||
hook_name: str
|
||||
cmd_version: str | None
|
||||
cmd: list[str] | None
|
||||
pwd: str | None
|
||||
@@ -113,13 +114,14 @@ class ArchiveResultSchema(MinimalArchiveResultSchema):
|
||||
|
||||
class ArchiveResultFilterSchema(FilterSchema):
|
||||
id: Optional[str] = Field(None, q=['id__startswith', 'snapshot__id__startswith', 'snapshot__timestamp__startswith'])
|
||||
search: Optional[str] = Field(None, q=['snapshot__url__icontains', 'snapshot__title__icontains', 'snapshot__tags__name__icontains', 'extractor', 'output_str__icontains', 'id__startswith', 'snapshot__id__startswith', 'snapshot__timestamp__startswith'])
|
||||
search: Optional[str] = Field(None, q=['snapshot__url__icontains', 'snapshot__title__icontains', 'snapshot__tags__name__icontains', 'plugin', 'output_str__icontains', 'id__startswith', 'snapshot__id__startswith', 'snapshot__timestamp__startswith'])
|
||||
snapshot_id: Optional[str] = Field(None, q=['snapshot__id__startswith', 'snapshot__timestamp__startswith'])
|
||||
snapshot_url: Optional[str] = Field(None, q='snapshot__url__icontains')
|
||||
snapshot_tag: Optional[str] = Field(None, q='snapshot__tags__name__icontains')
|
||||
status: Optional[str] = Field(None, q='status')
|
||||
output_str: Optional[str] = Field(None, q='output_str__icontains')
|
||||
extractor: Optional[str] = Field(None, q='extractor__icontains')
|
||||
plugin: Optional[str] = Field(None, q='plugin__icontains')
|
||||
hook_name: Optional[str] = Field(None, q='hook_name__icontains')
|
||||
cmd: Optional[str] = Field(None, q='cmd__0__icontains')
|
||||
pwd: Optional[str] = Field(None, q='pwd__icontains')
|
||||
cmd_version: Optional[str] = Field(None, q='cmd_version')
|
||||
|
||||
@@ -86,7 +86,7 @@ def add(urls: str | list[str],
|
||||
'ONLY_NEW': not update,
|
||||
'INDEX_ONLY': index_only,
|
||||
'OVERWRITE': overwrite,
|
||||
'EXTRACTORS': plugins,
|
||||
'PLUGINS': plugins,
|
||||
'DEFAULT_PERSONA': persona or 'Default',
|
||||
'PARSER': parser,
|
||||
}
|
||||
|
||||
@@ -40,7 +40,7 @@ def process_archiveresult_by_id(archiveresult_id: str) -> int:
|
||||
"""
|
||||
Run extraction for a single ArchiveResult by ID (used by workers).
|
||||
|
||||
Triggers the ArchiveResult's state machine tick() to run the extractor.
|
||||
Triggers the ArchiveResult's state machine tick() to run the extractor plugin.
|
||||
"""
|
||||
from rich import print as rprint
|
||||
from core.models import ArchiveResult
|
||||
@@ -51,7 +51,7 @@ def process_archiveresult_by_id(archiveresult_id: str) -> int:
|
||||
rprint(f'[red]ArchiveResult {archiveresult_id} not found[/red]', file=sys.stderr)
|
||||
return 1
|
||||
|
||||
rprint(f'[blue]Extracting {archiveresult.extractor} for {archiveresult.snapshot.url}[/blue]', file=sys.stderr)
|
||||
rprint(f'[blue]Extracting {archiveresult.plugin} for {archiveresult.snapshot.url}[/blue]', file=sys.stderr)
|
||||
|
||||
try:
|
||||
# Trigger state machine tick - this runs the actual extraction
|
||||
@@ -151,7 +151,7 @@ def run_plugins(
|
||||
# Only create for specific plugin
|
||||
result, created = ArchiveResult.objects.get_or_create(
|
||||
snapshot=snapshot,
|
||||
extractor=plugin,
|
||||
plugin=plugin,
|
||||
defaults={
|
||||
'status': ArchiveResult.StatusChoices.QUEUED,
|
||||
'retry_at': timezone.now(),
|
||||
@@ -193,7 +193,7 @@ def run_plugins(
|
||||
snapshot = Snapshot.objects.get(id=snapshot_id)
|
||||
results = snapshot.archiveresult_set.all()
|
||||
if plugin:
|
||||
results = results.filter(extractor=plugin)
|
||||
results = results.filter(plugin=plugin)
|
||||
|
||||
for result in results:
|
||||
if is_tty:
|
||||
@@ -202,7 +202,7 @@ def run_plugins(
|
||||
'failed': 'red',
|
||||
'skipped': 'yellow',
|
||||
}.get(result.status, 'dim')
|
||||
rprint(f' [{status_color}]{result.status}[/{status_color}] {result.extractor} → {result.output_str or ""}', file=sys.stderr)
|
||||
rprint(f' [{status_color}]{result.status}[/{status_color}] {result.plugin} → {result.output_str or ""}', file=sys.stderr)
|
||||
else:
|
||||
write_record(archiveresult_to_jsonl(result))
|
||||
except Snapshot.DoesNotExist:
|
||||
|
||||
@@ -13,16 +13,16 @@ from archivebox.config import DATA_DIR
|
||||
from archivebox.config.common import SERVER_CONFIG
|
||||
from archivebox.misc.paginators import AccelleratedPaginator
|
||||
from archivebox.base_models.admin import BaseModelAdmin
|
||||
from archivebox.hooks import get_extractor_icon
|
||||
from archivebox.hooks import get_plugin_icon
|
||||
|
||||
|
||||
from core.models import ArchiveResult, Snapshot
|
||||
|
||||
|
||||
def render_archiveresults_list(archiveresults_qs, limit=50):
|
||||
"""Render a nice inline list view of archive results with status, extractor, output, and actions."""
|
||||
"""Render a nice inline list view of archive results with status, plugin, output, and actions."""
|
||||
|
||||
results = list(archiveresults_qs.order_by('extractor').select_related('snapshot')[:limit])
|
||||
results = list(archiveresults_qs.order_by('plugin').select_related('snapshot')[:limit])
|
||||
|
||||
if not results:
|
||||
return mark_safe('<div style="color: #64748b; font-style: italic; padding: 16px 0;">No Archive Results yet...</div>')
|
||||
@@ -40,8 +40,8 @@ def render_archiveresults_list(archiveresults_qs, limit=50):
|
||||
status = result.status or 'queued'
|
||||
color, bg = status_colors.get(status, ('#6b7280', '#f3f4f6'))
|
||||
|
||||
# Get extractor icon
|
||||
icon = get_extractor_icon(result.extractor)
|
||||
# Get plugin icon
|
||||
icon = get_plugin_icon(result.plugin)
|
||||
|
||||
# Format timestamp
|
||||
end_time = result.end_ts.strftime('%Y-%m-%d %H:%M:%S') if result.end_ts else '-'
|
||||
@@ -79,7 +79,7 @@ def render_archiveresults_list(archiveresults_qs, limit=50):
|
||||
font-size: 11px; font-weight: 600; text-transform: uppercase;
|
||||
color: {color}; background: {bg};">{status}</span>
|
||||
</td>
|
||||
<td style="padding: 10px 12px; white-space: nowrap; font-size: 20px;" title="{result.extractor}">
|
||||
<td style="padding: 10px 12px; white-space: nowrap; font-size: 20px;" title="{result.plugin}">
|
||||
{icon}
|
||||
</td>
|
||||
<td style="padding: 10px 12px; font-weight: 500; color: #334155;">
|
||||
@@ -88,7 +88,7 @@ def render_archiveresults_list(archiveresults_qs, limit=50):
|
||||
title="View output fullscreen"
|
||||
onmouseover="this.style.color='#2563eb'; this.style.textDecoration='underline';"
|
||||
onmouseout="this.style.color='#334155'; this.style.textDecoration='none';">
|
||||
{result.extractor}
|
||||
{result.plugin}
|
||||
</a>
|
||||
</td>
|
||||
<td style="padding: 10px 12px; max-width: 280px;">
|
||||
@@ -162,7 +162,7 @@ def render_archiveresults_list(archiveresults_qs, limit=50):
|
||||
<th style="padding: 10px 12px; text-align: left; font-weight: 600; color: #475569; font-size: 12px; text-transform: uppercase; letter-spacing: 0.05em;">ID</th>
|
||||
<th style="padding: 10px 12px; text-align: left; font-weight: 600; color: #475569; font-size: 12px; text-transform: uppercase; letter-spacing: 0.05em;">Status</th>
|
||||
<th style="padding: 10px 12px; text-align: left; font-weight: 600; color: #475569; font-size: 12px; width: 32px;"></th>
|
||||
<th style="padding: 10px 12px; text-align: left; font-weight: 600; color: #475569; font-size: 12px; text-transform: uppercase; letter-spacing: 0.05em;">Extractor</th>
|
||||
<th style="padding: 10px 12px; text-align: left; font-weight: 600; color: #475569; font-size: 12px; text-transform: uppercase; letter-spacing: 0.05em;">Plugin</th>
|
||||
<th style="padding: 10px 12px; text-align: left; font-weight: 600; color: #475569; font-size: 12px; text-transform: uppercase; letter-spacing: 0.05em;">Output</th>
|
||||
<th style="padding: 10px 12px; text-align: left; font-weight: 600; color: #475569; font-size: 12px; text-transform: uppercase; letter-spacing: 0.05em;">Completed</th>
|
||||
<th style="padding: 10px 12px; text-align: left; font-weight: 600; color: #475569; font-size: 12px; text-transform: uppercase; letter-spacing: 0.05em;">Version</th>
|
||||
@@ -185,9 +185,9 @@ class ArchiveResultInline(admin.TabularInline):
|
||||
parent_model = Snapshot
|
||||
# fk_name = 'snapshot'
|
||||
extra = 0
|
||||
sort_fields = ('end_ts', 'extractor', 'output_str', 'status', 'cmd_version')
|
||||
sort_fields = ('end_ts', 'plugin', 'output_str', 'status', 'cmd_version')
|
||||
readonly_fields = ('id', 'result_id', 'completed', 'command', 'version')
|
||||
fields = ('start_ts', 'end_ts', *readonly_fields, 'extractor', 'cmd', 'cmd_version', 'pwd', 'created_by', 'status', 'retry_at', 'output_str')
|
||||
fields = ('start_ts', 'end_ts', *readonly_fields, 'plugin', 'cmd', 'cmd_version', 'pwd', 'created_by', 'status', 'retry_at', 'output_str')
|
||||
# exclude = ('id',)
|
||||
ordering = ('end_ts',)
|
||||
show_change_link = True
|
||||
@@ -253,9 +253,9 @@ class ArchiveResultInline(admin.TabularInline):
|
||||
|
||||
class ArchiveResultAdmin(BaseModelAdmin):
|
||||
list_display = ('id', 'created_by', 'created_at', 'snapshot_info', 'tags_str', 'status', 'extractor_with_icon', 'cmd_str', 'output_str')
|
||||
sort_fields = ('id', 'created_by', 'created_at', 'extractor', 'status')
|
||||
sort_fields = ('id', 'created_by', 'created_at', 'plugin', 'status')
|
||||
readonly_fields = ('cmd_str', 'snapshot_info', 'tags_str', 'created_at', 'modified_at', 'output_summary', 'extractor_with_icon', 'iface')
|
||||
search_fields = ('id', 'snapshot__url', 'extractor', 'output_str', 'cmd_version', 'cmd', 'snapshot__timestamp')
|
||||
search_fields = ('id', 'snapshot__url', 'plugin', 'output_str', 'cmd_version', 'cmd', 'snapshot__timestamp')
|
||||
autocomplete_fields = ['snapshot']
|
||||
|
||||
fieldsets = (
|
||||
@@ -263,8 +263,8 @@ class ArchiveResultAdmin(BaseModelAdmin):
|
||||
'fields': ('snapshot', 'snapshot_info', 'tags_str'),
|
||||
'classes': ('card', 'wide'),
|
||||
}),
|
||||
('Extractor', {
|
||||
'fields': ('extractor', 'extractor_with_icon', 'status', 'retry_at', 'iface'),
|
||||
('Plugin', {
|
||||
'fields': ('plugin', 'plugin_with_icon', 'status', 'retry_at', 'iface'),
|
||||
'classes': ('card',),
|
||||
}),
|
||||
('Timing', {
|
||||
@@ -285,7 +285,7 @@ class ArchiveResultAdmin(BaseModelAdmin):
|
||||
}),
|
||||
)
|
||||
|
||||
list_filter = ('status', 'extractor', 'start_ts', 'cmd_version')
|
||||
list_filter = ('status', 'plugin', 'start_ts', 'cmd_version')
|
||||
ordering = ['-start_ts']
|
||||
list_per_page = SERVER_CONFIG.SNAPSHOTS_PER_PAGE
|
||||
|
||||
@@ -321,14 +321,14 @@ class ArchiveResultAdmin(BaseModelAdmin):
|
||||
def tags_str(self, result):
|
||||
return result.snapshot.tags_str()
|
||||
|
||||
@admin.display(description='Extractor', ordering='extractor')
|
||||
def extractor_with_icon(self, result):
|
||||
icon = get_extractor_icon(result.extractor)
|
||||
@admin.display(description='Plugin', ordering='plugin')
|
||||
def plugin_with_icon(self, result):
|
||||
icon = get_plugin_icon(result.plugin)
|
||||
return format_html(
|
||||
'<span title="{}">{}</span> {}',
|
||||
result.extractor,
|
||||
result.plugin,
|
||||
icon,
|
||||
result.extractor,
|
||||
result.plugin,
|
||||
)
|
||||
|
||||
def cmd_str(self, result):
|
||||
|
||||
@@ -10,19 +10,19 @@ DEPTH_CHOICES = (
|
||||
('1', 'depth = 1 (archive these URLs and all URLs one hop away)'),
|
||||
)
|
||||
|
||||
from archivebox.hooks import get_extractors
|
||||
from archivebox.hooks import get_plugins
|
||||
|
||||
def get_archive_methods():
|
||||
"""Get available archive methods from discovered hooks."""
|
||||
return [(name, name) for name in get_extractors()]
|
||||
def get_plugin_choices():
|
||||
"""Get available extractor plugins from discovered hooks."""
|
||||
return [(name, name) for name in get_plugins()]
|
||||
|
||||
|
||||
class AddLinkForm(forms.Form):
|
||||
url = forms.RegexField(label="URLs (one per line)", regex=URL_REGEX, min_length='6', strip=True, widget=forms.Textarea, required=True)
|
||||
tag = forms.CharField(label="Tags (comma separated tag1,tag2,tag3)", strip=True, required=False)
|
||||
depth = forms.ChoiceField(label="Archive depth", choices=DEPTH_CHOICES, initial='0', widget=forms.RadioSelect(attrs={"class": "depth-selection"}))
|
||||
archive_methods = forms.MultipleChoiceField(
|
||||
label="Archive methods (select at least 1, otherwise all will be used by default)",
|
||||
plugins = forms.MultipleChoiceField(
|
||||
label="Plugins (select at least 1, otherwise all will be used by default)",
|
||||
required=False,
|
||||
widget=forms.SelectMultiple,
|
||||
choices=[], # populated dynamically in __init__
|
||||
@@ -30,7 +30,7 @@ class AddLinkForm(forms.Form):
|
||||
|
||||
def __init__(self, *args, **kwargs):
|
||||
super().__init__(*args, **kwargs)
|
||||
self.fields['archive_methods'].choices = get_archive_methods()
|
||||
self.fields['plugins'].choices = get_plugin_choices()
|
||||
# TODO: hook these up to the view and put them
|
||||
# in a collapsible UI section labeled "Advanced"
|
||||
#
|
||||
|
||||
@@ -867,6 +867,7 @@ class Snapshot(ModelWithOutputDir, ModelWithConfig, ModelWithNotes, ModelWithHea
|
||||
ArchiveResult.objects.create(
|
||||
snapshot=self,
|
||||
plugin=plugin,
|
||||
hook_name=result_data.get('hook_name', ''),
|
||||
status=result_data.get('status', 'failed'),
|
||||
output_str=result_data.get('output', ''),
|
||||
cmd=result_data.get('cmd', []),
|
||||
@@ -1162,7 +1163,7 @@ class Snapshot(ModelWithOutputDir, ModelWithConfig, ModelWithNotes, ModelWithHea
|
||||
continue
|
||||
pid_file = plugin_dir / 'hook.pid'
|
||||
if pid_file.exists():
|
||||
kill_process(pid_file)
|
||||
kill_process(pid_file, validate=True) # Use validation
|
||||
|
||||
# Update the ArchiveResult from filesystem
|
||||
plugin_name = plugin_dir.name
|
||||
|
||||
@@ -192,7 +192,7 @@ class ArchiveResultMachine(StateMachine, strict_states=True):
|
||||
|
||||
def is_backoff(self) -> bool:
|
||||
"""Check if we should backoff and retry later."""
|
||||
# Backoff if status is still started (extractor didn't complete) and output_str is empty
|
||||
# Backoff if status is still started (plugin didn't complete) and output_str is empty
|
||||
return (
|
||||
self.archiveresult.status == ArchiveResult.StatusChoices.STARTED and
|
||||
not self.archiveresult.output_str
|
||||
@@ -222,19 +222,19 @@ class ArchiveResultMachine(StateMachine, strict_states=True):
|
||||
# Suppressed: state transition logs
|
||||
# Lock the object and mark start time
|
||||
self.archiveresult.update_for_workers(
|
||||
retry_at=timezone.now() + timedelta(seconds=120), # 2 min timeout for extractor
|
||||
retry_at=timezone.now() + timedelta(seconds=120), # 2 min timeout for plugin
|
||||
status=ArchiveResult.StatusChoices.STARTED,
|
||||
start_ts=timezone.now(),
|
||||
iface=NetworkInterface.current(),
|
||||
)
|
||||
|
||||
# Run the extractor - this updates status, output, timestamps, etc.
|
||||
# Run the plugin - this updates status, output, timestamps, etc.
|
||||
self.archiveresult.run()
|
||||
|
||||
# Save the updated result
|
||||
self.archiveresult.save()
|
||||
|
||||
# Suppressed: extractor result logs (already logged by worker)
|
||||
# Suppressed: plugin result logs (already logged by worker)
|
||||
|
||||
@backoff.enter
|
||||
def enter_backoff(self):
|
||||
|
||||
@@ -5,7 +5,7 @@ from django.utils.safestring import mark_safe
|
||||
from typing import Union
|
||||
|
||||
from archivebox.hooks import (
|
||||
get_extractor_icon, get_extractor_template, get_extractor_name,
|
||||
get_plugin_icon, get_plugin_template, get_plugin_name,
|
||||
)
|
||||
|
||||
|
||||
@@ -51,30 +51,30 @@ def url_replace(context, **kwargs):
|
||||
|
||||
|
||||
@register.simple_tag
|
||||
def extractor_icon(extractor: str) -> str:
|
||||
def plugin_icon(plugin: str) -> str:
|
||||
"""
|
||||
Render the icon for an extractor.
|
||||
Render the icon for a plugin.
|
||||
|
||||
Usage: {% extractor_icon "screenshot" %}
|
||||
Usage: {% plugin_icon "screenshot" %}
|
||||
"""
|
||||
return mark_safe(get_extractor_icon(extractor))
|
||||
return mark_safe(get_plugin_icon(plugin))
|
||||
|
||||
|
||||
@register.simple_tag(takes_context=True)
|
||||
def extractor_thumbnail(context, result) -> str:
|
||||
def plugin_thumbnail(context, result) -> str:
|
||||
"""
|
||||
Render the thumbnail template for an archive result.
|
||||
|
||||
Usage: {% extractor_thumbnail result %}
|
||||
Usage: {% plugin_thumbnail result %}
|
||||
|
||||
Context variables passed to template:
|
||||
- result: ArchiveResult object
|
||||
- snapshot: Parent Snapshot object
|
||||
- output_path: Path to output relative to snapshot dir (from embed_path())
|
||||
- extractor: Extractor base name
|
||||
- plugin: Plugin base name
|
||||
"""
|
||||
extractor = get_extractor_name(result.extractor)
|
||||
template_str = get_extractor_template(extractor, 'thumbnail')
|
||||
plugin = get_plugin_name(result.plugin)
|
||||
template_str = get_plugin_template(plugin, 'thumbnail')
|
||||
|
||||
if not template_str:
|
||||
return ''
|
||||
@@ -89,7 +89,7 @@ def extractor_thumbnail(context, result) -> str:
|
||||
'result': result,
|
||||
'snapshot': result.snapshot,
|
||||
'output_path': output_path,
|
||||
'extractor': extractor,
|
||||
'plugin': plugin,
|
||||
})
|
||||
return mark_safe(tpl.render(ctx))
|
||||
except Exception:
|
||||
@@ -97,14 +97,14 @@ def extractor_thumbnail(context, result) -> str:
|
||||
|
||||
|
||||
@register.simple_tag(takes_context=True)
|
||||
def extractor_embed(context, result) -> str:
|
||||
def plugin_embed(context, result) -> str:
|
||||
"""
|
||||
Render the embed iframe template for an archive result.
|
||||
|
||||
Usage: {% extractor_embed result %}
|
||||
Usage: {% plugin_embed result %}
|
||||
"""
|
||||
extractor = get_extractor_name(result.extractor)
|
||||
template_str = get_extractor_template(extractor, 'embed')
|
||||
plugin = get_plugin_name(result.plugin)
|
||||
template_str = get_plugin_template(plugin, 'embed')
|
||||
|
||||
if not template_str:
|
||||
return ''
|
||||
@@ -117,7 +117,7 @@ def extractor_embed(context, result) -> str:
|
||||
'result': result,
|
||||
'snapshot': result.snapshot,
|
||||
'output_path': output_path,
|
||||
'extractor': extractor,
|
||||
'plugin': plugin,
|
||||
})
|
||||
return mark_safe(tpl.render(ctx))
|
||||
except Exception:
|
||||
@@ -125,14 +125,14 @@ def extractor_embed(context, result) -> str:
|
||||
|
||||
|
||||
@register.simple_tag(takes_context=True)
|
||||
def extractor_fullscreen(context, result) -> str:
|
||||
def plugin_fullscreen(context, result) -> str:
|
||||
"""
|
||||
Render the fullscreen template for an archive result.
|
||||
|
||||
Usage: {% extractor_fullscreen result %}
|
||||
Usage: {% plugin_fullscreen result %}
|
||||
"""
|
||||
extractor = get_extractor_name(result.extractor)
|
||||
template_str = get_extractor_template(extractor, 'fullscreen')
|
||||
plugin = get_plugin_name(result.plugin)
|
||||
template_str = get_plugin_template(plugin, 'fullscreen')
|
||||
|
||||
if not template_str:
|
||||
return ''
|
||||
@@ -145,7 +145,7 @@ def extractor_fullscreen(context, result) -> str:
|
||||
'result': result,
|
||||
'snapshot': result.snapshot,
|
||||
'output_path': output_path,
|
||||
'extractor': extractor,
|
||||
'plugin': plugin,
|
||||
})
|
||||
return mark_safe(tpl.render(ctx))
|
||||
except Exception:
|
||||
@@ -153,10 +153,10 @@ def extractor_fullscreen(context, result) -> str:
|
||||
|
||||
|
||||
@register.filter
|
||||
def extractor_name(value: str) -> str:
|
||||
def plugin_name(value: str) -> str:
|
||||
"""
|
||||
Get the base name of an extractor (strips numeric prefix).
|
||||
Get the base name of a plugin (strips numeric prefix).
|
||||
|
||||
Usage: {{ result.extractor|extractor_name }}
|
||||
Usage: {{ result.plugin|plugin_name }}
|
||||
"""
|
||||
return get_extractor_name(value)
|
||||
return get_plugin_name(value)
|
||||
|
||||
@@ -56,9 +56,9 @@ class SnapshotView(View):
|
||||
def render_live_index(request, snapshot):
|
||||
TITLE_LOADING_MSG = 'Not yet archived...'
|
||||
|
||||
# Dict of extractor -> ArchiveResult object
|
||||
# Dict of plugin -> ArchiveResult object
|
||||
archiveresult_objects = {}
|
||||
# Dict of extractor -> result info dict (for template compatibility)
|
||||
# Dict of plugin -> result info dict (for template compatibility)
|
||||
archiveresults = {}
|
||||
|
||||
results = snapshot.archiveresult_set.all()
|
||||
@@ -75,16 +75,16 @@ class SnapshotView(View):
|
||||
continue
|
||||
|
||||
# Store the full ArchiveResult object for template tags
|
||||
archiveresult_objects[result.extractor] = result
|
||||
archiveresult_objects[result.plugin] = result
|
||||
|
||||
result_info = {
|
||||
'name': result.extractor,
|
||||
'name': result.plugin,
|
||||
'path': embed_path,
|
||||
'ts': ts_to_date_str(result.end_ts),
|
||||
'size': abs_path.stat().st_size or '?',
|
||||
'result': result, # Include the full object for template tags
|
||||
}
|
||||
archiveresults[result.extractor] = result_info
|
||||
archiveresults[result.plugin] = result_info
|
||||
|
||||
# Use canonical_outputs for intelligent discovery
|
||||
# This method now scans ArchiveResults and uses smart heuristics
|
||||
@@ -119,8 +119,8 @@ class SnapshotView(View):
|
||||
|
||||
# Get available extractors from hooks (sorted by numeric prefix for ordering)
|
||||
# Convert to base names for display ordering
|
||||
all_extractors = [get_extractor_name(e) for e in get_extractors()]
|
||||
preferred_types = tuple(all_extractors)
|
||||
all_plugins = [get_extractor_name(e) for e in get_extractors()]
|
||||
preferred_types = tuple(all_plugins)
|
||||
all_types = preferred_types + tuple(result_type for result_type in archiveresults.keys() if result_type not in preferred_types)
|
||||
|
||||
best_result = {'path': 'None', 'result': None}
|
||||
@@ -463,7 +463,6 @@ class AddView(UserPassesTestMixin, FormView):
|
||||
urls_content = sources_file.read_text()
|
||||
crawl = Crawl.objects.create(
|
||||
urls=urls_content,
|
||||
extractor=parser,
|
||||
max_depth=depth,
|
||||
tags_str=tag,
|
||||
label=f'{self.request.user.username}@{HOSTNAME}{self.request.path} {timestamp}',
|
||||
@@ -598,12 +597,12 @@ def live_progress_view(request):
|
||||
ArchiveResult.StatusChoices.SUCCEEDED: 2,
|
||||
ArchiveResult.StatusChoices.FAILED: 3,
|
||||
}
|
||||
return (status_order.get(ar.status, 4), ar.extractor)
|
||||
return (status_order.get(ar.status, 4), ar.plugin)
|
||||
|
||||
all_extractors = [
|
||||
all_plugins = [
|
||||
{
|
||||
'id': str(ar.id),
|
||||
'extractor': ar.extractor,
|
||||
'plugin': ar.plugin,
|
||||
'status': ar.status,
|
||||
}
|
||||
for ar in sorted(snapshot_results, key=extractor_sort_key)
|
||||
@@ -619,7 +618,7 @@ def live_progress_view(request):
|
||||
'completed_extractors': completed_extractors,
|
||||
'failed_extractors': failed_extractors,
|
||||
'pending_extractors': pending_extractors,
|
||||
'all_extractors': all_extractors,
|
||||
'all_plugins': all_plugins,
|
||||
})
|
||||
|
||||
# Check if crawl can start (for debugging stuck crawls)
|
||||
|
||||
@@ -353,10 +353,18 @@ class Crawl(ModelWithOutputDir, ModelWithConfig, ModelWithHealthStats, ModelWith
|
||||
import signal
|
||||
from pathlib import Path
|
||||
from archivebox.hooks import run_hook, discover_hooks
|
||||
from archivebox.misc.process_utils import validate_pid_file
|
||||
|
||||
# Kill any background processes by scanning for all .pid files
|
||||
if self.OUTPUT_DIR.exists():
|
||||
for pid_file in self.OUTPUT_DIR.glob('**/*.pid'):
|
||||
# Validate PID before killing to avoid killing unrelated processes
|
||||
cmd_file = pid_file.parent / 'cmd.sh'
|
||||
if not validate_pid_file(pid_file, cmd_file):
|
||||
# PID reused by different process or process dead
|
||||
pid_file.unlink(missing_ok=True)
|
||||
continue
|
||||
|
||||
try:
|
||||
pid = int(pid_file.read_text().strip())
|
||||
try:
|
||||
|
||||
@@ -292,8 +292,13 @@ def run_hook(
|
||||
stdout_file = output_dir / 'stdout.log'
|
||||
stderr_file = output_dir / 'stderr.log'
|
||||
pid_file = output_dir / 'hook.pid'
|
||||
cmd_file = output_dir / 'cmd.sh'
|
||||
|
||||
try:
|
||||
# Write command script for validation
|
||||
from archivebox.misc.process_utils import write_cmd_file
|
||||
write_cmd_file(cmd_file, cmd)
|
||||
|
||||
# Open log files for writing
|
||||
with open(stdout_file, 'w') as out, open(stderr_file, 'w') as err:
|
||||
process = subprocess.Popen(
|
||||
@@ -304,8 +309,10 @@ def run_hook(
|
||||
env=env,
|
||||
)
|
||||
|
||||
# Write PID for all hooks (useful for debugging/cleanup)
|
||||
pid_file.write_text(str(process.pid))
|
||||
# Write PID with mtime set to process start time for validation
|
||||
from archivebox.misc.process_utils import write_pid_file_with_mtime
|
||||
process_start_time = time.time()
|
||||
write_pid_file_with_mtime(pid_file, process.pid, process_start_time)
|
||||
|
||||
if is_background:
|
||||
# Background hook - return None immediately, don't wait
|
||||
@@ -1327,21 +1334,29 @@ def process_is_alive(pid_file: Path) -> bool:
|
||||
return False
|
||||
|
||||
|
||||
def kill_process(pid_file: Path, sig: int = signal.SIGTERM):
|
||||
def kill_process(pid_file: Path, sig: int = signal.SIGTERM, validate: bool = True):
|
||||
"""
|
||||
Kill process in PID file.
|
||||
Kill process in PID file with optional validation.
|
||||
|
||||
Args:
|
||||
pid_file: Path to hook.pid file
|
||||
sig: Signal to send (default SIGTERM)
|
||||
validate: If True, validate process identity before killing (default: True)
|
||||
"""
|
||||
if not pid_file.exists():
|
||||
return
|
||||
|
||||
try:
|
||||
pid = int(pid_file.read_text().strip())
|
||||
os.kill(pid, sig)
|
||||
except (OSError, ValueError):
|
||||
pass
|
||||
from archivebox.misc.process_utils import safe_kill_process
|
||||
|
||||
if validate:
|
||||
# Use safe kill with validation
|
||||
cmd_file = pid_file.parent / 'cmd.sh'
|
||||
safe_kill_process(pid_file, cmd_file, signal_num=sig, validate=True)
|
||||
else:
|
||||
# Legacy behavior - kill without validation
|
||||
if not pid_file.exists():
|
||||
return
|
||||
try:
|
||||
pid = int(pid_file.read_text().strip())
|
||||
os.kill(pid, sig)
|
||||
except (OSError, ValueError):
|
||||
pass
|
||||
|
||||
|
||||
|
||||
@@ -178,7 +178,8 @@ def archiveresult_to_jsonl(result) -> Dict[str, Any]:
|
||||
'type': TYPE_ARCHIVERESULT,
|
||||
'id': str(result.id),
|
||||
'snapshot_id': str(result.snapshot_id),
|
||||
'extractor': result.extractor,
|
||||
'plugin': result.plugin,
|
||||
'hook_name': result.hook_name,
|
||||
'status': result.status,
|
||||
'output_str': result.output_str,
|
||||
'start_ts': result.start_ts.isoformat() if result.start_ts else None,
|
||||
|
||||
@@ -29,6 +29,98 @@ const http = require('http');
|
||||
const EXTRACTOR_NAME = 'chrome_launch';
|
||||
const OUTPUT_DIR = 'chrome';
|
||||
|
||||
// Helper: Write PID file with mtime set to process start time
|
||||
function writePidWithMtime(filePath, pid, startTimeSeconds) {
|
||||
fs.writeFileSync(filePath, String(pid));
|
||||
// Set both atime and mtime to process start time for validation
|
||||
const startTimeMs = startTimeSeconds * 1000;
|
||||
fs.utimesSync(filePath, new Date(startTimeMs), new Date(startTimeMs));
|
||||
}
|
||||
|
||||
// Helper: Write command script for validation
|
||||
function writeCmdScript(filePath, binary, args) {
|
||||
// Shell escape arguments containing spaces or special characters
|
||||
const escapedArgs = args.map(arg => {
|
||||
if (arg.includes(' ') || arg.includes('"') || arg.includes('$')) {
|
||||
return `"${arg.replace(/"/g, '\\"')}"`;
|
||||
}
|
||||
return arg;
|
||||
});
|
||||
const script = `#!/bin/bash\n${binary} ${escapedArgs.join(' ')}\n`;
|
||||
fs.writeFileSync(filePath, script);
|
||||
fs.chmodSync(filePath, 0o755);
|
||||
}
|
||||
|
||||
// Helper: Get process start time (cross-platform)
|
||||
function getProcessStartTime(pid) {
|
||||
try {
|
||||
const { execSync } = require('child_process');
|
||||
if (process.platform === 'darwin') {
|
||||
// macOS: ps -p PID -o lstart= gives start time
|
||||
const output = execSync(`ps -p ${pid} -o lstart=`, { encoding: 'utf8', timeout: 1000 });
|
||||
return Date.parse(output.trim()) / 1000; // Convert to epoch seconds
|
||||
} else {
|
||||
// Linux: read /proc/PID/stat field 22 (starttime in clock ticks)
|
||||
const stat = fs.readFileSync(`/proc/${pid}/stat`, 'utf8');
|
||||
const match = stat.match(/\) \w+ (\d+)/);
|
||||
if (match) {
|
||||
const startTicks = parseInt(match[1], 10);
|
||||
// Convert clock ticks to seconds (assuming 100 ticks/sec)
|
||||
const uptimeSeconds = parseFloat(fs.readFileSync('/proc/uptime', 'utf8').split(' ')[0]);
|
||||
const bootTime = Date.now() / 1000 - uptimeSeconds;
|
||||
return bootTime + (startTicks / 100);
|
||||
}
|
||||
}
|
||||
} catch (e) {
|
||||
// Can't get start time
|
||||
return null;
|
||||
}
|
||||
return null;
|
||||
}
|
||||
|
||||
// Helper: Validate PID using mtime and command
|
||||
function validatePid(pid, pidFile, cmdFile) {
|
||||
try {
|
||||
// Check process exists
|
||||
try {
|
||||
process.kill(pid, 0); // Signal 0 = check existence
|
||||
} catch (e) {
|
||||
return false; // Process doesn't exist
|
||||
}
|
||||
|
||||
// Check mtime matches process start time (within 5 sec tolerance)
|
||||
const fileStat = fs.statSync(pidFile);
|
||||
const fileMtime = fileStat.mtimeMs / 1000; // Convert to seconds
|
||||
const procStartTime = getProcessStartTime(pid);
|
||||
|
||||
if (procStartTime === null) {
|
||||
// Can't validate - fall back to basic existence check
|
||||
return true;
|
||||
}
|
||||
|
||||
if (Math.abs(fileMtime - procStartTime) > 5) {
|
||||
// PID was reused by different process
|
||||
return false;
|
||||
}
|
||||
|
||||
// Validate command if available
|
||||
if (fs.existsSync(cmdFile)) {
|
||||
const cmd = fs.readFileSync(cmdFile, 'utf8');
|
||||
// Check for Chrome/Chromium and debug port
|
||||
if (!cmd.includes('chrome') && !cmd.includes('chromium')) {
|
||||
return false;
|
||||
}
|
||||
if (!cmd.includes('--remote-debugging-port')) {
|
||||
return false;
|
||||
}
|
||||
}
|
||||
|
||||
return true;
|
||||
} catch (e) {
|
||||
return false;
|
||||
}
|
||||
}
|
||||
|
||||
// Global state for cleanup
|
||||
let chromePid = null;
|
||||
|
||||
@@ -240,17 +332,17 @@ function killZombieChrome() {
|
||||
const pid = parseInt(fs.readFileSync(pidFile, 'utf8').trim(), 10);
|
||||
if (isNaN(pid) || pid <= 0) continue;
|
||||
|
||||
// Check if process exists
|
||||
try {
|
||||
process.kill(pid, 0);
|
||||
} catch (e) {
|
||||
// Process dead, remove stale PID file
|
||||
// Validate PID before killing
|
||||
const cmdFile = path.join(chromeDir, 'cmd.sh');
|
||||
if (!validatePid(pid, pidFile, cmdFile)) {
|
||||
// PID reused or validation failed
|
||||
console.error(`[!] PID ${pid} failed validation (reused or wrong process) - cleaning up`);
|
||||
try { fs.unlinkSync(pidFile); } catch (e) {}
|
||||
continue;
|
||||
}
|
||||
|
||||
// Process alive but crawl is stale - zombie!
|
||||
console.error(`[!] Found zombie (PID ${pid}) from stale crawl ${crawl.name}`);
|
||||
// Process alive, validated, and crawl is stale - zombie!
|
||||
console.error(`[!] Found validated zombie (PID ${pid}) from stale crawl ${crawl.name}`);
|
||||
|
||||
try {
|
||||
// Kill process group first
|
||||
@@ -386,15 +478,20 @@ async function launchChrome(binary) {
|
||||
chromeProcess.unref(); // Don't keep Node.js process running
|
||||
|
||||
chromePid = chromeProcess.pid;
|
||||
const chromeStartTime = Date.now() / 1000; // Unix epoch seconds
|
||||
console.error(`[*] Launched Chrome (PID: ${chromePid}), waiting for debug port...`);
|
||||
|
||||
// Write Chrome PID for backup cleanup (named .pid so Crawl.cleanup() finds it)
|
||||
fs.writeFileSync(path.join(OUTPUT_DIR, 'chrome.pid'), String(chromePid));
|
||||
// Write Chrome PID with mtime set to start time for validation
|
||||
writePidWithMtime(path.join(OUTPUT_DIR, 'chrome.pid'), chromePid, chromeStartTime);
|
||||
|
||||
// Write command script for validation
|
||||
writeCmdScript(path.join(OUTPUT_DIR, 'cmd.sh'), binary, chromeArgs);
|
||||
|
||||
fs.writeFileSync(path.join(OUTPUT_DIR, 'port.txt'), String(debugPort));
|
||||
|
||||
// Write hook's own PID so Crawl.cleanup() can kill this hook process
|
||||
// (which will trigger our SIGTERM handler to kill Chrome)
|
||||
fs.writeFileSync(path.join(OUTPUT_DIR, 'hook.pid'), String(process.pid));
|
||||
// Write hook's own PID with mtime for validation
|
||||
const hookStartTime = Date.now() / 1000;
|
||||
writePidWithMtime(path.join(OUTPUT_DIR, 'hook.pid'), process.pid, hookStartTime);
|
||||
|
||||
try {
|
||||
// Wait for Chrome to be ready
|
||||
|
||||
@@ -488,7 +488,7 @@
|
||||
<span class="progress-fill"></span>
|
||||
<span class="badge-content">
|
||||
<span class="badge-icon">${icon}</span>
|
||||
<span>${extractor.extractor}</span>
|
||||
<span>${extractor.plugin}</span>
|
||||
</span>
|
||||
</span>
|
||||
`;
|
||||
@@ -499,10 +499,10 @@
|
||||
const adminUrl = `/admin/core/snapshot/${snapshot.id}/change/`;
|
||||
|
||||
let extractorHtml = '';
|
||||
if (snapshot.all_extractors && snapshot.all_extractors.length > 0) {
|
||||
// Sort extractors alphabetically by name to prevent reordering on updates
|
||||
const sortedExtractors = [...snapshot.all_extractors].sort((a, b) =>
|
||||
a.extractor.localeCompare(b.extractor)
|
||||
if (snapshot.all_plugins && snapshot.all_plugins.length > 0) {
|
||||
// Sort plugins alphabetically by name to prevent reordering on updates
|
||||
const sortedExtractors = [...snapshot.all_plugins].sort((a, b) =>
|
||||
a.plugin.localeCompare(b.plugin)
|
||||
);
|
||||
extractorHtml = `
|
||||
<div class="extractor-list">
|
||||
|
||||
@@ -158,7 +158,7 @@ CREATE TABLE IF NOT EXISTS core_snapshot_tags (
|
||||
CREATE TABLE IF NOT EXISTS core_archiveresult (
|
||||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||||
snapshot_id CHAR(32) NOT NULL REFERENCES core_snapshot(id),
|
||||
extractor VARCHAR(32) NOT NULL,
|
||||
plugin VARCHAR(32) NOT NULL,
|
||||
cmd TEXT,
|
||||
pwd VARCHAR(256),
|
||||
cmd_version VARCHAR(128),
|
||||
@@ -379,7 +379,7 @@ CREATE TABLE IF NOT EXISTS crawls_seed (
|
||||
created_by_id INTEGER NOT NULL REFERENCES auth_user(id),
|
||||
modified_at DATETIME,
|
||||
uri VARCHAR(2048) NOT NULL,
|
||||
extractor VARCHAR(32) NOT NULL DEFAULT 'auto',
|
||||
plugin VARCHAR(32) NOT NULL DEFAULT 'auto',
|
||||
tags_str VARCHAR(255) NOT NULL DEFAULT '',
|
||||
label VARCHAR(255) NOT NULL DEFAULT '',
|
||||
config TEXT DEFAULT '{}',
|
||||
@@ -465,7 +465,7 @@ CREATE TABLE IF NOT EXISTS core_archiveresult (
|
||||
created_at DATETIME NOT NULL,
|
||||
modified_at DATETIME,
|
||||
snapshot_id CHAR(36) NOT NULL REFERENCES core_snapshot(id),
|
||||
extractor VARCHAR(32) NOT NULL,
|
||||
plugin VARCHAR(32) NOT NULL,
|
||||
pwd VARCHAR(256),
|
||||
cmd TEXT,
|
||||
cmd_version VARCHAR(128),
|
||||
|
||||
@@ -227,10 +227,10 @@ class Worker:
|
||||
urls = obj.get_urls_list()
|
||||
url = urls[0] if urls else None
|
||||
|
||||
extractor = None
|
||||
if hasattr(obj, 'extractor'):
|
||||
plugin = None
|
||||
if hasattr(obj, 'plugin'):
|
||||
# ArchiveResultWorker, Crawl
|
||||
extractor = obj.extractor
|
||||
plugin = obj.plugin
|
||||
|
||||
log_worker_event(
|
||||
worker_type=worker_type_name,
|
||||
@@ -239,7 +239,7 @@ class Worker:
|
||||
pid=self.pid,
|
||||
worker_id=str(self.worker_id),
|
||||
url=url,
|
||||
extractor=extractor,
|
||||
plugin=plugin,
|
||||
metadata=start_metadata if start_metadata else None,
|
||||
)
|
||||
|
||||
@@ -262,7 +262,7 @@ class Worker:
|
||||
pid=self.pid,
|
||||
worker_id=str(self.worker_id),
|
||||
url=url,
|
||||
extractor=extractor,
|
||||
plugin=plugin,
|
||||
metadata=complete_metadata,
|
||||
)
|
||||
else:
|
||||
@@ -345,9 +345,9 @@ class ArchiveResultWorker(Worker):
|
||||
name: ClassVar[str] = 'archiveresult'
|
||||
MAX_TICK_TIME: ClassVar[int] = 120
|
||||
|
||||
def __init__(self, extractor: str | None = None, **kwargs: Any):
|
||||
def __init__(self, plugin: str | None = None, **kwargs: Any):
|
||||
super().__init__(**kwargs)
|
||||
self.extractor = extractor
|
||||
self.plugin = plugin
|
||||
|
||||
def get_model(self):
|
||||
from core.models import ArchiveResult
|
||||
@@ -359,16 +359,16 @@ class ArchiveResultWorker(Worker):
|
||||
|
||||
qs = super().get_queue()
|
||||
|
||||
if self.extractor:
|
||||
qs = qs.filter(extractor=self.extractor)
|
||||
if self.plugin:
|
||||
qs = qs.filter(plugin=self.plugin)
|
||||
|
||||
# Note: Removed blocking logic since plugins have separate output directories
|
||||
# and don't interfere with each other. Each plugin (extractor) runs independently.
|
||||
# and don't interfere with each other. Each plugin runs independently.
|
||||
|
||||
return qs
|
||||
|
||||
def process_item(self, obj) -> bool:
|
||||
"""Process an ArchiveResult by running its extractor."""
|
||||
"""Process an ArchiveResult by running its plugin."""
|
||||
try:
|
||||
obj.sm.tick()
|
||||
return True
|
||||
@@ -378,8 +378,8 @@ class ArchiveResultWorker(Worker):
|
||||
return False
|
||||
|
||||
@classmethod
|
||||
def start(cls, worker_id: int | None = None, daemon: bool = False, extractor: str | None = None, **kwargs: Any) -> int:
|
||||
"""Fork a new worker as subprocess with optional extractor filter."""
|
||||
def start(cls, worker_id: int | None = None, daemon: bool = False, plugin: str | None = None, **kwargs: Any) -> int:
|
||||
"""Fork a new worker as subprocess with optional plugin filter."""
|
||||
if worker_id is None:
|
||||
worker_id = get_next_worker_id(cls.name)
|
||||
|
||||
@@ -387,7 +387,7 @@ class ArchiveResultWorker(Worker):
|
||||
proc = Process(
|
||||
target=_run_worker,
|
||||
args=(cls.name, worker_id, daemon),
|
||||
kwargs={'extractor': extractor, **kwargs},
|
||||
kwargs={'plugin': plugin, **kwargs},
|
||||
name=f'{cls.name}_worker_{worker_id}',
|
||||
)
|
||||
proc.start()
|
||||
|
||||
@@ -74,17 +74,17 @@ def test_background_hooks_dont_block_parser_extractors(tmp_path, process):
|
||||
# Check that background hooks are running
|
||||
# Background hooks: consolelog, ssl, responses, redirects, staticfile
|
||||
bg_hooks = c.execute(
|
||||
"SELECT extractor, status FROM core_archiveresult WHERE extractor IN ('consolelog', 'ssl', 'responses', 'redirects', 'staticfile') ORDER BY extractor"
|
||||
"SELECT plugin, status FROM core_archiveresult WHERE plugin IN ('consolelog', 'ssl', 'responses', 'redirects', 'staticfile') ORDER BY plugin"
|
||||
).fetchall()
|
||||
|
||||
# Check that parser extractors have run (not stuck in queued)
|
||||
parser_extractors = c.execute(
|
||||
"SELECT extractor, status FROM core_archiveresult WHERE extractor LIKE 'parse_%_urls' ORDER BY extractor"
|
||||
"SELECT plugin, status FROM core_archiveresult WHERE plugin LIKE 'parse_%_urls' ORDER BY plugin"
|
||||
).fetchall()
|
||||
|
||||
# Check all extractors to see what's happening
|
||||
all_extractors = c.execute(
|
||||
"SELECT extractor, status FROM core_archiveresult ORDER BY extractor"
|
||||
"SELECT plugin, status FROM core_archiveresult ORDER BY plugin"
|
||||
).fetchall()
|
||||
|
||||
conn.close()
|
||||
@@ -160,7 +160,7 @@ def test_parser_extractors_emit_snapshot_jsonl(tmp_path, process):
|
||||
|
||||
# Check that parse_html_urls ran
|
||||
parse_html = c.execute(
|
||||
"SELECT id, status, output_str FROM core_archiveresult WHERE extractor = '60_parse_html_urls'"
|
||||
"SELECT id, status, output_str FROM core_archiveresult WHERE plugin = '60_parse_html_urls'"
|
||||
).fetchone()
|
||||
|
||||
conn.close()
|
||||
@@ -171,7 +171,7 @@ def test_parser_extractors_emit_snapshot_jsonl(tmp_path, process):
|
||||
|
||||
# Parser should have run
|
||||
assert status in ['started', 'succeeded', 'failed'], \
|
||||
f"parse_html_urls should have run, got status: {status}"
|
||||
f"60_parse_html_urls should have run, got status: {status}"
|
||||
|
||||
# If it succeeded and found links, output should contain JSON
|
||||
if status == 'succeeded' and output:
|
||||
@@ -185,39 +185,37 @@ def test_recursive_crawl_creates_child_snapshots(tmp_path, process):
|
||||
"""Test that recursive crawling creates child snapshots with proper depth and parent_snapshot_id."""
|
||||
os.chdir(tmp_path)
|
||||
|
||||
# Disable most extractors to speed up test, but keep wget for HTML content
|
||||
# Create a test HTML file with links
|
||||
test_html = tmp_path / 'test.html'
|
||||
test_html.write_text('''
|
||||
<html>
|
||||
<body>
|
||||
<h1>Test Page</h1>
|
||||
<a href="https://monadical.com/about">About</a>
|
||||
<a href="https://monadical.com/blog">Blog</a>
|
||||
<a href="https://monadical.com/contact">Contact</a>
|
||||
</body>
|
||||
</html>
|
||||
''')
|
||||
|
||||
# Minimal env for fast testing
|
||||
env = os.environ.copy()
|
||||
env.update({
|
||||
"USE_WGET": "true", # Need wget to fetch HTML for parsers
|
||||
"USE_SINGLEFILE": "false",
|
||||
"USE_READABILITY": "false",
|
||||
"USE_MERCURY": "false",
|
||||
"SAVE_HTMLTOTEXT": "false",
|
||||
"SAVE_PDF": "false",
|
||||
"SAVE_SCREENSHOT": "false",
|
||||
"SAVE_DOM": "false",
|
||||
"SAVE_HEADERS": "false",
|
||||
"USE_GIT": "false",
|
||||
"SAVE_MEDIA": "false",
|
||||
"SAVE_ARCHIVE_DOT_ORG": "false",
|
||||
"SAVE_TITLE": "false",
|
||||
"SAVE_FAVICON": "false",
|
||||
"USE_CHROME": "false",
|
||||
"URL_ALLOWLIST": r"monadical\.com/.*", # Only crawl same domain
|
||||
})
|
||||
|
||||
# Start a crawl with depth=1 (just one hop to test recursive crawling)
|
||||
# Use file:// URL so it's instant, no network fetch needed
|
||||
proc = subprocess.Popen(
|
||||
['archivebox', 'add', '--depth=1', 'https://monadical.com'],
|
||||
['archivebox', 'add', '--depth=1', f'file://{test_html}'],
|
||||
stdout=subprocess.PIPE,
|
||||
stderr=subprocess.PIPE,
|
||||
text=True,
|
||||
env=env,
|
||||
)
|
||||
|
||||
# Give orchestrator time to process - parser extractors should emit child snapshots within 60s
|
||||
# Even if root snapshot is still processing, child snapshots can start in parallel
|
||||
time.sleep(60)
|
||||
# Give orchestrator time to process - file:// is fast, should complete in 20s
|
||||
time.sleep(20)
|
||||
|
||||
# Kill the process
|
||||
proc.kill()
|
||||
@@ -231,8 +229,7 @@ def test_recursive_crawl_creates_child_snapshots(tmp_path, process):
|
||||
|
||||
# Check root snapshot (depth=0)
|
||||
root_snapshot = c.execute(
|
||||
"SELECT id, url, depth, parent_snapshot_id FROM core_snapshot WHERE url = ? AND depth = 0",
|
||||
('https://monadical.com',)
|
||||
"SELECT id, url, depth, parent_snapshot_id FROM core_snapshot WHERE depth = 0 ORDER BY created_at LIMIT 1"
|
||||
).fetchone()
|
||||
|
||||
# Check if any child snapshots were created (depth=1)
|
||||
@@ -247,13 +244,13 @@ def test_recursive_crawl_creates_child_snapshots(tmp_path, process):
|
||||
|
||||
# Check parser extractor status
|
||||
parser_status = c.execute(
|
||||
"SELECT extractor, status FROM core_archiveresult WHERE snapshot_id = ? AND extractor LIKE 'parse_%_urls'",
|
||||
"SELECT plugin, status FROM core_archiveresult WHERE snapshot_id = ? AND plugin LIKE 'parse_%_urls'",
|
||||
(root_snapshot[0] if root_snapshot else '',)
|
||||
).fetchall()
|
||||
|
||||
# Check for started extractors that might be blocking
|
||||
started_extractors = c.execute(
|
||||
"SELECT extractor, status FROM core_archiveresult WHERE snapshot_id = ? AND status = 'started'",
|
||||
"SELECT plugin, status FROM core_archiveresult WHERE snapshot_id = ? AND status = 'started'",
|
||||
(root_snapshot[0] if root_snapshot else '',)
|
||||
).fetchall()
|
||||
|
||||
@@ -417,12 +414,12 @@ def test_archiveresult_worker_queue_filters_by_foreground_extractors(tmp_path, p
|
||||
|
||||
# Get background hooks that are started
|
||||
bg_started = c.execute(
|
||||
"SELECT extractor FROM core_archiveresult WHERE extractor IN ('consolelog', 'ssl', 'responses', 'redirects', 'staticfile') AND status = 'started'"
|
||||
"SELECT plugin FROM core_archiveresult WHERE plugin IN ('consolelog', 'ssl', 'responses', 'redirects', 'staticfile') AND status = 'started'"
|
||||
).fetchall()
|
||||
|
||||
# Get parser extractors that should be queued or better
|
||||
parser_status = c.execute(
|
||||
"SELECT extractor, status FROM core_archiveresult WHERE extractor LIKE 'parse_%_urls'"
|
||||
"SELECT plugin, status FROM core_archiveresult WHERE plugin LIKE 'parse_%_urls'"
|
||||
).fetchall()
|
||||
|
||||
conn.close()
|
||||
|
||||
Reference in New Issue
Block a user