mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-06 07:47:53 +10:00
move tests into subfolder, add missing install hooks
This commit is contained in:
172
old/Architecture.md
Normal file
172
old/Architecture.md
Normal file
@@ -0,0 +1,172 @@
|
||||
# ArchiveBox UI
|
||||
|
||||
## Page: Getting Started
|
||||
|
||||
### What do you want to capture?
|
||||
|
||||
- Save some URLs now -> [Add page]
|
||||
- Paste some URLs to archive now
|
||||
- Upload a file containing URLs (bookmarks.html export, RSS.xml feed, markdown file, word doc, PDF, etc.)
|
||||
- Pull in URLs to archive from a remote location (e.g. RSS feed URL, remote TXT file, JSON file, etc.)
|
||||
|
||||
- Import URLs from a browser -> [Import page]
|
||||
- Desktop: Get the ArchiveBox Chrome/Firefox extension
|
||||
- Mobile: Get the ArchiveBox iOS App / Android App
|
||||
- Upload a bookmarks.html export file
|
||||
- Upload a browser_history.sqlite3 export file
|
||||
|
||||
- Import URLs from a 3rd party bookmarking service -> [Sync page]
|
||||
- Pocket
|
||||
- Pinboard
|
||||
- Instapaper
|
||||
- Wallabag
|
||||
- Zapier, N8N, IFTTT, etc.
|
||||
- Upload a bookmarks.html export, bookmarks.json, RSS, etc. file
|
||||
|
||||
- Archive URLs on a schedule -> [Schedule page]
|
||||
|
||||
- Archive an entire website -> [Crawl page]
|
||||
- What starting URL/domain?
|
||||
- How deep?
|
||||
- Follow links to external domains?
|
||||
- Follow links to parent URLs?
|
||||
- Maximum number of pages to save?
|
||||
- Maximum number of requests/minute?
|
||||
|
||||
- Crawl for URLs with a search engine and save automatically
|
||||
-
|
||||
- Some URLs on a schedule
|
||||
- Save an entire website (e.g. `https://example.com`)
|
||||
- Save results matching a search query (e.g. "site:example.com")
|
||||
- Save a social media feed (e.g. `https://x.com/user/1234567890`)
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
### Crawls App
|
||||
|
||||
- Archive an entire website -> [Crawl page]
|
||||
- What are the starting URLs?
|
||||
- How many hops to follow?
|
||||
- Follow links to external domains?
|
||||
- Follow links to parent URLs?
|
||||
- Maximum number of pages to save?
|
||||
- Maximum number of requests/minute?
|
||||
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
### Scheduler App
|
||||
|
||||
|
||||
- Archive URLs on a schedule -> [Schedule page]
|
||||
- What URL(s)?
|
||||
- How often?
|
||||
- Do you want to discard old snapshots after x amount of time?
|
||||
- Any filter rules?
|
||||
- Want to be notified when changes are detected -> redirect[Alerts app/create new alert(crawl=self)]
|
||||
|
||||
|
||||
* Choose Schedule check for new URLs: Schedule.objects.get(pk=xyz)
|
||||
- 1 minute
|
||||
- 5 minutes
|
||||
- 1 hour
|
||||
- 1 day
|
||||
|
||||
* Choose Destination Crawl to archive URLs using : Crawl.objects.get(pk=xyz)
|
||||
- Tags
|
||||
- Persona
|
||||
- Created By ID
|
||||
- Config
|
||||
- Filters
|
||||
- URL patterns to include
|
||||
- URL patterns to exclude
|
||||
- ONLY_NEW= Ignore URLs if already saved once / save URL each time it appears / only save is last save > x time ago
|
||||
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
### Sources App (For managing sources that ArchiveBox pulls URLs in from)
|
||||
|
||||
- Add a new source to pull URLs in from (WIZARD)
|
||||
- Choose URI:
|
||||
- [x] Web UI
|
||||
- [x] CLI
|
||||
- Local filesystem path (directory to monitor for new files containing URLs)
|
||||
- Remote URL (RSS/JSON/XML feed)
|
||||
- Chrome browser profile sync (login using gmail to pull bookmarks/history)
|
||||
- Pocket, Pinboard, Instapaper, Wallabag, etc.
|
||||
- Zapier, N8N, IFTTT, etc.
|
||||
- Local server filesystem path (directory to monitor for new files containing URLs)
|
||||
- Google drive (directory to monitor for new files containing URLs)
|
||||
- Remote server FTP/SFTP/SCP path (directory to monitor for new files containing URLs)
|
||||
- AWS/S3/B2/GCP bucket (directory to monitor for new files containing URLs)
|
||||
- XBrowserSync (login to pull bookmarks)
|
||||
- Choose extractor
|
||||
- auto
|
||||
- RSS
|
||||
- Pocket
|
||||
- etc.
|
||||
- Specify extra Config, e.g.
|
||||
- credentials
|
||||
- extractor tuning options (e.g. verify_ssl, cookies, etc.)
|
||||
|
||||
- Provide credentials for the source
|
||||
- API Key
|
||||
- Username / Password
|
||||
- OAuth
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
### Alerts App
|
||||
|
||||
- Create a new alert, choose condition
|
||||
- Get notified when a site goes down (<x% success ratio for Snapshots)
|
||||
- Get notified when a site changes visually more than x% (screenshot diff)
|
||||
- Get notified when a site's text content changes more than x% (text diff)
|
||||
- Get notified when a keyword appears
|
||||
- Get notified when a keyword dissapears
|
||||
- When an AI prompt returns some result
|
||||
|
||||
- Choose alert threshold:
|
||||
- any condition is met
|
||||
- all conditions are met
|
||||
- condition is met for x% of URLs
|
||||
- condition is met for x% of time
|
||||
|
||||
- Choose how to notify: (List[AlertDestination])
|
||||
- maximum alert frequency
|
||||
- destination type: email / Slack / Webhook / Google Sheet / logfile
|
||||
- destination info:
|
||||
- email address(es)
|
||||
- Slack channel
|
||||
- Webhook URL
|
||||
|
||||
- Choose scope:
|
||||
- Choose ArchiveResult scope (extractors): (a query that returns ArchiveResult.objects QuerySet)
|
||||
- All extractors
|
||||
- Only screenshots
|
||||
- Only readability / mercury text
|
||||
- Only video
|
||||
- Only html
|
||||
- Only headers
|
||||
|
||||
- Choose Snapshot scope (URL): (a query that returns Snapshot.objects QuerySet)
|
||||
- All domains
|
||||
- Specific domain
|
||||
- All domains in a tag
|
||||
- All domains in a tag category
|
||||
- All URLs matching a certain regex pattern
|
||||
|
||||
- Choose crawl scope: (a query that returns Crawl.objects QuerySet)
|
||||
- All crawls
|
||||
- Specific crawls
|
||||
- crawls by a certain user
|
||||
- crawls using a certain persona
|
||||
|
||||
|
||||
class AlertDestination(models.Model):
|
||||
destination_type: [email, slack, webhook, google_sheet, local logfile, b2/s3/gcp bucket, etc.]
|
||||
maximum_frequency
|
||||
filter_rules
|
||||
credentials
|
||||
alert_template: JINJA2 json/text template that gets populated with alert contents
|
||||
716
old/TODO_archivebox_jsonl_cli.md
Normal file
716
old/TODO_archivebox_jsonl_cli.md
Normal file
@@ -0,0 +1,716 @@
|
||||
# ArchiveBox CLI Pipeline Architecture
|
||||
|
||||
## Overview
|
||||
|
||||
This plan implements a JSONL-based CLI pipeline for ArchiveBox, enabling Unix-style piping between commands:
|
||||
|
||||
```bash
|
||||
archivebox crawl create URL | archivebox snapshot create | archivebox archiveresult create | archivebox run
|
||||
```
|
||||
|
||||
## Design Principles
|
||||
|
||||
1. **Maximize model method reuse**: Use `.to_json()`, `.from_json()`, `.to_jsonl()`, `.from_jsonl()` everywhere
|
||||
2. **Pass-through behavior**: All commands output input records + newly created records (accumulating pipeline)
|
||||
3. **Create-or-update**: Commands create records if they don't exist, update if ID matches existing
|
||||
4. **Auto-cascade**: `archivebox run` automatically creates Snapshots from Crawls and ArchiveResults from Snapshots
|
||||
5. **Generic filtering**: Implement filters as functions that take queryset → return queryset
|
||||
6. **Minimal code**: Extract duplicated `apply_filters()` to shared module
|
||||
|
||||
---
|
||||
|
||||
## Real-World Use Cases
|
||||
|
||||
These examples demonstrate the JSONL piping architecture. Key points:
|
||||
- `archivebox run` auto-cascades (Crawl → Snapshots → ArchiveResults)
|
||||
- `archivebox run` **emits JSONL** of everything it creates, enabling chained processing
|
||||
- Use CLI args (`--status=`, `--plugin=`) for efficient DB filtering; use jq for transforms
|
||||
|
||||
### 1. Basic Archive
|
||||
```bash
|
||||
# Simple URL archive (run auto-creates snapshots and archive results)
|
||||
archivebox crawl create https://example.com | archivebox run
|
||||
|
||||
# Multiple URLs from a file
|
||||
archivebox crawl create < urls.txt | archivebox run
|
||||
|
||||
# With depth crawling (follow links)
|
||||
archivebox crawl create --depth=2 https://docs.python.org | archivebox run
|
||||
```
|
||||
|
||||
### 2. Retry Failed Extractions
|
||||
```bash
|
||||
# Retry all failed extractions
|
||||
archivebox archiveresult list --status=failed | archivebox run
|
||||
|
||||
# Retry only failed PDFs from a specific domain
|
||||
archivebox archiveresult list --status=failed --plugin=pdf --url__icontains=nytimes.com \
|
||||
| archivebox run
|
||||
```
|
||||
|
||||
### 3. Import Bookmarks from Pinboard (jq transform)
|
||||
```bash
|
||||
# Fetch Pinboard API, transform fields to match ArchiveBox schema, archive
|
||||
curl -s "https://api.pinboard.in/v1/posts/all?format=json&auth_token=$TOKEN" \
|
||||
| jq -c '.[] | {url: .href, tags_str: .tags, title: .description}' \
|
||||
| archivebox crawl create \
|
||||
| archivebox run
|
||||
```
|
||||
|
||||
### 4. Retry Failed with Different Binary (jq transform + re-run)
|
||||
```bash
|
||||
# Get failed wget results, transform to use wget2 binary instead, re-queue as new attempts
|
||||
archivebox archiveresult list --status=failed --plugin=wget \
|
||||
| jq -c '{snapshot_id, plugin, status: "queued", overrides: {WGET_BINARY: "wget2"}}' \
|
||||
| archivebox archiveresult create \
|
||||
| archivebox run
|
||||
|
||||
# Chain processing: archive, then re-run any failures with increased timeout
|
||||
archivebox crawl create https://slow-site.com \
|
||||
| archivebox run \
|
||||
| jq -c 'select(.type == "ArchiveResult" and .status == "failed")
|
||||
| del(.id) | .status = "queued" | .overrides.TIMEOUT = "120"' \
|
||||
| archivebox archiveresult create \
|
||||
| archivebox run
|
||||
```
|
||||
|
||||
### 5. Selective Extraction
|
||||
```bash
|
||||
# Create only screenshot extractions for queued snapshots
|
||||
archivebox snapshot list --status=queued \
|
||||
| archivebox archiveresult create --plugin=screenshot \
|
||||
| archivebox run
|
||||
|
||||
# Re-run singlefile on everything that was skipped
|
||||
archivebox archiveresult list --plugin=singlefile --status=skipped \
|
||||
| archivebox archiveresult update --status=queued \
|
||||
| archivebox run
|
||||
```
|
||||
|
||||
### 6. Bulk Tag Management
|
||||
```bash
|
||||
# Tag all Twitter/X URLs (efficient DB filter, no jq needed)
|
||||
archivebox snapshot list --url__icontains=twitter.com \
|
||||
| archivebox snapshot update --tag=twitter
|
||||
|
||||
# Tag snapshots based on computed criteria (jq for logic DB can't do)
|
||||
archivebox snapshot list --status=sealed \
|
||||
| jq -c 'select(.archiveresult_count > 5) | . + {tags_str: (.tags_str + ",well-archived")}' \
|
||||
| archivebox snapshot update
|
||||
```
|
||||
|
||||
### 7. RSS Feed Monitoring
|
||||
```bash
|
||||
# Archive all items from an RSS feed
|
||||
curl -s "https://hnrss.org/frontpage" \
|
||||
| xq -r '.rss.channel.item[].link' \
|
||||
| archivebox crawl create --tag=hackernews-$(date +%Y%m%d) \
|
||||
| archivebox run
|
||||
```
|
||||
|
||||
### 8. Recursive Link Following (run output → filter → re-run)
|
||||
```bash
|
||||
# Archive a page, then archive all PDFs it links to
|
||||
archivebox crawl create https://research-papers.org/index.html \
|
||||
| archivebox run \
|
||||
| jq -c 'select(.type == "Snapshot") | .discovered_urls[]?
|
||||
| select(endswith(".pdf")) | {url: .}' \
|
||||
| archivebox crawl create --tag=linked-pdfs \
|
||||
| archivebox run
|
||||
|
||||
# Depth crawl with custom handling: retry timeouts with longer timeout
|
||||
archivebox crawl create --depth=1 https://example.com \
|
||||
| archivebox run \
|
||||
| jq -c 'select(.type == "ArchiveResult" and .status == "failed" and .error contains "timeout")
|
||||
| del(.id) | .overrides.TIMEOUT = "300"' \
|
||||
| archivebox archiveresult create \
|
||||
| archivebox run
|
||||
```
|
||||
|
||||
### Composability Summary
|
||||
|
||||
| Pattern | Example |
|
||||
|---------|---------|
|
||||
| **Filter → Process** | `list --status=failed --plugin=pdf \| run` |
|
||||
| **Transform → Archive** | `curl API \| jq '{url, tags_str}' \| crawl create \| run` |
|
||||
| **Retry w/ Changes** | `run \| jq 'select(.status=="failed") \| del(.id)' \| create \| run` |
|
||||
| **Selective Extract** | `snapshot list \| archiveresult create --plugin=screenshot` |
|
||||
| **Bulk Update** | `list --url__icontains=X \| update --tag=Y` |
|
||||
| **Chain Processing** | `crawl \| run \| jq transform \| create \| run` |
|
||||
|
||||
The key insight: **`archivebox run` emits JSONL of everything it creates**, enabling:
|
||||
- Retry failed items with different settings (timeouts, binaries, etc.)
|
||||
- Recursive crawling (archive page → extract links → archive those)
|
||||
- Chained transforms (filter failures, modify config, re-queue)
|
||||
|
||||
---
|
||||
|
||||
## Code Reuse Findings
|
||||
|
||||
### Existing Model Methods (USE THESE)
|
||||
- `Crawl.to_json()`, `Crawl.from_json()`, `Crawl.to_jsonl()`, `Crawl.from_jsonl()`
|
||||
- `Snapshot.to_json()`, `Snapshot.from_json()`, `Snapshot.to_jsonl()`, `Snapshot.from_jsonl()`
|
||||
- `Tag.to_json()`, `Tag.from_json()`, `Tag.to_jsonl()`, `Tag.from_jsonl()`
|
||||
|
||||
### Missing Model Methods (MUST IMPLEMENT)
|
||||
- **`ArchiveResult.from_json()`** - Does not exist, must be added
|
||||
- **`ArchiveResult.from_jsonl()`** - Does not exist, must be added
|
||||
|
||||
### Existing Utilities (USE THESE)
|
||||
- `archivebox/misc/jsonl.py`: `read_stdin()`, `read_args_or_stdin()`, `write_record()`, `parse_line()`
|
||||
- Type constants: `TYPE_CRAWL`, `TYPE_SNAPSHOT`, `TYPE_ARCHIVERESULT`, etc.
|
||||
|
||||
### Duplicated Code (EXTRACT)
|
||||
- `apply_filters()` duplicated in 7 CLI files → extract to `archivebox/cli/cli_utils.py`
|
||||
|
||||
### Supervisord Config (UPDATE)
|
||||
- `archivebox/workers/supervisord_util.py` line ~35: `"command": "archivebox manage orchestrator"` → `"command": "archivebox run"`
|
||||
|
||||
### Field Name Standardization (FIX)
|
||||
- **Issue**: `Crawl.to_json()` outputs `tags_str`, but `Snapshot.to_json()` outputs `tags`
|
||||
- **Fix**: Standardize all models to use `tags_str` in JSONL output (matches model property names)
|
||||
|
||||
---
|
||||
|
||||
## Implementation Order
|
||||
|
||||
### Phase 1: Model Prerequisites
|
||||
1. **Implement `ArchiveResult.from_json()`** in `archivebox/core/models.py`
|
||||
- Pattern: Match `Snapshot.from_json()` and `Crawl.from_json()` style
|
||||
- Handle: ID lookup (update existing) or create new
|
||||
- Required fields: `snapshot_id`, `plugin`
|
||||
- Optional fields: `status`, `hook_name`, etc.
|
||||
|
||||
2. **Implement `ArchiveResult.from_jsonl()`** in `archivebox/core/models.py`
|
||||
- Filter records by `type='ArchiveResult'`
|
||||
- Call `from_json()` for each matching record
|
||||
|
||||
3. **Fix `Snapshot.to_json()` field name**
|
||||
- Change `'tags': self.tags_str()` → `'tags_str': self.tags_str()`
|
||||
- Update any code that depends on `tags` key in Snapshot JSONL
|
||||
|
||||
### Phase 2: Shared Utilities
|
||||
4. **Extract `apply_filters()` to `archivebox/cli/cli_utils.py`**
|
||||
- Generic queryset filtering from CLI kwargs
|
||||
- Support `--id__in=[csv]`, `--url__icontains=str`, etc.
|
||||
- Remove duplicates from 7 CLI files
|
||||
|
||||
### Phase 3: Pass-Through Behavior (NEW FEATURE)
|
||||
5. **Add pass-through to `archivebox crawl create`**
|
||||
- Output non-Crawl input records unchanged
|
||||
- Output created Crawl records
|
||||
|
||||
6. **Add pass-through to `archivebox snapshot create`**
|
||||
- Output non-Snapshot/non-Crawl input records unchanged
|
||||
- Process Crawl records → create Snapshots
|
||||
- Output both original Crawl and created Snapshots
|
||||
|
||||
7. **Add pass-through to `archivebox archiveresult create`**
|
||||
- Output non-Snapshot/non-ArchiveResult input records unchanged
|
||||
- Process Snapshot records → create ArchiveResults
|
||||
- Output both original Snapshots and created ArchiveResults
|
||||
|
||||
8. **Add create-or-update to `archivebox run`**
|
||||
- Records WITH id: lookup and queue existing
|
||||
- Records WITHOUT id: create via `Model.from_json()`, then queue
|
||||
- Pass-through output of all processed records
|
||||
|
||||
### Phase 4: Test Infrastructure
|
||||
9. **Create `archivebox/tests/conftest.py`** with pytest-django
|
||||
- Use `pytest-django` for proper test database handling
|
||||
- Isolated DATA_DIR per test via `tmp_path` fixture
|
||||
- `run_archivebox_cmd()` helper for subprocess testing
|
||||
|
||||
### Phase 5: Unit Tests
|
||||
10. **Create `archivebox/tests/test_cli_crawl.py`** - crawl create/list/pass-through tests
|
||||
11. **Create `archivebox/tests/test_cli_snapshot.py`** - snapshot create/list/pass-through tests
|
||||
12. **Create `archivebox/tests/test_cli_archiveresult.py`** - archiveresult create/list/pass-through tests
|
||||
13. **Create `archivebox/tests/test_cli_run.py`** - run command create-or-update tests
|
||||
|
||||
### Phase 6: Integration & Config
|
||||
14. **Extend `archivebox/cli/tests_piping.py`** - Add pass-through integration tests
|
||||
15. **Update supervisord config** - `orchestrator` → `run`
|
||||
|
||||
---
|
||||
|
||||
## Future Work (Deferred)
|
||||
|
||||
### Commands to Defer
|
||||
- `archivebox tag create|list|update|delete` - Already works, defer improvements
|
||||
- `archivebox binary create|list|update|delete` - Lower priority
|
||||
- `archivebox process list` - Lower priority
|
||||
- `archivebox apikey create|list|update|delete` - Lower priority
|
||||
|
||||
### `archivebox add` Relationship
|
||||
- **Current**: `archivebox add` is the primary user-facing command, stays as-is
|
||||
- **Future**: Refactor `add` to internally use `crawl create | snapshot create | run` pipeline
|
||||
- **Note**: This refactor is deferred; `add` continues to work independently for now
|
||||
|
||||
---
|
||||
|
||||
## Key Files
|
||||
|
||||
| File | Action | Phase |
|
||||
|------|--------|-------|
|
||||
| `archivebox/core/models.py` | Add `ArchiveResult.from_json()`, `from_jsonl()` | 1 |
|
||||
| `archivebox/core/models.py` | Fix `Snapshot.to_json()` → `tags_str` | 1 |
|
||||
| `archivebox/cli/cli_utils.py` | NEW - shared `apply_filters()` | 2 |
|
||||
| `archivebox/cli/archivebox_crawl.py` | Add pass-through to create | 3 |
|
||||
| `archivebox/cli/archivebox_snapshot.py` | Add pass-through to create | 3 |
|
||||
| `archivebox/cli/archivebox_archiveresult.py` | Add pass-through to create | 3 |
|
||||
| `archivebox/cli/archivebox_run.py` | Add create-or-update, pass-through | 3 |
|
||||
| `archivebox/tests/conftest.py` | NEW - pytest fixtures | 4 |
|
||||
| `archivebox/tests/test_cli_crawl.py` | NEW - crawl unit tests | 5 |
|
||||
| `archivebox/tests/test_cli_snapshot.py` | NEW - snapshot unit tests | 5 |
|
||||
| `archivebox/tests/test_cli_archiveresult.py` | NEW - archiveresult unit tests | 5 |
|
||||
| `archivebox/tests/test_cli_run.py` | NEW - run unit tests | 5 |
|
||||
| `archivebox/cli/tests_piping.py` | Extend with pass-through tests | 6 |
|
||||
| `archivebox/workers/supervisord_util.py` | Update orchestrator→run | 6 |
|
||||
|
||||
---
|
||||
|
||||
## Implementation Details
|
||||
|
||||
### ArchiveResult.from_json() Design
|
||||
|
||||
```python
|
||||
@staticmethod
|
||||
def from_json(record: Dict[str, Any], overrides: Dict[str, Any] = None) -> 'ArchiveResult | None':
|
||||
"""
|
||||
Create or update a single ArchiveResult from a JSON record dict.
|
||||
|
||||
Args:
|
||||
record: Dict with 'snapshot_id' and 'plugin' (required for create),
|
||||
or 'id' (for update)
|
||||
overrides: Dict of field overrides
|
||||
|
||||
Returns:
|
||||
ArchiveResult instance or None if invalid
|
||||
"""
|
||||
from django.utils import timezone
|
||||
|
||||
overrides = overrides or {}
|
||||
|
||||
# If 'id' is provided, lookup and update existing
|
||||
result_id = record.get('id')
|
||||
if result_id:
|
||||
try:
|
||||
result = ArchiveResult.objects.get(id=result_id)
|
||||
# Update fields from record
|
||||
if record.get('status'):
|
||||
result.status = record['status']
|
||||
result.retry_at = timezone.now()
|
||||
result.save()
|
||||
return result
|
||||
except ArchiveResult.DoesNotExist:
|
||||
pass # Fall through to create
|
||||
|
||||
# Required fields for creation
|
||||
snapshot_id = record.get('snapshot_id')
|
||||
plugin = record.get('plugin')
|
||||
|
||||
if not snapshot_id or not plugin:
|
||||
return None
|
||||
|
||||
try:
|
||||
snapshot = Snapshot.objects.get(id=snapshot_id)
|
||||
except Snapshot.DoesNotExist:
|
||||
return None
|
||||
|
||||
# Create or get existing result
|
||||
result, created = ArchiveResult.objects.get_or_create(
|
||||
snapshot=snapshot,
|
||||
plugin=plugin,
|
||||
defaults={
|
||||
'status': record.get('status', ArchiveResult.StatusChoices.QUEUED),
|
||||
'retry_at': timezone.now(),
|
||||
'hook_name': record.get('hook_name', ''),
|
||||
**overrides,
|
||||
}
|
||||
)
|
||||
|
||||
# If not created, optionally reset for retry
|
||||
if not created and record.get('status'):
|
||||
result.status = record['status']
|
||||
result.retry_at = timezone.now()
|
||||
result.save()
|
||||
|
||||
return result
|
||||
```
|
||||
|
||||
### Pass-Through Pattern
|
||||
|
||||
All `create` commands follow this pattern:
|
||||
|
||||
```python
|
||||
def create_X(args, ...):
|
||||
is_tty = sys.stdout.isatty()
|
||||
records = list(read_args_or_stdin(args))
|
||||
|
||||
for record in records:
|
||||
record_type = record.get('type')
|
||||
|
||||
# Pass-through: output records we don't handle
|
||||
if record_type not in HANDLED_TYPES:
|
||||
if not is_tty:
|
||||
write_record(record)
|
||||
continue
|
||||
|
||||
# Handle our type: create via Model.from_json()
|
||||
obj = Model.from_json(record, overrides={...})
|
||||
|
||||
# Output created record (hydrated with db id)
|
||||
if obj and not is_tty:
|
||||
write_record(obj.to_json())
|
||||
```
|
||||
|
||||
### Pass-Through Semantics Example
|
||||
|
||||
```
|
||||
Input:
|
||||
{"type": "Crawl", "id": "abc", "urls": "https://example.com", ...}
|
||||
{"type": "Tag", "name": "important"}
|
||||
|
||||
archivebox snapshot create output:
|
||||
{"type": "Crawl", "id": "abc", ...} # pass-through (not our type)
|
||||
{"type": "Tag", "name": "important"} # pass-through (not our type)
|
||||
{"type": "Snapshot", "id": "xyz", ...} # created from Crawl URLs
|
||||
```
|
||||
|
||||
### Create-or-Update Pattern for `archivebox run`
|
||||
|
||||
```python
|
||||
def process_stdin_records() -> int:
|
||||
records = list(read_stdin())
|
||||
is_tty = sys.stdout.isatty()
|
||||
|
||||
for record in records:
|
||||
record_type = record.get('type')
|
||||
record_id = record.get('id')
|
||||
|
||||
# Create-or-update based on whether ID exists
|
||||
if record_type == TYPE_CRAWL:
|
||||
if record_id:
|
||||
try:
|
||||
obj = Crawl.objects.get(id=record_id)
|
||||
except Crawl.DoesNotExist:
|
||||
obj = Crawl.from_json(record)
|
||||
else:
|
||||
obj = Crawl.from_json(record)
|
||||
|
||||
if obj:
|
||||
obj.retry_at = timezone.now()
|
||||
obj.save()
|
||||
if not is_tty:
|
||||
write_record(obj.to_json())
|
||||
|
||||
# Similar for Snapshot, ArchiveResult...
|
||||
```
|
||||
|
||||
### Shared apply_filters() Design
|
||||
|
||||
Extract to `archivebox/cli/cli_utils.py`:
|
||||
|
||||
```python
|
||||
"""Shared CLI utilities for ArchiveBox commands."""
|
||||
|
||||
from typing import Optional
|
||||
|
||||
def apply_filters(queryset, filter_kwargs: dict, limit: Optional[int] = None):
|
||||
"""
|
||||
Apply Django-style filters from CLI kwargs to a QuerySet.
|
||||
|
||||
Supports: --status=queued, --url__icontains=example, --id__in=uuid1,uuid2
|
||||
|
||||
Args:
|
||||
queryset: Django QuerySet to filter
|
||||
filter_kwargs: Dict of filter key-value pairs from CLI
|
||||
limit: Optional limit on results
|
||||
|
||||
Returns:
|
||||
Filtered QuerySet
|
||||
"""
|
||||
filters = {}
|
||||
for key, value in filter_kwargs.items():
|
||||
if value is None or key in ('limit', 'offset'):
|
||||
continue
|
||||
# Handle CSV lists for __in filters
|
||||
if key.endswith('__in') and isinstance(value, str):
|
||||
value = [v.strip() for v in value.split(',')]
|
||||
filters[key] = value
|
||||
|
||||
if filters:
|
||||
queryset = queryset.filter(**filters)
|
||||
if limit:
|
||||
queryset = queryset[:limit]
|
||||
|
||||
return queryset
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## conftest.py Design (pytest-django)
|
||||
|
||||
```python
|
||||
"""archivebox/tests/conftest.py - Pytest fixtures for CLI tests."""
|
||||
|
||||
import os
|
||||
import sys
|
||||
import json
|
||||
import subprocess
|
||||
from pathlib import Path
|
||||
from typing import List, Dict, Any, Optional, Tuple
|
||||
|
||||
import pytest
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# Fixtures
|
||||
# =============================================================================
|
||||
|
||||
@pytest.fixture
|
||||
def isolated_data_dir(tmp_path, settings):
|
||||
"""
|
||||
Create isolated DATA_DIR for each test.
|
||||
|
||||
Uses tmp_path for isolation, configures Django settings.
|
||||
"""
|
||||
data_dir = tmp_path / 'archivebox_data'
|
||||
data_dir.mkdir()
|
||||
|
||||
# Set environment for subprocess calls
|
||||
os.environ['DATA_DIR'] = str(data_dir)
|
||||
|
||||
# Update Django settings
|
||||
settings.DATA_DIR = data_dir
|
||||
|
||||
yield data_dir
|
||||
|
||||
# Cleanup handled by tmp_path fixture
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def initialized_archive(isolated_data_dir):
|
||||
"""
|
||||
Initialize ArchiveBox archive in isolated directory.
|
||||
|
||||
Runs `archivebox init` to set up database and directories.
|
||||
"""
|
||||
from archivebox.cli.archivebox_init import init
|
||||
init(setup=True, quick=True)
|
||||
return isolated_data_dir
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def cli_env(initialized_archive):
|
||||
"""
|
||||
Environment dict for CLI subprocess calls.
|
||||
|
||||
Includes DATA_DIR and disables slow extractors.
|
||||
"""
|
||||
return {
|
||||
**os.environ,
|
||||
'DATA_DIR': str(initialized_archive),
|
||||
'USE_COLOR': 'False',
|
||||
'SHOW_PROGRESS': 'False',
|
||||
'SAVE_TITLE': 'True',
|
||||
'SAVE_FAVICON': 'False',
|
||||
'SAVE_WGET': 'False',
|
||||
'SAVE_WARC': 'False',
|
||||
'SAVE_PDF': 'False',
|
||||
'SAVE_SCREENSHOT': 'False',
|
||||
'SAVE_DOM': 'False',
|
||||
'SAVE_SINGLEFILE': 'False',
|
||||
'SAVE_READABILITY': 'False',
|
||||
'SAVE_MERCURY': 'False',
|
||||
'SAVE_GIT': 'False',
|
||||
'SAVE_YTDLP': 'False',
|
||||
'SAVE_HEADERS': 'False',
|
||||
}
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# CLI Helpers
|
||||
# =============================================================================
|
||||
|
||||
def run_archivebox_cmd(
|
||||
args: List[str],
|
||||
stdin: Optional[str] = None,
|
||||
cwd: Optional[Path] = None,
|
||||
env: Optional[Dict[str, str]] = None,
|
||||
timeout: int = 60,
|
||||
) -> Tuple[str, str, int]:
|
||||
"""
|
||||
Run archivebox command, return (stdout, stderr, returncode).
|
||||
|
||||
Args:
|
||||
args: Command arguments (e.g., ['crawl', 'create', 'https://example.com'])
|
||||
stdin: Optional string to pipe to stdin
|
||||
cwd: Working directory (defaults to DATA_DIR from env)
|
||||
env: Environment variables (defaults to os.environ with DATA_DIR)
|
||||
timeout: Command timeout in seconds
|
||||
|
||||
Returns:
|
||||
Tuple of (stdout, stderr, returncode)
|
||||
"""
|
||||
cmd = [sys.executable, '-m', 'archivebox'] + args
|
||||
|
||||
env = env or {**os.environ}
|
||||
cwd = cwd or Path(env.get('DATA_DIR', '.'))
|
||||
|
||||
result = subprocess.run(
|
||||
cmd,
|
||||
input=stdin,
|
||||
capture_output=True,
|
||||
text=True,
|
||||
cwd=cwd,
|
||||
env=env,
|
||||
timeout=timeout,
|
||||
)
|
||||
|
||||
return result.stdout, result.stderr, result.returncode
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# Output Assertions
|
||||
# =============================================================================
|
||||
|
||||
def parse_jsonl_output(stdout: str) -> List[Dict[str, Any]]:
|
||||
"""Parse JSONL output into list of dicts."""
|
||||
records = []
|
||||
for line in stdout.strip().split('\n'):
|
||||
line = line.strip()
|
||||
if line and line.startswith('{'):
|
||||
try:
|
||||
records.append(json.loads(line))
|
||||
except json.JSONDecodeError:
|
||||
pass
|
||||
return records
|
||||
|
||||
|
||||
def assert_jsonl_contains_type(stdout: str, record_type: str, min_count: int = 1):
|
||||
"""Assert output contains at least min_count records of type."""
|
||||
records = parse_jsonl_output(stdout)
|
||||
matching = [r for r in records if r.get('type') == record_type]
|
||||
assert len(matching) >= min_count, \
|
||||
f"Expected >= {min_count} {record_type}, got {len(matching)}"
|
||||
return matching
|
||||
|
||||
|
||||
def assert_jsonl_pass_through(stdout: str, input_records: List[Dict[str, Any]]):
|
||||
"""Assert that input records appear in output (pass-through behavior)."""
|
||||
output_records = parse_jsonl_output(stdout)
|
||||
output_ids = {r.get('id') for r in output_records if r.get('id')}
|
||||
|
||||
for input_rec in input_records:
|
||||
input_id = input_rec.get('id')
|
||||
if input_id:
|
||||
assert input_id in output_ids, \
|
||||
f"Input record {input_id} not found in output (pass-through failed)"
|
||||
|
||||
|
||||
def assert_record_has_fields(record: Dict[str, Any], required_fields: List[str]):
|
||||
"""Assert record has all required fields with non-None values."""
|
||||
for field in required_fields:
|
||||
assert field in record, f"Record missing field: {field}"
|
||||
assert record[field] is not None, f"Record field is None: {field}"
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# Database Assertions
|
||||
# =============================================================================
|
||||
|
||||
def assert_db_count(model_class, filters: Dict[str, Any], expected: int):
|
||||
"""Assert database count matches expected."""
|
||||
actual = model_class.objects.filter(**filters).count()
|
||||
assert actual == expected, \
|
||||
f"Expected {expected} {model_class.__name__}, got {actual}"
|
||||
|
||||
|
||||
def assert_db_exists(model_class, **filters):
|
||||
"""Assert at least one record exists matching filters."""
|
||||
assert model_class.objects.filter(**filters).exists(), \
|
||||
f"No {model_class.__name__} found matching {filters}"
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# Test Data Factories
|
||||
# =============================================================================
|
||||
|
||||
def create_test_url(domain: str = 'example.com', path: str = None) -> str:
|
||||
"""Generate unique test URL."""
|
||||
import uuid
|
||||
path = path or uuid.uuid4().hex[:8]
|
||||
return f'https://{domain}/{path}'
|
||||
|
||||
|
||||
def create_test_crawl_json(urls: List[str] = None, **kwargs) -> Dict[str, Any]:
|
||||
"""Create Crawl JSONL record for testing."""
|
||||
from archivebox.misc.jsonl import TYPE_CRAWL
|
||||
|
||||
urls = urls or [create_test_url()]
|
||||
return {
|
||||
'type': TYPE_CRAWL,
|
||||
'urls': '\n'.join(urls),
|
||||
'max_depth': kwargs.get('max_depth', 0),
|
||||
'tags_str': kwargs.get('tags_str', ''),
|
||||
'status': kwargs.get('status', 'queued'),
|
||||
**{k: v for k, v in kwargs.items() if k not in ('max_depth', 'tags_str', 'status')},
|
||||
}
|
||||
|
||||
|
||||
def create_test_snapshot_json(url: str = None, **kwargs) -> Dict[str, Any]:
|
||||
"""Create Snapshot JSONL record for testing."""
|
||||
from archivebox.misc.jsonl import TYPE_SNAPSHOT
|
||||
|
||||
return {
|
||||
'type': TYPE_SNAPSHOT,
|
||||
'url': url or create_test_url(),
|
||||
'tags_str': kwargs.get('tags_str', ''),
|
||||
'status': kwargs.get('status', 'queued'),
|
||||
**{k: v for k, v in kwargs.items() if k not in ('tags_str', 'status')},
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Test Rules
|
||||
|
||||
- **NO SKIPPING** - Every test runs
|
||||
- **NO MOCKING** - Real subprocess calls, real database
|
||||
- **NO DISABLING** - Failing tests identify real problems
|
||||
- **MINIMAL CODE** - Import helpers from conftest.py
|
||||
- **ISOLATED** - Each test gets its own DATA_DIR via `tmp_path`
|
||||
|
||||
---
|
||||
|
||||
## Task Checklist
|
||||
|
||||
### Phase 1: Model Prerequisites
|
||||
- [x] Implement `ArchiveResult.from_json()` in `archivebox/core/models.py`
|
||||
- [x] Implement `ArchiveResult.from_jsonl()` in `archivebox/core/models.py`
|
||||
- [x] Fix `Snapshot.to_json()` to use `tags_str` instead of `tags`
|
||||
|
||||
### Phase 2: Shared Utilities
|
||||
- [x] Create `archivebox/cli/cli_utils.py` with shared `apply_filters()`
|
||||
- [x] Update 7 CLI files to import from `cli_utils.py`
|
||||
|
||||
### Phase 3: Pass-Through Behavior
|
||||
- [x] Add pass-through to `archivebox_crawl.py` create
|
||||
- [x] Add pass-through to `archivebox_snapshot.py` create
|
||||
- [x] Add pass-through to `archivebox_archiveresult.py` create
|
||||
- [x] Add create-or-update to `archivebox_run.py`
|
||||
- [x] Add pass-through output to `archivebox_run.py`
|
||||
|
||||
### Phase 4: Test Infrastructure
|
||||
- [x] Create `archivebox/tests/conftest.py` with pytest-django fixtures
|
||||
|
||||
### Phase 5: Unit Tests
|
||||
- [x] Create `archivebox/tests/test_cli_crawl.py`
|
||||
- [x] Create `archivebox/tests/test_cli_snapshot.py`
|
||||
- [x] Create `archivebox/tests/test_cli_archiveresult.py`
|
||||
- [x] Create `archivebox/tests/test_cli_run.py`
|
||||
|
||||
### Phase 6: Integration & Config
|
||||
- [x] Extend `archivebox/cli/tests_piping.py` with pass-through tests
|
||||
- [x] Update `archivebox/workers/supervisord_util.py`: orchestrator→run
|
||||
131
old/TODO_cli_refactor.md
Normal file
131
old/TODO_cli_refactor.md
Normal file
@@ -0,0 +1,131 @@
|
||||
# ArchiveBox CLI Refactor TODO
|
||||
|
||||
## Design Decisions
|
||||
|
||||
1. **Keep `archivebox add`** as high-level convenience command
|
||||
2. **Unified `archivebox run`** for processing (replaces per-model `run` and `orchestrator`)
|
||||
3. **Expose all models** including binary, process, machine
|
||||
4. **Clean break** from old command structure (no backward compatibility aliases)
|
||||
|
||||
## Final Architecture
|
||||
|
||||
```
|
||||
archivebox <model> <action> [args...] [--filters]
|
||||
archivebox run [stdin JSONL]
|
||||
```
|
||||
|
||||
### Actions (4 per model):
|
||||
- `create` - Create records (from args, stdin, or JSONL), dedupes by indexed fields
|
||||
- `list` - Query records (with filters, returns JSONL)
|
||||
- `update` - Modify records (from stdin JSONL, PATCH semantics)
|
||||
- `delete` - Remove records (from stdin JSONL, requires --yes)
|
||||
|
||||
### Unified Run Command:
|
||||
- `archivebox run` - Process queued work
|
||||
- With stdin JSONL: Process piped records, exit when complete
|
||||
- Without stdin (TTY): Run orchestrator in foreground until killed
|
||||
|
||||
### Models (7 total):
|
||||
- `crawl` - Crawl jobs
|
||||
- `snapshot` - Individual archived pages
|
||||
- `archiveresult` - Plugin extraction results
|
||||
- `tag` - Tags/labels
|
||||
- `binary` - Detected binaries (chrome, wget, etc.)
|
||||
- `process` - Process execution records (read-only)
|
||||
- `machine` - Machine/host records (read-only)
|
||||
|
||||
---
|
||||
|
||||
## Implementation Checklist
|
||||
|
||||
### Phase 1: Unified Run Command
|
||||
- [x] Create `archivebox/cli/archivebox_run.py` - unified processing command
|
||||
|
||||
### Phase 2: Core Model Commands
|
||||
- [x] Refactor `archivebox/cli/archivebox_snapshot.py` to Click group with create|list|update|delete
|
||||
- [x] Refactor `archivebox/cli/archivebox_crawl.py` to Click group with create|list|update|delete
|
||||
- [x] Create `archivebox/cli/archivebox_archiveresult.py` with create|list|update|delete
|
||||
- [x] Create `archivebox/cli/archivebox_tag.py` with create|list|update|delete
|
||||
|
||||
### Phase 3: System Model Commands
|
||||
- [x] Create `archivebox/cli/archivebox_binary.py` with create|list|update|delete
|
||||
- [x] Create `archivebox/cli/archivebox_process.py` with list only (read-only)
|
||||
- [x] Create `archivebox/cli/archivebox_machine.py` with list only (read-only)
|
||||
|
||||
### Phase 4: Registry & Cleanup
|
||||
- [x] Update `archivebox/cli/__init__.py` command registry
|
||||
- [x] Delete `archivebox/cli/archivebox_extract.py`
|
||||
- [x] Delete `archivebox/cli/archivebox_remove.py`
|
||||
- [x] Delete `archivebox/cli/archivebox_search.py`
|
||||
- [x] Delete `archivebox/cli/archivebox_orchestrator.py`
|
||||
- [x] Update `archivebox/cli/archivebox_add.py` internals (no changes needed - uses models directly)
|
||||
- [x] Update `archivebox/cli/tests_piping.py`
|
||||
|
||||
### Phase 5: Tests for New Commands
|
||||
- [ ] Add tests for `archivebox run` command
|
||||
- [ ] Add tests for `archivebox crawl create|list|update|delete`
|
||||
- [ ] Add tests for `archivebox snapshot create|list|update|delete`
|
||||
- [ ] Add tests for `archivebox archiveresult create|list|update|delete`
|
||||
- [ ] Add tests for `archivebox tag create|list|update|delete`
|
||||
- [ ] Add tests for `archivebox binary create|list|update|delete`
|
||||
- [ ] Add tests for `archivebox process list`
|
||||
- [ ] Add tests for `archivebox machine list`
|
||||
|
||||
---
|
||||
|
||||
## Usage Examples
|
||||
|
||||
### Basic CRUD
|
||||
```bash
|
||||
# Create
|
||||
archivebox crawl create https://example.com https://foo.com --depth=1
|
||||
archivebox snapshot create https://example.com --tag=news
|
||||
|
||||
# List with filters
|
||||
archivebox crawl list --status=queued
|
||||
archivebox snapshot list --url__icontains=example.com
|
||||
archivebox archiveresult list --status=failed --plugin=screenshot
|
||||
|
||||
# Update (reads JSONL from stdin, applies changes)
|
||||
archivebox snapshot list --tag=old | archivebox snapshot update --tag=new
|
||||
|
||||
# Delete (requires --yes)
|
||||
archivebox crawl list --url__icontains=example.com | archivebox crawl delete --yes
|
||||
```
|
||||
|
||||
### Unified Run Command
|
||||
```bash
|
||||
# Run orchestrator in foreground (replaces `archivebox orchestrator`)
|
||||
archivebox run
|
||||
|
||||
# Process specific records (pipe any JSONL type, exits when done)
|
||||
archivebox snapshot list --status=queued | archivebox run
|
||||
archivebox archiveresult list --status=failed | archivebox run
|
||||
archivebox crawl list --status=queued | archivebox run
|
||||
|
||||
# Mixed types work too - run handles any JSONL
|
||||
cat mixed_records.jsonl | archivebox run
|
||||
```
|
||||
|
||||
### Composed Workflows
|
||||
```bash
|
||||
# Full pipeline (replaces old `archivebox add`)
|
||||
archivebox crawl create https://example.com --status=queued \
|
||||
| archivebox snapshot create --status=queued \
|
||||
| archivebox archiveresult create --status=queued \
|
||||
| archivebox run
|
||||
|
||||
# Re-run failed extractions
|
||||
archivebox archiveresult list --status=failed | archivebox run
|
||||
|
||||
# Delete all snapshots for a domain
|
||||
archivebox snapshot list --url__icontains=spam.com | archivebox snapshot delete --yes
|
||||
```
|
||||
|
||||
### Keep `archivebox add` as convenience
|
||||
```bash
|
||||
# This remains the simple user-friendly interface:
|
||||
archivebox add https://example.com --depth=1 --tag=news
|
||||
|
||||
# Internally equivalent to the composed pipeline above
|
||||
```
|
||||
532
old/TODO_hook_concurrency.md
Normal file
532
old/TODO_hook_concurrency.md
Normal file
@@ -0,0 +1,532 @@
|
||||
# ArchiveBox Hook Script Concurrency & Execution Plan
|
||||
|
||||
## Overview
|
||||
|
||||
Snapshot.run() should enforce that snapshot hooks are run in **10 discrete, sequential "steps"**: `0*`, `1*`, `2*`, `3*`, `4*`, `5*`, `6*`, `7*`, `8*`, `9*`.
|
||||
|
||||
For every discovered hook script, ArchiveBox should create an ArchiveResult in `queued` state, then manage running them using `retry_at` and inline logic to enforce this ordering.
|
||||
|
||||
## Design Decisions
|
||||
|
||||
### ArchiveResult Schema
|
||||
- Add `ArchiveResult.hook_name` (CharField, nullable) - just filename, e.g., `'on_Snapshot__20_chrome_tab.bg.js'`
|
||||
- Keep `ArchiveResult.plugin` - still important (plugin directory name)
|
||||
- Step number derived on-the-fly from `hook_name` via `extract_step(hook_name)` - not stored
|
||||
|
||||
### Snapshot Schema
|
||||
- Add `Snapshot.current_step` (IntegerField 0-9, default=0)
|
||||
- Integrate with `SnapshotMachine` state transitions for step advancement
|
||||
|
||||
### Hook Discovery & Execution
|
||||
- `Snapshot.run()` discovers all hooks upfront, creates one AR per hook with `hook_name` set
|
||||
- All ARs for a given step can be claimed and executed in parallel by workers
|
||||
- Workers claim ARs where `extract_step(ar.hook_name) <= snapshot.current_step`
|
||||
- `Snapshot.advance_step_if_ready()` increments `current_step` when:
|
||||
- All **foreground** hooks in current step are finished (SUCCEEDED/FAILED/SKIPPED)
|
||||
- Background hooks don't block advancement (they continue running)
|
||||
- Called from `SnapshotMachine` state transitions
|
||||
|
||||
### ArchiveResult.run() Behavior
|
||||
- If `self.hook_name` is set: run that single hook
|
||||
- If `self.hook_name` is None: discover all hooks for `self.plugin` and run sequentially
|
||||
- Background hooks detected by `.bg.` in filename (e.g., `on_Snapshot__20_chrome_tab.bg.js`)
|
||||
- Background hooks return immediately (ArchiveResult stays in STARTED state)
|
||||
- Foreground hooks wait for completion, update status from JSONL output
|
||||
|
||||
### Hook Execution Flow
|
||||
1. **Within a step**: Workers claim all ARs for current step in parallel
|
||||
2. **Foreground hooks** (no .bg): ArchiveResult waits for completion, transitions to SUCCEEDED/FAILED/SKIPPED
|
||||
3. **Background hooks** (.bg): ArchiveResult transitions to STARTED, hook continues running
|
||||
4. **Step advancement**: `Snapshot.advance_step_if_ready()` checks:
|
||||
- Are all foreground ARs in current step finished? (SUCCEEDED/FAILED/SKIPPED)
|
||||
- Ignore ARs still in STARTED (background hooks)
|
||||
- If yes, increment `current_step`
|
||||
5. **Snapshot sealing**: When `current_step=9` and all foreground hooks done, kill background hooks via `Snapshot.cleanup()`
|
||||
|
||||
### Unnumbered Hooks
|
||||
- Extract step via `re.search(r'__(\d{2})_', hook_name)`, default to 9 if no match
|
||||
- Log warning for unnumbered hooks
|
||||
- Purely runtime derivation - no stored field
|
||||
|
||||
## Hook Numbering Convention
|
||||
|
||||
Hooks scripts are numbered `00` to `99` to control:
|
||||
- **First digit (0-9)**: Which step they are part of
|
||||
- **Second digit (0-9)**: Order within that step
|
||||
|
||||
Hook scripts are launched **strictly sequentially** based on their filename alphabetical order, and run in sets of several per step before moving on to the next step.
|
||||
|
||||
**Naming Format:**
|
||||
```
|
||||
on_{ModelName}__{run_order}_{human_readable_description}[.bg].{ext}
|
||||
```
|
||||
|
||||
**Examples:**
|
||||
```
|
||||
on_Snapshot__00_this_would_run_first.sh
|
||||
on_Snapshot__05_start_ytdlp_download.bg.sh
|
||||
on_Snapshot__10_chrome_tab_opened.js
|
||||
on_Snapshot__50_screenshot.js
|
||||
on_Snapshot__53_media.bg.py
|
||||
```
|
||||
|
||||
## Background (.bg) vs Foreground Scripts
|
||||
|
||||
### Foreground Scripts (no .bg suffix)
|
||||
- Launch in parallel with other hooks in their step
|
||||
- Step waits for all foreground hooks to complete or timeout
|
||||
- Get killed with SIGTERM if they exceed their `PLUGINNAME_TIMEOUT`
|
||||
- Step advances when all foreground hooks finish
|
||||
|
||||
### Background Scripts (.bg suffix)
|
||||
- Launch in parallel with other hooks in their step
|
||||
- Do NOT block step progression - step can advance while they run
|
||||
- Continue running across step boundaries until complete or timeout
|
||||
- Get killed with SIGTERM when Snapshot transitions to SEALED (via `Snapshot.cleanup()`)
|
||||
- Should exit naturally when work is complete (best case)
|
||||
|
||||
**Important:** A .bg script started in step 2 can keep running through steps 3, 4, 5... until the Snapshot seals or the hook exits naturally.
|
||||
|
||||
## Execution Step Guidelines
|
||||
|
||||
These are **naming conventions and guidelines**, not enforced checkpoints. They provide semantic organization for plugin ordering:
|
||||
|
||||
### Step 0: Pre-Setup
|
||||
```
|
||||
00-09: Initial setup, validation, feature detection
|
||||
```
|
||||
|
||||
### Step 1: Chrome Launch & Tab Creation
|
||||
```
|
||||
10-19: Browser/tab lifecycle setup
|
||||
- Chrome browser launch
|
||||
- Tab creation and CDP connection
|
||||
```
|
||||
|
||||
### Step 2: Navigation & Settlement
|
||||
```
|
||||
20-29: Page loading and settling
|
||||
- Navigate to URL
|
||||
- Wait for page load
|
||||
- Initial response capture (responses, ssl, consolelog as .bg listeners)
|
||||
```
|
||||
|
||||
### Step 3: Page Adjustment
|
||||
```
|
||||
30-39: DOM manipulation before archiving
|
||||
- Hide popups/banners
|
||||
- Solve captchas
|
||||
- Expand comments/details sections
|
||||
- Inject custom CSS/JS
|
||||
- Accessibility modifications
|
||||
```
|
||||
|
||||
### Step 4: Ready for Archiving
|
||||
```
|
||||
40-49: Final pre-archiving checks
|
||||
- Verify page is fully adjusted
|
||||
- Wait for any pending modifications
|
||||
```
|
||||
|
||||
### Step 5: DOM Extraction (Sequential, Non-BG)
|
||||
```
|
||||
50-59: Extractors that need exclusive DOM access
|
||||
- singlefile (MUST NOT be .bg)
|
||||
- screenshot (MUST NOT be .bg)
|
||||
- pdf (MUST NOT be .bg)
|
||||
- dom (MUST NOT be .bg)
|
||||
- title
|
||||
- headers
|
||||
- readability
|
||||
- mercury
|
||||
|
||||
These MUST run sequentially as they temporarily modify the DOM
|
||||
during extraction, then revert it. Running in parallel would corrupt results.
|
||||
```
|
||||
|
||||
### Step 6: Post-DOM Extraction
|
||||
```
|
||||
60-69: Extractors that don't need DOM or run on downloaded files
|
||||
- wget
|
||||
- git
|
||||
- media (.bg - can run for hours)
|
||||
- gallerydl (.bg)
|
||||
- forumdl (.bg)
|
||||
- papersdl (.bg)
|
||||
```
|
||||
|
||||
### Step 7: Chrome Cleanup
|
||||
```
|
||||
70-79: Browser/tab teardown
|
||||
- Close tabs
|
||||
- Cleanup Chrome resources
|
||||
```
|
||||
|
||||
### Step 8: Post-Processing
|
||||
```
|
||||
80-89: Reprocess outputs from earlier extractors
|
||||
- OCR of images
|
||||
- Audio/video transcription
|
||||
- URL parsing from downloaded content (rss, html, json, txt, csv, md)
|
||||
- LLM analysis/summarization of outputs
|
||||
```
|
||||
|
||||
### Step 9: Indexing & Finalization
|
||||
```
|
||||
90-99: Save to indexes and finalize
|
||||
- Index text content to Sonic/SQLite FTS
|
||||
- Create symlinks
|
||||
- Generate merkle trees
|
||||
- Final status updates
|
||||
```
|
||||
|
||||
## Hook Script Interface
|
||||
|
||||
### Input: CLI Arguments (NOT stdin)
|
||||
Hooks receive configuration as CLI flags (CSV or JSON-encoded):
|
||||
|
||||
```bash
|
||||
--url="https://example.com"
|
||||
--snapshot-id="1234-5678-uuid"
|
||||
--config='{"some_key": "some_value"}'
|
||||
--plugins=git,media,favicon,title
|
||||
--timeout=50
|
||||
--enable-something
|
||||
```
|
||||
|
||||
### Input: Environment Variables
|
||||
All configuration comes from env vars, defined in `plugin_dir/config.json` JSONSchema:
|
||||
|
||||
```bash
|
||||
WGET_BINARY=/usr/bin/wget
|
||||
WGET_TIMEOUT=60
|
||||
WGET_USER_AGENT="Mozilla/5.0..."
|
||||
WGET_EXTRA_ARGS="--no-check-certificate"
|
||||
SAVE_WGET=True
|
||||
```
|
||||
|
||||
**Required:** Every plugin must support `PLUGINNAME_TIMEOUT` for self-termination.
|
||||
|
||||
### Output: Filesystem (CWD)
|
||||
Hooks read/write files to:
|
||||
- `$CWD`: Their own output subdirectory (e.g., `archive/snapshots/{id}/wget/`)
|
||||
- `$CWD/..`: Parent directory (to read outputs from other hooks)
|
||||
|
||||
This allows hooks to:
|
||||
- Access files created by other hooks
|
||||
- Keep their outputs separate by default
|
||||
- Use semaphore files for coordination (if needed)
|
||||
|
||||
### Output: JSONL to stdout
|
||||
Hooks emit one JSONL line per database record they want to create or update:
|
||||
|
||||
```jsonl
|
||||
{"type": "Tag", "name": "sci-fi"}
|
||||
{"type": "ArchiveResult", "id": "1234-uuid", "status": "succeeded", "output_str": "wget/index.html"}
|
||||
{"type": "Snapshot", "id": "5678-uuid", "title": "Example Page"}
|
||||
```
|
||||
|
||||
See `archivebox/misc/jsonl.py` and model `from_json()` / `from_jsonl()` methods for full list of supported types and fields.
|
||||
|
||||
### Output: stderr for Human Logs
|
||||
Hooks should emit human-readable output or debug info to **stderr**. There are no guarantees this will be persisted long-term. Use stdout JSONL or filesystem for outputs that matter.
|
||||
|
||||
### Cleanup: Delete Cruft
|
||||
If hooks emit no meaningful long-term outputs, they should delete any temporary files themselves to avoid wasting space. However, the ArchiveResult DB row should be kept so we know:
|
||||
- It doesn't need to be retried
|
||||
- It isn't missing
|
||||
- What happened (status, error message)
|
||||
|
||||
### Signal Handling: SIGINT/SIGTERM
|
||||
Hooks are expected to listen for polite `SIGINT`/`SIGTERM` and finish hastily, then exit cleanly. Beyond that, they may be `SIGKILL'd` at ArchiveBox's discretion.
|
||||
|
||||
**If hooks double-fork or spawn long-running processes:** They must output a `.pid` file in their directory so zombies can be swept safely.
|
||||
|
||||
## Hook Failure Modes & Retry Logic
|
||||
|
||||
Hooks can fail in several ways. ArchiveBox handles each differently:
|
||||
|
||||
### 1. Soft Failure (Record & Don't Retry)
|
||||
**Exit:** `0` (success)
|
||||
**JSONL:** `{"type": "ArchiveResult", "status": "failed", "output_str": "404 Not Found"}`
|
||||
|
||||
This means: "I ran successfully, but the resource wasn't available." Don't retry this.
|
||||
|
||||
**Use cases:**
|
||||
- 404 errors
|
||||
- Content not available
|
||||
- Feature not applicable to this URL
|
||||
|
||||
### 2. Hard Failure / Temporary Error (Retry Later)
|
||||
**Exit:** Non-zero (1, 2, etc.)
|
||||
**JSONL:** None (or incomplete)
|
||||
|
||||
This means: "Something went wrong, I couldn't complete." Treat this ArchiveResult as "missing" and set `retry_at` for later.
|
||||
|
||||
**Use cases:**
|
||||
- 500 server errors
|
||||
- Network timeouts
|
||||
- Binary not found / crashed
|
||||
- Transient errors
|
||||
|
||||
**Behavior:**
|
||||
- ArchiveBox sets `retry_at` on the ArchiveResult
|
||||
- Hook will be retried during next `archivebox update`
|
||||
|
||||
### 3. Partial Success (Update & Continue)
|
||||
**Exit:** Non-zero
|
||||
**JSONL:** Partial records emitted before crash
|
||||
|
||||
**Behavior:**
|
||||
- Update ArchiveResult with whatever was emitted
|
||||
- Mark remaining work as "missing" with `retry_at`
|
||||
|
||||
### 4. Success (Record & Continue)
|
||||
**Exit:** `0`
|
||||
**JSONL:** `{"type": "ArchiveResult", "status": "succeeded", "output_str": "output/file.html"}`
|
||||
|
||||
This is the happy path.
|
||||
|
||||
### Error Handling Rules
|
||||
|
||||
- **DO NOT skip hooks** based on failures
|
||||
- **Continue to next hook** regardless of foreground or background failures
|
||||
- **Update ArchiveResults** with whatever information is available
|
||||
- **Set retry_at** for "missing" or temporarily-failed hooks
|
||||
- **Let background scripts continue** even if foreground scripts fail
|
||||
|
||||
## File Structure
|
||||
|
||||
```
|
||||
archivebox/plugins/{plugin_name}/
|
||||
├── config.json # JSONSchema: env var config options
|
||||
├── binaries.jsonl # Runtime dependencies: apt|brew|pip|npm|env
|
||||
├── on_Snapshot__XX_name.py # Hook script (foreground)
|
||||
├── on_Snapshot__XX_name.bg.py # Hook script (background)
|
||||
└── tests/
|
||||
└── test_name.py
|
||||
```
|
||||
|
||||
## Implementation Checklist
|
||||
|
||||
### Phase 1: Schema Migration ✅
|
||||
- [x] Add `Snapshot.current_step` (IntegerField 0-9, default=0)
|
||||
- [x] Add `ArchiveResult.hook_name` (CharField, nullable) - just filename
|
||||
- [x] Create migration: `0034_snapshot_current_step.py`
|
||||
|
||||
### Phase 2: Core Logic Updates ✅
|
||||
- [x] Add `extract_step(hook_name)` utility in `archivebox/hooks.py`
|
||||
- Extract first digit from `__XX_` pattern
|
||||
- Default to 9 for unnumbered hooks
|
||||
- [x] Add `is_background_hook(hook_name)` utility in `archivebox/hooks.py`
|
||||
- Check for `.bg.` in filename
|
||||
- [x] Update `Snapshot.create_pending_archiveresults()` in `archivebox/core/models.py`:
|
||||
- Discover all hooks (not plugins)
|
||||
- Create one AR per hook with `hook_name` set
|
||||
- [x] Update `ArchiveResult.run()` in `archivebox/core/models.py`:
|
||||
- If `hook_name` set: run single hook
|
||||
- If `hook_name` None: discover all plugin hooks (existing behavior)
|
||||
- [x] Add `Snapshot.advance_step_if_ready()` method:
|
||||
- Check if all foreground ARs in current step finished
|
||||
- Increment `current_step` if ready
|
||||
- Ignore background hooks (.bg) in completion check
|
||||
- [x] Integrate with `SnapshotMachine.is_finished()` in `archivebox/core/statemachines.py`:
|
||||
- Call `advance_step_if_ready()` before checking if done
|
||||
|
||||
### Phase 3: Worker Coordination ✅
|
||||
- [x] Update worker AR claiming query in `archivebox/workers/worker.py`:
|
||||
- Filter: `extract_step(ar.hook_name) <= snapshot.current_step`
|
||||
- Claims ARs in QUEUED state, checks step in Python before processing
|
||||
- Orders by hook_name for deterministic execution within step
|
||||
|
||||
### Phase 4: Hook Renumbering ✅
|
||||
- [x] Renumber hooks per renumbering map below
|
||||
- [x] Add `.bg` suffix to long-running hooks (media, gallerydl, forumdl, papersdl)
|
||||
- [x] Move parse_* hooks to step 7 (70-79)
|
||||
- [x] Test all hooks still work after renumbering
|
||||
|
||||
## Migration Path
|
||||
|
||||
### Natural Compatibility
|
||||
No special migration needed:
|
||||
1. Existing ARs with `hook_name=None` continue to work (discover all plugin hooks at runtime)
|
||||
2. New ARs get `hook_name` set (single hook per AR)
|
||||
3. `ArchiveResult.run()` handles both cases naturally
|
||||
4. Unnumbered hooks default to step 9 (log warning)
|
||||
|
||||
### Renumbering Map
|
||||
|
||||
**Completed Renames:**
|
||||
```
|
||||
# Step 5: DOM Extraction (sequential, non-background)
|
||||
singlefile/on_Snapshot__37_singlefile.py → singlefile/on_Snapshot__50_singlefile.py ✅
|
||||
screenshot/on_Snapshot__34_screenshot.js → screenshot/on_Snapshot__51_screenshot.js ✅
|
||||
pdf/on_Snapshot__35_pdf.js → pdf/on_Snapshot__52_pdf.js ✅
|
||||
dom/on_Snapshot__36_dom.js → dom/on_Snapshot__53_dom.js ✅
|
||||
title/on_Snapshot__32_title.js → title/on_Snapshot__54_title.js ✅
|
||||
readability/on_Snapshot__52_readability.py → readability/on_Snapshot__55_readability.py ✅
|
||||
headers/on_Snapshot__33_headers.js → headers/on_Snapshot__55_headers.js ✅
|
||||
mercury/on_Snapshot__53_mercury.py → mercury/on_Snapshot__56_mercury.py ✅
|
||||
htmltotext/on_Snapshot__54_htmltotext.py → htmltotext/on_Snapshot__57_htmltotext.py ✅
|
||||
|
||||
# Step 6: Post-DOM Extraction (background for long-running)
|
||||
wget/on_Snapshot__50_wget.py → wget/on_Snapshot__61_wget.py ✅
|
||||
git/on_Snapshot__12_git.py → git/on_Snapshot__62_git.py ✅
|
||||
media/on_Snapshot__51_media.py → media/on_Snapshot__63_media.bg.py ✅
|
||||
gallerydl/on_Snapshot__52_gallerydl.py → gallerydl/on_Snapshot__64_gallerydl.bg.py ✅
|
||||
forumdl/on_Snapshot__53_forumdl.py → forumdl/on_Snapshot__65_forumdl.bg.py ✅
|
||||
papersdl/on_Snapshot__54_papersdl.py → papersdl/on_Snapshot__66_papersdl.bg.py ✅
|
||||
|
||||
# Step 7: URL Extraction (parse_* hooks moved from step 6)
|
||||
parse_html_urls/on_Snapshot__60_parse_html_urls.py → parse_html_urls/on_Snapshot__70_parse_html_urls.py ✅
|
||||
parse_txt_urls/on_Snapshot__62_parse_txt_urls.py → parse_txt_urls/on_Snapshot__71_parse_txt_urls.py ✅
|
||||
parse_rss_urls/on_Snapshot__61_parse_rss_urls.py → parse_rss_urls/on_Snapshot__72_parse_rss_urls.py ✅
|
||||
parse_netscape_urls/on_Snapshot__63_parse_netscape_urls.py → parse_netscape_urls/on_Snapshot__73_parse_netscape_urls.py ✅
|
||||
parse_jsonl_urls/on_Snapshot__64_parse_jsonl_urls.py → parse_jsonl_urls/on_Snapshot__74_parse_jsonl_urls.py ✅
|
||||
parse_dom_outlinks/on_Snapshot__40_parse_dom_outlinks.js → parse_dom_outlinks/on_Snapshot__75_parse_dom_outlinks.js ✅
|
||||
```
|
||||
|
||||
## Testing Strategy
|
||||
|
||||
### Unit Tests
|
||||
- Test hook ordering (00-99)
|
||||
- Test step grouping (first digit)
|
||||
- Test .bg vs foreground execution
|
||||
- Test timeout enforcement
|
||||
- Test JSONL parsing
|
||||
- Test failure modes & retry_at logic
|
||||
|
||||
### Integration Tests
|
||||
- Test full Snapshot.run() with mixed hooks
|
||||
- Test .bg scripts running beyond step 99
|
||||
- Test zombie process cleanup
|
||||
- Test graceful SIGTERM handling
|
||||
- Test concurrent .bg script coordination
|
||||
|
||||
### Performance Tests
|
||||
- Measure overhead of per-hook ArchiveResults
|
||||
- Test with 50+ concurrent .bg scripts
|
||||
- Test filesystem contention with many hooks
|
||||
|
||||
## Open Questions
|
||||
|
||||
### Q: Should we provide semaphore utilities?
|
||||
**A:** No. Keep plugins decoupled. Let them use simple filesystem coordination if needed.
|
||||
|
||||
### Q: What happens if ArchiveResult table gets huge?
|
||||
**A:** We can delete old successful ArchiveResults periodically, or archive them to cold storage. The important data is in the filesystem outputs.
|
||||
|
||||
### Q: Should naturally-exiting .bg scripts still be .bg?
|
||||
**A:** Yes. The .bg suffix means "don't block step progression," not "run until step 99." Natural exit is the best case.
|
||||
|
||||
## Examples
|
||||
|
||||
### Foreground Hook (Sequential DOM Access)
|
||||
```python
|
||||
#!/usr/bin/env python3
|
||||
# archivebox/plugins/screenshot/on_Snapshot__51_screenshot.js
|
||||
|
||||
# Runs at step 5, blocks step progression until complete
|
||||
# Gets killed if it exceeds SCREENSHOT_TIMEOUT
|
||||
|
||||
timeout = get_env_int('SCREENSHOT_TIMEOUT') or get_env_int('TIMEOUT', 60)
|
||||
|
||||
try:
|
||||
result = subprocess.run(cmd, capture_output=True, timeout=timeout)
|
||||
if result.returncode == 0:
|
||||
print(json.dumps({
|
||||
"type": "ArchiveResult",
|
||||
"status": "succeeded",
|
||||
"output_str": "screenshot.png"
|
||||
}))
|
||||
sys.exit(0)
|
||||
else:
|
||||
# Temporary failure - will be retried
|
||||
sys.exit(1)
|
||||
except subprocess.TimeoutExpired:
|
||||
# Timeout - will be retried
|
||||
sys.exit(1)
|
||||
```
|
||||
|
||||
### Background Hook (Long-Running Download)
|
||||
```python
|
||||
#!/usr/bin/env python3
|
||||
# archivebox/plugins/ytdlp/on_Snapshot__63_ytdlp.bg.py
|
||||
|
||||
# Runs at step 6, doesn't block step progression
|
||||
# Gets full YTDLP_TIMEOUT (e.g., 3600s) regardless of when step 99 completes
|
||||
|
||||
timeout = get_env_int('YTDLP_TIMEOUT') or get_env_int('TIMEOUT', 3600)
|
||||
|
||||
try:
|
||||
result = subprocess.run(['yt-dlp', url], capture_output=True, timeout=timeout)
|
||||
if result.returncode == 0:
|
||||
print(json.dumps({
|
||||
"type": "ArchiveResult",
|
||||
"status": "succeeded",
|
||||
"output_str": "media/"
|
||||
}))
|
||||
sys.exit(0)
|
||||
else:
|
||||
# Hard failure - don't retry
|
||||
print(json.dumps({
|
||||
"type": "ArchiveResult",
|
||||
"status": "failed",
|
||||
"output_str": "Video unavailable"
|
||||
}))
|
||||
sys.exit(0) # Exit 0 to record the failure
|
||||
except subprocess.TimeoutExpired:
|
||||
# Timeout - will be retried
|
||||
sys.exit(1)
|
||||
```
|
||||
|
||||
### Background Hook with Natural Exit
|
||||
```javascript
|
||||
#!/usr/bin/env node
|
||||
// archivebox/plugins/ssl/on_Snapshot__23_ssl.bg.js
|
||||
|
||||
// Sets up listener, captures SSL info, then exits naturally
|
||||
// No SIGTERM handler needed - already exits when done
|
||||
|
||||
async function main() {
|
||||
const page = await connectToChrome();
|
||||
|
||||
// Set up listener
|
||||
page.on('response', async (response) => {
|
||||
const securityDetails = response.securityDetails();
|
||||
if (securityDetails) {
|
||||
fs.writeFileSync('ssl.json', JSON.stringify(securityDetails));
|
||||
}
|
||||
});
|
||||
|
||||
// Wait for navigation (done by other hook)
|
||||
await waitForNavigation();
|
||||
|
||||
// Emit result
|
||||
console.log(JSON.stringify({
|
||||
type: 'ArchiveResult',
|
||||
status: 'succeeded',
|
||||
output_str: 'ssl.json'
|
||||
}));
|
||||
|
||||
process.exit(0); // Natural exit - no await indefinitely
|
||||
}
|
||||
|
||||
main().catch(e => {
|
||||
console.error(`ERROR: ${e.message}`);
|
||||
process.exit(1); // Will be retried
|
||||
});
|
||||
```
|
||||
|
||||
## Summary
|
||||
|
||||
This plan provides:
|
||||
- ✅ Clear execution ordering (10 steps, 00-99 numbering)
|
||||
- ✅ Async support (.bg suffix)
|
||||
- ✅ Independent timeout control per plugin
|
||||
- ✅ Flexible failure handling & retry logic
|
||||
- ✅ Streaming JSONL output for DB updates
|
||||
- ✅ Simple filesystem-based coordination
|
||||
- ✅ Backward compatibility during migration
|
||||
|
||||
The main implementation work is refactoring `Snapshot.run()` to enforce step ordering and manage .bg script lifecycles. Plugin renumbering is straightforward mechanical work.
|
||||
1947
old/TODO_process_tracking.md
Normal file
1947
old/TODO_process_tracking.md
Normal file
File diff suppressed because it is too large
Load Diff
6108
old/archivebox.ts
Normal file
6108
old/archivebox.ts
Normal file
File diff suppressed because it is too large
Load Diff
Reference in New Issue
Block a user