mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-01-03 09:25:42 +10:00
Add real-world use cases to CLI pipeline plan
Added 10 practical examples demonstrating the JSONL piping architecture: 1. Basic archive with auto-cascade 2. Retry failed extractions (by status, plugin, domain) 3. Pinboard bookmark import with jq 4. GitHub repo filtering with jq regex 5. Selective extraction (screenshots only) 6. Bulk tag management 7. Deep documentation crawling 8. RSS feed monitoring 9. Archive audit with jq aggregation 10. Incremental backup with diff Also added auto-cascade principle: `archivebox run` automatically creates Snapshots from Crawls and ArchiveResults from Snapshots, so intermediate commands are only needed for customization.
This commit is contained in:
@@ -13,8 +13,162 @@ archivebox crawl create URL | archivebox snapshot create | archivebox archiveres
|
||||
1. **Maximize model method reuse**: Use `.to_json()`, `.from_json()`, `.to_jsonl()`, `.from_jsonl()` everywhere
|
||||
2. **Pass-through behavior**: All commands output input records + newly created records (accumulating pipeline)
|
||||
3. **Create-or-update**: Commands create records if they don't exist, update if ID matches existing
|
||||
4. **Generic filtering**: Implement filters as functions that take queryset → return queryset
|
||||
5. **Minimal code**: Extract duplicated `apply_filters()` to shared module
|
||||
4. **Auto-cascade**: `archivebox run` automatically creates Snapshots from Crawls and ArchiveResults from Snapshots
|
||||
5. **Generic filtering**: Implement filters as functions that take queryset → return queryset
|
||||
6. **Minimal code**: Extract duplicated `apply_filters()` to shared module
|
||||
|
||||
---
|
||||
|
||||
## Real-World Use Cases
|
||||
|
||||
These examples demonstrate the power of the JSONL piping architecture. Note: `archivebox run`
|
||||
auto-cascades (Crawl → Snapshots → ArchiveResults), so intermediate commands are only needed
|
||||
when you want to customize behavior at that stage.
|
||||
|
||||
### 1. Basic Archive
|
||||
```bash
|
||||
# Simple URL archive (run auto-creates snapshots and archive results)
|
||||
archivebox crawl create https://example.com | archivebox run
|
||||
|
||||
# Multiple URLs from a file
|
||||
archivebox crawl create < urls.txt | archivebox run
|
||||
|
||||
# With depth crawling (follow links)
|
||||
archivebox crawl create --depth=2 https://docs.python.org | archivebox run
|
||||
```
|
||||
|
||||
### 2. Retry Failed Extractions
|
||||
```bash
|
||||
# Retry all failed extractions
|
||||
archivebox archiveresult list --status=failed | archivebox run
|
||||
|
||||
# Retry only failed PDFs
|
||||
archivebox archiveresult list --status=failed --plugin=pdf | archivebox run
|
||||
|
||||
# Retry failed items from a specific domain (jq filter)
|
||||
archivebox snapshot list --status=queued \
|
||||
| jq 'select(.url | contains("nytimes.com"))' \
|
||||
| archivebox run
|
||||
```
|
||||
|
||||
### 3. Import Bookmarks from Pinboard (jq)
|
||||
```bash
|
||||
# Fetch Pinboard bookmarks and archive them
|
||||
curl -s "https://api.pinboard.in/v1/posts/all?format=json&auth_token=$TOKEN" \
|
||||
| jq -c '.[] | {url: .href, tags_str: .tags, title: .description}' \
|
||||
| archivebox crawl create \
|
||||
| archivebox run
|
||||
```
|
||||
|
||||
### 4. Filter and Process with jq
|
||||
```bash
|
||||
# Archive only GitHub repository root pages (not issues, PRs, etc.)
|
||||
archivebox snapshot list \
|
||||
| jq 'select(.url | test("github\\.com/[^/]+/[^/]+/?$"))' \
|
||||
| archivebox run
|
||||
|
||||
# Find snapshots with specific tag pattern
|
||||
archivebox snapshot list \
|
||||
| jq 'select(.tags_str | contains("research"))' \
|
||||
| archivebox run
|
||||
```
|
||||
|
||||
### 5. Selective Extraction (Screenshots Only)
|
||||
```bash
|
||||
# Create only screenshot extractions for queued snapshots
|
||||
archivebox snapshot list --status=queued \
|
||||
| archivebox archiveresult create --plugin=screenshot \
|
||||
| archivebox run
|
||||
|
||||
# Re-run singlefile on everything that was skipped
|
||||
archivebox archiveresult list --plugin=singlefile --status=skipped \
|
||||
| archivebox archiveresult update --status=queued \
|
||||
| archivebox run
|
||||
```
|
||||
|
||||
### 6. Bulk Tag Management
|
||||
```bash
|
||||
# Tag all Twitter/X URLs
|
||||
archivebox snapshot list --url__icontains=twitter.com \
|
||||
| archivebox snapshot update --tag=twitter
|
||||
|
||||
# Tag all URLs from today's crawl
|
||||
archivebox crawl list --created_at__gte=$(date +%Y-%m-%d) \
|
||||
| archivebox snapshot list \
|
||||
| archivebox snapshot update --tag=daily-$(date +%Y%m%d)
|
||||
```
|
||||
|
||||
### 7. Deep Documentation Crawl
|
||||
```bash
|
||||
# Mirror documentation site (depth=3 follows links 3 levels deep)
|
||||
archivebox crawl create --depth=3 https://docs.djangoproject.com/en/4.2/ \
|
||||
| archivebox run
|
||||
|
||||
# Crawl with custom tag
|
||||
archivebox crawl create --depth=2 --tag=python-docs https://docs.python.org/3/ \
|
||||
| archivebox run
|
||||
```
|
||||
|
||||
### 8. RSS Feed Monitoring
|
||||
```bash
|
||||
# Archive all items from an RSS feed
|
||||
curl -s "https://hnrss.org/frontpage" \
|
||||
| grep -oP '<link>\K[^<]+' \
|
||||
| archivebox crawl create --tag=hackernews \
|
||||
| archivebox run
|
||||
|
||||
# Or with proper XML parsing
|
||||
curl -s "https://example.com/feed.xml" \
|
||||
| xq -r '.rss.channel.item[].link' \
|
||||
| archivebox crawl create \
|
||||
| archivebox run
|
||||
```
|
||||
|
||||
### 9. Archive Audit with jq
|
||||
```bash
|
||||
# Count snapshots by status
|
||||
archivebox snapshot list | jq -s 'group_by(.status) | map({status: .[0].status, count: length})'
|
||||
|
||||
# Find large archive results (over 50MB)
|
||||
archivebox archiveresult list \
|
||||
| jq 'select(.output_size > 52428800) | {id, plugin, size_mb: (.output_size/1048576)}'
|
||||
|
||||
# Export summary of archive
|
||||
archivebox snapshot list \
|
||||
| jq -s '{total: length, by_status: (group_by(.status) | map({(.[0].status): length}) | add)}'
|
||||
```
|
||||
|
||||
### 10. Incremental Backup
|
||||
```bash
|
||||
# Archive URLs not already in archive
|
||||
comm -23 \
|
||||
<(sort new_urls.txt) \
|
||||
<(archivebox snapshot list | jq -r '.url' | sort) \
|
||||
| archivebox crawl create \
|
||||
| archivebox run
|
||||
|
||||
# Re-archive anything older than 30 days
|
||||
archivebox snapshot list \
|
||||
| jq "select(.created_at < \"$(date -d '30 days ago' --iso-8601)\")" \
|
||||
| archivebox archiveresult create \
|
||||
| archivebox run
|
||||
```
|
||||
|
||||
### Composability Summary
|
||||
|
||||
| Pattern | Example |
|
||||
|---------|---------|
|
||||
| **Filter → Process** | `list --status=failed \| run` |
|
||||
| **Transform → Archive** | `curl RSS \| jq \| crawl create \| run` |
|
||||
| **Bulk Tag** | `list --url__icontains=X \| update --tag=Y` |
|
||||
| **Selective Extract** | `snapshot list \| archiveresult create --plugin=pdf` |
|
||||
| **Chain Depth** | `crawl create --depth=2 \| run` |
|
||||
| **Export/Audit** | `list \| jq -s 'group_by(.status)'` |
|
||||
| **Compose with Unix** | `\| jq \| grep \| sort \| uniq \| parallel` |
|
||||
|
||||
The key insight: **every intermediate step produces valid JSONL** that can be saved, filtered,
|
||||
transformed, or resumed later. This makes archiving workflows debuggable, repeatable, and
|
||||
composable with the entire Unix ecosystem.
|
||||
|
||||
---
|
||||
|
||||
|
||||
Reference in New Issue
Block a user