From 0f46d8a22ec90e81262514bb6761b4a15c022c13 Mon Sep 17 00:00:00 2001 From: Claude Date: Wed, 31 Dec 2025 09:20:25 +0000 Subject: [PATCH 1/2] Add real-world use cases to CLI pipeline plan Added 10 practical examples demonstrating the JSONL piping architecture: 1. Basic archive with auto-cascade 2. Retry failed extractions (by status, plugin, domain) 3. Pinboard bookmark import with jq 4. GitHub repo filtering with jq regex 5. Selective extraction (screenshots only) 6. Bulk tag management 7. Deep documentation crawling 8. RSS feed monitoring 9. Archive audit with jq aggregation 10. Incremental backup with diff Also added auto-cascade principle: `archivebox run` automatically creates Snapshots from Crawls and ArchiveResults from Snapshots, so intermediate commands are only needed for customization. --- TODO_archivebox_jsonl_cli.md | 158 ++++++++++++++++++++++++++++++++++- 1 file changed, 156 insertions(+), 2 deletions(-) diff --git a/TODO_archivebox_jsonl_cli.md b/TODO_archivebox_jsonl_cli.md index ba0c2de7..40c17fe7 100644 --- a/TODO_archivebox_jsonl_cli.md +++ b/TODO_archivebox_jsonl_cli.md @@ -13,8 +13,162 @@ archivebox crawl create URL | archivebox snapshot create | archivebox archiveres 1. **Maximize model method reuse**: Use `.to_json()`, `.from_json()`, `.to_jsonl()`, `.from_jsonl()` everywhere 2. **Pass-through behavior**: All commands output input records + newly created records (accumulating pipeline) 3. **Create-or-update**: Commands create records if they don't exist, update if ID matches existing -4. **Generic filtering**: Implement filters as functions that take queryset → return queryset -5. **Minimal code**: Extract duplicated `apply_filters()` to shared module +4. **Auto-cascade**: `archivebox run` automatically creates Snapshots from Crawls and ArchiveResults from Snapshots +5. **Generic filtering**: Implement filters as functions that take queryset → return queryset +6. **Minimal code**: Extract duplicated `apply_filters()` to shared module + +--- + +## Real-World Use Cases + +These examples demonstrate the power of the JSONL piping architecture. Note: `archivebox run` +auto-cascades (Crawl → Snapshots → ArchiveResults), so intermediate commands are only needed +when you want to customize behavior at that stage. + +### 1. Basic Archive +```bash +# Simple URL archive (run auto-creates snapshots and archive results) +archivebox crawl create https://example.com | archivebox run + +# Multiple URLs from a file +archivebox crawl create < urls.txt | archivebox run + +# With depth crawling (follow links) +archivebox crawl create --depth=2 https://docs.python.org | archivebox run +``` + +### 2. Retry Failed Extractions +```bash +# Retry all failed extractions +archivebox archiveresult list --status=failed | archivebox run + +# Retry only failed PDFs +archivebox archiveresult list --status=failed --plugin=pdf | archivebox run + +# Retry failed items from a specific domain (jq filter) +archivebox snapshot list --status=queued \ + | jq 'select(.url | contains("nytimes.com"))' \ + | archivebox run +``` + +### 3. Import Bookmarks from Pinboard (jq) +```bash +# Fetch Pinboard bookmarks and archive them +curl -s "https://api.pinboard.in/v1/posts/all?format=json&auth_token=$TOKEN" \ + | jq -c '.[] | {url: .href, tags_str: .tags, title: .description}' \ + | archivebox crawl create \ + | archivebox run +``` + +### 4. Filter and Process with jq +```bash +# Archive only GitHub repository root pages (not issues, PRs, etc.) +archivebox snapshot list \ + | jq 'select(.url | test("github\\.com/[^/]+/[^/]+/?$"))' \ + | archivebox run + +# Find snapshots with specific tag pattern +archivebox snapshot list \ + | jq 'select(.tags_str | contains("research"))' \ + | archivebox run +``` + +### 5. Selective Extraction (Screenshots Only) +```bash +# Create only screenshot extractions for queued snapshots +archivebox snapshot list --status=queued \ + | archivebox archiveresult create --plugin=screenshot \ + | archivebox run + +# Re-run singlefile on everything that was skipped +archivebox archiveresult list --plugin=singlefile --status=skipped \ + | archivebox archiveresult update --status=queued \ + | archivebox run +``` + +### 6. Bulk Tag Management +```bash +# Tag all Twitter/X URLs +archivebox snapshot list --url__icontains=twitter.com \ + | archivebox snapshot update --tag=twitter + +# Tag all URLs from today's crawl +archivebox crawl list --created_at__gte=$(date +%Y-%m-%d) \ + | archivebox snapshot list \ + | archivebox snapshot update --tag=daily-$(date +%Y%m%d) +``` + +### 7. Deep Documentation Crawl +```bash +# Mirror documentation site (depth=3 follows links 3 levels deep) +archivebox crawl create --depth=3 https://docs.djangoproject.com/en/4.2/ \ + | archivebox run + +# Crawl with custom tag +archivebox crawl create --depth=2 --tag=python-docs https://docs.python.org/3/ \ + | archivebox run +``` + +### 8. RSS Feed Monitoring +```bash +# Archive all items from an RSS feed +curl -s "https://hnrss.org/frontpage" \ + | grep -oP '\K[^<]+' \ + | archivebox crawl create --tag=hackernews \ + | archivebox run + +# Or with proper XML parsing +curl -s "https://example.com/feed.xml" \ + | xq -r '.rss.channel.item[].link' \ + | archivebox crawl create \ + | archivebox run +``` + +### 9. Archive Audit with jq +```bash +# Count snapshots by status +archivebox snapshot list | jq -s 'group_by(.status) | map({status: .[0].status, count: length})' + +# Find large archive results (over 50MB) +archivebox archiveresult list \ + | jq 'select(.output_size > 52428800) | {id, plugin, size_mb: (.output_size/1048576)}' + +# Export summary of archive +archivebox snapshot list \ + | jq -s '{total: length, by_status: (group_by(.status) | map({(.[0].status): length}) | add)}' +``` + +### 10. Incremental Backup +```bash +# Archive URLs not already in archive +comm -23 \ + <(sort new_urls.txt) \ + <(archivebox snapshot list | jq -r '.url' | sort) \ + | archivebox crawl create \ + | archivebox run + +# Re-archive anything older than 30 days +archivebox snapshot list \ + | jq "select(.created_at < \"$(date -d '30 days ago' --iso-8601)\")" \ + | archivebox archiveresult create \ + | archivebox run +``` + +### Composability Summary + +| Pattern | Example | +|---------|---------| +| **Filter → Process** | `list --status=failed \| run` | +| **Transform → Archive** | `curl RSS \| jq \| crawl create \| run` | +| **Bulk Tag** | `list --url__icontains=X \| update --tag=Y` | +| **Selective Extract** | `snapshot list \| archiveresult create --plugin=pdf` | +| **Chain Depth** | `crawl create --depth=2 \| run` | +| **Export/Audit** | `list \| jq -s 'group_by(.status)'` | +| **Compose with Unix** | `\| jq \| grep \| sort \| uniq \| parallel` | + +The key insight: **every intermediate step produces valid JSONL** that can be saved, filtered, +transformed, or resumed later. This makes archiving workflows debuggable, repeatable, and +composable with the entire Unix ecosystem. --- From 1c85b4daa35f55c9dd2de8bf27ab3e29c7629045 Mon Sep 17 00:00:00 2001 From: Claude Date: Wed, 31 Dec 2025 09:26:23 +0000 Subject: [PATCH 2/2] Refine use cases: 8 examples with efficient patterns MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Trimmed from 10 to 8 focused examples - Emphasize CLI args for DB filtering (efficient), jq for transforms - Added key examples showing `run` emits JSONL enabling chained processing: - #4: Retry failed with different binary/timeout via jq transform - #8: Recursive link following (run → jq filter → crawl → run) - Removed redundant jq domain filtering (use --url__icontains instead) - Updated summary table with "Retry w/ Changes" and "Chain Processing" patterns --- TODO_archivebox_jsonl_cli.md | 125 ++++++++++++++--------------------- 1 file changed, 49 insertions(+), 76 deletions(-) diff --git a/TODO_archivebox_jsonl_cli.md b/TODO_archivebox_jsonl_cli.md index 40c17fe7..fb7bf9fd 100644 --- a/TODO_archivebox_jsonl_cli.md +++ b/TODO_archivebox_jsonl_cli.md @@ -21,9 +21,10 @@ archivebox crawl create URL | archivebox snapshot create | archivebox archiveres ## Real-World Use Cases -These examples demonstrate the power of the JSONL piping architecture. Note: `archivebox run` -auto-cascades (Crawl → Snapshots → ArchiveResults), so intermediate commands are only needed -when you want to customize behavior at that stage. +These examples demonstrate the JSONL piping architecture. Key points: +- `archivebox run` auto-cascades (Crawl → Snapshots → ArchiveResults) +- `archivebox run` **emits JSONL** of everything it creates, enabling chained processing +- Use CLI args (`--status=`, `--plugin=`) for efficient DB filtering; use jq for transforms ### 1. Basic Archive ```bash @@ -42,38 +43,38 @@ archivebox crawl create --depth=2 https://docs.python.org | archivebox run # Retry all failed extractions archivebox archiveresult list --status=failed | archivebox run -# Retry only failed PDFs -archivebox archiveresult list --status=failed --plugin=pdf | archivebox run - -# Retry failed items from a specific domain (jq filter) -archivebox snapshot list --status=queued \ - | jq 'select(.url | contains("nytimes.com"))' \ +# Retry only failed PDFs from a specific domain +archivebox archiveresult list --status=failed --plugin=pdf --url__icontains=nytimes.com \ | archivebox run ``` -### 3. Import Bookmarks from Pinboard (jq) +### 3. Import Bookmarks from Pinboard (jq transform) ```bash -# Fetch Pinboard bookmarks and archive them +# Fetch Pinboard API, transform fields to match ArchiveBox schema, archive curl -s "https://api.pinboard.in/v1/posts/all?format=json&auth_token=$TOKEN" \ | jq -c '.[] | {url: .href, tags_str: .tags, title: .description}' \ | archivebox crawl create \ | archivebox run ``` -### 4. Filter and Process with jq +### 4. Retry Failed with Different Binary (jq transform + re-run) ```bash -# Archive only GitHub repository root pages (not issues, PRs, etc.) -archivebox snapshot list \ - | jq 'select(.url | test("github\\.com/[^/]+/[^/]+/?$"))' \ +# Get failed wget results, transform to use wget2 binary instead, re-queue as new attempts +archivebox archiveresult list --status=failed --plugin=wget \ + | jq -c '{snapshot_id, plugin, status: "queued", overrides: {WGET_BINARY: "wget2"}}' \ + | archivebox archiveresult create \ | archivebox run -# Find snapshots with specific tag pattern -archivebox snapshot list \ - | jq 'select(.tags_str | contains("research"))' \ +# Chain processing: archive, then re-run any failures with increased timeout +archivebox crawl create https://slow-site.com \ + | archivebox run \ + | jq -c 'select(.type == "ArchiveResult" and .status == "failed") + | del(.id) | .status = "queued" | .overrides.TIMEOUT = "120"' \ + | archivebox archiveresult create \ | archivebox run ``` -### 5. Selective Extraction (Screenshots Only) +### 5. Selective Extraction ```bash # Create only screenshot extractions for queued snapshots archivebox snapshot list --status=queued \ @@ -88,68 +89,40 @@ archivebox archiveresult list --plugin=singlefile --status=skipped \ ### 6. Bulk Tag Management ```bash -# Tag all Twitter/X URLs +# Tag all Twitter/X URLs (efficient DB filter, no jq needed) archivebox snapshot list --url__icontains=twitter.com \ | archivebox snapshot update --tag=twitter -# Tag all URLs from today's crawl -archivebox crawl list --created_at__gte=$(date +%Y-%m-%d) \ - | archivebox snapshot list \ - | archivebox snapshot update --tag=daily-$(date +%Y%m%d) +# Tag snapshots based on computed criteria (jq for logic DB can't do) +archivebox snapshot list --status=sealed \ + | jq -c 'select(.archiveresult_count > 5) | . + {tags_str: (.tags_str + ",well-archived")}' \ + | archivebox snapshot update ``` -### 7. Deep Documentation Crawl -```bash -# Mirror documentation site (depth=3 follows links 3 levels deep) -archivebox crawl create --depth=3 https://docs.djangoproject.com/en/4.2/ \ - | archivebox run - -# Crawl with custom tag -archivebox crawl create --depth=2 --tag=python-docs https://docs.python.org/3/ \ - | archivebox run -``` - -### 8. RSS Feed Monitoring +### 7. RSS Feed Monitoring ```bash # Archive all items from an RSS feed curl -s "https://hnrss.org/frontpage" \ - | grep -oP '\K[^<]+' \ - | archivebox crawl create --tag=hackernews \ - | archivebox run - -# Or with proper XML parsing -curl -s "https://example.com/feed.xml" \ | xq -r '.rss.channel.item[].link' \ - | archivebox crawl create \ + | archivebox crawl create --tag=hackernews-$(date +%Y%m%d) \ | archivebox run ``` -### 9. Archive Audit with jq +### 8. Recursive Link Following (run output → filter → re-run) ```bash -# Count snapshots by status -archivebox snapshot list | jq -s 'group_by(.status) | map({status: .[0].status, count: length})' - -# Find large archive results (over 50MB) -archivebox archiveresult list \ - | jq 'select(.output_size > 52428800) | {id, plugin, size_mb: (.output_size/1048576)}' - -# Export summary of archive -archivebox snapshot list \ - | jq -s '{total: length, by_status: (group_by(.status) | map({(.[0].status): length}) | add)}' -``` - -### 10. Incremental Backup -```bash -# Archive URLs not already in archive -comm -23 \ - <(sort new_urls.txt) \ - <(archivebox snapshot list | jq -r '.url' | sort) \ - | archivebox crawl create \ +# Archive a page, then archive all PDFs it links to +archivebox crawl create https://research-papers.org/index.html \ + | archivebox run \ + | jq -c 'select(.type == "Snapshot") | .discovered_urls[]? + | select(endswith(".pdf")) | {url: .}' \ + | archivebox crawl create --tag=linked-pdfs \ | archivebox run -# Re-archive anything older than 30 days -archivebox snapshot list \ - | jq "select(.created_at < \"$(date -d '30 days ago' --iso-8601)\")" \ +# Depth crawl with custom handling: retry timeouts with longer timeout +archivebox crawl create --depth=1 https://example.com \ + | archivebox run \ + | jq -c 'select(.type == "ArchiveResult" and .status == "failed" and .error contains "timeout") + | del(.id) | .overrides.TIMEOUT = "300"' \ | archivebox archiveresult create \ | archivebox run ``` @@ -158,17 +131,17 @@ archivebox snapshot list \ | Pattern | Example | |---------|---------| -| **Filter → Process** | `list --status=failed \| run` | -| **Transform → Archive** | `curl RSS \| jq \| crawl create \| run` | -| **Bulk Tag** | `list --url__icontains=X \| update --tag=Y` | -| **Selective Extract** | `snapshot list \| archiveresult create --plugin=pdf` | -| **Chain Depth** | `crawl create --depth=2 \| run` | -| **Export/Audit** | `list \| jq -s 'group_by(.status)'` | -| **Compose with Unix** | `\| jq \| grep \| sort \| uniq \| parallel` | +| **Filter → Process** | `list --status=failed --plugin=pdf \| run` | +| **Transform → Archive** | `curl API \| jq '{url, tags_str}' \| crawl create \| run` | +| **Retry w/ Changes** | `run \| jq 'select(.status=="failed") \| del(.id)' \| create \| run` | +| **Selective Extract** | `snapshot list \| archiveresult create --plugin=screenshot` | +| **Bulk Update** | `list --url__icontains=X \| update --tag=Y` | +| **Chain Processing** | `crawl \| run \| jq transform \| create \| run` | -The key insight: **every intermediate step produces valid JSONL** that can be saved, filtered, -transformed, or resumed later. This makes archiving workflows debuggable, repeatable, and -composable with the entire Unix ecosystem. +The key insight: **`archivebox run` emits JSONL of everything it creates**, enabling: +- Retry failed items with different settings (timeouts, binaries, etc.) +- Recursive crawling (archive page → extract links → archive those) +- Chained transforms (filter failures, modify config, re-queue) ---