move tests into subfolder, add missing install hooks

2026-04-06 07:47:53 +10:00 · 2026-01-02 00:22:07 -08:00
parent c2afb40350
commit 65ee09ceab
80 changed files with 2659 additions and 859 deletions
--- a/old/Architecture.md
+++ b/old/Architecture.md
@@ -0,0 +1,172 @@
+# ArchiveBox UI
+
+## Page: Getting Started
+
+### What do you want to capture?
+
+- Save some URLs now -> [Add page]
+    - Paste some URLs to archive now
+    - Upload a file containing URLs (bookmarks.html export, RSS.xml feed, markdown file, word doc, PDF, etc.)
+    - Pull in URLs to archive from a remote location (e.g. RSS feed URL, remote TXT file, JSON file, etc.)
+
+- Import URLs from a browser -> [Import page]
+    - Desktop: Get the ArchiveBox Chrome/Firefox extension
+    - Mobile: Get the ArchiveBox iOS App / Android App
+    - Upload a bookmarks.html export file
+    - Upload a browser_history.sqlite3 export file
+
+- Import URLs from a 3rd party bookmarking service -> [Sync page]
+    - Pocket
+    - Pinboard
+    - Instapaper
+    - Wallabag
+    - Zapier, N8N, IFTTT, etc.
+    - Upload a bookmarks.html export, bookmarks.json, RSS, etc. file
+
+- Archive URLs on a schedule -> [Schedule page]
+
+- Archive an entire website -> [Crawl page]
+    - What starting URL/domain?
+    - How deep?
+    - Follow links to external domains?
+    - Follow links to parent URLs?
+    - Maximum number of pages to save?
+    - Maximum number of requests/minute?
+
+- Crawl for URLs with a search engine and save automatically
+    - 
+- Some URLs on a schedule
+- Save an entire website (e.g. `https://example.com`)
+- Save results matching a search query (e.g. "site:example.com")
+- Save a social media feed (e.g. `https://x.com/user/1234567890`)
+
+--------------------------------------------------------------------------------
+
+### Crawls App
+
+- Archive an entire website -> [Crawl page]
+    - What are the starting URLs?
+    - How many hops to follow?
+    - Follow links to external domains?
+    - Follow links to parent URLs?
+    - Maximum number of pages to save?
+    - Maximum number of requests/minute?
+
+
+--------------------------------------------------------------------------------
+
+### Scheduler App
+
+
+- Archive URLs on a schedule -> [Schedule page]
+    - What URL(s)?
+    - How often?
+    - Do you want to discard old snapshots after x amount of time?
+    - Any filter rules?
+    - Want to be notified when changes are detected -> redirect[Alerts app/create new alert(crawl=self)]
+
+
+* Choose Schedule check for new URLs: Schedule.objects.get(pk=xyz)
+    - 1 minute
+    - 5 minutes
+    - 1 hour
+    - 1 day
+
+    * Choose Destination Crawl to archive URLs using : Crawl.objects.get(pk=xyz)
+        - Tags
+        - Persona
+        - Created By ID
+        - Config
+        - Filters
+            - URL patterns to include
+            - URL patterns to exclude
+            - ONLY_NEW= Ignore URLs if already saved once / save URL each time it appears / only save is last save > x time ago
+
+
+--------------------------------------------------------------------------------
+
+### Sources App (For managing sources that ArchiveBox pulls URLs in from)
+
+- Add a new source to pull URLs in from (WIZARD)
+    - Choose URI:
+        - [x] Web UI
+        - [x] CLI
+        - Local filesystem path (directory to monitor for new files containing URLs)
+        - Remote URL (RSS/JSON/XML feed)
+        - Chrome browser profile sync (login using gmail to pull bookmarks/history)
+        - Pocket, Pinboard, Instapaper, Wallabag, etc.
+        - Zapier, N8N, IFTTT, etc.
+        - Local server filesystem path (directory to monitor for new files containing URLs)
+        - Google drive (directory to monitor for new files containing URLs)
+        - Remote server FTP/SFTP/SCP path (directory to monitor for new files containing URLs)
+        - AWS/S3/B2/GCP bucket (directory to monitor for new files containing URLs)
+        - XBrowserSync (login to pull bookmarks)
+    - Choose extractor
+        - auto
+        - RSS
+        - Pocket
+        - etc.
+    - Specify extra Config, e.g.
+        - credentials
+        - extractor tuning options (e.g. verify_ssl, cookies, etc.)
+
+- Provide credentials for the source
+    - API Key
+    - Username / Password
+    - OAuth
+
+--------------------------------------------------------------------------------
+
+### Alerts App
+
+- Create a new alert, choose condition
+    - Get notified when a site goes down (<x% success ratio for Snapshots)
+    - Get notified when a site changes visually more than x% (screenshot diff)
+    - Get notified when a site's text content changes more than x% (text diff)
+    - Get notified when a keyword appears
+    - Get notified when a keyword dissapears
+    - When an AI prompt returns some result
+
+- Choose alert threshold:
+    - any condition is met
+    - all conditions are met
+    - condition is met for x% of URLs
+    - condition is met for x% of time
+
+- Choose how to notify: (List[AlertDestination])
+    - maximum alert frequency
+    - destination type: email / Slack / Webhook / Google Sheet / logfile
+    - destination info:
+        - email address(es)
+        - Slack channel
+        - Webhook URL
+
+- Choose scope:
+    - Choose ArchiveResult scope (extractors): (a query that returns ArchiveResult.objects QuerySet)
+        - All extractors
+        - Only screenshots
+        - Only readability / mercury text
+        - Only video
+        - Only html
+        - Only headers
+
+    - Choose Snapshot scope (URL): (a query that returns Snapshot.objects QuerySet)
+        - All domains
+        - Specific domain
+        - All domains in a tag
+        - All domains in a tag category
+        - All URLs matching a certain regex pattern
+
+    - Choose crawl scope: (a query that returns Crawl.objects QuerySet)
+        - All crawls
+        - Specific crawls
+        - crawls by a certain user
+        - crawls using a certain persona
+
+
+class AlertDestination(models.Model):
+    destination_type: [email, slack, webhook, google_sheet, local logfile, b2/s3/gcp bucket, etc.]
+    maximum_frequency
+    filter_rules
+    credentials
+    alert_template: JINJA2 json/text template that gets populated with alert contents
--- a/old/TODO_archivebox_jsonl_cli.md
+++ b/old/TODO_archivebox_jsonl_cli.md
@@ -0,0 +1,716 @@
+# ArchiveBox CLI Pipeline Architecture
+
+## Overview
+
+This plan implements a JSONL-based CLI pipeline for ArchiveBox, enabling Unix-style piping between commands:
+
+```bash
+archivebox crawl create URL | archivebox snapshot create | archivebox archiveresult create | archivebox run
+```
+
+## Design Principles
+
+1. **Maximize model method reuse**: Use `.to_json()`, `.from_json()`, `.to_jsonl()`, `.from_jsonl()` everywhere
+2. **Pass-through behavior**: All commands output input records + newly created records (accumulating pipeline)
+3. **Create-or-update**: Commands create records if they don't exist, update if ID matches existing
+4. **Auto-cascade**: `archivebox run` automatically creates Snapshots from Crawls and ArchiveResults from Snapshots
+5. **Generic filtering**: Implement filters as functions that take queryset → return queryset
+6. **Minimal code**: Extract duplicated `apply_filters()` to shared module
+
+---
+
+## Real-World Use Cases
+
+These examples demonstrate the JSONL piping architecture. Key points:
+- `archivebox run` auto-cascades (Crawl → Snapshots → ArchiveResults)
+- `archivebox run` **emits JSONL** of everything it creates, enabling chained processing
+- Use CLI args (`--status=`, `--plugin=`) for efficient DB filtering; use jq for transforms
+
+### 1. Basic Archive
+```bash
+# Simple URL archive (run auto-creates snapshots and archive results)
+archivebox crawl create https://example.com | archivebox run
+
+# Multiple URLs from a file
+archivebox crawl create < urls.txt | archivebox run
+
+# With depth crawling (follow links)
+archivebox crawl create --depth=2 https://docs.python.org | archivebox run
+```
+
+### 2. Retry Failed Extractions
+```bash
+# Retry all failed extractions
+archivebox archiveresult list --status=failed | archivebox run
+
+# Retry only failed PDFs from a specific domain
+archivebox archiveresult list --status=failed --plugin=pdf --url__icontains=nytimes.com \
+  | archivebox run
+```
+
+### 3. Import Bookmarks from Pinboard (jq transform)
+```bash
+# Fetch Pinboard API, transform fields to match ArchiveBox schema, archive
+curl -s "https://api.pinboard.in/v1/posts/all?format=json&auth_token=$TOKEN" \
+  | jq -c '.[] | {url: .href, tags_str: .tags, title: .description}' \
+  | archivebox crawl create \
+  | archivebox run
+```
+
+### 4. Retry Failed with Different Binary (jq transform + re-run)
+```bash
+# Get failed wget results, transform to use wget2 binary instead, re-queue as new attempts
+archivebox archiveresult list --status=failed --plugin=wget \
+  | jq -c '{snapshot_id, plugin, status: "queued", overrides: {WGET_BINARY: "wget2"}}' \
+  | archivebox archiveresult create \
+  | archivebox run
+
+# Chain processing: archive, then re-run any failures with increased timeout
+archivebox crawl create https://slow-site.com \
+  | archivebox run \
+  | jq -c 'select(.type == "ArchiveResult" and .status == "failed")
+           | del(.id) | .status = "queued" | .overrides.TIMEOUT = "120"' \
+  | archivebox archiveresult create \
+  | archivebox run
+```
+
+### 5. Selective Extraction
+```bash
+# Create only screenshot extractions for queued snapshots
+archivebox snapshot list --status=queued \
+  | archivebox archiveresult create --plugin=screenshot \
+  | archivebox run
+
+# Re-run singlefile on everything that was skipped
+archivebox archiveresult list --plugin=singlefile --status=skipped \
+  | archivebox archiveresult update --status=queued \
+  | archivebox run
+```
+
+### 6. Bulk Tag Management
+```bash
+# Tag all Twitter/X URLs (efficient DB filter, no jq needed)
+archivebox snapshot list --url__icontains=twitter.com \
+  | archivebox snapshot update --tag=twitter
+
+# Tag snapshots based on computed criteria (jq for logic DB can't do)
+archivebox snapshot list --status=sealed \
+  | jq -c 'select(.archiveresult_count > 5) | . + {tags_str: (.tags_str + ",well-archived")}' \
+  | archivebox snapshot update
+```
+
+### 7. RSS Feed Monitoring
+```bash
+# Archive all items from an RSS feed
+curl -s "https://hnrss.org/frontpage" \
+  | xq -r '.rss.channel.item[].link' \
+  | archivebox crawl create --tag=hackernews-$(date +%Y%m%d) \
+  | archivebox run
+```
+
+### 8. Recursive Link Following (run output → filter → re-run)
+```bash
+# Archive a page, then archive all PDFs it links to
+archivebox crawl create https://research-papers.org/index.html \
+  | archivebox run \
+  | jq -c 'select(.type == "Snapshot") | .discovered_urls[]?
+           | select(endswith(".pdf")) | {url: .}' \
+  | archivebox crawl create --tag=linked-pdfs \
+  | archivebox run
+
+# Depth crawl with custom handling: retry timeouts with longer timeout
+archivebox crawl create --depth=1 https://example.com \
+  | archivebox run \
+  | jq -c 'select(.type == "ArchiveResult" and .status == "failed" and .error contains "timeout")
+           | del(.id) | .overrides.TIMEOUT = "300"' \
+  | archivebox archiveresult create \
+  | archivebox run
+```
+
+### Composability Summary
+
+| Pattern | Example |
+|---------|---------|
+| **Filter → Process** | `list --status=failed --plugin=pdf \| run` |
+| **Transform → Archive** | `curl API \| jq '{url, tags_str}' \| crawl create \| run` |
+| **Retry w/ Changes** | `run \| jq 'select(.status=="failed") \| del(.id)' \| create \| run` |
+| **Selective Extract** | `snapshot list \| archiveresult create --plugin=screenshot` |
+| **Bulk Update** | `list --url__icontains=X \| update --tag=Y` |
+| **Chain Processing** | `crawl \| run \| jq transform \| create \| run` |
+
+The key insight: **`archivebox run` emits JSONL of everything it creates**, enabling:
+- Retry failed items with different settings (timeouts, binaries, etc.)
+- Recursive crawling (archive page → extract links → archive those)
+- Chained transforms (filter failures, modify config, re-queue)
+
+---
+
+## Code Reuse Findings
+
+### Existing Model Methods (USE THESE)
+- `Crawl.to_json()`, `Crawl.from_json()`, `Crawl.to_jsonl()`, `Crawl.from_jsonl()`
+- `Snapshot.to_json()`, `Snapshot.from_json()`, `Snapshot.to_jsonl()`, `Snapshot.from_jsonl()`
+- `Tag.to_json()`, `Tag.from_json()`, `Tag.to_jsonl()`, `Tag.from_jsonl()`
+
+### Missing Model Methods (MUST IMPLEMENT)
+- **`ArchiveResult.from_json()`** - Does not exist, must be added
+- **`ArchiveResult.from_jsonl()`** - Does not exist, must be added
+
+### Existing Utilities (USE THESE)
+- `archivebox/misc/jsonl.py`: `read_stdin()`, `read_args_or_stdin()`, `write_record()`, `parse_line()`
+- Type constants: `TYPE_CRAWL`, `TYPE_SNAPSHOT`, `TYPE_ARCHIVERESULT`, etc.
+
+### Duplicated Code (EXTRACT)
+- `apply_filters()` duplicated in 7 CLI files → extract to `archivebox/cli/cli_utils.py`
+
+### Supervisord Config (UPDATE)
+- `archivebox/workers/supervisord_util.py` line ~35: `"command": "archivebox manage orchestrator"` → `"command": "archivebox run"`
+
+### Field Name Standardization (FIX)
+- **Issue**: `Crawl.to_json()` outputs `tags_str`, but `Snapshot.to_json()` outputs `tags`
+- **Fix**: Standardize all models to use `tags_str` in JSONL output (matches model property names)
+
+---
+
+## Implementation Order
+
+### Phase 1: Model Prerequisites
+1. **Implement `ArchiveResult.from_json()`** in `archivebox/core/models.py`
+   - Pattern: Match `Snapshot.from_json()` and `Crawl.from_json()` style
+   - Handle: ID lookup (update existing) or create new
+   - Required fields: `snapshot_id`, `plugin`
+   - Optional fields: `status`, `hook_name`, etc.
+
+2. **Implement `ArchiveResult.from_jsonl()`** in `archivebox/core/models.py`
+   - Filter records by `type='ArchiveResult'`
+   - Call `from_json()` for each matching record
+
+3. **Fix `Snapshot.to_json()` field name**
+   - Change `'tags': self.tags_str()` → `'tags_str': self.tags_str()`
+   - Update any code that depends on `tags` key in Snapshot JSONL
+
+### Phase 2: Shared Utilities
+4. **Extract `apply_filters()` to `archivebox/cli/cli_utils.py`**
+   - Generic queryset filtering from CLI kwargs
+   - Support `--id__in=[csv]`, `--url__icontains=str`, etc.
+   - Remove duplicates from 7 CLI files
+
+### Phase 3: Pass-Through Behavior (NEW FEATURE)
+5. **Add pass-through to `archivebox crawl create`**
+   - Output non-Crawl input records unchanged
+   - Output created Crawl records
+
+6. **Add pass-through to `archivebox snapshot create`**
+   - Output non-Snapshot/non-Crawl input records unchanged
+   - Process Crawl records → create Snapshots
+   - Output both original Crawl and created Snapshots
+
+7. **Add pass-through to `archivebox archiveresult create`**
+   - Output non-Snapshot/non-ArchiveResult input records unchanged
+   - Process Snapshot records → create ArchiveResults
+   - Output both original Snapshots and created ArchiveResults
+
+8. **Add create-or-update to `archivebox run`**
+   - Records WITH id: lookup and queue existing
+   - Records WITHOUT id: create via `Model.from_json()`, then queue
+   - Pass-through output of all processed records
+
+### Phase 4: Test Infrastructure
+9. **Create `archivebox/tests/conftest.py`** with pytest-django
+   - Use `pytest-django` for proper test database handling
+   - Isolated DATA_DIR per test via `tmp_path` fixture
+   - `run_archivebox_cmd()` helper for subprocess testing
+
+### Phase 5: Unit Tests
+10. **Create `archivebox/tests/test_cli_crawl.py`** - crawl create/list/pass-through tests
+11. **Create `archivebox/tests/test_cli_snapshot.py`** - snapshot create/list/pass-through tests
+12. **Create `archivebox/tests/test_cli_archiveresult.py`** - archiveresult create/list/pass-through tests
+13. **Create `archivebox/tests/test_cli_run.py`** - run command create-or-update tests
+
+### Phase 6: Integration & Config
+14. **Extend `archivebox/cli/tests_piping.py`** - Add pass-through integration tests
+15. **Update supervisord config** - `orchestrator` → `run`
+
+---
+
+## Future Work (Deferred)
+
+### Commands to Defer
+- `archivebox tag create|list|update|delete` - Already works, defer improvements
+- `archivebox binary create|list|update|delete` - Lower priority
+- `archivebox process list` - Lower priority
+- `archivebox apikey create|list|update|delete` - Lower priority
+
+### `archivebox add` Relationship
+- **Current**: `archivebox add` is the primary user-facing command, stays as-is
+- **Future**: Refactor `add` to internally use `crawl create | snapshot create | run` pipeline
+- **Note**: This refactor is deferred; `add` continues to work independently for now
+
+---
+
+## Key Files
+
+| File | Action | Phase |
+|------|--------|-------|
+| `archivebox/core/models.py` | Add `ArchiveResult.from_json()`, `from_jsonl()` | 1 |
+| `archivebox/core/models.py` | Fix `Snapshot.to_json()` → `tags_str` | 1 |
+| `archivebox/cli/cli_utils.py` | NEW - shared `apply_filters()` | 2 |
+| `archivebox/cli/archivebox_crawl.py` | Add pass-through to create | 3 |
+| `archivebox/cli/archivebox_snapshot.py` | Add pass-through to create | 3 |
+| `archivebox/cli/archivebox_archiveresult.py` | Add pass-through to create | 3 |
+| `archivebox/cli/archivebox_run.py` | Add create-or-update, pass-through | 3 |
+| `archivebox/tests/conftest.py` | NEW - pytest fixtures | 4 |
+| `archivebox/tests/test_cli_crawl.py` | NEW - crawl unit tests | 5 |
+| `archivebox/tests/test_cli_snapshot.py` | NEW - snapshot unit tests | 5 |
+| `archivebox/tests/test_cli_archiveresult.py` | NEW - archiveresult unit tests | 5 |
+| `archivebox/tests/test_cli_run.py` | NEW - run unit tests | 5 |
+| `archivebox/cli/tests_piping.py` | Extend with pass-through tests | 6 |
+| `archivebox/workers/supervisord_util.py` | Update orchestrator→run | 6 |
+
+---
+
+## Implementation Details
+
+### ArchiveResult.from_json() Design
+
+```python
+@staticmethod
+def from_json(record: Dict[str, Any], overrides: Dict[str, Any] = None) -> 'ArchiveResult | None':
+    """
+    Create or update a single ArchiveResult from a JSON record dict.
+
+    Args:
+        record: Dict with 'snapshot_id' and 'plugin' (required for create),
+                or 'id' (for update)
+        overrides: Dict of field overrides
+
+    Returns:
+        ArchiveResult instance or None if invalid
+    """
+    from django.utils import timezone
+
+    overrides = overrides or {}
+
+    # If 'id' is provided, lookup and update existing
+    result_id = record.get('id')
+    if result_id:
+        try:
+            result = ArchiveResult.objects.get(id=result_id)
+            # Update fields from record
+            if record.get('status'):
+                result.status = record['status']
+                result.retry_at = timezone.now()
+            result.save()
+            return result
+        except ArchiveResult.DoesNotExist:
+            pass  # Fall through to create
+
+    # Required fields for creation
+    snapshot_id = record.get('snapshot_id')
+    plugin = record.get('plugin')
+
+    if not snapshot_id or not plugin:
+        return None
+
+    try:
+        snapshot = Snapshot.objects.get(id=snapshot_id)
+    except Snapshot.DoesNotExist:
+        return None
+
+    # Create or get existing result
+    result, created = ArchiveResult.objects.get_or_create(
+        snapshot=snapshot,
+        plugin=plugin,
+        defaults={
+            'status': record.get('status', ArchiveResult.StatusChoices.QUEUED),
+            'retry_at': timezone.now(),
+            'hook_name': record.get('hook_name', ''),
+            **overrides,
+        }
+    )
+
+    # If not created, optionally reset for retry
+    if not created and record.get('status'):
+        result.status = record['status']
+        result.retry_at = timezone.now()
+        result.save()
+
+    return result
+```
+
+### Pass-Through Pattern
+
+All `create` commands follow this pattern:
+
+```python
+def create_X(args, ...):
+    is_tty = sys.stdout.isatty()
+    records = list(read_args_or_stdin(args))
+
+    for record in records:
+        record_type = record.get('type')
+
+        # Pass-through: output records we don't handle
+        if record_type not in HANDLED_TYPES:
+            if not is_tty:
+                write_record(record)
+            continue
+
+        # Handle our type: create via Model.from_json()
+        obj = Model.from_json(record, overrides={...})
+
+        # Output created record (hydrated with db id)
+        if obj and not is_tty:
+            write_record(obj.to_json())
+```
+
+### Pass-Through Semantics Example
+
+```
+Input:
+  {"type": "Crawl", "id": "abc", "urls": "https://example.com", ...}
+  {"type": "Tag", "name": "important"}
+
+archivebox snapshot create output:
+  {"type": "Crawl", "id": "abc", ...}           # pass-through (not our type)
+  {"type": "Tag", "name": "important"}          # pass-through (not our type)
+  {"type": "Snapshot", "id": "xyz", ...}        # created from Crawl URLs
+```
+
+### Create-or-Update Pattern for `archivebox run`
+
+```python
+def process_stdin_records() -> int:
+    records = list(read_stdin())
+    is_tty = sys.stdout.isatty()
+
+    for record in records:
+        record_type = record.get('type')
+        record_id = record.get('id')
+
+        # Create-or-update based on whether ID exists
+        if record_type == TYPE_CRAWL:
+            if record_id:
+                try:
+                    obj = Crawl.objects.get(id=record_id)
+                except Crawl.DoesNotExist:
+                    obj = Crawl.from_json(record)
+            else:
+                obj = Crawl.from_json(record)
+
+            if obj:
+                obj.retry_at = timezone.now()
+                obj.save()
+                if not is_tty:
+                    write_record(obj.to_json())
+
+        # Similar for Snapshot, ArchiveResult...
+```
+
+### Shared apply_filters() Design
+
+Extract to `archivebox/cli/cli_utils.py`:
+
+```python
+"""Shared CLI utilities for ArchiveBox commands."""
+
+from typing import Optional
+
+def apply_filters(queryset, filter_kwargs: dict, limit: Optional[int] = None):
+    """
+    Apply Django-style filters from CLI kwargs to a QuerySet.
+
+    Supports: --status=queued, --url__icontains=example, --id__in=uuid1,uuid2
+
+    Args:
+        queryset: Django QuerySet to filter
+        filter_kwargs: Dict of filter key-value pairs from CLI
+        limit: Optional limit on results
+
+    Returns:
+        Filtered QuerySet
+    """
+    filters = {}
+    for key, value in filter_kwargs.items():
+        if value is None or key in ('limit', 'offset'):
+            continue
+        # Handle CSV lists for __in filters
+        if key.endswith('__in') and isinstance(value, str):
+            value = [v.strip() for v in value.split(',')]
+        filters[key] = value
+
+    if filters:
+        queryset = queryset.filter(**filters)
+    if limit:
+        queryset = queryset[:limit]
+
+    return queryset
+```
+
+---
+
+## conftest.py Design (pytest-django)
+
+```python
+"""archivebox/tests/conftest.py - Pytest fixtures for CLI tests."""
+
+import os
+import sys
+import json
+import subprocess
+from pathlib import Path
+from typing import List, Dict, Any, Optional, Tuple
+
+import pytest
+
+
+# =============================================================================
+# Fixtures
+# =============================================================================
+
+@pytest.fixture
+def isolated_data_dir(tmp_path, settings):
+    """
+    Create isolated DATA_DIR for each test.
+
+    Uses tmp_path for isolation, configures Django settings.
+    """
+    data_dir = tmp_path / 'archivebox_data'
+    data_dir.mkdir()
+
+    # Set environment for subprocess calls
+    os.environ['DATA_DIR'] = str(data_dir)
+
+    # Update Django settings
+    settings.DATA_DIR = data_dir
+
+    yield data_dir
+
+    # Cleanup handled by tmp_path fixture
+
+
+@pytest.fixture
+def initialized_archive(isolated_data_dir):
+    """
+    Initialize ArchiveBox archive in isolated directory.
+
+    Runs `archivebox init` to set up database and directories.
+    """
+    from archivebox.cli.archivebox_init import init
+    init(setup=True, quick=True)
+    return isolated_data_dir
+
+
+@pytest.fixture
+def cli_env(initialized_archive):
+    """
+    Environment dict for CLI subprocess calls.
+
+    Includes DATA_DIR and disables slow extractors.
+    """
+    return {
+        **os.environ,
+        'DATA_DIR': str(initialized_archive),
+        'USE_COLOR': 'False',
+        'SHOW_PROGRESS': 'False',
+        'SAVE_TITLE': 'True',
+        'SAVE_FAVICON': 'False',
+        'SAVE_WGET': 'False',
+        'SAVE_WARC': 'False',
+        'SAVE_PDF': 'False',
+        'SAVE_SCREENSHOT': 'False',
+        'SAVE_DOM': 'False',
+        'SAVE_SINGLEFILE': 'False',
+        'SAVE_READABILITY': 'False',
+        'SAVE_MERCURY': 'False',
+        'SAVE_GIT': 'False',
+        'SAVE_YTDLP': 'False',
+        'SAVE_HEADERS': 'False',
+    }
+
+
+# =============================================================================
+# CLI Helpers
+# =============================================================================
+
+def run_archivebox_cmd(
+    args: List[str],
+    stdin: Optional[str] = None,
+    cwd: Optional[Path] = None,
+    env: Optional[Dict[str, str]] = None,
+    timeout: int = 60,
+) -> Tuple[str, str, int]:
+    """
+    Run archivebox command, return (stdout, stderr, returncode).
+
+    Args:
+        args: Command arguments (e.g., ['crawl', 'create', 'https://example.com'])
+        stdin: Optional string to pipe to stdin
+        cwd: Working directory (defaults to DATA_DIR from env)
+        env: Environment variables (defaults to os.environ with DATA_DIR)
+        timeout: Command timeout in seconds
+
+    Returns:
+        Tuple of (stdout, stderr, returncode)
+    """
+    cmd = [sys.executable, '-m', 'archivebox'] + args
+
+    env = env or {**os.environ}
+    cwd = cwd or Path(env.get('DATA_DIR', '.'))
+
+    result = subprocess.run(
+        cmd,
+        input=stdin,
+        capture_output=True,
+        text=True,
+        cwd=cwd,
+        env=env,
+        timeout=timeout,
+    )
+
+    return result.stdout, result.stderr, result.returncode
+
+
+# =============================================================================
+# Output Assertions
+# =============================================================================
+
+def parse_jsonl_output(stdout: str) -> List[Dict[str, Any]]:
+    """Parse JSONL output into list of dicts."""
+    records = []
+    for line in stdout.strip().split('\n'):
+        line = line.strip()
+        if line and line.startswith('{'):
+            try:
+                records.append(json.loads(line))
+            except json.JSONDecodeError:
+                pass
+    return records
+
+
+def assert_jsonl_contains_type(stdout: str, record_type: str, min_count: int = 1):
+    """Assert output contains at least min_count records of type."""
+    records = parse_jsonl_output(stdout)
+    matching = [r for r in records if r.get('type') == record_type]
+    assert len(matching) >= min_count, \
+        f"Expected >= {min_count} {record_type}, got {len(matching)}"
+    return matching
+
+
+def assert_jsonl_pass_through(stdout: str, input_records: List[Dict[str, Any]]):
+    """Assert that input records appear in output (pass-through behavior)."""
+    output_records = parse_jsonl_output(stdout)
+    output_ids = {r.get('id') for r in output_records if r.get('id')}
+
+    for input_rec in input_records:
+        input_id = input_rec.get('id')
+        if input_id:
+            assert input_id in output_ids, \
+                f"Input record {input_id} not found in output (pass-through failed)"
+
+
+def assert_record_has_fields(record: Dict[str, Any], required_fields: List[str]):
+    """Assert record has all required fields with non-None values."""
+    for field in required_fields:
+        assert field in record, f"Record missing field: {field}"
+        assert record[field] is not None, f"Record field is None: {field}"
+
+
+# =============================================================================
+# Database Assertions
+# =============================================================================
+
+def assert_db_count(model_class, filters: Dict[str, Any], expected: int):
+    """Assert database count matches expected."""
+    actual = model_class.objects.filter(**filters).count()
+    assert actual == expected, \
+        f"Expected {expected} {model_class.__name__}, got {actual}"
+
+
+def assert_db_exists(model_class, **filters):
+    """Assert at least one record exists matching filters."""
+    assert model_class.objects.filter(**filters).exists(), \
+        f"No {model_class.__name__} found matching {filters}"
+
+
+# =============================================================================
+# Test Data Factories
+# =============================================================================
+
+def create_test_url(domain: str = 'example.com', path: str = None) -> str:
+    """Generate unique test URL."""
+    import uuid
+    path = path or uuid.uuid4().hex[:8]
+    return f'https://{domain}/{path}'
+
+
+def create_test_crawl_json(urls: List[str] = None, **kwargs) -> Dict[str, Any]:
+    """Create Crawl JSONL record for testing."""
+    from archivebox.misc.jsonl import TYPE_CRAWL
+
+    urls = urls or [create_test_url()]
+    return {
+        'type': TYPE_CRAWL,
+        'urls': '\n'.join(urls),
+        'max_depth': kwargs.get('max_depth', 0),
+        'tags_str': kwargs.get('tags_str', ''),
+        'status': kwargs.get('status', 'queued'),
+        **{k: v for k, v in kwargs.items() if k not in ('max_depth', 'tags_str', 'status')},
+    }
+
+
+def create_test_snapshot_json(url: str = None, **kwargs) -> Dict[str, Any]:
+    """Create Snapshot JSONL record for testing."""
+    from archivebox.misc.jsonl import TYPE_SNAPSHOT
+
+    return {
+        'type': TYPE_SNAPSHOT,
+        'url': url or create_test_url(),
+        'tags_str': kwargs.get('tags_str', ''),
+        'status': kwargs.get('status', 'queued'),
+        **{k: v for k, v in kwargs.items() if k not in ('tags_str', 'status')},
+    }
+```
+
+---
+
+## Test Rules
+
+- **NO SKIPPING** - Every test runs
+- **NO MOCKING** - Real subprocess calls, real database
+- **NO DISABLING** - Failing tests identify real problems
+- **MINIMAL CODE** - Import helpers from conftest.py
+- **ISOLATED** - Each test gets its own DATA_DIR via `tmp_path`
+
+---
+
+## Task Checklist
+
+### Phase 1: Model Prerequisites
+- [x] Implement `ArchiveResult.from_json()` in `archivebox/core/models.py`
+- [x] Implement `ArchiveResult.from_jsonl()` in `archivebox/core/models.py`
+- [x] Fix `Snapshot.to_json()` to use `tags_str` instead of `tags`
+
+### Phase 2: Shared Utilities
+- [x] Create `archivebox/cli/cli_utils.py` with shared `apply_filters()`
+- [x] Update 7 CLI files to import from `cli_utils.py`
+
+### Phase 3: Pass-Through Behavior
+- [x] Add pass-through to `archivebox_crawl.py` create
+- [x] Add pass-through to `archivebox_snapshot.py` create
+- [x] Add pass-through to `archivebox_archiveresult.py` create
+- [x] Add create-or-update to `archivebox_run.py`
+- [x] Add pass-through output to `archivebox_run.py`
+
+### Phase 4: Test Infrastructure
+- [x] Create `archivebox/tests/conftest.py` with pytest-django fixtures
+
+### Phase 5: Unit Tests
+- [x] Create `archivebox/tests/test_cli_crawl.py`
+- [x] Create `archivebox/tests/test_cli_snapshot.py`
+- [x] Create `archivebox/tests/test_cli_archiveresult.py`
+- [x] Create `archivebox/tests/test_cli_run.py`
+
+### Phase 6: Integration & Config
+- [x] Extend `archivebox/cli/tests_piping.py` with pass-through tests
+- [x] Update `archivebox/workers/supervisord_util.py`: orchestrator→run
--- a/old/TODO_cli_refactor.md
+++ b/old/TODO_cli_refactor.md
@@ -0,0 +1,131 @@
+# ArchiveBox CLI Refactor TODO
+
+## Design Decisions
+
+1. **Keep `archivebox add`** as high-level convenience command
+2. **Unified `archivebox run`** for processing (replaces per-model `run` and `orchestrator`)
+3. **Expose all models** including binary, process, machine
+4. **Clean break** from old command structure (no backward compatibility aliases)
+
+## Final Architecture
+
+```
+archivebox <model> <action> [args...] [--filters]
+archivebox run [stdin JSONL]
+```
+
+### Actions (4 per model):
+- `create` - Create records (from args, stdin, or JSONL), dedupes by indexed fields
+- `list` - Query records (with filters, returns JSONL)
+- `update` - Modify records (from stdin JSONL, PATCH semantics)
+- `delete` - Remove records (from stdin JSONL, requires --yes)
+
+### Unified Run Command:
+- `archivebox run` - Process queued work
+  - With stdin JSONL: Process piped records, exit when complete
+  - Without stdin (TTY): Run orchestrator in foreground until killed
+
+### Models (7 total):
+- `crawl` - Crawl jobs
+- `snapshot` - Individual archived pages
+- `archiveresult` - Plugin extraction results
+- `tag` - Tags/labels
+- `binary` - Detected binaries (chrome, wget, etc.)
+- `process` - Process execution records (read-only)
+- `machine` - Machine/host records (read-only)
+
+---
+
+## Implementation Checklist
+
+### Phase 1: Unified Run Command
+- [x] Create `archivebox/cli/archivebox_run.py` - unified processing command
+
+### Phase 2: Core Model Commands
+- [x] Refactor `archivebox/cli/archivebox_snapshot.py` to Click group with create|list|update|delete
+- [x] Refactor `archivebox/cli/archivebox_crawl.py` to Click group with create|list|update|delete
+- [x] Create `archivebox/cli/archivebox_archiveresult.py` with create|list|update|delete
+- [x] Create `archivebox/cli/archivebox_tag.py` with create|list|update|delete
+
+### Phase 3: System Model Commands
+- [x] Create `archivebox/cli/archivebox_binary.py` with create|list|update|delete
+- [x] Create `archivebox/cli/archivebox_process.py` with list only (read-only)
+- [x] Create `archivebox/cli/archivebox_machine.py` with list only (read-only)
+
+### Phase 4: Registry & Cleanup
+- [x] Update `archivebox/cli/__init__.py` command registry
+- [x] Delete `archivebox/cli/archivebox_extract.py`
+- [x] Delete `archivebox/cli/archivebox_remove.py`
+- [x] Delete `archivebox/cli/archivebox_search.py`
+- [x] Delete `archivebox/cli/archivebox_orchestrator.py`
+- [x] Update `archivebox/cli/archivebox_add.py` internals (no changes needed - uses models directly)
+- [x] Update `archivebox/cli/tests_piping.py`
+
+### Phase 5: Tests for New Commands
+- [ ] Add tests for `archivebox run` command
+- [ ] Add tests for `archivebox crawl create|list|update|delete`
+- [ ] Add tests for `archivebox snapshot create|list|update|delete`
+- [ ] Add tests for `archivebox archiveresult create|list|update|delete`
+- [ ] Add tests for `archivebox tag create|list|update|delete`
+- [ ] Add tests for `archivebox binary create|list|update|delete`
+- [ ] Add tests for `archivebox process list`
+- [ ] Add tests for `archivebox machine list`
+
+---
+
+## Usage Examples
+
+### Basic CRUD
+```bash
+# Create
+archivebox crawl create https://example.com https://foo.com --depth=1
+archivebox snapshot create https://example.com --tag=news
+
+# List with filters
+archivebox crawl list --status=queued
+archivebox snapshot list --url__icontains=example.com
+archivebox archiveresult list --status=failed --plugin=screenshot
+
+# Update (reads JSONL from stdin, applies changes)
+archivebox snapshot list --tag=old | archivebox snapshot update --tag=new
+
+# Delete (requires --yes)
+archivebox crawl list --url__icontains=example.com | archivebox crawl delete --yes
+```
+
+### Unified Run Command
+```bash
+# Run orchestrator in foreground (replaces `archivebox orchestrator`)
+archivebox run
+
+# Process specific records (pipe any JSONL type, exits when done)
+archivebox snapshot list --status=queued | archivebox run
+archivebox archiveresult list --status=failed | archivebox run
+archivebox crawl list --status=queued | archivebox run
+
+# Mixed types work too - run handles any JSONL
+cat mixed_records.jsonl | archivebox run
+```
+
+### Composed Workflows
+```bash
+# Full pipeline (replaces old `archivebox add`)
+archivebox crawl create https://example.com --status=queued \
+  | archivebox snapshot create --status=queued \
+  | archivebox archiveresult create --status=queued \
+  | archivebox run
+
+# Re-run failed extractions
+archivebox archiveresult list --status=failed | archivebox run
+
+# Delete all snapshots for a domain
+archivebox snapshot list --url__icontains=spam.com | archivebox snapshot delete --yes
+```
+
+### Keep `archivebox add` as convenience
+```bash
+# This remains the simple user-friendly interface:
+archivebox add https://example.com --depth=1 --tag=news
+
+# Internally equivalent to the composed pipeline above
+```
--- a/old/TODO_hook_concurrency.md
+++ b/old/TODO_hook_concurrency.md
@@ -0,0 +1,532 @@
+# ArchiveBox Hook Script Concurrency & Execution Plan
+
+## Overview
+
+Snapshot.run() should enforce that snapshot hooks are run in **10 discrete, sequential "steps"**: `0*`, `1*`, `2*`, `3*`, `4*`, `5*`, `6*`, `7*`, `8*`, `9*`.
+
+For every discovered hook script, ArchiveBox should create an ArchiveResult in `queued` state, then manage running them using `retry_at` and inline logic to enforce this ordering.
+
+## Design Decisions
+
+### ArchiveResult Schema
+- Add `ArchiveResult.hook_name` (CharField, nullable) - just filename, e.g., `'on_Snapshot__20_chrome_tab.bg.js'`
+- Keep `ArchiveResult.plugin` - still important (plugin directory name)
+- Step number derived on-the-fly from `hook_name` via `extract_step(hook_name)` - not stored
+
+### Snapshot Schema
+- Add `Snapshot.current_step` (IntegerField 0-9, default=0)
+- Integrate with `SnapshotMachine` state transitions for step advancement
+
+### Hook Discovery & Execution
+- `Snapshot.run()` discovers all hooks upfront, creates one AR per hook with `hook_name` set
+- All ARs for a given step can be claimed and executed in parallel by workers
+- Workers claim ARs where `extract_step(ar.hook_name) <= snapshot.current_step`
+- `Snapshot.advance_step_if_ready()` increments `current_step` when:
+  - All **foreground** hooks in current step are finished (SUCCEEDED/FAILED/SKIPPED)
+  - Background hooks don't block advancement (they continue running)
+  - Called from `SnapshotMachine` state transitions
+
+### ArchiveResult.run() Behavior
+- If `self.hook_name` is set: run that single hook
+- If `self.hook_name` is None: discover all hooks for `self.plugin` and run sequentially
+- Background hooks detected by `.bg.` in filename (e.g., `on_Snapshot__20_chrome_tab.bg.js`)
+- Background hooks return immediately (ArchiveResult stays in STARTED state)
+- Foreground hooks wait for completion, update status from JSONL output
+
+### Hook Execution Flow
+1. **Within a step**: Workers claim all ARs for current step in parallel
+2. **Foreground hooks** (no .bg): ArchiveResult waits for completion, transitions to SUCCEEDED/FAILED/SKIPPED
+3. **Background hooks** (.bg): ArchiveResult transitions to STARTED, hook continues running
+4. **Step advancement**: `Snapshot.advance_step_if_ready()` checks:
+   - Are all foreground ARs in current step finished? (SUCCEEDED/FAILED/SKIPPED)
+   - Ignore ARs still in STARTED (background hooks)
+   - If yes, increment `current_step`
+5. **Snapshot sealing**: When `current_step=9` and all foreground hooks done, kill background hooks via `Snapshot.cleanup()`
+
+### Unnumbered Hooks
+- Extract step via `re.search(r'__(\d{2})_', hook_name)`, default to 9 if no match
+- Log warning for unnumbered hooks
+- Purely runtime derivation - no stored field
+
+## Hook Numbering Convention
+
+Hooks scripts are numbered `00` to `99` to control:
+- **First digit (0-9)**: Which step they are part of
+- **Second digit (0-9)**: Order within that step
+
+Hook scripts are launched **strictly sequentially** based on their filename alphabetical order, and run in sets of several per step before moving on to the next step.
+
+**Naming Format:**
+```
+on_{ModelName}__{run_order}_{human_readable_description}[.bg].{ext}
+```
+
+**Examples:**
+```
+on_Snapshot__00_this_would_run_first.sh
+on_Snapshot__05_start_ytdlp_download.bg.sh
+on_Snapshot__10_chrome_tab_opened.js
+on_Snapshot__50_screenshot.js
+on_Snapshot__53_media.bg.py
+```
+
+## Background (.bg) vs Foreground Scripts
+
+### Foreground Scripts (no .bg suffix)
+- Launch in parallel with other hooks in their step
+- Step waits for all foreground hooks to complete or timeout
+- Get killed with SIGTERM if they exceed their `PLUGINNAME_TIMEOUT`
+- Step advances when all foreground hooks finish
+
+### Background Scripts (.bg suffix)
+- Launch in parallel with other hooks in their step
+- Do NOT block step progression - step can advance while they run
+- Continue running across step boundaries until complete or timeout
+- Get killed with SIGTERM when Snapshot transitions to SEALED (via `Snapshot.cleanup()`)
+- Should exit naturally when work is complete (best case)
+
+**Important:** A .bg script started in step 2 can keep running through steps 3, 4, 5... until the Snapshot seals or the hook exits naturally.
+
+## Execution Step Guidelines
+
+These are **naming conventions and guidelines**, not enforced checkpoints. They provide semantic organization for plugin ordering:
+
+### Step 0: Pre-Setup
+```
+00-09: Initial setup, validation, feature detection
+```
+
+### Step 1: Chrome Launch & Tab Creation
+```
+10-19: Browser/tab lifecycle setup
+- Chrome browser launch
+- Tab creation and CDP connection
+```
+
+### Step 2: Navigation & Settlement
+```
+20-29: Page loading and settling
+- Navigate to URL
+- Wait for page load
+- Initial response capture (responses, ssl, consolelog as .bg listeners)
+```
+
+### Step 3: Page Adjustment
+```
+30-39: DOM manipulation before archiving
+- Hide popups/banners
+- Solve captchas
+- Expand comments/details sections
+- Inject custom CSS/JS
+- Accessibility modifications
+```
+
+### Step 4: Ready for Archiving
+```
+40-49: Final pre-archiving checks
+- Verify page is fully adjusted
+- Wait for any pending modifications
+```
+
+### Step 5: DOM Extraction (Sequential, Non-BG)
+```
+50-59: Extractors that need exclusive DOM access
+- singlefile (MUST NOT be .bg)
+- screenshot (MUST NOT be .bg)
+- pdf (MUST NOT be .bg)
+- dom (MUST NOT be .bg)
+- title
+- headers
+- readability
+- mercury
+
+These MUST run sequentially as they temporarily modify the DOM
+during extraction, then revert it. Running in parallel would corrupt results.
+```
+
+### Step 6: Post-DOM Extraction
+```
+60-69: Extractors that don't need DOM or run on downloaded files
+- wget
+- git
+- media (.bg - can run for hours)
+- gallerydl (.bg)
+- forumdl (.bg)
+- papersdl (.bg)
+```
+
+### Step 7: Chrome Cleanup
+```
+70-79: Browser/tab teardown
+- Close tabs
+- Cleanup Chrome resources
+```
+
+### Step 8: Post-Processing
+```
+80-89: Reprocess outputs from earlier extractors
+- OCR of images
+- Audio/video transcription
+- URL parsing from downloaded content (rss, html, json, txt, csv, md)
+- LLM analysis/summarization of outputs
+```
+
+### Step 9: Indexing & Finalization
+```
+90-99: Save to indexes and finalize
+- Index text content to Sonic/SQLite FTS
+- Create symlinks
+- Generate merkle trees
+- Final status updates
+```
+
+## Hook Script Interface
+
+### Input: CLI Arguments (NOT stdin)
+Hooks receive configuration as CLI flags (CSV or JSON-encoded):
+
+```bash
+--url="https://example.com"
+--snapshot-id="1234-5678-uuid"
+--config='{"some_key": "some_value"}'
+--plugins=git,media,favicon,title
+--timeout=50
+--enable-something
+```
+
+### Input: Environment Variables
+All configuration comes from env vars, defined in `plugin_dir/config.json` JSONSchema:
+
+```bash
+WGET_BINARY=/usr/bin/wget
+WGET_TIMEOUT=60
+WGET_USER_AGENT="Mozilla/5.0..."
+WGET_EXTRA_ARGS="--no-check-certificate"
+SAVE_WGET=True
+```
+
+**Required:** Every plugin must support `PLUGINNAME_TIMEOUT` for self-termination.
+
+### Output: Filesystem (CWD)
+Hooks read/write files to:
+- `$CWD`: Their own output subdirectory (e.g., `archive/snapshots/{id}/wget/`)
+- `$CWD/..`: Parent directory (to read outputs from other hooks)
+
+This allows hooks to:
+- Access files created by other hooks
+- Keep their outputs separate by default
+- Use semaphore files for coordination (if needed)
+
+### Output: JSONL to stdout
+Hooks emit one JSONL line per database record they want to create or update:
+
+```jsonl
+{"type": "Tag", "name": "sci-fi"}
+{"type": "ArchiveResult", "id": "1234-uuid", "status": "succeeded", "output_str": "wget/index.html"}
+{"type": "Snapshot", "id": "5678-uuid", "title": "Example Page"}
+```
+
+See `archivebox/misc/jsonl.py` and model `from_json()` / `from_jsonl()` methods for full list of supported types and fields.
+
+### Output: stderr for Human Logs
+Hooks should emit human-readable output or debug info to **stderr**. There are no guarantees this will be persisted long-term. Use stdout JSONL or filesystem for outputs that matter.
+
+### Cleanup: Delete Cruft
+If hooks emit no meaningful long-term outputs, they should delete any temporary files themselves to avoid wasting space. However, the ArchiveResult DB row should be kept so we know:
+- It doesn't need to be retried
+- It isn't missing
+- What happened (status, error message)
+
+### Signal Handling: SIGINT/SIGTERM
+Hooks are expected to listen for polite `SIGINT`/`SIGTERM` and finish hastily, then exit cleanly. Beyond that, they may be `SIGKILL'd` at ArchiveBox's discretion.
+
+**If hooks double-fork or spawn long-running processes:** They must output a `.pid` file in their directory so zombies can be swept safely.
+
+## Hook Failure Modes & Retry Logic
+
+Hooks can fail in several ways. ArchiveBox handles each differently:
+
+### 1. Soft Failure (Record & Don't Retry)
+**Exit:** `0` (success)
+**JSONL:** `{"type": "ArchiveResult", "status": "failed", "output_str": "404 Not Found"}`
+
+This means: "I ran successfully, but the resource wasn't available." Don't retry this.
+
+**Use cases:**
+- 404 errors
+- Content not available
+- Feature not applicable to this URL
+
+### 2. Hard Failure / Temporary Error (Retry Later)
+**Exit:** Non-zero (1, 2, etc.)
+**JSONL:** None (or incomplete)
+
+This means: "Something went wrong, I couldn't complete." Treat this ArchiveResult as "missing" and set `retry_at` for later.
+
+**Use cases:**
+- 500 server errors
+- Network timeouts
+- Binary not found / crashed
+- Transient errors
+
+**Behavior:**
+- ArchiveBox sets `retry_at` on the ArchiveResult
+- Hook will be retried during next `archivebox update`
+
+### 3. Partial Success (Update & Continue)
+**Exit:** Non-zero
+**JSONL:** Partial records emitted before crash
+
+**Behavior:**
+- Update ArchiveResult with whatever was emitted
+- Mark remaining work as "missing" with `retry_at`
+
+### 4. Success (Record & Continue)
+**Exit:** `0`
+**JSONL:** `{"type": "ArchiveResult", "status": "succeeded", "output_str": "output/file.html"}`
+
+This is the happy path.
+
+### Error Handling Rules
+
+- **DO NOT skip hooks** based on failures
+- **Continue to next hook** regardless of foreground or background failures
+- **Update ArchiveResults** with whatever information is available
+- **Set retry_at** for "missing" or temporarily-failed hooks
+- **Let background scripts continue** even if foreground scripts fail
+
+## File Structure
+
+```
+archivebox/plugins/{plugin_name}/
+├── config.json              # JSONSchema: env var config options
+├── binaries.jsonl           # Runtime dependencies: apt|brew|pip|npm|env
+├── on_Snapshot__XX_name.py  # Hook script (foreground)
+├── on_Snapshot__XX_name.bg.py  # Hook script (background)
+└── tests/
+    └── test_name.py
+```
+
+## Implementation Checklist
+
+### Phase 1: Schema Migration ✅
+- [x] Add `Snapshot.current_step` (IntegerField 0-9, default=0)
+- [x] Add `ArchiveResult.hook_name` (CharField, nullable) - just filename
+- [x] Create migration: `0034_snapshot_current_step.py`
+
+### Phase 2: Core Logic Updates ✅
+- [x] Add `extract_step(hook_name)` utility in `archivebox/hooks.py`
+  - Extract first digit from `__XX_` pattern
+  - Default to 9 for unnumbered hooks
+- [x] Add `is_background_hook(hook_name)` utility in `archivebox/hooks.py`
+  - Check for `.bg.` in filename
+- [x] Update `Snapshot.create_pending_archiveresults()` in `archivebox/core/models.py`:
+  - Discover all hooks (not plugins)
+  - Create one AR per hook with `hook_name` set
+- [x] Update `ArchiveResult.run()` in `archivebox/core/models.py`:
+  - If `hook_name` set: run single hook
+  - If `hook_name` None: discover all plugin hooks (existing behavior)
+- [x] Add `Snapshot.advance_step_if_ready()` method:
+  - Check if all foreground ARs in current step finished
+  - Increment `current_step` if ready
+  - Ignore background hooks (.bg) in completion check
+- [x] Integrate with `SnapshotMachine.is_finished()` in `archivebox/core/statemachines.py`:
+  - Call `advance_step_if_ready()` before checking if done
+
+### Phase 3: Worker Coordination ✅
+- [x] Update worker AR claiming query in `archivebox/workers/worker.py`:
+  - Filter: `extract_step(ar.hook_name) <= snapshot.current_step`
+  - Claims ARs in QUEUED state, checks step in Python before processing
+  - Orders by hook_name for deterministic execution within step
+
+### Phase 4: Hook Renumbering ✅
+- [x] Renumber hooks per renumbering map below
+- [x] Add `.bg` suffix to long-running hooks (media, gallerydl, forumdl, papersdl)
+- [x] Move parse_* hooks to step 7 (70-79)
+- [x] Test all hooks still work after renumbering
+
+## Migration Path
+
+### Natural Compatibility
+No special migration needed:
+1. Existing ARs with `hook_name=None` continue to work (discover all plugin hooks at runtime)
+2. New ARs get `hook_name` set (single hook per AR)
+3. `ArchiveResult.run()` handles both cases naturally
+4. Unnumbered hooks default to step 9 (log warning)
+
+### Renumbering Map
+
+**Completed Renames:**
+```
+# Step 5: DOM Extraction (sequential, non-background)
+singlefile/on_Snapshot__37_singlefile.py      → singlefile/on_Snapshot__50_singlefile.py ✅
+screenshot/on_Snapshot__34_screenshot.js      → screenshot/on_Snapshot__51_screenshot.js ✅
+pdf/on_Snapshot__35_pdf.js                    → pdf/on_Snapshot__52_pdf.js ✅
+dom/on_Snapshot__36_dom.js                    → dom/on_Snapshot__53_dom.js ✅
+title/on_Snapshot__32_title.js                → title/on_Snapshot__54_title.js ✅
+readability/on_Snapshot__52_readability.py    → readability/on_Snapshot__55_readability.py ✅
+headers/on_Snapshot__33_headers.js            → headers/on_Snapshot__55_headers.js ✅
+mercury/on_Snapshot__53_mercury.py            → mercury/on_Snapshot__56_mercury.py ✅
+htmltotext/on_Snapshot__54_htmltotext.py      → htmltotext/on_Snapshot__57_htmltotext.py ✅
+
+# Step 6: Post-DOM Extraction (background for long-running)
+wget/on_Snapshot__50_wget.py                  → wget/on_Snapshot__61_wget.py ✅
+git/on_Snapshot__12_git.py                    → git/on_Snapshot__62_git.py ✅
+media/on_Snapshot__51_media.py                → media/on_Snapshot__63_media.bg.py ✅
+gallerydl/on_Snapshot__52_gallerydl.py        → gallerydl/on_Snapshot__64_gallerydl.bg.py ✅
+forumdl/on_Snapshot__53_forumdl.py            → forumdl/on_Snapshot__65_forumdl.bg.py ✅
+papersdl/on_Snapshot__54_papersdl.py          → papersdl/on_Snapshot__66_papersdl.bg.py ✅
+
+# Step 7: URL Extraction (parse_* hooks moved from step 6)
+parse_html_urls/on_Snapshot__60_parse_html_urls.py      → parse_html_urls/on_Snapshot__70_parse_html_urls.py ✅
+parse_txt_urls/on_Snapshot__62_parse_txt_urls.py        → parse_txt_urls/on_Snapshot__71_parse_txt_urls.py ✅
+parse_rss_urls/on_Snapshot__61_parse_rss_urls.py        → parse_rss_urls/on_Snapshot__72_parse_rss_urls.py ✅
+parse_netscape_urls/on_Snapshot__63_parse_netscape_urls.py → parse_netscape_urls/on_Snapshot__73_parse_netscape_urls.py ✅
+parse_jsonl_urls/on_Snapshot__64_parse_jsonl_urls.py    → parse_jsonl_urls/on_Snapshot__74_parse_jsonl_urls.py ✅
+parse_dom_outlinks/on_Snapshot__40_parse_dom_outlinks.js → parse_dom_outlinks/on_Snapshot__75_parse_dom_outlinks.js ✅
+```
+
+## Testing Strategy
+
+### Unit Tests
+- Test hook ordering (00-99)
+- Test step grouping (first digit)
+- Test .bg vs foreground execution
+- Test timeout enforcement
+- Test JSONL parsing
+- Test failure modes & retry_at logic
+
+### Integration Tests
+- Test full Snapshot.run() with mixed hooks
+- Test .bg scripts running beyond step 99
+- Test zombie process cleanup
+- Test graceful SIGTERM handling
+- Test concurrent .bg script coordination
+
+### Performance Tests
+- Measure overhead of per-hook ArchiveResults
+- Test with 50+ concurrent .bg scripts
+- Test filesystem contention with many hooks
+
+## Open Questions
+
+### Q: Should we provide semaphore utilities?
+**A:** No. Keep plugins decoupled. Let them use simple filesystem coordination if needed.
+
+### Q: What happens if ArchiveResult table gets huge?
+**A:** We can delete old successful ArchiveResults periodically, or archive them to cold storage. The important data is in the filesystem outputs.
+
+### Q: Should naturally-exiting .bg scripts still be .bg?
+**A:** Yes. The .bg suffix means "don't block step progression," not "run until step 99." Natural exit is the best case.
+
+## Examples
+
+### Foreground Hook (Sequential DOM Access)
+```python
+#!/usr/bin/env python3
+# archivebox/plugins/screenshot/on_Snapshot__51_screenshot.js
+
+# Runs at step 5, blocks step progression until complete
+# Gets killed if it exceeds SCREENSHOT_TIMEOUT
+
+timeout = get_env_int('SCREENSHOT_TIMEOUT') or get_env_int('TIMEOUT', 60)
+
+try:
+    result = subprocess.run(cmd, capture_output=True, timeout=timeout)
+    if result.returncode == 0:
+        print(json.dumps({
+            "type": "ArchiveResult",
+            "status": "succeeded",
+            "output_str": "screenshot.png"
+        }))
+        sys.exit(0)
+    else:
+        # Temporary failure - will be retried
+        sys.exit(1)
+except subprocess.TimeoutExpired:
+    # Timeout - will be retried
+    sys.exit(1)
+```
+
+### Background Hook (Long-Running Download)
+```python
+#!/usr/bin/env python3
+# archivebox/plugins/ytdlp/on_Snapshot__63_ytdlp.bg.py
+
+# Runs at step 6, doesn't block step progression
+# Gets full YTDLP_TIMEOUT (e.g., 3600s) regardless of when step 99 completes
+
+timeout = get_env_int('YTDLP_TIMEOUT') or get_env_int('TIMEOUT', 3600)
+
+try:
+    result = subprocess.run(['yt-dlp', url], capture_output=True, timeout=timeout)
+    if result.returncode == 0:
+        print(json.dumps({
+            "type": "ArchiveResult",
+            "status": "succeeded",
+            "output_str": "media/"
+        }))
+        sys.exit(0)
+    else:
+        # Hard failure - don't retry
+        print(json.dumps({
+            "type": "ArchiveResult",
+            "status": "failed",
+            "output_str": "Video unavailable"
+        }))
+        sys.exit(0)  # Exit 0 to record the failure
+except subprocess.TimeoutExpired:
+    # Timeout - will be retried
+    sys.exit(1)
+```
+
+### Background Hook with Natural Exit
+```javascript
+#!/usr/bin/env node
+// archivebox/plugins/ssl/on_Snapshot__23_ssl.bg.js
+
+// Sets up listener, captures SSL info, then exits naturally
+// No SIGTERM handler needed - already exits when done
+
+async function main() {
+    const page = await connectToChrome();
+
+    // Set up listener
+    page.on('response', async (response) => {
+        const securityDetails = response.securityDetails();
+        if (securityDetails) {
+            fs.writeFileSync('ssl.json', JSON.stringify(securityDetails));
+        }
+    });
+
+    // Wait for navigation (done by other hook)
+    await waitForNavigation();
+
+    // Emit result
+    console.log(JSON.stringify({
+        type: 'ArchiveResult',
+        status: 'succeeded',
+        output_str: 'ssl.json'
+    }));
+
+    process.exit(0);  // Natural exit - no await indefinitely
+}
+
+main().catch(e => {
+    console.error(`ERROR: ${e.message}`);
+    process.exit(1);  // Will be retried
+});
+```
+
+## Summary
+
+This plan provides:
+- ✅ Clear execution ordering (10 steps, 00-99 numbering)
+- ✅ Async support (.bg suffix)
+- ✅ Independent timeout control per plugin
+- ✅ Flexible failure handling & retry logic
+- ✅ Streaming JSONL output for DB updates
+- ✅ Simple filesystem-based coordination
+- ✅ Backward compatibility during migration
+
+The main implementation work is refactoring `Snapshot.run()` to enforce step ordering and manage .bg script lifecycles. Plugin renumbering is straightforward mechanical work.
--- a/old/TODO_process_tracking.md
+++ b/old/TODO_process_tracking.md
--- a/old/archivebox.ts
+++ b/old/archivebox.ts