Merge branch 'dev' into claude/refactor-process-management-WcQyZ

Resolved conflicts by keeping Process model changes and accepting dev changes for unrelated files. Ensured pid_utils.py remains deleted as intended by this PR. Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>
2026-04-06 07:47:53 +10:00 · 2025-12-31 11:28:47 +00:00
parent ee201a0f83 7dd2d65770
commit 5121b0e5f9
68 changed files with 9076 additions and 1447 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -27,6 +27,17 @@ uv sync --dev --all-extras  # Always use uv, never pip directly
 source .venv/bin/activate
 ```

+### Generate and Apply Migrations
+```bash
+# Generate migrations (run from archivebox subdirectory)
+cd archivebox
+./manage.py makemigrations
+
+# Apply migrations to test database
+cd data/
+archivebox init
+```
+
 ## Running Tests

 ### CRITICAL: Never Run as Root
--- a/TODO_archivebox_jsonl_cli.md
+++ b/TODO_archivebox_jsonl_cli.md
@@ -0,0 +1,716 @@
+# ArchiveBox CLI Pipeline Architecture
+
+## Overview
+
+This plan implements a JSONL-based CLI pipeline for ArchiveBox, enabling Unix-style piping between commands:
+
+```bash
+archivebox crawl create URL | archivebox snapshot create | archivebox archiveresult create | archivebox run
+```
+
+## Design Principles
+
+1. **Maximize model method reuse**: Use `.to_json()`, `.from_json()`, `.to_jsonl()`, `.from_jsonl()` everywhere
+2. **Pass-through behavior**: All commands output input records + newly created records (accumulating pipeline)
+3. **Create-or-update**: Commands create records if they don't exist, update if ID matches existing
+4. **Auto-cascade**: `archivebox run` automatically creates Snapshots from Crawls and ArchiveResults from Snapshots
+5. **Generic filtering**: Implement filters as functions that take queryset → return queryset
+6. **Minimal code**: Extract duplicated `apply_filters()` to shared module
+
+---
+
+## Real-World Use Cases
+
+These examples demonstrate the JSONL piping architecture. Key points:
+- `archivebox run` auto-cascades (Crawl → Snapshots → ArchiveResults)
+- `archivebox run` **emits JSONL** of everything it creates, enabling chained processing
+- Use CLI args (`--status=`, `--plugin=`) for efficient DB filtering; use jq for transforms
+
+### 1. Basic Archive
+```bash
+# Simple URL archive (run auto-creates snapshots and archive results)
+archivebox crawl create https://example.com | archivebox run
+
+# Multiple URLs from a file
+archivebox crawl create < urls.txt | archivebox run
+
+# With depth crawling (follow links)
+archivebox crawl create --depth=2 https://docs.python.org | archivebox run
+```
+
+### 2. Retry Failed Extractions
+```bash
+# Retry all failed extractions
+archivebox archiveresult list --status=failed | archivebox run
+
+# Retry only failed PDFs from a specific domain
+archivebox archiveresult list --status=failed --plugin=pdf --url__icontains=nytimes.com \
+  | archivebox run
+```
+
+### 3. Import Bookmarks from Pinboard (jq transform)
+```bash
+# Fetch Pinboard API, transform fields to match ArchiveBox schema, archive
+curl -s "https://api.pinboard.in/v1/posts/all?format=json&auth_token=$TOKEN" \
+  | jq -c '.[] | {url: .href, tags_str: .tags, title: .description}' \
+  | archivebox crawl create \
+  | archivebox run
+```
+
+### 4. Retry Failed with Different Binary (jq transform + re-run)
+```bash
+# Get failed wget results, transform to use wget2 binary instead, re-queue as new attempts
+archivebox archiveresult list --status=failed --plugin=wget \
+  | jq -c '{snapshot_id, plugin, status: "queued", overrides: {WGET_BINARY: "wget2"}}' \
+  | archivebox archiveresult create \
+  | archivebox run
+
+# Chain processing: archive, then re-run any failures with increased timeout
+archivebox crawl create https://slow-site.com \
+  | archivebox run \
+  | jq -c 'select(.type == "ArchiveResult" and .status == "failed")
+           | del(.id) | .status = "queued" | .overrides.TIMEOUT = "120"' \
+  | archivebox archiveresult create \
+  | archivebox run
+```
+
+### 5. Selective Extraction
+```bash
+# Create only screenshot extractions for queued snapshots
+archivebox snapshot list --status=queued \
+  | archivebox archiveresult create --plugin=screenshot \
+  | archivebox run
+
+# Re-run singlefile on everything that was skipped
+archivebox archiveresult list --plugin=singlefile --status=skipped \
+  | archivebox archiveresult update --status=queued \
+  | archivebox run
+```
+
+### 6. Bulk Tag Management
+```bash
+# Tag all Twitter/X URLs (efficient DB filter, no jq needed)
+archivebox snapshot list --url__icontains=twitter.com \
+  | archivebox snapshot update --tag=twitter
+
+# Tag snapshots based on computed criteria (jq for logic DB can't do)
+archivebox snapshot list --status=sealed \
+  | jq -c 'select(.archiveresult_count > 5) | . + {tags_str: (.tags_str + ",well-archived")}' \
+  | archivebox snapshot update
+```
+
+### 7. RSS Feed Monitoring
+```bash
+# Archive all items from an RSS feed
+curl -s "https://hnrss.org/frontpage" \
+  | xq -r '.rss.channel.item[].link' \
+  | archivebox crawl create --tag=hackernews-$(date +%Y%m%d) \
+  | archivebox run
+```
+
+### 8. Recursive Link Following (run output → filter → re-run)
+```bash
+# Archive a page, then archive all PDFs it links to
+archivebox crawl create https://research-papers.org/index.html \
+  | archivebox run \
+  | jq -c 'select(.type == "Snapshot") | .discovered_urls[]?
+           | select(endswith(".pdf")) | {url: .}' \
+  | archivebox crawl create --tag=linked-pdfs \
+  | archivebox run
+
+# Depth crawl with custom handling: retry timeouts with longer timeout
+archivebox crawl create --depth=1 https://example.com \
+  | archivebox run \
+  | jq -c 'select(.type == "ArchiveResult" and .status == "failed" and .error contains "timeout")
+           | del(.id) | .overrides.TIMEOUT = "300"' \
+  | archivebox archiveresult create \
+  | archivebox run
+```
+
+### Composability Summary
+
+| Pattern | Example |
+|---------|---------|
+| **Filter → Process** | `list --status=failed --plugin=pdf \| run` |
+| **Transform → Archive** | `curl API \| jq '{url, tags_str}' \| crawl create \| run` |
+| **Retry w/ Changes** | `run \| jq 'select(.status=="failed") \| del(.id)' \| create \| run` |
+| **Selective Extract** | `snapshot list \| archiveresult create --plugin=screenshot` |
+| **Bulk Update** | `list --url__icontains=X \| update --tag=Y` |
+| **Chain Processing** | `crawl \| run \| jq transform \| create \| run` |
+
+The key insight: **`archivebox run` emits JSONL of everything it creates**, enabling:
+- Retry failed items with different settings (timeouts, binaries, etc.)
+- Recursive crawling (archive page → extract links → archive those)
+- Chained transforms (filter failures, modify config, re-queue)
+
+---
+
+## Code Reuse Findings
+
+### Existing Model Methods (USE THESE)
+- `Crawl.to_json()`, `Crawl.from_json()`, `Crawl.to_jsonl()`, `Crawl.from_jsonl()`
+- `Snapshot.to_json()`, `Snapshot.from_json()`, `Snapshot.to_jsonl()`, `Snapshot.from_jsonl()`
+- `Tag.to_json()`, `Tag.from_json()`, `Tag.to_jsonl()`, `Tag.from_jsonl()`
+
+### Missing Model Methods (MUST IMPLEMENT)
+- **`ArchiveResult.from_json()`** - Does not exist, must be added
+- **`ArchiveResult.from_jsonl()`** - Does not exist, must be added
+
+### Existing Utilities (USE THESE)
+- `archivebox/misc/jsonl.py`: `read_stdin()`, `read_args_or_stdin()`, `write_record()`, `parse_line()`
+- Type constants: `TYPE_CRAWL`, `TYPE_SNAPSHOT`, `TYPE_ARCHIVERESULT`, etc.
+
+### Duplicated Code (EXTRACT)
+- `apply_filters()` duplicated in 7 CLI files → extract to `archivebox/cli/cli_utils.py`
+
+### Supervisord Config (UPDATE)
+- `archivebox/workers/supervisord_util.py` line ~35: `"command": "archivebox manage orchestrator"` → `"command": "archivebox run"`
+
+### Field Name Standardization (FIX)
+- **Issue**: `Crawl.to_json()` outputs `tags_str`, but `Snapshot.to_json()` outputs `tags`
+- **Fix**: Standardize all models to use `tags_str` in JSONL output (matches model property names)
+
+---
+
+## Implementation Order
+
+### Phase 1: Model Prerequisites
+1. **Implement `ArchiveResult.from_json()`** in `archivebox/core/models.py`
+   - Pattern: Match `Snapshot.from_json()` and `Crawl.from_json()` style
+   - Handle: ID lookup (update existing) or create new
+   - Required fields: `snapshot_id`, `plugin`
+   - Optional fields: `status`, `hook_name`, etc.
+
+2. **Implement `ArchiveResult.from_jsonl()`** in `archivebox/core/models.py`
+   - Filter records by `type='ArchiveResult'`
+   - Call `from_json()` for each matching record
+
+3. **Fix `Snapshot.to_json()` field name**
+   - Change `'tags': self.tags_str()` → `'tags_str': self.tags_str()`
+   - Update any code that depends on `tags` key in Snapshot JSONL
+
+### Phase 2: Shared Utilities
+4. **Extract `apply_filters()` to `archivebox/cli/cli_utils.py`**
+   - Generic queryset filtering from CLI kwargs
+   - Support `--id__in=[csv]`, `--url__icontains=str`, etc.
+   - Remove duplicates from 7 CLI files
+
+### Phase 3: Pass-Through Behavior (NEW FEATURE)
+5. **Add pass-through to `archivebox crawl create`**
+   - Output non-Crawl input records unchanged
+   - Output created Crawl records
+
+6. **Add pass-through to `archivebox snapshot create`**
+   - Output non-Snapshot/non-Crawl input records unchanged
+   - Process Crawl records → create Snapshots
+   - Output both original Crawl and created Snapshots
+
+7. **Add pass-through to `archivebox archiveresult create`**
+   - Output non-Snapshot/non-ArchiveResult input records unchanged
+   - Process Snapshot records → create ArchiveResults
+   - Output both original Snapshots and created ArchiveResults
+
+8. **Add create-or-update to `archivebox run`**
+   - Records WITH id: lookup and queue existing
+   - Records WITHOUT id: create via `Model.from_json()`, then queue
+   - Pass-through output of all processed records
+
+### Phase 4: Test Infrastructure
+9. **Create `archivebox/tests/conftest.py`** with pytest-django
+   - Use `pytest-django` for proper test database handling
+   - Isolated DATA_DIR per test via `tmp_path` fixture
+   - `run_archivebox_cmd()` helper for subprocess testing
+
+### Phase 5: Unit Tests
+10. **Create `archivebox/tests/test_cli_crawl.py`** - crawl create/list/pass-through tests
+11. **Create `archivebox/tests/test_cli_snapshot.py`** - snapshot create/list/pass-through tests
+12. **Create `archivebox/tests/test_cli_archiveresult.py`** - archiveresult create/list/pass-through tests
+13. **Create `archivebox/tests/test_cli_run.py`** - run command create-or-update tests
+
+### Phase 6: Integration & Config
+14. **Extend `archivebox/cli/tests_piping.py`** - Add pass-through integration tests
+15. **Update supervisord config** - `orchestrator` → `run`
+
+---
+
+## Future Work (Deferred)
+
+### Commands to Defer
+- `archivebox tag create|list|update|delete` - Already works, defer improvements
+- `archivebox binary create|list|update|delete` - Lower priority
+- `archivebox process list` - Lower priority
+- `archivebox apikey create|list|update|delete` - Lower priority
+
+### `archivebox add` Relationship
+- **Current**: `archivebox add` is the primary user-facing command, stays as-is
+- **Future**: Refactor `add` to internally use `crawl create | snapshot create | run` pipeline
+- **Note**: This refactor is deferred; `add` continues to work independently for now
+
+---
+
+## Key Files
+
+| File | Action | Phase |
+|------|--------|-------|
+| `archivebox/core/models.py` | Add `ArchiveResult.from_json()`, `from_jsonl()` | 1 |
+| `archivebox/core/models.py` | Fix `Snapshot.to_json()` → `tags_str` | 1 |
+| `archivebox/cli/cli_utils.py` | NEW - shared `apply_filters()` | 2 |
+| `archivebox/cli/archivebox_crawl.py` | Add pass-through to create | 3 |
+| `archivebox/cli/archivebox_snapshot.py` | Add pass-through to create | 3 |
+| `archivebox/cli/archivebox_archiveresult.py` | Add pass-through to create | 3 |
+| `archivebox/cli/archivebox_run.py` | Add create-or-update, pass-through | 3 |
+| `archivebox/tests/conftest.py` | NEW - pytest fixtures | 4 |
+| `archivebox/tests/test_cli_crawl.py` | NEW - crawl unit tests | 5 |
+| `archivebox/tests/test_cli_snapshot.py` | NEW - snapshot unit tests | 5 |
+| `archivebox/tests/test_cli_archiveresult.py` | NEW - archiveresult unit tests | 5 |
+| `archivebox/tests/test_cli_run.py` | NEW - run unit tests | 5 |
+| `archivebox/cli/tests_piping.py` | Extend with pass-through tests | 6 |
+| `archivebox/workers/supervisord_util.py` | Update orchestrator→run | 6 |
+
+---
+
+## Implementation Details
+
+### ArchiveResult.from_json() Design
+
+```python
+@staticmethod
+def from_json(record: Dict[str, Any], overrides: Dict[str, Any] = None) -> 'ArchiveResult | None':
+    """
+    Create or update a single ArchiveResult from a JSON record dict.
+
+    Args:
+        record: Dict with 'snapshot_id' and 'plugin' (required for create),
+                or 'id' (for update)
+        overrides: Dict of field overrides
+
+    Returns:
+        ArchiveResult instance or None if invalid
+    """
+    from django.utils import timezone
+
+    overrides = overrides or {}
+
+    # If 'id' is provided, lookup and update existing
+    result_id = record.get('id')
+    if result_id:
+        try:
+            result = ArchiveResult.objects.get(id=result_id)
+            # Update fields from record
+            if record.get('status'):
+                result.status = record['status']
+                result.retry_at = timezone.now()
+            result.save()
+            return result
+        except ArchiveResult.DoesNotExist:
+            pass  # Fall through to create
+
+    # Required fields for creation
+    snapshot_id = record.get('snapshot_id')
+    plugin = record.get('plugin')
+
+    if not snapshot_id or not plugin:
+        return None
+
+    try:
+        snapshot = Snapshot.objects.get(id=snapshot_id)
+    except Snapshot.DoesNotExist:
+        return None
+
+    # Create or get existing result
+    result, created = ArchiveResult.objects.get_or_create(
+        snapshot=snapshot,
+        plugin=plugin,
+        defaults={
+            'status': record.get('status', ArchiveResult.StatusChoices.QUEUED),
+            'retry_at': timezone.now(),
+            'hook_name': record.get('hook_name', ''),
+            **overrides,
+        }
+    )
+
+    # If not created, optionally reset for retry
+    if not created and record.get('status'):
+        result.status = record['status']
+        result.retry_at = timezone.now()
+        result.save()
+
+    return result
+```
+
+### Pass-Through Pattern
+
+All `create` commands follow this pattern:
+
+```python
+def create_X(args, ...):
+    is_tty = sys.stdout.isatty()
+    records = list(read_args_or_stdin(args))
+
+    for record in records:
+        record_type = record.get('type')
+
+        # Pass-through: output records we don't handle
+        if record_type not in HANDLED_TYPES:
+            if not is_tty:
+                write_record(record)
+            continue
+
+        # Handle our type: create via Model.from_json()
+        obj = Model.from_json(record, overrides={...})
+
+        # Output created record (hydrated with db id)
+        if obj and not is_tty:
+            write_record(obj.to_json())
+```
+
+### Pass-Through Semantics Example
+
+```
+Input:
+  {"type": "Crawl", "id": "abc", "urls": "https://example.com", ...}
+  {"type": "Tag", "name": "important"}
+
+archivebox snapshot create output:
+  {"type": "Crawl", "id": "abc", ...}           # pass-through (not our type)
+  {"type": "Tag", "name": "important"}          # pass-through (not our type)
+  {"type": "Snapshot", "id": "xyz", ...}        # created from Crawl URLs
+```
+
+### Create-or-Update Pattern for `archivebox run`
+
+```python
+def process_stdin_records() -> int:
+    records = list(read_stdin())
+    is_tty = sys.stdout.isatty()
+
+    for record in records:
+        record_type = record.get('type')
+        record_id = record.get('id')
+
+        # Create-or-update based on whether ID exists
+        if record_type == TYPE_CRAWL:
+            if record_id:
+                try:
+                    obj = Crawl.objects.get(id=record_id)
+                except Crawl.DoesNotExist:
+                    obj = Crawl.from_json(record)
+            else:
+                obj = Crawl.from_json(record)
+
+            if obj:
+                obj.retry_at = timezone.now()
+                obj.save()
+                if not is_tty:
+                    write_record(obj.to_json())
+
+        # Similar for Snapshot, ArchiveResult...
+```
+
+### Shared apply_filters() Design
+
+Extract to `archivebox/cli/cli_utils.py`:
+
+```python
+"""Shared CLI utilities for ArchiveBox commands."""
+
+from typing import Optional
+
+def apply_filters(queryset, filter_kwargs: dict, limit: Optional[int] = None):
+    """
+    Apply Django-style filters from CLI kwargs to a QuerySet.
+
+    Supports: --status=queued, --url__icontains=example, --id__in=uuid1,uuid2
+
+    Args:
+        queryset: Django QuerySet to filter
+        filter_kwargs: Dict of filter key-value pairs from CLI
+        limit: Optional limit on results
+
+    Returns:
+        Filtered QuerySet
+    """
+    filters = {}
+    for key, value in filter_kwargs.items():
+        if value is None or key in ('limit', 'offset'):
+            continue
+        # Handle CSV lists for __in filters
+        if key.endswith('__in') and isinstance(value, str):
+            value = [v.strip() for v in value.split(',')]
+        filters[key] = value
+
+    if filters:
+        queryset = queryset.filter(**filters)
+    if limit:
+        queryset = queryset[:limit]
+
+    return queryset
+```
+
+---
+
+## conftest.py Design (pytest-django)
+
+```python
+"""archivebox/tests/conftest.py - Pytest fixtures for CLI tests."""
+
+import os
+import sys
+import json
+import subprocess
+from pathlib import Path
+from typing import List, Dict, Any, Optional, Tuple
+
+import pytest
+
+
+# =============================================================================
+# Fixtures
+# =============================================================================
+
+@pytest.fixture
+def isolated_data_dir(tmp_path, settings):
+    """
+    Create isolated DATA_DIR for each test.
+
+    Uses tmp_path for isolation, configures Django settings.
+    """
+    data_dir = tmp_path / 'archivebox_data'
+    data_dir.mkdir()
+
+    # Set environment for subprocess calls
+    os.environ['DATA_DIR'] = str(data_dir)
+
+    # Update Django settings
+    settings.DATA_DIR = data_dir
+
+    yield data_dir
+
+    # Cleanup handled by tmp_path fixture
+
+
+@pytest.fixture
+def initialized_archive(isolated_data_dir):
+    """
+    Initialize ArchiveBox archive in isolated directory.
+
+    Runs `archivebox init` to set up database and directories.
+    """
+    from archivebox.cli.archivebox_init import init
+    init(setup=True, quick=True)
+    return isolated_data_dir
+
+
+@pytest.fixture
+def cli_env(initialized_archive):
+    """
+    Environment dict for CLI subprocess calls.
+
+    Includes DATA_DIR and disables slow extractors.
+    """
+    return {
+        **os.environ,
+        'DATA_DIR': str(initialized_archive),
+        'USE_COLOR': 'False',
+        'SHOW_PROGRESS': 'False',
+        'SAVE_TITLE': 'True',
+        'SAVE_FAVICON': 'False',
+        'SAVE_WGET': 'False',
+        'SAVE_WARC': 'False',
+        'SAVE_PDF': 'False',
+        'SAVE_SCREENSHOT': 'False',
+        'SAVE_DOM': 'False',
+        'SAVE_SINGLEFILE': 'False',
+        'SAVE_READABILITY': 'False',
+        'SAVE_MERCURY': 'False',
+        'SAVE_GIT': 'False',
+        'SAVE_YTDLP': 'False',
+        'SAVE_HEADERS': 'False',
+    }
+
+
+# =============================================================================
+# CLI Helpers
+# =============================================================================
+
+def run_archivebox_cmd(
+    args: List[str],
+    stdin: Optional[str] = None,
+    cwd: Optional[Path] = None,
+    env: Optional[Dict[str, str]] = None,
+    timeout: int = 60,
+) -> Tuple[str, str, int]:
+    """
+    Run archivebox command, return (stdout, stderr, returncode).
+
+    Args:
+        args: Command arguments (e.g., ['crawl', 'create', 'https://example.com'])
+        stdin: Optional string to pipe to stdin
+        cwd: Working directory (defaults to DATA_DIR from env)
+        env: Environment variables (defaults to os.environ with DATA_DIR)
+        timeout: Command timeout in seconds
+
+    Returns:
+        Tuple of (stdout, stderr, returncode)
+    """
+    cmd = [sys.executable, '-m', 'archivebox'] + args
+
+    env = env or {**os.environ}
+    cwd = cwd or Path(env.get('DATA_DIR', '.'))
+
+    result = subprocess.run(
+        cmd,
+        input=stdin,
+        capture_output=True,
+        text=True,
+        cwd=cwd,
+        env=env,
+        timeout=timeout,
+    )
+
+    return result.stdout, result.stderr, result.returncode
+
+
+# =============================================================================
+# Output Assertions
+# =============================================================================
+
+def parse_jsonl_output(stdout: str) -> List[Dict[str, Any]]:
+    """Parse JSONL output into list of dicts."""
+    records = []
+    for line in stdout.strip().split('\n'):
+        line = line.strip()
+        if line and line.startswith('{'):
+            try:
+                records.append(json.loads(line))
+            except json.JSONDecodeError:
+                pass
+    return records
+
+
+def assert_jsonl_contains_type(stdout: str, record_type: str, min_count: int = 1):
+    """Assert output contains at least min_count records of type."""
+    records = parse_jsonl_output(stdout)
+    matching = [r for r in records if r.get('type') == record_type]
+    assert len(matching) >= min_count, \
+        f"Expected >= {min_count} {record_type}, got {len(matching)}"
+    return matching
+
+
+def assert_jsonl_pass_through(stdout: str, input_records: List[Dict[str, Any]]):
+    """Assert that input records appear in output (pass-through behavior)."""
+    output_records = parse_jsonl_output(stdout)
+    output_ids = {r.get('id') for r in output_records if r.get('id')}
+
+    for input_rec in input_records:
+        input_id = input_rec.get('id')
+        if input_id:
+            assert input_id in output_ids, \
+                f"Input record {input_id} not found in output (pass-through failed)"
+
+
+def assert_record_has_fields(record: Dict[str, Any], required_fields: List[str]):
+    """Assert record has all required fields with non-None values."""
+    for field in required_fields:
+        assert field in record, f"Record missing field: {field}"
+        assert record[field] is not None, f"Record field is None: {field}"
+
+
+# =============================================================================
+# Database Assertions
+# =============================================================================
+
+def assert_db_count(model_class, filters: Dict[str, Any], expected: int):
+    """Assert database count matches expected."""
+    actual = model_class.objects.filter(**filters).count()
+    assert actual == expected, \
+        f"Expected {expected} {model_class.__name__}, got {actual}"
+
+
+def assert_db_exists(model_class, **filters):
+    """Assert at least one record exists matching filters."""
+    assert model_class.objects.filter(**filters).exists(), \
+        f"No {model_class.__name__} found matching {filters}"
+
+
+# =============================================================================
+# Test Data Factories
+# =============================================================================
+
+def create_test_url(domain: str = 'example.com', path: str = None) -> str:
+    """Generate unique test URL."""
+    import uuid
+    path = path or uuid.uuid4().hex[:8]
+    return f'https://{domain}/{path}'
+
+
+def create_test_crawl_json(urls: List[str] = None, **kwargs) -> Dict[str, Any]:
+    """Create Crawl JSONL record for testing."""
+    from archivebox.misc.jsonl import TYPE_CRAWL
+
+    urls = urls or [create_test_url()]
+    return {
+        'type': TYPE_CRAWL,
+        'urls': '\n'.join(urls),
+        'max_depth': kwargs.get('max_depth', 0),
+        'tags_str': kwargs.get('tags_str', ''),
+        'status': kwargs.get('status', 'queued'),
+        **{k: v for k, v in kwargs.items() if k not in ('max_depth', 'tags_str', 'status')},
+    }
+
+
+def create_test_snapshot_json(url: str = None, **kwargs) -> Dict[str, Any]:
+    """Create Snapshot JSONL record for testing."""
+    from archivebox.misc.jsonl import TYPE_SNAPSHOT
+
+    return {
+        'type': TYPE_SNAPSHOT,
+        'url': url or create_test_url(),
+        'tags_str': kwargs.get('tags_str', ''),
+        'status': kwargs.get('status', 'queued'),
+        **{k: v for k, v in kwargs.items() if k not in ('tags_str', 'status')},
+    }
+```
+
+---
+
+## Test Rules
+
+- **NO SKIPPING** - Every test runs
+- **NO MOCKING** - Real subprocess calls, real database
+- **NO DISABLING** - Failing tests identify real problems
+- **MINIMAL CODE** - Import helpers from conftest.py
+- **ISOLATED** - Each test gets its own DATA_DIR via `tmp_path`
+
+---
+
+## Task Checklist
+
+### Phase 1: Model Prerequisites
+- [x] Implement `ArchiveResult.from_json()` in `archivebox/core/models.py`
+- [x] Implement `ArchiveResult.from_jsonl()` in `archivebox/core/models.py`
+- [x] Fix `Snapshot.to_json()` to use `tags_str` instead of `tags`
+
+### Phase 2: Shared Utilities
+- [x] Create `archivebox/cli/cli_utils.py` with shared `apply_filters()`
+- [x] Update 7 CLI files to import from `cli_utils.py`
+
+### Phase 3: Pass-Through Behavior
+- [x] Add pass-through to `archivebox_crawl.py` create
+- [x] Add pass-through to `archivebox_snapshot.py` create
+- [x] Add pass-through to `archivebox_archiveresult.py` create
+- [x] Add create-or-update to `archivebox_run.py`
+- [x] Add pass-through output to `archivebox_run.py`
+
+### Phase 4: Test Infrastructure
+- [x] Create `archivebox/tests/conftest.py` with pytest-django fixtures
+
+### Phase 5: Unit Tests
+- [x] Create `archivebox/tests/test_cli_crawl.py`
+- [x] Create `archivebox/tests/test_cli_snapshot.py`
+- [x] Create `archivebox/tests/test_cli_archiveresult.py`
+- [x] Create `archivebox/tests/test_cli_run.py`
+
+### Phase 6: Integration & Config
+- [x] Extend `archivebox/cli/tests_piping.py` with pass-through tests
+- [x] Update `archivebox/workers/supervisord_util.py`: orchestrator→run
--- a/TODO_cli_refactor.md
+++ b/TODO_cli_refactor.md
@@ -0,0 +1,131 @@
+# ArchiveBox CLI Refactor TODO
+
+## Design Decisions
+
+1. **Keep `archivebox add`** as high-level convenience command
+2. **Unified `archivebox run`** for processing (replaces per-model `run` and `orchestrator`)
+3. **Expose all models** including binary, process, machine
+4. **Clean break** from old command structure (no backward compatibility aliases)
+
+## Final Architecture
+
+```
+archivebox <model> <action> [args...] [--filters]
+archivebox run [stdin JSONL]
+```
+
+### Actions (4 per model):
+- `create` - Create records (from args, stdin, or JSONL), dedupes by indexed fields
+- `list` - Query records (with filters, returns JSONL)
+- `update` - Modify records (from stdin JSONL, PATCH semantics)
+- `delete` - Remove records (from stdin JSONL, requires --yes)
+
+### Unified Run Command:
+- `archivebox run` - Process queued work
+  - With stdin JSONL: Process piped records, exit when complete
+  - Without stdin (TTY): Run orchestrator in foreground until killed
+
+### Models (7 total):
+- `crawl` - Crawl jobs
+- `snapshot` - Individual archived pages
+- `archiveresult` - Plugin extraction results
+- `tag` - Tags/labels
+- `binary` - Detected binaries (chrome, wget, etc.)
+- `process` - Process execution records (read-only)
+- `machine` - Machine/host records (read-only)
+
+---
+
+## Implementation Checklist
+
+### Phase 1: Unified Run Command
+- [x] Create `archivebox/cli/archivebox_run.py` - unified processing command
+
+### Phase 2: Core Model Commands
+- [x] Refactor `archivebox/cli/archivebox_snapshot.py` to Click group with create|list|update|delete
+- [x] Refactor `archivebox/cli/archivebox_crawl.py` to Click group with create|list|update|delete
+- [x] Create `archivebox/cli/archivebox_archiveresult.py` with create|list|update|delete
+- [x] Create `archivebox/cli/archivebox_tag.py` with create|list|update|delete
+
+### Phase 3: System Model Commands
+- [x] Create `archivebox/cli/archivebox_binary.py` with create|list|update|delete
+- [x] Create `archivebox/cli/archivebox_process.py` with list only (read-only)
+- [x] Create `archivebox/cli/archivebox_machine.py` with list only (read-only)
+
+### Phase 4: Registry & Cleanup
+- [x] Update `archivebox/cli/__init__.py` command registry
+- [x] Delete `archivebox/cli/archivebox_extract.py`
+- [x] Delete `archivebox/cli/archivebox_remove.py`
+- [x] Delete `archivebox/cli/archivebox_search.py`
+- [x] Delete `archivebox/cli/archivebox_orchestrator.py`
+- [x] Update `archivebox/cli/archivebox_add.py` internals (no changes needed - uses models directly)
+- [x] Update `archivebox/cli/tests_piping.py`
+
+### Phase 5: Tests for New Commands
+- [ ] Add tests for `archivebox run` command
+- [ ] Add tests for `archivebox crawl create|list|update|delete`
+- [ ] Add tests for `archivebox snapshot create|list|update|delete`
+- [ ] Add tests for `archivebox archiveresult create|list|update|delete`
+- [ ] Add tests for `archivebox tag create|list|update|delete`
+- [ ] Add tests for `archivebox binary create|list|update|delete`
+- [ ] Add tests for `archivebox process list`
+- [ ] Add tests for `archivebox machine list`
+
+---
+
+## Usage Examples
+
+### Basic CRUD
+```bash
+# Create
+archivebox crawl create https://example.com https://foo.com --depth=1
+archivebox snapshot create https://example.com --tag=news
+
+# List with filters
+archivebox crawl list --status=queued
+archivebox snapshot list --url__icontains=example.com
+archivebox archiveresult list --status=failed --plugin=screenshot
+
+# Update (reads JSONL from stdin, applies changes)
+archivebox snapshot list --tag=old | archivebox snapshot update --tag=new
+
+# Delete (requires --yes)
+archivebox crawl list --url__icontains=example.com | archivebox crawl delete --yes
+```
+
+### Unified Run Command
+```bash
+# Run orchestrator in foreground (replaces `archivebox orchestrator`)
+archivebox run
+
+# Process specific records (pipe any JSONL type, exits when done)
+archivebox snapshot list --status=queued | archivebox run
+archivebox archiveresult list --status=failed | archivebox run
+archivebox crawl list --status=queued | archivebox run
+
+# Mixed types work too - run handles any JSONL
+cat mixed_records.jsonl | archivebox run
+```
+
+### Composed Workflows
+```bash
+# Full pipeline (replaces old `archivebox add`)
+archivebox crawl create https://example.com --status=queued \
+  | archivebox snapshot create --status=queued \
+  | archivebox archiveresult create --status=queued \
+  | archivebox run
+
+# Re-run failed extractions
+archivebox archiveresult list --status=failed | archivebox run
+
+# Delete all snapshots for a domain
+archivebox snapshot list --url__icontains=spam.com | archivebox snapshot delete --yes
+```
+
+### Keep `archivebox add` as convenience
+```bash
+# This remains the simple user-friendly interface:
+archivebox add https://example.com --depth=1 --tag=news
+
+# Internally equivalent to the composed pipeline above
+```
--- a/archivebox.ts
+++ b/archivebox.ts
@@ -478,7 +478,7 @@ interface LoadedChromeExtension extends ChromeExtension {

 const CHROME_EXTENSIONS: LoadedChromeExtension[] = [
    // Content access / unblocking / blocking plugins
-    {webstore_id: 'ifibfemgeogfhoebkmokieepdoobkbpo', name: 'captcha2'},                 // https://2captcha.com/blog/how-to-use-2captcha-solver-extension-in-puppeteer
+    {webstore_id: 'ifibfemgeogfhoebkmokieepdoobkbpo', name: 'twocaptcha'},                 // https://2captcha.com/blog/how-to-use-2captcha-solver-extension-in-puppeteer
    {webstore_id: 'edibdbjcniadpccecjdfdjjppcpchdlm', name: 'istilldontcareaboutcookies'},
    {webstore_id: 'cjpalhdlnbpafiamejdnhcphjbkeiagm', name: 'ublock'},
    // {webstore_id: 'mlomiejdfkolichcflejclcbmpeaniij', name: 'ghostery'},
@@ -1123,7 +1123,7 @@ async function setup2CaptchaExtension({browser, extensions}) {
    try {
        // open a new tab to finish setting up the 2captcha extension manually using its extension options page
        page = await browser.newPage()
-        const { options_url } = extensions.filter(ext => ext.name === 'captcha2')[0]
+        const { options_url } = extensions.filter(ext => ext.name === 'twocaptcha')[0]
        await page.goto(options_url)
        await wait(2_500)
        await page.bringToFront()
--- a/archivebox/cli/init.py
+++ b/archivebox/cli/init.py
@@ -27,36 +27,45 @@ class ArchiveBoxGroup(click.Group):
        'init': 'archivebox.cli.archivebox_init.main',
        'install': 'archivebox.cli.archivebox_install.main',
    }
+    # Model commands (CRUD operations via subcommands)
+    model_commands = {
+        'crawl': 'archivebox.cli.archivebox_crawl.main',
+        'snapshot': 'archivebox.cli.archivebox_snapshot.main',
+        'archiveresult': 'archivebox.cli.archivebox_archiveresult.main',
+        'tag': 'archivebox.cli.archivebox_tag.main',
+        'binary': 'archivebox.cli.archivebox_binary.main',
+        'process': 'archivebox.cli.archivebox_process.main',
+        'machine': 'archivebox.cli.archivebox_machine.main',
+    }
    archive_commands = {
+        # High-level commands
        'add': 'archivebox.cli.archivebox_add.main',
-        'remove': 'archivebox.cli.archivebox_remove.main',
+        'run': 'archivebox.cli.archivebox_run.main',
        'update': 'archivebox.cli.archivebox_update.main',
-        'search': 'archivebox.cli.archivebox_search.main',
        'status': 'archivebox.cli.archivebox_status.main',
        'config': 'archivebox.cli.archivebox_config.main',
        'schedule': 'archivebox.cli.archivebox_schedule.main',
        'server': 'archivebox.cli.archivebox_server.main',
        'shell': 'archivebox.cli.archivebox_shell.main',
        'manage': 'archivebox.cli.archivebox_manage.main',
-        # Worker/orchestrator commands
-        'orchestrator': 'archivebox.cli.archivebox_orchestrator.main',
+        # Introspection commands
+        'pluginmap': 'archivebox.cli.archivebox_pluginmap.main',
+        # Worker command
        'worker': 'archivebox.cli.archivebox_worker.main',
-        # Task commands (called by workers as subprocesses)
-        'crawl': 'archivebox.cli.archivebox_crawl.main',
-        'snapshot': 'archivebox.cli.archivebox_snapshot.main',
-        'extract': 'archivebox.cli.archivebox_extract.main',
    }
    all_subcommands = {
        **meta_commands,
        **setup_commands,
+        **model_commands,
        **archive_commands,
    }
    renamed_commands = {
        'setup': 'install',
-        'list': 'search',
        'import': 'add',
        'archive': 'add',
-        'export': 'search',
+        # Old commands replaced by new model commands
+        'orchestrator': 'run',
+        'extract': 'archiveresult',
    }
    
    @classmethod
@@ -110,9 +119,9 @@ def cli(ctx, help=False):
    if help or ctx.invoked_subcommand is None:
        ctx.invoke(ctx.command.get_command(ctx, 'help'))
    
-    # if the subcommand is in the archive_commands dict and is not 'manage',
+    # if the subcommand is in archive_commands or model_commands,
    # then we need to set up the django environment and check that we're in a valid data folder
-    if subcommand in ArchiveBoxGroup.archive_commands:
+    if subcommand in ArchiveBoxGroup.archive_commands or subcommand in ArchiveBoxGroup.model_commands:
        # print('SETUP DJANGO AND CHECK DATA FOLDER')
        try:
            from archivebox.config.django import setup_django
--- a/archivebox/cli/archivebox_archiveresult.py
+++ b/archivebox/cli/archivebox_archiveresult.py
@@ -0,0 +1,380 @@
+#!/usr/bin/env python3
+
+"""
+archivebox archiveresult <action> [args...] [--filters]
+
+Manage ArchiveResult records (plugin extraction results).
+
+Actions:
+    create  - Create ArchiveResults for Snapshots (queue extractions)
+    list    - List ArchiveResults as JSONL (with optional filters)
+    update  - Update ArchiveResults from stdin JSONL
+    delete  - Delete ArchiveResults from stdin JSONL
+
+Examples:
+    # Create ArchiveResults for snapshots (queue for extraction)
+    archivebox snapshot list --status=queued | archivebox archiveresult create
+    archivebox archiveresult create --plugin=screenshot --snapshot-id=<uuid>
+
+    # List with filters
+    archivebox archiveresult list --status=failed
+    archivebox archiveresult list --plugin=screenshot --status=succeeded
+
+    # Update (reset failed extractions to queued)
+    archivebox archiveresult list --status=failed | archivebox archiveresult update --status=queued
+
+    # Delete
+    archivebox archiveresult list --plugin=singlefile | archivebox archiveresult delete --yes
+
+    # Re-run failed extractions
+    archivebox archiveresult list --status=failed | archivebox run
+"""
+
+__package__ = 'archivebox.cli'
+__command__ = 'archivebox archiveresult'
+
+import sys
+from typing import Optional
+
+import rich_click as click
+from rich import print as rprint
+
+from archivebox.cli.cli_utils import apply_filters
+
+
+# =============================================================================
+# CREATE
+# =============================================================================
+
+def create_archiveresults(
+    snapshot_id: Optional[str] = None,
+    plugin: Optional[str] = None,
+    status: str = 'queued',
+) -> int:
+    """
+    Create ArchiveResults for Snapshots.
+
+    Reads Snapshot records from stdin and creates ArchiveResult entries.
+    Pass-through: Non-Snapshot/ArchiveResult records are output unchanged.
+    If --plugin is specified, only creates results for that plugin.
+    Otherwise, creates results for all pending plugins.
+
+    Exit codes:
+        0: Success
+        1: Failure
+    """
+    from django.utils import timezone
+
+    from archivebox.misc.jsonl import read_stdin, write_record, TYPE_SNAPSHOT, TYPE_ARCHIVERESULT
+    from archivebox.core.models import Snapshot, ArchiveResult
+
+    is_tty = sys.stdout.isatty()
+
+    # If snapshot_id provided directly, use that
+    if snapshot_id:
+        try:
+            snapshots = [Snapshot.objects.get(id=snapshot_id)]
+            pass_through_records = []
+        except Snapshot.DoesNotExist:
+            rprint(f'[red]Snapshot not found: {snapshot_id}[/red]', file=sys.stderr)
+            return 1
+    else:
+        # Read from stdin
+        records = list(read_stdin())
+        if not records:
+            rprint('[yellow]No Snapshot records provided via stdin[/yellow]', file=sys.stderr)
+            return 1
+
+        # Separate snapshot records from pass-through records
+        snapshot_ids = []
+        pass_through_records = []
+
+        for record in records:
+            record_type = record.get('type', '')
+
+            if record_type == TYPE_SNAPSHOT:
+                # Pass through the Snapshot record itself
+                pass_through_records.append(record)
+                if record.get('id'):
+                    snapshot_ids.append(record['id'])
+
+            elif record_type == TYPE_ARCHIVERESULT:
+                # ArchiveResult records: pass through if they have an id
+                if record.get('id'):
+                    pass_through_records.append(record)
+                # If no id, we could create it, but for now just pass through
+                else:
+                    pass_through_records.append(record)
+
+            elif record_type:
+                # Other typed records (Crawl, Tag, etc): pass through
+                pass_through_records.append(record)
+
+            elif record.get('id'):
+                # Untyped record with id - assume it's a snapshot ID
+                snapshot_ids.append(record['id'])
+
+        # Output pass-through records first
+        if not is_tty:
+            for record in pass_through_records:
+                write_record(record)
+
+        if not snapshot_ids:
+            if pass_through_records:
+                rprint(f'[dim]Passed through {len(pass_through_records)} records, no new snapshots to process[/dim]', file=sys.stderr)
+                return 0
+            rprint('[yellow]No valid Snapshot IDs in input[/yellow]', file=sys.stderr)
+            return 1
+
+        snapshots = list(Snapshot.objects.filter(id__in=snapshot_ids))
+
+    if not snapshots:
+        rprint('[yellow]No matching snapshots found[/yellow]', file=sys.stderr)
+        return 0 if pass_through_records else 1
+
+    created_count = 0
+    for snapshot in snapshots:
+        if plugin:
+            # Create for specific plugin only
+            result, created = ArchiveResult.objects.get_or_create(
+                snapshot=snapshot,
+                plugin=plugin,
+                defaults={
+                    'status': status,
+                    'retry_at': timezone.now(),
+                }
+            )
+            if not created and result.status in [ArchiveResult.StatusChoices.FAILED, ArchiveResult.StatusChoices.SKIPPED]:
+                # Reset for retry
+                result.status = status
+                result.retry_at = timezone.now()
+                result.save()
+
+            if not is_tty:
+                write_record(result.to_json())
+            created_count += 1
+        else:
+            # Create all pending plugins
+            snapshot.create_pending_archiveresults()
+            for result in snapshot.archiveresult_set.filter(status=ArchiveResult.StatusChoices.QUEUED):
+                if not is_tty:
+                    write_record(result.to_json())
+                created_count += 1
+
+    rprint(f'[green]Created/queued {created_count} archive results[/green]', file=sys.stderr)
+    return 0
+
+
+# =============================================================================
+# LIST
+# =============================================================================
+
+def list_archiveresults(
+    status: Optional[str] = None,
+    plugin: Optional[str] = None,
+    snapshot_id: Optional[str] = None,
+    limit: Optional[int] = None,
+) -> int:
+    """
+    List ArchiveResults as JSONL with optional filters.
+
+    Exit codes:
+        0: Success (even if no results)
+    """
+    from archivebox.misc.jsonl import write_record
+    from archivebox.core.models import ArchiveResult
+
+    is_tty = sys.stdout.isatty()
+
+    queryset = ArchiveResult.objects.all().order_by('-start_ts')
+
+    # Apply filters
+    filter_kwargs = {
+        'status': status,
+        'plugin': plugin,
+        'snapshot_id': snapshot_id,
+    }
+    queryset = apply_filters(queryset, filter_kwargs, limit=limit)
+
+    count = 0
+    for result in queryset:
+        if is_tty:
+            status_color = {
+                'queued': 'yellow',
+                'started': 'blue',
+                'succeeded': 'green',
+                'failed': 'red',
+                'skipped': 'dim',
+                'backoff': 'magenta',
+            }.get(result.status, 'dim')
+            rprint(f'[{status_color}]{result.status:10}[/{status_color}] {result.plugin:15} [dim]{result.id}[/dim] {result.snapshot.url[:40]}')
+        else:
+            write_record(result.to_json())
+        count += 1
+
+    rprint(f'[dim]Listed {count} archive results[/dim]', file=sys.stderr)
+    return 0
+
+
+# =============================================================================
+# UPDATE
+# =============================================================================
+
+def update_archiveresults(
+    status: Optional[str] = None,
+) -> int:
+    """
+    Update ArchiveResults from stdin JSONL.
+
+    Reads ArchiveResult records from stdin and applies updates.
+    Uses PATCH semantics - only specified fields are updated.
+
+    Exit codes:
+        0: Success
+        1: No input or error
+    """
+    from django.utils import timezone
+
+    from archivebox.misc.jsonl import read_stdin, write_record
+    from archivebox.core.models import ArchiveResult
+
+    is_tty = sys.stdout.isatty()
+
+    records = list(read_stdin())
+    if not records:
+        rprint('[yellow]No records provided via stdin[/yellow]', file=sys.stderr)
+        return 1
+
+    updated_count = 0
+    for record in records:
+        result_id = record.get('id')
+        if not result_id:
+            continue
+
+        try:
+            result = ArchiveResult.objects.get(id=result_id)
+
+            # Apply updates from CLI flags
+            if status:
+                result.status = status
+                result.retry_at = timezone.now()
+
+            result.save()
+            updated_count += 1
+
+            if not is_tty:
+                write_record(result.to_json())
+
+        except ArchiveResult.DoesNotExist:
+            rprint(f'[yellow]ArchiveResult not found: {result_id}[/yellow]', file=sys.stderr)
+            continue
+
+    rprint(f'[green]Updated {updated_count} archive results[/green]', file=sys.stderr)
+    return 0
+
+
+# =============================================================================
+# DELETE
+# =============================================================================
+
+def delete_archiveresults(yes: bool = False, dry_run: bool = False) -> int:
+    """
+    Delete ArchiveResults from stdin JSONL.
+
+    Requires --yes flag to confirm deletion.
+
+    Exit codes:
+        0: Success
+        1: No input or missing --yes flag
+    """
+    from archivebox.misc.jsonl import read_stdin
+    from archivebox.core.models import ArchiveResult
+
+    records = list(read_stdin())
+    if not records:
+        rprint('[yellow]No records provided via stdin[/yellow]', file=sys.stderr)
+        return 1
+
+    result_ids = [r.get('id') for r in records if r.get('id')]
+
+    if not result_ids:
+        rprint('[yellow]No valid archive result IDs in input[/yellow]', file=sys.stderr)
+        return 1
+
+    results = ArchiveResult.objects.filter(id__in=result_ids)
+    count = results.count()
+
+    if count == 0:
+        rprint('[yellow]No matching archive results found[/yellow]', file=sys.stderr)
+        return 0
+
+    if dry_run:
+        rprint(f'[yellow]Would delete {count} archive results (dry run)[/yellow]', file=sys.stderr)
+        for result in results[:10]:
+            rprint(f'  [dim]{result.id}[/dim] {result.plugin} {result.snapshot.url[:40]}', file=sys.stderr)
+        if count > 10:
+            rprint(f'  ... and {count - 10} more', file=sys.stderr)
+        return 0
+
+    if not yes:
+        rprint('[red]Use --yes to confirm deletion[/red]', file=sys.stderr)
+        return 1
+
+    # Perform deletion
+    deleted_count, _ = results.delete()
+    rprint(f'[green]Deleted {deleted_count} archive results[/green]', file=sys.stderr)
+    return 0
+
+
+# =============================================================================
+# CLI Commands
+# =============================================================================
+
+@click.group()
+def main():
+    """Manage ArchiveResult records (plugin extraction results)."""
+    pass
+
+
+@main.command('create')
+@click.option('--snapshot-id', help='Snapshot ID to create results for')
+@click.option('--plugin', '-p', help='Plugin name (e.g., screenshot, singlefile)')
+@click.option('--status', '-s', default='queued', help='Initial status (default: queued)')
+def create_cmd(snapshot_id: Optional[str], plugin: Optional[str], status: str):
+    """Create ArchiveResults for Snapshots from stdin JSONL."""
+    sys.exit(create_archiveresults(snapshot_id=snapshot_id, plugin=plugin, status=status))
+
+
+@main.command('list')
+@click.option('--status', '-s', help='Filter by status (queued, started, succeeded, failed, skipped)')
+@click.option('--plugin', '-p', help='Filter by plugin name')
+@click.option('--snapshot-id', help='Filter by snapshot ID')
+@click.option('--limit', '-n', type=int, help='Limit number of results')
+def list_cmd(status: Optional[str], plugin: Optional[str],
+             snapshot_id: Optional[str], limit: Optional[int]):
+    """List ArchiveResults as JSONL."""
+    sys.exit(list_archiveresults(
+        status=status,
+        plugin=plugin,
+        snapshot_id=snapshot_id,
+        limit=limit,
+    ))
+
+
+@main.command('update')
+@click.option('--status', '-s', help='Set status')
+def update_cmd(status: Optional[str]):
+    """Update ArchiveResults from stdin JSONL."""
+    sys.exit(update_archiveresults(status=status))
+
+
+@main.command('delete')
+@click.option('--yes', '-y', is_flag=True, help='Confirm deletion')
+@click.option('--dry-run', is_flag=True, help='Show what would be deleted')
+def delete_cmd(yes: bool, dry_run: bool):
+    """Delete ArchiveResults from stdin JSONL."""
+    sys.exit(delete_archiveresults(yes=yes, dry_run=dry_run))
+
+
+if __name__ == '__main__':
+    main()
--- a/archivebox/cli/archivebox_binary.py
+++ b/archivebox/cli/archivebox_binary.py
@@ -0,0 +1,290 @@
+#!/usr/bin/env python3
+
+"""
+archivebox binary <action> [args...] [--filters]
+
+Manage Binary records (detected executables like chrome, wget, etc.).
+
+Actions:
+    create  - Create/register a Binary
+    list    - List Binaries as JSONL (with optional filters)
+    update  - Update Binaries from stdin JSONL
+    delete  - Delete Binaries from stdin JSONL
+
+Examples:
+    # List all binaries
+    archivebox binary list
+
+    # List specific binary
+    archivebox binary list --name=chrome
+
+    # List binaries with specific version
+    archivebox binary list --version__icontains=120
+
+    # Delete old binary entries
+    archivebox binary list --name=chrome | archivebox binary delete --yes
+"""
+
+__package__ = 'archivebox.cli'
+__command__ = 'archivebox binary'
+
+import sys
+from typing import Optional
+
+import rich_click as click
+from rich import print as rprint
+
+from archivebox.cli.cli_utils import apply_filters
+
+
+# =============================================================================
+# CREATE
+# =============================================================================
+
+def create_binary(
+    name: str,
+    abspath: str,
+    version: str = '',
+) -> int:
+    """
+    Create/register a Binary.
+
+    Exit codes:
+        0: Success
+        1: Failure
+    """
+    from archivebox.misc.jsonl import write_record
+    from archivebox.machine.models import Binary
+
+    is_tty = sys.stdout.isatty()
+
+    if not name or not abspath:
+        rprint('[red]Both --name and --abspath are required[/red]', file=sys.stderr)
+        return 1
+
+    try:
+        binary, created = Binary.objects.get_or_create(
+            name=name,
+            abspath=abspath,
+            defaults={'version': version}
+        )
+
+        if not is_tty:
+            write_record(binary.to_json())
+
+        if created:
+            rprint(f'[green]Created binary: {name} at {abspath}[/green]', file=sys.stderr)
+        else:
+            rprint(f'[dim]Binary already exists: {name} at {abspath}[/dim]', file=sys.stderr)
+
+        return 0
+
+    except Exception as e:
+        rprint(f'[red]Error creating binary: {e}[/red]', file=sys.stderr)
+        return 1
+
+
+# =============================================================================
+# LIST
+# =============================================================================
+
+def list_binaries(
+    name: Optional[str] = None,
+    abspath__icontains: Optional[str] = None,
+    version__icontains: Optional[str] = None,
+    limit: Optional[int] = None,
+) -> int:
+    """
+    List Binaries as JSONL with optional filters.
+
+    Exit codes:
+        0: Success (even if no results)
+    """
+    from archivebox.misc.jsonl import write_record
+    from archivebox.machine.models import Binary
+
+    is_tty = sys.stdout.isatty()
+
+    queryset = Binary.objects.all().order_by('name', '-loaded_at')
+
+    # Apply filters
+    filter_kwargs = {
+        'name': name,
+        'abspath__icontains': abspath__icontains,
+        'version__icontains': version__icontains,
+    }
+    queryset = apply_filters(queryset, filter_kwargs, limit=limit)
+
+    count = 0
+    for binary in queryset:
+        if is_tty:
+            rprint(f'[cyan]{binary.name:20}[/cyan] [dim]{binary.version:15}[/dim] {binary.abspath}')
+        else:
+            write_record(binary.to_json())
+        count += 1
+
+    rprint(f'[dim]Listed {count} binaries[/dim]', file=sys.stderr)
+    return 0
+
+
+# =============================================================================
+# UPDATE
+# =============================================================================
+
+def update_binaries(
+    version: Optional[str] = None,
+    abspath: Optional[str] = None,
+) -> int:
+    """
+    Update Binaries from stdin JSONL.
+
+    Reads Binary records from stdin and applies updates.
+    Uses PATCH semantics - only specified fields are updated.
+
+    Exit codes:
+        0: Success
+        1: No input or error
+    """
+    from archivebox.misc.jsonl import read_stdin, write_record
+    from archivebox.machine.models import Binary
+
+    is_tty = sys.stdout.isatty()
+
+    records = list(read_stdin())
+    if not records:
+        rprint('[yellow]No records provided via stdin[/yellow]', file=sys.stderr)
+        return 1
+
+    updated_count = 0
+    for record in records:
+        binary_id = record.get('id')
+        if not binary_id:
+            continue
+
+        try:
+            binary = Binary.objects.get(id=binary_id)
+
+            # Apply updates from CLI flags
+            if version:
+                binary.version = version
+            if abspath:
+                binary.abspath = abspath
+
+            binary.save()
+            updated_count += 1
+
+            if not is_tty:
+                write_record(binary.to_json())
+
+        except Binary.DoesNotExist:
+            rprint(f'[yellow]Binary not found: {binary_id}[/yellow]', file=sys.stderr)
+            continue
+
+    rprint(f'[green]Updated {updated_count} binaries[/green]', file=sys.stderr)
+    return 0
+
+
+# =============================================================================
+# DELETE
+# =============================================================================
+
+def delete_binaries(yes: bool = False, dry_run: bool = False) -> int:
+    """
+    Delete Binaries from stdin JSONL.
+
+    Requires --yes flag to confirm deletion.
+
+    Exit codes:
+        0: Success
+        1: No input or missing --yes flag
+    """
+    from archivebox.misc.jsonl import read_stdin
+    from archivebox.machine.models import Binary
+
+    records = list(read_stdin())
+    if not records:
+        rprint('[yellow]No records provided via stdin[/yellow]', file=sys.stderr)
+        return 1
+
+    binary_ids = [r.get('id') for r in records if r.get('id')]
+
+    if not binary_ids:
+        rprint('[yellow]No valid binary IDs in input[/yellow]', file=sys.stderr)
+        return 1
+
+    binaries = Binary.objects.filter(id__in=binary_ids)
+    count = binaries.count()
+
+    if count == 0:
+        rprint('[yellow]No matching binaries found[/yellow]', file=sys.stderr)
+        return 0
+
+    if dry_run:
+        rprint(f'[yellow]Would delete {count} binaries (dry run)[/yellow]', file=sys.stderr)
+        for binary in binaries:
+            rprint(f'  {binary.name} {binary.abspath}', file=sys.stderr)
+        return 0
+
+    if not yes:
+        rprint('[red]Use --yes to confirm deletion[/red]', file=sys.stderr)
+        return 1
+
+    # Perform deletion
+    deleted_count, _ = binaries.delete()
+    rprint(f'[green]Deleted {deleted_count} binaries[/green]', file=sys.stderr)
+    return 0
+
+
+# =============================================================================
+# CLI Commands
+# =============================================================================
+
+@click.group()
+def main():
+    """Manage Binary records (detected executables)."""
+    pass
+
+
+@main.command('create')
+@click.option('--name', '-n', required=True, help='Binary name (e.g., chrome, wget)')
+@click.option('--abspath', '-p', required=True, help='Absolute path to binary')
+@click.option('--version', '-v', default='', help='Binary version')
+def create_cmd(name: str, abspath: str, version: str):
+    """Create/register a Binary."""
+    sys.exit(create_binary(name=name, abspath=abspath, version=version))
+
+
+@main.command('list')
+@click.option('--name', '-n', help='Filter by name')
+@click.option('--abspath__icontains', help='Filter by path contains')
+@click.option('--version__icontains', help='Filter by version contains')
+@click.option('--limit', type=int, help='Limit number of results')
+def list_cmd(name: Optional[str], abspath__icontains: Optional[str],
+             version__icontains: Optional[str], limit: Optional[int]):
+    """List Binaries as JSONL."""
+    sys.exit(list_binaries(
+        name=name,
+        abspath__icontains=abspath__icontains,
+        version__icontains=version__icontains,
+        limit=limit,
+    ))
+
+
+@main.command('update')
+@click.option('--version', '-v', help='Set version')
+@click.option('--abspath', '-p', help='Set path')
+def update_cmd(version: Optional[str], abspath: Optional[str]):
+    """Update Binaries from stdin JSONL."""
+    sys.exit(update_binaries(version=version, abspath=abspath))
+
+
+@main.command('delete')
+@click.option('--yes', '-y', is_flag=True, help='Confirm deletion')
+@click.option('--dry-run', is_flag=True, help='Show what would be deleted')
+def delete_cmd(yes: bool, dry_run: bool):
+    """Delete Binaries from stdin JSONL."""
+    sys.exit(delete_binaries(yes=yes, dry_run=dry_run))
+
+
+if __name__ == '__main__':
+    main()
--- a/archivebox/cli/archivebox_crawl.py
+++ b/archivebox/cli/archivebox_crawl.py
@@ -1,108 +1,153 @@
 #!/usr/bin/env python3

 """
-archivebox crawl [urls...] [--depth=N] [--tag=TAG]
+archivebox crawl <action> [args...] [--filters]

-Create Crawl jobs from URLs. Accepts URLs as arguments, from stdin, or via JSONL.
-Does NOT immediately start the crawl - pipe to `archivebox snapshot` to process.
+Manage Crawl records.

-Input formats:
-    - Plain URLs (one per line)
-    - JSONL: {"url": "...", "depth": 1, "tags": "..."}
-
-Output (JSONL):
-    {"type": "Crawl", "id": "...", "urls": "...", "status": "queued", ...}
+Actions:
+    create  - Create Crawl jobs from URLs
+    list    - List Crawls as JSONL (with optional filters)
+    update  - Update Crawls from stdin JSONL
+    delete  - Delete Crawls from stdin JSONL

 Examples:
-    # Create a crawl job
-    archivebox crawl https://example.com
+    # Create
+    archivebox crawl create https://example.com https://foo.com --depth=1
+    archivebox crawl create --tag=news https://example.com

-    # Create crawl with depth
-    archivebox crawl --depth=1 https://example.com
+    # List with filters
+    archivebox crawl list --status=queued
+    archivebox crawl list --urls__icontains=example.com

-    # Full pipeline: create crawl, create snapshots, run extractors
-    archivebox crawl https://example.com | archivebox snapshot | archivebox extract
+    # Update
+    archivebox crawl list --status=started | archivebox crawl update --status=queued

-    # Process existing Crawl by ID (runs the crawl state machine)
-    archivebox crawl 01234567-89ab-cdef-0123-456789abcdef
+    # Delete
+    archivebox crawl list --urls__icontains=spam.com | archivebox crawl delete --yes
+
+    # Full pipeline
+    archivebox crawl create https://example.com | archivebox snapshot create | archivebox run
 """

 __package__ = 'archivebox.cli'
 __command__ = 'archivebox crawl'

 import sys
-from typing import Optional
+from typing import Optional, Iterable

 import rich_click as click
+from rich import print as rprint
+
+from archivebox.cli.cli_utils import apply_filters


-def create_crawls(
-    records: list,
+# =============================================================================
+# CREATE
+# =============================================================================
+
+def create_crawl(
+    urls: Iterable[str],
    depth: int = 0,
    tag: str = '',
+    status: str = 'queued',
    created_by_id: Optional[int] = None,
 ) -> int:
    """
-    Create a single Crawl job from all input URLs.
+    Create a Crawl job from URLs.

-    Takes pre-read records, creates one Crawl with all URLs, outputs JSONL.
-    Does NOT start the crawl - just creates the job in QUEUED state.
+    Takes URLs as args or stdin, creates one Crawl with all URLs, outputs JSONL.
+    Pass-through: Records that are not URLs are output unchanged (for piping).

    Exit codes:
        0: Success
        1: Failure
    """
-    from rich import print as rprint
-
-    from archivebox.misc.jsonl import write_record
+    from archivebox.misc.jsonl import read_args_or_stdin, write_record, TYPE_CRAWL
    from archivebox.base_models.models import get_or_create_system_user_pk
    from archivebox.crawls.models import Crawl

    created_by_id = created_by_id or get_or_create_system_user_pk()
    is_tty = sys.stdout.isatty()

+    # Collect all input records
+    records = list(read_args_or_stdin(urls))
+
    if not records:
        rprint('[yellow]No URLs provided. Pass URLs as arguments or via stdin.[/yellow]', file=sys.stderr)
        return 1

-    # Collect all URLs into a single newline-separated string
-    urls = []
+    # Separate pass-through records from URL records
+    url_list = []
+    pass_through_records = []
+
    for record in records:
+        record_type = record.get('type', '')
+
+        # Pass-through: output records that aren't URL/Crawl types
+        if record_type and record_type != TYPE_CRAWL and not record.get('url') and not record.get('urls'):
+            pass_through_records.append(record)
+            continue
+
+        # Handle existing Crawl records (just pass through with id)
+        if record_type == TYPE_CRAWL and record.get('id'):
+            pass_through_records.append(record)
+            continue
+
+        # Collect URLs
        url = record.get('url')
        if url:
-            urls.append(url)
+            url_list.append(url)

-    if not urls:
+        # Handle 'urls' field (newline-separated)
+        urls_field = record.get('urls')
+        if urls_field:
+            for line in urls_field.split('\n'):
+                line = line.strip()
+                if line and not line.startswith('#'):
+                    url_list.append(line)
+
+    # Output pass-through records first
+    if not is_tty:
+        for record in pass_through_records:
+            write_record(record)
+
+    if not url_list:
+        if pass_through_records:
+            # If we had pass-through records but no URLs, that's OK
+            rprint(f'[dim]Passed through {len(pass_through_records)} records, no new URLs[/dim]', file=sys.stderr)
+            return 0
        rprint('[red]No valid URLs found[/red]', file=sys.stderr)
        return 1

    try:
        # Build crawl record with all URLs as newline-separated string
        crawl_record = {
-            'urls': '\n'.join(urls),
+            'urls': '\n'.join(url_list),
            'max_depth': depth,
            'tags_str': tag,
+            'status': status,
            'label': '',
        }

-        crawl = Crawl.from_jsonl(crawl_record, overrides={'created_by_id': created_by_id})
+        crawl = Crawl.from_json(crawl_record, overrides={'created_by_id': created_by_id})
        if not crawl:
            rprint('[red]Failed to create crawl[/red]', file=sys.stderr)
            return 1

        # Output JSONL record (only when piped)
        if not is_tty:
-            write_record(crawl.to_jsonl())
+            write_record(crawl.to_json())

-        rprint(f'[green]Created crawl with {len(urls)} URLs[/green]', file=sys.stderr)
+        rprint(f'[green]Created crawl with {len(url_list)} URLs[/green]', file=sys.stderr)

        # If TTY, show human-readable output
        if is_tty:
            rprint(f'  [dim]{crawl.id}[/dim]', file=sys.stderr)
-            for url in urls[:5]:  # Show first 5 URLs
+            for url in url_list[:5]:  # Show first 5 URLs
                rprint(f'    {url[:70]}', file=sys.stderr)
-            if len(urls) > 5:
-                rprint(f'    ... and {len(urls) - 5} more', file=sys.stderr)
+            if len(url_list) > 5:
+                rprint(f'    ... and {len(url_list) - 5} more', file=sys.stderr)

        return 0

@@ -111,81 +156,217 @@ def create_crawls(
        return 1


-def process_crawl_by_id(crawl_id: str) -> int:
-    """
-    Process a single Crawl by ID (used by workers).
+# =============================================================================
+# LIST
+# =============================================================================

-    Triggers the Crawl's state machine tick() which will:
-    - Transition from queued -> started (creates root snapshot)
-    - Transition from started -> sealed (when all snapshots done)
+def list_crawls(
+    status: Optional[str] = None,
+    urls__icontains: Optional[str] = None,
+    max_depth: Optional[int] = None,
+    limit: Optional[int] = None,
+) -> int:
    """
-    from rich import print as rprint
+    List Crawls as JSONL with optional filters.
+
+    Exit codes:
+        0: Success (even if no results)
+    """
+    from archivebox.misc.jsonl import write_record
    from archivebox.crawls.models import Crawl

-    try:
-        crawl = Crawl.objects.get(id=crawl_id)
-    except Crawl.DoesNotExist:
-        rprint(f'[red]Crawl {crawl_id} not found[/red]', file=sys.stderr)
-        return 1
+    is_tty = sys.stdout.isatty()

-    rprint(f'[blue]Processing Crawl {crawl.id} (status={crawl.status})[/blue]', file=sys.stderr)
+    queryset = Crawl.objects.all().order_by('-created_at')

-    try:
-        crawl.sm.tick()
-        crawl.refresh_from_db()
-        rprint(f'[green]Crawl complete (status={crawl.status})[/green]', file=sys.stderr)
-        return 0
-    except Exception as e:
-        rprint(f'[red]Crawl error: {type(e).__name__}: {e}[/red]', file=sys.stderr)
-        return 1
+    # Apply filters
+    filter_kwargs = {
+        'status': status,
+        'urls__icontains': urls__icontains,
+        'max_depth': max_depth,
+    }
+    queryset = apply_filters(queryset, filter_kwargs, limit=limit)
+
+    count = 0
+    for crawl in queryset:
+        if is_tty:
+            status_color = {
+                'queued': 'yellow',
+                'started': 'blue',
+                'sealed': 'green',
+            }.get(crawl.status, 'dim')
+            url_preview = crawl.urls[:50].replace('\n', ' ')
+            rprint(f'[{status_color}]{crawl.status:8}[/{status_color}] [dim]{crawl.id}[/dim] {url_preview}...')
+        else:
+            write_record(crawl.to_json())
+        count += 1
+
+    rprint(f'[dim]Listed {count} crawls[/dim]', file=sys.stderr)
+    return 0


-def is_crawl_id(value: str) -> bool:
-    """Check if value looks like a Crawl UUID."""
-    import re
-    uuid_pattern = re.compile(r'^[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$', re.I)
-    if not uuid_pattern.match(value):
-        return False
-    # Verify it's actually a Crawl (not a Snapshot or other object)
+# =============================================================================
+# UPDATE
+# =============================================================================
+
+def update_crawls(
+    status: Optional[str] = None,
+    max_depth: Optional[int] = None,
+) -> int:
+    """
+    Update Crawls from stdin JSONL.
+
+    Reads Crawl records from stdin and applies updates.
+    Uses PATCH semantics - only specified fields are updated.
+
+    Exit codes:
+        0: Success
+        1: No input or error
+    """
+    from django.utils import timezone
+
+    from archivebox.misc.jsonl import read_stdin, write_record
    from archivebox.crawls.models import Crawl
-    return Crawl.objects.filter(id=value).exists()

+    is_tty = sys.stdout.isatty()

-@click.command()
-@click.option('--depth', '-d', type=int, default=0, help='Max depth for recursive crawling (default: 0, no recursion)')
-@click.option('--tag', '-t', default='', help='Comma-separated tags to add to snapshots')
-@click.argument('args', nargs=-1)
-def main(depth: int, tag: str, args: tuple):
-    """Create Crawl jobs from URLs, or process existing Crawls by ID"""
-    from archivebox.misc.jsonl import read_args_or_stdin
-
-    # Read all input
-    records = list(read_args_or_stdin(args))
-
+    records = list(read_stdin())
    if not records:
-        from rich import print as rprint
-        rprint('[yellow]No URLs or Crawl IDs provided. Pass as arguments or via stdin.[/yellow]', file=sys.stderr)
-        sys.exit(1)
+        rprint('[yellow]No records provided via stdin[/yellow]', file=sys.stderr)
+        return 1

-    # Check if input looks like existing Crawl IDs to process
-    # If ALL inputs are Crawl UUIDs, process them
-    all_are_crawl_ids = all(
-        is_crawl_id(r.get('id') or r.get('url', ''))
-        for r in records
-    )
+    updated_count = 0
+    for record in records:
+        crawl_id = record.get('id')
+        if not crawl_id:
+            continue

-    if all_are_crawl_ids:
-        # Process existing Crawls by ID
-        exit_code = 0
-        for record in records:
-            crawl_id = record.get('id') or record.get('url')
-            result = process_crawl_by_id(crawl_id)
-            if result != 0:
-                exit_code = result
-        sys.exit(exit_code)
-    else:
-        # Default behavior: create Crawl jobs from URLs
-        sys.exit(create_crawls(records, depth=depth, tag=tag))
+        try:
+            crawl = Crawl.objects.get(id=crawl_id)
+
+            # Apply updates from CLI flags
+            if status:
+                crawl.status = status
+                crawl.retry_at = timezone.now()
+            if max_depth is not None:
+                crawl.max_depth = max_depth
+
+            crawl.save()
+            updated_count += 1
+
+            if not is_tty:
+                write_record(crawl.to_json())
+
+        except Crawl.DoesNotExist:
+            rprint(f'[yellow]Crawl not found: {crawl_id}[/yellow]', file=sys.stderr)
+            continue
+
+    rprint(f'[green]Updated {updated_count} crawls[/green]', file=sys.stderr)
+    return 0
+
+
+# =============================================================================
+# DELETE
+# =============================================================================
+
+def delete_crawls(yes: bool = False, dry_run: bool = False) -> int:
+    """
+    Delete Crawls from stdin JSONL.
+
+    Requires --yes flag to confirm deletion.
+
+    Exit codes:
+        0: Success
+        1: No input or missing --yes flag
+    """
+    from archivebox.misc.jsonl import read_stdin
+    from archivebox.crawls.models import Crawl
+
+    records = list(read_stdin())
+    if not records:
+        rprint('[yellow]No records provided via stdin[/yellow]', file=sys.stderr)
+        return 1
+
+    crawl_ids = [r.get('id') for r in records if r.get('id')]
+
+    if not crawl_ids:
+        rprint('[yellow]No valid crawl IDs in input[/yellow]', file=sys.stderr)
+        return 1
+
+    crawls = Crawl.objects.filter(id__in=crawl_ids)
+    count = crawls.count()
+
+    if count == 0:
+        rprint('[yellow]No matching crawls found[/yellow]', file=sys.stderr)
+        return 0
+
+    if dry_run:
+        rprint(f'[yellow]Would delete {count} crawls (dry run)[/yellow]', file=sys.stderr)
+        for crawl in crawls:
+            url_preview = crawl.urls[:50].replace('\n', ' ')
+            rprint(f'  [dim]{crawl.id}[/dim] {url_preview}...', file=sys.stderr)
+        return 0
+
+    if not yes:
+        rprint('[red]Use --yes to confirm deletion[/red]', file=sys.stderr)
+        return 1
+
+    # Perform deletion
+    deleted_count, _ = crawls.delete()
+    rprint(f'[green]Deleted {deleted_count} crawls[/green]', file=sys.stderr)
+    return 0
+
+
+# =============================================================================
+# CLI Commands
+# =============================================================================
+
+@click.group()
+def main():
+    """Manage Crawl records."""
+    pass
+
+
+@main.command('create')
+@click.argument('urls', nargs=-1)
+@click.option('--depth', '-d', type=int, default=0, help='Max crawl depth (default: 0)')
+@click.option('--tag', '-t', default='', help='Comma-separated tags to add')
+@click.option('--status', '-s', default='queued', help='Initial status (default: queued)')
+def create_cmd(urls: tuple, depth: int, tag: str, status: str):
+    """Create a Crawl job from URLs or stdin."""
+    sys.exit(create_crawl(urls, depth=depth, tag=tag, status=status))
+
+
+@main.command('list')
+@click.option('--status', '-s', help='Filter by status (queued, started, sealed)')
+@click.option('--urls__icontains', help='Filter by URLs contains')
+@click.option('--max-depth', type=int, help='Filter by max depth')
+@click.option('--limit', '-n', type=int, help='Limit number of results')
+def list_cmd(status: Optional[str], urls__icontains: Optional[str],
+             max_depth: Optional[int], limit: Optional[int]):
+    """List Crawls as JSONL."""
+    sys.exit(list_crawls(
+        status=status,
+        urls__icontains=urls__icontains,
+        max_depth=max_depth,
+        limit=limit,
+    ))
+
+
+@main.command('update')
+@click.option('--status', '-s', help='Set status')
+@click.option('--max-depth', type=int, help='Set max depth')
+def update_cmd(status: Optional[str], max_depth: Optional[int]):
+    """Update Crawls from stdin JSONL."""
+    sys.exit(update_crawls(status=status, max_depth=max_depth))
+
+
+@main.command('delete')
+@click.option('--yes', '-y', is_flag=True, help='Confirm deletion')
+@click.option('--dry-run', is_flag=True, help='Show what would be deleted')
+def delete_cmd(yes: bool, dry_run: bool):
+    """Delete Crawls from stdin JSONL."""
+    sys.exit(delete_crawls(yes=yes, dry_run=dry_run))


 if __name__ == '__main__':
--- a/archivebox/cli/archivebox_init.py
+++ b/archivebox/cli/archivebox_init.py
@@ -127,7 +127,7 @@ def init(force: bool=False, quick: bool=False, install: bool=False) -> None:

            if pending_links:
                for link_dict in pending_links.values():
-                    Snapshot.from_jsonl(link_dict)
+                    Snapshot.from_json(link_dict)

            # Hint for orphaned snapshot directories
            print()
--- a/archivebox/cli/archivebox_machine.py
+++ b/archivebox/cli/archivebox_machine.py
@@ -0,0 +1,99 @@
+#!/usr/bin/env python3
+
+"""
+archivebox machine <action> [--filters]
+
+Manage Machine records (system-managed, mostly read-only).
+
+Machine records track the host machines where ArchiveBox runs.
+They are created automatically by the system and are primarily for debugging.
+
+Actions:
+    list    - List Machines as JSONL (with optional filters)
+
+Examples:
+    # List all machines
+    archivebox machine list
+
+    # List machines by hostname
+    archivebox machine list --hostname__icontains=myserver
+"""
+
+__package__ = 'archivebox.cli'
+__command__ = 'archivebox machine'
+
+import sys
+from typing import Optional
+
+import rich_click as click
+from rich import print as rprint
+
+from archivebox.cli.cli_utils import apply_filters
+
+
+# =============================================================================
+# LIST
+# =============================================================================
+
+def list_machines(
+    hostname__icontains: Optional[str] = None,
+    os_platform: Optional[str] = None,
+    limit: Optional[int] = None,
+) -> int:
+    """
+    List Machines as JSONL with optional filters.
+
+    Exit codes:
+        0: Success (even if no results)
+    """
+    from archivebox.misc.jsonl import write_record
+    from archivebox.machine.models import Machine
+
+    is_tty = sys.stdout.isatty()
+
+    queryset = Machine.objects.all().order_by('-created_at')
+
+    # Apply filters
+    filter_kwargs = {
+        'hostname__icontains': hostname__icontains,
+        'os_platform': os_platform,
+    }
+    queryset = apply_filters(queryset, filter_kwargs, limit=limit)
+
+    count = 0
+    for machine in queryset:
+        if is_tty:
+            rprint(f'[cyan]{machine.hostname:30}[/cyan] [dim]{machine.os_platform:10}[/dim] {machine.id}')
+        else:
+            write_record(machine.to_json())
+        count += 1
+
+    rprint(f'[dim]Listed {count} machines[/dim]', file=sys.stderr)
+    return 0
+
+
+# =============================================================================
+# CLI Commands
+# =============================================================================
+
+@click.group()
+def main():
+    """Manage Machine records (read-only, system-managed)."""
+    pass
+
+
+@main.command('list')
+@click.option('--hostname__icontains', help='Filter by hostname contains')
+@click.option('--os-platform', help='Filter by OS platform')
+@click.option('--limit', '-n', type=int, help='Limit number of results')
+def list_cmd(hostname__icontains: Optional[str], os_platform: Optional[str], limit: Optional[int]):
+    """List Machines as JSONL."""
+    sys.exit(list_machines(
+        hostname__icontains=hostname__icontains,
+        os_platform=os_platform,
+        limit=limit,
+    ))
+
+
+if __name__ == '__main__':
+    main()
--- a/archivebox/cli/archivebox_pluginmap.py
+++ b/archivebox/cli/archivebox_pluginmap.py
@@ -0,0 +1,356 @@
+#!/usr/bin/env python3
+
+__package__ = 'archivebox.cli'
+
+from typing import Optional
+from pathlib import Path
+
+import rich_click as click
+
+from archivebox.misc.util import docstring, enforce_types
+
+
+# State Machine ASCII Art Diagrams
+CRAWL_MACHINE_DIAGRAM = """
+┌─────────────────────────────────────────────────────────────────────────────┐
+│                              CrawlMachine                                   │
+├─────────────────────────────────────────────────────────────────────────────┤
+│                                                                             │
+│   ┌─────────────┐                                                           │
+│   │   QUEUED    │◄────────────────┐                                         │
+│   │  (initial)  │                 │                                         │
+│   └──────┬──────┘                 │                                         │
+│          │                        │ tick() unless can_start()               │
+│          │ tick() when            │                                         │
+│          │ can_start()            │                                         │
+│          ▼                        │                                         │
+│   ┌─────────────┐                 │                                         │
+│   │   STARTED   │─────────────────┘                                         │
+│   │             │◄────────────────┐                                         │
+│   │ enter:      │                 │                                         │
+│   │  crawl.run()│                 │ tick() unless is_finished()             │
+│   │  (discover  │                 │                                         │
+│   │   Crawl     │─────────────────┘                                         │
+│   │   hooks)    │                                                           │
+│   └──────┬──────┘                                                           │
+│          │                                                                  │
+│          │ tick() when is_finished()                                        │
+│          ▼                                                                  │
+│   ┌─────────────┐                                                           │
+│   │   SEALED    │                                                           │
+│   │   (final)   │                                                           │
+│   │             │                                                           │
+│   │ enter:      │                                                           │
+│   │  cleanup()  │                                                           │
+│   └─────────────┘                                                           │
+│                                                                             │
+│   Hooks triggered: on_Crawl__* (during STARTED.enter via crawl.run())       │
+│                    on_CrawlEnd__* (during SEALED.enter via cleanup())       │
+└─────────────────────────────────────────────────────────────────────────────┘
+"""
+
+SNAPSHOT_MACHINE_DIAGRAM = """
+┌─────────────────────────────────────────────────────────────────────────────┐
+│                            SnapshotMachine                                  │
+├─────────────────────────────────────────────────────────────────────────────┤
+│                                                                             │
+│   ┌─────────────┐                                                           │
+│   │   QUEUED    │◄────────────────┐                                         │
+│   │  (initial)  │                 │                                         │
+│   └──────┬──────┘                 │                                         │
+│          │                        │ tick() unless can_start()               │
+│          │ tick() when            │                                         │
+│          │ can_start()            │                                         │
+│          ▼                        │                                         │
+│   ┌─────────────┐                 │                                         │
+│   │   STARTED   │─────────────────┘                                         │
+│   │             │◄────────────────┐                                         │
+│   │ enter:      │                 │                                         │
+│   │ snapshot    │                 │ tick() unless is_finished()             │
+│   │  .run()     │                 │                                         │
+│   │ (discover   │─────────────────┘                                         │
+│   │  Snapshot   │                                                           │
+│   │  hooks,     │                                                           │
+│   │  create     │                                                           │
+│   │  pending    │                                                           │
+│   │  results)   │                                                           │
+│   └──────┬──────┘                                                           │
+│          │                                                                  │
+│          │ tick() when is_finished()                                        │
+│          ▼                                                                  │
+│   ┌─────────────┐                                                           │
+│   │   SEALED    │                                                           │
+│   │   (final)   │                                                           │
+│   │             │                                                           │
+│   │ enter:      │                                                           │
+│   │  cleanup()  │                                                           │
+│   └─────────────┘                                                           │
+│                                                                             │
+│   Hooks triggered: on_Snapshot__* (creates ArchiveResults in STARTED.enter) │
+└─────────────────────────────────────────────────────────────────────────────┘
+"""
+
+ARCHIVERESULT_MACHINE_DIAGRAM = """
+┌─────────────────────────────────────────────────────────────────────────────┐
+│                          ArchiveResultMachine                               │
+├─────────────────────────────────────────────────────────────────────────────┤
+│                                                                             │
+│   ┌─────────────┐                                                           │
+│   │   QUEUED    │◄────────────────┐                                         │
+│   │  (initial)  │                 │                                         │
+│   └──────┬──────┘                 │                                         │
+│          │                        │ tick() unless can_start()               │
+│          │ tick() when            │                                         │
+│          │ can_start()            │                                         │
+│          ▼                        │                                         │
+│   ┌─────────────┐                 │                                         │
+│   │   STARTED   │─────────────────┘                                         │
+│   │             │◄────────────────┐                                         │
+│   │ enter:      │                 │ tick() unless is_finished()             │
+│   │ result.run()│─────────────────┘                                         │
+│   │ (execute    │                                                           │
+│   │  hook via   │                                                           │
+│   │  run_hook())│                                                           │
+│   └──────┬──────┘                                                           │
+│          │                                                                  │
+│          │ tick() checks status set by hook output                          │
+│          ├────────────────┬────────────────┬────────────────┐               │
+│          ▼                ▼                ▼                ▼               │
+│   ┌───────────┐    ┌───────────┐    ┌───────────┐    ┌───────────┐          │
+│   │ SUCCEEDED │    │  FAILED   │    │  SKIPPED  │    │  BACKOFF  │          │
+│   │  (final)  │    │  (final)  │    │  (final)  │    │           │          │
+│   └───────────┘    └───────────┘    └───────────┘    └─────┬─────┘          │
+│                                                            │                │
+│                                              can_start()───┘                │
+│                                              loops back to STARTED          │
+│                                                                             │
+│   Each ArchiveResult runs ONE specific hook (stored in .hook_name field)    │
+└─────────────────────────────────────────────────────────────────────────────┘
+"""
+
+BINARY_MACHINE_DIAGRAM = """
+┌─────────────────────────────────────────────────────────────────────────────┐
+│                             BinaryMachine                                   │
+├─────────────────────────────────────────────────────────────────────────────┤
+│                                                                             │
+│   ┌─────────────┐                                                           │
+│   │   QUEUED    │◄────────────────┐                                         │
+│   │  (initial)  │                 │                                         │
+│   └──────┬──────┘                 │                                         │
+│          │                        │ tick() unless can_start()               │
+│          │ tick() when            │                                         │
+│          │ can_start()            │                                         │
+│          ▼                        │                                         │
+│   ┌─────────────┐                 │                                         │
+│   │   STARTED   │─────────────────┘                                         │
+│   │             │◄────────────────┐                                         │
+│   │ enter:      │                 │                                         │
+│   │ binary.run()│                 │ tick() unless is_finished()             │
+│   │ (discover   │─────────────────┘                                         │
+│   │  Binary     │                                                           │
+│   │  hooks,     │                                                           │
+│   │  try each   │                                                           │
+│   │  provider)  │                                                           │
+│   └──────┬──────┘                                                           │
+│          │                                                                  │
+│          │ tick() checks status set by hook output                          │
+│          ├────────────────────────────────┐                                 │
+│          ▼                                ▼                                 │
+│   ┌─────────────┐                  ┌─────────────┐                          │
+│   │  SUCCEEDED  │                  │   FAILED    │                          │
+│   │   (final)   │                  │   (final)   │                          │
+│   │             │                  │             │                          │
+│   │ abspath,    │                  │ no provider │                          │
+│   │ version set │                  │ succeeded   │                          │
+│   └─────────────┘                  └─────────────┘                          │
+│                                                                             │
+│   Hooks triggered: on_Binary__* (provider hooks during STARTED.enter)       │
+│   Providers tried in sequence until one succeeds: apt, brew, pip, npm, etc. │
+└─────────────────────────────────────────────────────────────────────────────┘
+"""
+
+
+@enforce_types
+def pluginmap(
+    show_disabled: bool = False,
+    model: Optional[str] = None,
+    quiet: bool = False,
+) -> dict:
+    """
+    Show a map of all state machines and their associated plugin hooks.
+
+    Displays ASCII art diagrams of the core model state machines (Crawl, Snapshot,
+    ArchiveResult, Binary) and lists all auto-detected on_Modelname_xyz hooks
+    that will run for each model's transitions.
+    """
+    from rich.console import Console
+    from rich.table import Table
+    from rich.panel import Panel
+    from rich import box
+
+    from archivebox.hooks import (
+        discover_hooks,
+        extract_step,
+        is_background_hook,
+        BUILTIN_PLUGINS_DIR,
+        USER_PLUGINS_DIR,
+    )
+
+    console = Console()
+    prnt = console.print
+
+    # Model event types that can have hooks
+    model_events = {
+        'Crawl': {
+            'description': 'Hooks run when a Crawl starts (QUEUED→STARTED)',
+            'machine': 'CrawlMachine',
+            'diagram': CRAWL_MACHINE_DIAGRAM,
+        },
+        'CrawlEnd': {
+            'description': 'Hooks run when a Crawl finishes (STARTED→SEALED)',
+            'machine': 'CrawlMachine',
+            'diagram': None,  # Part of CrawlMachine
+        },
+        'Snapshot': {
+            'description': 'Hooks run for each Snapshot (creates ArchiveResults)',
+            'machine': 'SnapshotMachine',
+            'diagram': SNAPSHOT_MACHINE_DIAGRAM,
+        },
+        'Binary': {
+            'description': 'Hooks for installing binary dependencies (providers)',
+            'machine': 'BinaryMachine',
+            'diagram': BINARY_MACHINE_DIAGRAM,
+        },
+    }
+
+    # Filter to specific model if requested
+    if model:
+        model = model.title()
+        if model not in model_events:
+            prnt(f'[red]Error: Unknown model "{model}". Available: {", ".join(model_events.keys())}[/red]')
+            return {}
+        model_events = {model: model_events[model]}
+
+    result = {
+        'models': {},
+        'plugins_dir': str(BUILTIN_PLUGINS_DIR),
+        'user_plugins_dir': str(USER_PLUGINS_DIR),
+    }
+
+    if not quiet:
+        prnt()
+        prnt('[bold cyan]ArchiveBox Plugin Map[/bold cyan]')
+        prnt(f'[dim]Built-in plugins: {BUILTIN_PLUGINS_DIR}[/dim]')
+        prnt(f'[dim]User plugins: {USER_PLUGINS_DIR}[/dim]')
+        prnt()
+
+    # Show diagrams first (unless quiet mode)
+    if not quiet:
+        # Show ArchiveResult diagram separately since it's different
+        prnt(Panel(
+            ARCHIVERESULT_MACHINE_DIAGRAM,
+            title='[bold green]ArchiveResultMachine[/bold green]',
+            border_style='green',
+            expand=False,
+        ))
+        prnt()
+
+    for event_name, info in model_events.items():
+        # Discover hooks for this event
+        hooks = discover_hooks(event_name, filter_disabled=not show_disabled)
+
+        # Build hook info list
+        hook_infos = []
+        for hook_path in hooks:
+            # Get plugin name from parent directory (e.g., 'wget' from 'plugins/wget/on_Snapshot__61_wget.py')
+            plugin_name = hook_path.parent.name
+            step = extract_step(hook_path.name)
+            is_bg = is_background_hook(hook_path.name)
+
+            hook_infos.append({
+                'path': str(hook_path),
+                'name': hook_path.name,
+                'plugin': plugin_name,
+                'step': step,
+                'is_background': is_bg,
+                'extension': hook_path.suffix,
+            })
+
+        result['models'][event_name] = {
+            'description': info['description'],
+            'machine': info['machine'],
+            'hooks': hook_infos,
+            'hook_count': len(hook_infos),
+        }
+
+        if not quiet:
+            # Show diagram if this model has one
+            if info.get('diagram'):
+                prnt(Panel(
+                    info['diagram'],
+                    title=f'[bold green]{info["machine"]}[/bold green]',
+                    border_style='green',
+                    expand=False,
+                ))
+                prnt()
+
+            # Create hooks table
+            table = Table(
+                title=f'[bold yellow]on_{event_name}__* Hooks[/bold yellow] ({len(hooks)} found)',
+                box=box.ROUNDED,
+                show_header=True,
+                header_style='bold magenta',
+            )
+            table.add_column('Step', justify='center', width=6)
+            table.add_column('Plugin', style='cyan', width=20)
+            table.add_column('Hook Name', style='green')
+            table.add_column('BG', justify='center', width=4)
+            table.add_column('Type', justify='center', width=5)
+
+            # Sort by step then by name
+            sorted_hooks = sorted(hook_infos, key=lambda h: (h['step'], h['name']))
+
+            for hook in sorted_hooks:
+                bg_marker = '[yellow]bg[/yellow]' if hook['is_background'] else ''
+                ext = hook['extension'].lstrip('.')
+                table.add_row(
+                    str(hook['step']),
+                    hook['plugin'],
+                    hook['name'],
+                    bg_marker,
+                    ext,
+                )
+
+            prnt(table)
+            prnt()
+            prnt(f'[dim]{info["description"]}[/dim]')
+            prnt()
+
+    # Summary
+    if not quiet:
+        total_hooks = sum(m['hook_count'] for m in result['models'].values())
+        prnt(f'[bold]Total hooks discovered: {total_hooks}[/bold]')
+        prnt()
+        prnt('[dim]Hook naming convention: on_{Model}__{XX}_{description}[.bg].{ext}[/dim]')
+        prnt('[dim]  - XX: Two-digit order (first digit = step 0-9)[/dim]')
+        prnt('[dim]  - .bg: Background hook (non-blocking)[/dim]')
+        prnt('[dim]  - ext: py, sh, or js[/dim]')
+        prnt()
+
+    return result
+
+
+@click.command()
+@click.option('--show-disabled', '-a', is_flag=True, help='Show hooks from disabled plugins too')
+@click.option('--model', '-m', type=str, default=None, help='Filter to specific model (Crawl, Snapshot, Binary, CrawlEnd)')
+@click.option('--quiet', '-q', is_flag=True, help='Output JSON only, no ASCII diagrams')
+@docstring(pluginmap.__doc__)
+def main(**kwargs):
+    import json
+    result = pluginmap(**kwargs)
+    if kwargs.get('quiet'):
+        print(json.dumps(result, indent=2))
+
+
+if __name__ == '__main__':
+    main()
--- a/archivebox/cli/archivebox_process.py
+++ b/archivebox/cli/archivebox_process.py
@@ -0,0 +1,107 @@
+#!/usr/bin/env python3
+
+"""
+archivebox process <action> [--filters]
+
+Manage Process records (system-managed, mostly read-only).
+
+Process records track executions of binaries during extraction.
+They are created automatically by the system and are primarily for debugging.
+
+Actions:
+    list    - List Processes as JSONL (with optional filters)
+
+Examples:
+    # List all processes
+    archivebox process list
+
+    # List processes by binary
+    archivebox process list --binary-name=chrome
+
+    # List recent processes
+    archivebox process list --limit=10
+"""
+
+__package__ = 'archivebox.cli'
+__command__ = 'archivebox process'
+
+import sys
+from typing import Optional
+
+import rich_click as click
+from rich import print as rprint
+
+from archivebox.cli.cli_utils import apply_filters
+
+
+# =============================================================================
+# LIST
+# =============================================================================
+
+def list_processes(
+    binary_name: Optional[str] = None,
+    machine_id: Optional[str] = None,
+    limit: Optional[int] = None,
+) -> int:
+    """
+    List Processes as JSONL with optional filters.
+
+    Exit codes:
+        0: Success (even if no results)
+    """
+    from archivebox.misc.jsonl import write_record
+    from archivebox.machine.models import Process
+
+    is_tty = sys.stdout.isatty()
+
+    queryset = Process.objects.all().select_related('binary', 'machine').order_by('-start_ts')
+
+    # Apply filters
+    filter_kwargs = {}
+    if binary_name:
+        filter_kwargs['binary__name'] = binary_name
+    if machine_id:
+        filter_kwargs['machine_id'] = machine_id
+
+    queryset = apply_filters(queryset, filter_kwargs, limit=limit)
+
+    count = 0
+    for process in queryset:
+        if is_tty:
+            binary_name_str = process.binary.name if process.binary else 'unknown'
+            exit_code = process.returncode if process.returncode is not None else '?'
+            status_color = 'green' if process.returncode == 0 else 'red' if process.returncode else 'yellow'
+            rprint(f'[{status_color}]exit={exit_code:3}[/{status_color}] [cyan]{binary_name_str:15}[/cyan] [dim]{process.id}[/dim]')
+        else:
+            write_record(process.to_json())
+        count += 1
+
+    rprint(f'[dim]Listed {count} processes[/dim]', file=sys.stderr)
+    return 0
+
+
+# =============================================================================
+# CLI Commands
+# =============================================================================
+
+@click.group()
+def main():
+    """Manage Process records (read-only, system-managed)."""
+    pass
+
+
+@main.command('list')
+@click.option('--binary-name', '-b', help='Filter by binary name')
+@click.option('--machine-id', '-m', help='Filter by machine ID')
+@click.option('--limit', '-n', type=int, help='Limit number of results')
+def list_cmd(binary_name: Optional[str], machine_id: Optional[str], limit: Optional[int]):
+    """List Processes as JSONL."""
+    sys.exit(list_processes(
+        binary_name=binary_name,
+        machine_id=machine_id,
+        limit=limit,
+    ))
+
+
+if __name__ == '__main__':
+    main()
--- a/archivebox/cli/archivebox_run.py
+++ b/archivebox/cli/archivebox_run.py
@@ -0,0 +1,207 @@
+#!/usr/bin/env python3
+
+"""
+archivebox run [--daemon]
+
+Unified command for processing queued work.
+
+Modes:
+    - With stdin JSONL: Process piped records, exit when complete
+    - Without stdin (TTY): Run orchestrator in foreground until killed
+
+Examples:
+    # Run orchestrator in foreground (replaces `archivebox orchestrator`)
+    archivebox run
+
+    # Run as daemon (don't exit on idle)
+    archivebox run --daemon
+
+    # Process specific records (pipe any JSONL type, exits when done)
+    archivebox snapshot list --status=queued | archivebox run
+    archivebox archiveresult list --status=failed | archivebox run
+    archivebox crawl list --status=queued | archivebox run
+
+    # Mixed types work too
+    cat mixed_records.jsonl | archivebox run
+"""
+
+__package__ = 'archivebox.cli'
+__command__ = 'archivebox run'
+
+import sys
+
+import rich_click as click
+from rich import print as rprint
+
+
+def process_stdin_records() -> int:
+    """
+    Process JSONL records from stdin.
+
+    Create-or-update behavior:
+    - Records WITHOUT id: Create via Model.from_json(), then queue
+    - Records WITH id: Lookup existing, re-queue for processing
+
+    Outputs JSONL of all processed records (for chaining).
+
+    Handles any record type: Crawl, Snapshot, ArchiveResult.
+    Auto-cascades: Crawl → Snapshots → ArchiveResults.
+
+    Returns exit code (0 = success, 1 = error).
+    """
+    from django.utils import timezone
+
+    from archivebox.misc.jsonl import read_stdin, write_record, TYPE_CRAWL, TYPE_SNAPSHOT, TYPE_ARCHIVERESULT
+    from archivebox.base_models.models import get_or_create_system_user_pk
+    from archivebox.core.models import Snapshot, ArchiveResult
+    from archivebox.crawls.models import Crawl
+    from archivebox.workers.orchestrator import Orchestrator
+
+    records = list(read_stdin())
+    is_tty = sys.stdout.isatty()
+
+    if not records:
+        return 0  # Nothing to process
+
+    created_by_id = get_or_create_system_user_pk()
+    queued_count = 0
+    output_records = []
+
+    for record in records:
+        record_type = record.get('type', '')
+        record_id = record.get('id')
+
+        try:
+            if record_type == TYPE_CRAWL:
+                if record_id:
+                    # Existing crawl - re-queue
+                    try:
+                        crawl = Crawl.objects.get(id=record_id)
+                    except Crawl.DoesNotExist:
+                        crawl = Crawl.from_json(record, overrides={'created_by_id': created_by_id})
+                else:
+                    # New crawl - create it
+                    crawl = Crawl.from_json(record, overrides={'created_by_id': created_by_id})
+
+                if crawl:
+                    crawl.retry_at = timezone.now()
+                    if crawl.status not in [Crawl.StatusChoices.SEALED]:
+                        crawl.status = Crawl.StatusChoices.QUEUED
+                    crawl.save()
+                    output_records.append(crawl.to_json())
+                    queued_count += 1
+
+            elif record_type == TYPE_SNAPSHOT or (record.get('url') and not record_type):
+                if record_id:
+                    # Existing snapshot - re-queue
+                    try:
+                        snapshot = Snapshot.objects.get(id=record_id)
+                    except Snapshot.DoesNotExist:
+                        snapshot = Snapshot.from_json(record, overrides={'created_by_id': created_by_id})
+                else:
+                    # New snapshot - create it
+                    snapshot = Snapshot.from_json(record, overrides={'created_by_id': created_by_id})
+
+                if snapshot:
+                    snapshot.retry_at = timezone.now()
+                    if snapshot.status not in [Snapshot.StatusChoices.SEALED]:
+                        snapshot.status = Snapshot.StatusChoices.QUEUED
+                    snapshot.save()
+                    output_records.append(snapshot.to_json())
+                    queued_count += 1
+
+            elif record_type == TYPE_ARCHIVERESULT:
+                if record_id:
+                    # Existing archiveresult - re-queue
+                    try:
+                        archiveresult = ArchiveResult.objects.get(id=record_id)
+                    except ArchiveResult.DoesNotExist:
+                        archiveresult = ArchiveResult.from_json(record)
+                else:
+                    # New archiveresult - create it
+                    archiveresult = ArchiveResult.from_json(record)
+
+                if archiveresult:
+                    archiveresult.retry_at = timezone.now()
+                    if archiveresult.status in [ArchiveResult.StatusChoices.FAILED, ArchiveResult.StatusChoices.SKIPPED, ArchiveResult.StatusChoices.BACKOFF]:
+                        archiveresult.status = ArchiveResult.StatusChoices.QUEUED
+                    archiveresult.save()
+                    output_records.append(archiveresult.to_json())
+                    queued_count += 1
+
+            else:
+                # Unknown type - pass through
+                output_records.append(record)
+
+        except Exception as e:
+            rprint(f'[yellow]Error processing record: {e}[/yellow]', file=sys.stderr)
+            continue
+
+    # Output all processed records (for chaining)
+    if not is_tty:
+        for rec in output_records:
+            write_record(rec)
+
+    if queued_count == 0:
+        rprint('[yellow]No records to process[/yellow]', file=sys.stderr)
+        return 0
+
+    rprint(f'[blue]Processing {queued_count} records...[/blue]', file=sys.stderr)
+
+    # Run orchestrator until all queued work is done
+    orchestrator = Orchestrator(exit_on_idle=True)
+    orchestrator.runloop()
+
+    return 0
+
+
+def run_orchestrator(daemon: bool = False) -> int:
+    """
+    Run the orchestrator process.
+
+    The orchestrator:
+    1. Polls each model queue (Crawl, Snapshot, ArchiveResult)
+    2. Spawns worker processes when there is work to do
+    3. Monitors worker health and restarts failed workers
+    4. Exits when all queues are empty (unless --daemon)
+
+    Args:
+        daemon: Run forever (don't exit when idle)
+
+    Returns exit code (0 = success, 1 = error).
+    """
+    from archivebox.workers.orchestrator import Orchestrator
+
+    if Orchestrator.is_running():
+        rprint('[yellow]Orchestrator is already running[/yellow]', file=sys.stderr)
+        return 0
+
+    try:
+        orchestrator = Orchestrator(exit_on_idle=not daemon)
+        orchestrator.runloop()
+        return 0
+    except KeyboardInterrupt:
+        return 0
+    except Exception as e:
+        rprint(f'[red]Orchestrator error: {type(e).__name__}: {e}[/red]', file=sys.stderr)
+        return 1
+
+
+@click.command()
+@click.option('--daemon', '-d', is_flag=True, help="Run forever (don't exit on idle)")
+def main(daemon: bool):
+    """
+    Process queued work.
+
+    When stdin is piped: Process those specific records and exit.
+    When run standalone: Run orchestrator in foreground.
+    """
+    # Check if stdin has data (non-TTY means piped input)
+    if not sys.stdin.isatty():
+        sys.exit(process_stdin_records())
+    else:
+        sys.exit(run_orchestrator(daemon=daemon))
+
+
+if __name__ == '__main__':
+    main()
--- a/archivebox/cli/archivebox_snapshot.py
+++ b/archivebox/cli/archivebox_snapshot.py
@@ -1,95 +1,63 @@
 #!/usr/bin/env python3

 """
-archivebox snapshot [urls_or_crawl_ids...] [--tag=TAG] [--plugins=NAMES]
+archivebox snapshot <action> [args...] [--filters]

-Create Snapshots from URLs or Crawl jobs. Accepts URLs, Crawl JSONL, or Crawl IDs.
+Manage Snapshot records.

-Input formats:
-    - Plain URLs (one per line)
-    - JSONL: {"type": "Crawl", "id": "...", "urls": "..."}
-    - JSONL: {"type": "Snapshot", "url": "...", "title": "...", "tags": "..."}
-    - Crawl UUIDs (one per line)
-
-Output (JSONL):
-    {"type": "Snapshot", "id": "...", "url": "...", "status": "queued", ...}
+Actions:
+    create  - Create Snapshots from URLs or Crawl JSONL
+    list    - List Snapshots as JSONL (with optional filters)
+    update  - Update Snapshots from stdin JSONL
+    delete  - Delete Snapshots from stdin JSONL

 Examples:
-    # Create snapshots from URLs directly
-    archivebox snapshot https://example.com https://foo.com
+    # Create
+    archivebox snapshot create https://example.com --tag=news
+    archivebox crawl create https://example.com | archivebox snapshot create

-    # Pipe from crawl command
-    archivebox crawl https://example.com | archivebox snapshot
+    # List with filters
+    archivebox snapshot list --status=queued
+    archivebox snapshot list --url__icontains=example.com

-    # Chain with extract
-    archivebox crawl https://example.com | archivebox snapshot | archivebox extract
+    # Update
+    archivebox snapshot list --tag=old | archivebox snapshot update --tag=new

-    # Run specific plugins after creating snapshots
-    archivebox snapshot --plugins=screenshot,singlefile https://example.com
-
-    # Process existing Snapshot by ID
-    archivebox snapshot 01234567-89ab-cdef-0123-456789abcdef
+    # Delete
+    archivebox snapshot list --url__icontains=spam.com | archivebox snapshot delete --yes
 """

 __package__ = 'archivebox.cli'
 __command__ = 'archivebox snapshot'

 import sys
-from typing import Optional
+from typing import Optional, Iterable

 import rich_click as click
+from rich import print as rprint

-from archivebox.misc.util import docstring
+from archivebox.cli.cli_utils import apply_filters


-def process_snapshot_by_id(snapshot_id: str) -> int:
-    """
-    Process a single Snapshot by ID (used by workers).
-
-    Triggers the Snapshot's state machine tick() which will:
-    - Transition from queued -> started (creates pending ArchiveResults)
-    - Transition from started -> sealed (when all ArchiveResults done)
-    """
-    from rich import print as rprint
-    from archivebox.core.models import Snapshot
-
-    try:
-        snapshot = Snapshot.objects.get(id=snapshot_id)
-    except Snapshot.DoesNotExist:
-        rprint(f'[red]Snapshot {snapshot_id} not found[/red]', file=sys.stderr)
-        return 1
-
-    rprint(f'[blue]Processing Snapshot {snapshot.id} {snapshot.url[:50]} (status={snapshot.status})[/blue]', file=sys.stderr)
-
-    try:
-        snapshot.sm.tick()
-        snapshot.refresh_from_db()
-        rprint(f'[green]Snapshot complete (status={snapshot.status})[/green]', file=sys.stderr)
-        return 0
-    except Exception as e:
-        rprint(f'[red]Snapshot error: {type(e).__name__}: {e}[/red]', file=sys.stderr)
-        return 1
-
+# =============================================================================
+# CREATE
+# =============================================================================

 def create_snapshots(
-    args: tuple,
+    urls: Iterable[str],
    tag: str = '',
-    plugins: str = '',
+    status: str = 'queued',
+    depth: int = 0,
    created_by_id: Optional[int] = None,
 ) -> int:
    """
-    Create Snapshots from URLs, Crawl JSONL, or Crawl IDs.
-
-    Reads from args or stdin, creates Snapshot objects, outputs JSONL.
-    If --plugins is passed, also runs specified plugins (blocking).
+    Create Snapshots from URLs or stdin JSONL (Crawl or Snapshot records).
+    Pass-through: Records that are not Crawl/Snapshot/URL are output unchanged.

    Exit codes:
        0: Success
        1: Failure
    """
-    from rich import print as rprint
-    from django.utils import timezone
-
    from archivebox.misc.jsonl import (
        read_args_or_stdin, write_record,
        TYPE_SNAPSHOT, TYPE_CRAWL
@@ -102,7 +70,7 @@ def create_snapshots(
    is_tty = sys.stdout.isatty()

    # Collect all input records
-    records = list(read_args_or_stdin(args))
+    records = list(read_args_or_stdin(urls))

    if not records:
        rprint('[yellow]No URLs or Crawls provided. Pass URLs as arguments or via stdin.[/yellow]', file=sys.stderr)
@@ -110,11 +78,17 @@ def create_snapshots(

    # Process each record - handle Crawls and plain URLs/Snapshots
    created_snapshots = []
+    pass_through_count = 0
+
    for record in records:
-        record_type = record.get('type')
+        record_type = record.get('type', '')

        try:
            if record_type == TYPE_CRAWL:
+                # Pass through the Crawl record itself first
+                if not is_tty:
+                    write_record(record)
+
                # Input is a Crawl - get or create it, then create Snapshots for its URLs
                crawl = None
                crawl_id = record.get('id')
@@ -122,145 +96,295 @@ def create_snapshots(
                    try:
                        crawl = Crawl.objects.get(id=crawl_id)
                    except Crawl.DoesNotExist:
-                        # Crawl doesn't exist, create it
-                        crawl = Crawl.from_jsonl(record, overrides={'created_by_id': created_by_id})
+                        crawl = Crawl.from_json(record, overrides={'created_by_id': created_by_id})
                else:
-                    # No ID, create new crawl
-                    crawl = Crawl.from_jsonl(record, overrides={'created_by_id': created_by_id})
+                    crawl = Crawl.from_json(record, overrides={'created_by_id': created_by_id})

                if not crawl:
                    continue

                # Create snapshots for each URL in the crawl
                for url in crawl.get_urls_list():
-                    # Merge CLI tags with crawl tags
                    merged_tags = crawl.tags_str
                    if tag:
-                        if merged_tags:
-                            merged_tags = f"{merged_tags},{tag}"
-                        else:
-                            merged_tags = tag
+                        merged_tags = f"{merged_tags},{tag}" if merged_tags else tag
                    snapshot_record = {
                        'url': url,
                        'tags': merged_tags,
                        'crawl_id': str(crawl.id),
-                        'depth': 0,
+                        'depth': depth,
+                        'status': status,
                    }
-                    snapshot = Snapshot.from_jsonl(snapshot_record, overrides={'created_by_id': created_by_id})
+                    snapshot = Snapshot.from_json(snapshot_record, overrides={'created_by_id': created_by_id})
                    if snapshot:
                        created_snapshots.append(snapshot)
                        if not is_tty:
-                            write_record(snapshot.to_jsonl())
+                            write_record(snapshot.to_json())

            elif record_type == TYPE_SNAPSHOT or record.get('url'):
                # Input is a Snapshot or plain URL
-                # Add tags if provided via CLI
                if tag and not record.get('tags'):
                    record['tags'] = tag
+                if status:
+                    record['status'] = status
+                record['depth'] = record.get('depth', depth)

-                snapshot = Snapshot.from_jsonl(record, overrides={'created_by_id': created_by_id})
+                snapshot = Snapshot.from_json(record, overrides={'created_by_id': created_by_id})
                if snapshot:
                    created_snapshots.append(snapshot)
                    if not is_tty:
-                        write_record(snapshot.to_jsonl())
+                        write_record(snapshot.to_json())
+
+            else:
+                # Pass-through: output records we don't handle
+                if not is_tty:
+                    write_record(record)
+                pass_through_count += 1

        except Exception as e:
            rprint(f'[red]Error creating snapshot: {e}[/red]', file=sys.stderr)
            continue

    if not created_snapshots:
+        if pass_through_count > 0:
+            rprint(f'[dim]Passed through {pass_through_count} records, no new snapshots[/dim]', file=sys.stderr)
+            return 0
        rprint('[red]No snapshots created[/red]', file=sys.stderr)
        return 1

    rprint(f'[green]Created {len(created_snapshots)} snapshots[/green]', file=sys.stderr)

-    # If TTY, show human-readable output
    if is_tty:
        for snapshot in created_snapshots:
            rprint(f'  [dim]{snapshot.id}[/dim] {snapshot.url[:60]}', file=sys.stderr)

-    # If --plugins is passed, create ArchiveResults and run the orchestrator
-    if plugins:
-        from archivebox.core.models import ArchiveResult
-        from archivebox.workers.orchestrator import Orchestrator
-
-        # Parse comma-separated plugins list
-        plugins_list = [p.strip() for p in plugins.split(',') if p.strip()]
-
-        # Create ArchiveResults for the specific plugins on each snapshot
-        for snapshot in created_snapshots:
-            for plugin_name in plugins_list:
-                result, created = ArchiveResult.objects.get_or_create(
-                    snapshot=snapshot,
-                    plugin=plugin_name,
-                    defaults={
-                        'status': ArchiveResult.StatusChoices.QUEUED,
-                        'retry_at': timezone.now(),
-                    }
-                )
-                if not created and result.status in [ArchiveResult.StatusChoices.FAILED, ArchiveResult.StatusChoices.SKIPPED]:
-                    # Reset for retry
-                    result.status = ArchiveResult.StatusChoices.QUEUED
-                    result.retry_at = timezone.now()
-                    result.save()
-
-        rprint(f'[blue]Running plugins: {plugins}...[/blue]', file=sys.stderr)
-        orchestrator = Orchestrator(exit_on_idle=True)
-        orchestrator.runloop()
-
    return 0


-def is_snapshot_id(value: str) -> bool:
-    """Check if value looks like a Snapshot UUID."""
-    import re
-    uuid_pattern = re.compile(r'^[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$', re.I)
-    if not uuid_pattern.match(value):
-        return False
-    # Verify it's actually a Snapshot (not a Crawl or other object)
+# =============================================================================
+# LIST
+# =============================================================================
+
+def list_snapshots(
+    status: Optional[str] = None,
+    url__icontains: Optional[str] = None,
+    url__istartswith: Optional[str] = None,
+    tag: Optional[str] = None,
+    crawl_id: Optional[str] = None,
+    limit: Optional[int] = None,
+) -> int:
+    """
+    List Snapshots as JSONL with optional filters.
+
+    Exit codes:
+        0: Success (even if no results)
+    """
+    from archivebox.misc.jsonl import write_record
    from archivebox.core.models import Snapshot
-    return Snapshot.objects.filter(id=value).exists()
+
+    is_tty = sys.stdout.isatty()
+
+    queryset = Snapshot.objects.all().order_by('-created_at')
+
+    # Apply filters
+    filter_kwargs = {
+        'status': status,
+        'url__icontains': url__icontains,
+        'url__istartswith': url__istartswith,
+        'crawl_id': crawl_id,
+    }
+    queryset = apply_filters(queryset, filter_kwargs, limit=limit)
+
+    # Tag filter requires special handling (M2M)
+    if tag:
+        queryset = queryset.filter(tags__name__iexact=tag)
+
+    count = 0
+    for snapshot in queryset:
+        if is_tty:
+            status_color = {
+                'queued': 'yellow',
+                'started': 'blue',
+                'sealed': 'green',
+            }.get(snapshot.status, 'dim')
+            rprint(f'[{status_color}]{snapshot.status:8}[/{status_color}] [dim]{snapshot.id}[/dim] {snapshot.url[:60]}')
+        else:
+            write_record(snapshot.to_json())
+        count += 1
+
+    rprint(f'[dim]Listed {count} snapshots[/dim]', file=sys.stderr)
+    return 0


-@click.command()
-@click.option('--tag', '-t', default='', help='Comma-separated tags to add to each snapshot')
-@click.option('--plugins', '-p', default='', help='Comma-separated list of plugins to run after creating snapshots (e.g., screenshot,singlefile)')
-@click.argument('args', nargs=-1)
-def main(tag: str, plugins: str, args: tuple):
-    """Create Snapshots from URLs/Crawls, or process existing Snapshots by ID"""
-    from archivebox.misc.jsonl import read_args_or_stdin
+# =============================================================================
+# UPDATE
+# =============================================================================

-    # Read all input
-    records = list(read_args_or_stdin(args))
+def update_snapshots(
+    status: Optional[str] = None,
+    tag: Optional[str] = None,
+) -> int:
+    """
+    Update Snapshots from stdin JSONL.

+    Reads Snapshot records from stdin and applies updates.
+    Uses PATCH semantics - only specified fields are updated.
+
+    Exit codes:
+        0: Success
+        1: No input or error
+    """
+    from django.utils import timezone
+
+    from archivebox.misc.jsonl import read_stdin, write_record
+    from archivebox.core.models import Snapshot
+
+    is_tty = sys.stdout.isatty()
+
+    records = list(read_stdin())
    if not records:
-        from rich import print as rprint
-        rprint('[yellow]No URLs, Crawl IDs, or Snapshot IDs provided. Pass as arguments or via stdin.[/yellow]', file=sys.stderr)
-        sys.exit(1)
+        rprint('[yellow]No records provided via stdin[/yellow]', file=sys.stderr)
+        return 1

-    # Check if input looks like existing Snapshot IDs to process
-    # If ALL inputs are UUIDs with no URL and exist as Snapshots, process them
-    all_are_snapshot_ids = all(
-        is_snapshot_id(r.get('id') or r.get('url', ''))
-        for r in records
-        if r.get('type') != 'Crawl'  # Don't check Crawl records as Snapshot IDs
-    )
+    updated_count = 0
+    for record in records:
+        snapshot_id = record.get('id')
+        if not snapshot_id:
+            continue

-    # But also check that we're not receiving Crawl JSONL
-    has_crawl_records = any(r.get('type') == 'Crawl' for r in records)
+        try:
+            snapshot = Snapshot.objects.get(id=snapshot_id)

-    if all_are_snapshot_ids and not has_crawl_records:
-        # Process existing Snapshots by ID
-        exit_code = 0
-        for record in records:
-            snapshot_id = record.get('id') or record.get('url')
-            result = process_snapshot_by_id(snapshot_id)
-            if result != 0:
-                exit_code = result
-        sys.exit(exit_code)
-    else:
-        # Create new Snapshots from URLs or Crawls
-        sys.exit(create_snapshots(args, tag=tag, plugins=plugins))
+            # Apply updates from CLI flags (override stdin values)
+            if status:
+                snapshot.status = status
+                snapshot.retry_at = timezone.now()
+            if tag:
+                # Add tag to existing tags
+                snapshot.save()  # Ensure saved before M2M
+                from archivebox.core.models import Tag
+                tag_obj, _ = Tag.objects.get_or_create(name=tag)
+                snapshot.tags.add(tag_obj)
+
+            snapshot.save()
+            updated_count += 1
+
+            if not is_tty:
+                write_record(snapshot.to_json())
+
+        except Snapshot.DoesNotExist:
+            rprint(f'[yellow]Snapshot not found: {snapshot_id}[/yellow]', file=sys.stderr)
+            continue
+
+    rprint(f'[green]Updated {updated_count} snapshots[/green]', file=sys.stderr)
+    return 0
+
+
+# =============================================================================
+# DELETE
+# =============================================================================
+
+def delete_snapshots(yes: bool = False, dry_run: bool = False) -> int:
+    """
+    Delete Snapshots from stdin JSONL.
+
+    Requires --yes flag to confirm deletion.
+
+    Exit codes:
+        0: Success
+        1: No input or missing --yes flag
+    """
+    from archivebox.misc.jsonl import read_stdin
+    from archivebox.core.models import Snapshot
+
+    records = list(read_stdin())
+    if not records:
+        rprint('[yellow]No records provided via stdin[/yellow]', file=sys.stderr)
+        return 1
+
+    snapshot_ids = [r.get('id') for r in records if r.get('id')]
+
+    if not snapshot_ids:
+        rprint('[yellow]No valid snapshot IDs in input[/yellow]', file=sys.stderr)
+        return 1
+
+    snapshots = Snapshot.objects.filter(id__in=snapshot_ids)
+    count = snapshots.count()
+
+    if count == 0:
+        rprint('[yellow]No matching snapshots found[/yellow]', file=sys.stderr)
+        return 0
+
+    if dry_run:
+        rprint(f'[yellow]Would delete {count} snapshots (dry run)[/yellow]', file=sys.stderr)
+        for snapshot in snapshots:
+            rprint(f'  [dim]{snapshot.id}[/dim] {snapshot.url[:60]}', file=sys.stderr)
+        return 0
+
+    if not yes:
+        rprint('[red]Use --yes to confirm deletion[/red]', file=sys.stderr)
+        return 1
+
+    # Perform deletion
+    deleted_count, _ = snapshots.delete()
+    rprint(f'[green]Deleted {deleted_count} snapshots[/green]', file=sys.stderr)
+    return 0
+
+
+# =============================================================================
+# CLI Commands
+# =============================================================================
+
+@click.group()
+def main():
+    """Manage Snapshot records."""
+    pass
+
+
+@main.command('create')
+@click.argument('urls', nargs=-1)
+@click.option('--tag', '-t', default='', help='Comma-separated tags to add')
+@click.option('--status', '-s', default='queued', help='Initial status (default: queued)')
+@click.option('--depth', '-d', type=int, default=0, help='Crawl depth (default: 0)')
+def create_cmd(urls: tuple, tag: str, status: str, depth: int):
+    """Create Snapshots from URLs or stdin JSONL."""
+    sys.exit(create_snapshots(urls, tag=tag, status=status, depth=depth))
+
+
+@main.command('list')
+@click.option('--status', '-s', help='Filter by status (queued, started, sealed)')
+@click.option('--url__icontains', help='Filter by URL contains')
+@click.option('--url__istartswith', help='Filter by URL starts with')
+@click.option('--tag', '-t', help='Filter by tag name')
+@click.option('--crawl-id', help='Filter by crawl ID')
+@click.option('--limit', '-n', type=int, help='Limit number of results')
+def list_cmd(status: Optional[str], url__icontains: Optional[str], url__istartswith: Optional[str],
+             tag: Optional[str], crawl_id: Optional[str], limit: Optional[int]):
+    """List Snapshots as JSONL."""
+    sys.exit(list_snapshots(
+        status=status,
+        url__icontains=url__icontains,
+        url__istartswith=url__istartswith,
+        tag=tag,
+        crawl_id=crawl_id,
+        limit=limit,
+    ))
+
+
+@main.command('update')
+@click.option('--status', '-s', help='Set status')
+@click.option('--tag', '-t', help='Add tag')
+def update_cmd(status: Optional[str], tag: Optional[str]):
+    """Update Snapshots from stdin JSONL."""
+    sys.exit(update_snapshots(status=status, tag=tag))
+
+
+@main.command('delete')
+@click.option('--yes', '-y', is_flag=True, help='Confirm deletion')
+@click.option('--dry-run', is_flag=True, help='Show what would be deleted')
+def delete_cmd(yes: bool, dry_run: bool):
+    """Delete Snapshots from stdin JSONL."""
+    sys.exit(delete_snapshots(yes=yes, dry_run=dry_run))


 if __name__ == '__main__':
--- a/archivebox/cli/archivebox_tag.py
+++ b/archivebox/cli/archivebox_tag.py
@@ -0,0 +1,293 @@
+#!/usr/bin/env python3
+
+"""
+archivebox tag <action> [args...] [--filters]
+
+Manage Tag records.
+
+Actions:
+    create  - Create Tags
+    list    - List Tags as JSONL (with optional filters)
+    update  - Update Tags from stdin JSONL
+    delete  - Delete Tags from stdin JSONL
+
+Examples:
+    # Create
+    archivebox tag create news tech science
+    archivebox tag create "important stuff"
+
+    # List
+    archivebox tag list
+    archivebox tag list --name__icontains=news
+
+    # Update (rename tags)
+    archivebox tag list --name=oldname | archivebox tag update --name=newname
+
+    # Delete
+    archivebox tag list --name=unused | archivebox tag delete --yes
+"""
+
+__package__ = 'archivebox.cli'
+__command__ = 'archivebox tag'
+
+import sys
+from typing import Optional, Iterable
+
+import rich_click as click
+from rich import print as rprint
+
+from archivebox.cli.cli_utils import apply_filters
+
+
+# =============================================================================
+# CREATE
+# =============================================================================
+
+def create_tags(names: Iterable[str]) -> int:
+    """
+    Create Tags from names.
+
+    Exit codes:
+        0: Success
+        1: Failure
+    """
+    from archivebox.misc.jsonl import write_record
+    from archivebox.core.models import Tag
+
+    is_tty = sys.stdout.isatty()
+
+    # Convert to list if needed
+    name_list = list(names) if names else []
+
+    if not name_list:
+        rprint('[yellow]No tag names provided. Pass names as arguments.[/yellow]', file=sys.stderr)
+        return 1
+
+    created_count = 0
+    for name in name_list:
+        name = name.strip()
+        if not name:
+            continue
+
+        tag, created = Tag.objects.get_or_create(name=name)
+
+        if not is_tty:
+            write_record(tag.to_json())
+
+        if created:
+            created_count += 1
+            rprint(f'[green]Created tag: {name}[/green]', file=sys.stderr)
+        else:
+            rprint(f'[dim]Tag already exists: {name}[/dim]', file=sys.stderr)
+
+    rprint(f'[green]Created {created_count} new tags[/green]', file=sys.stderr)
+    return 0
+
+
+# =============================================================================
+# LIST
+# =============================================================================
+
+def list_tags(
+    name: Optional[str] = None,
+    name__icontains: Optional[str] = None,
+    limit: Optional[int] = None,
+) -> int:
+    """
+    List Tags as JSONL with optional filters.
+
+    Exit codes:
+        0: Success (even if no results)
+    """
+    from archivebox.misc.jsonl import write_record
+    from archivebox.core.models import Tag
+
+    is_tty = sys.stdout.isatty()
+
+    queryset = Tag.objects.all().order_by('name')
+
+    # Apply filters
+    filter_kwargs = {
+        'name': name,
+        'name__icontains': name__icontains,
+    }
+    queryset = apply_filters(queryset, filter_kwargs, limit=limit)
+
+    count = 0
+    for tag in queryset:
+        snapshot_count = tag.snapshot_set.count()
+        if is_tty:
+            rprint(f'[cyan]{tag.name:30}[/cyan] [dim]({snapshot_count} snapshots)[/dim]')
+        else:
+            write_record(tag.to_json())
+        count += 1
+
+    rprint(f'[dim]Listed {count} tags[/dim]', file=sys.stderr)
+    return 0
+
+
+# =============================================================================
+# UPDATE
+# =============================================================================
+
+def update_tags(name: Optional[str] = None) -> int:
+    """
+    Update Tags from stdin JSONL.
+
+    Reads Tag records from stdin and applies updates.
+    Uses PATCH semantics - only specified fields are updated.
+
+    Exit codes:
+        0: Success
+        1: No input or error
+    """
+    from archivebox.misc.jsonl import read_stdin, write_record
+    from archivebox.core.models import Tag
+
+    is_tty = sys.stdout.isatty()
+
+    records = list(read_stdin())
+    if not records:
+        rprint('[yellow]No records provided via stdin[/yellow]', file=sys.stderr)
+        return 1
+
+    updated_count = 0
+    for record in records:
+        tag_id = record.get('id')
+        old_name = record.get('name')
+
+        if not tag_id and not old_name:
+            continue
+
+        try:
+            if tag_id:
+                tag = Tag.objects.get(id=tag_id)
+            else:
+                tag = Tag.objects.get(name=old_name)
+
+            # Apply updates from CLI flags
+            if name:
+                tag.name = name
+                tag.save()
+
+            updated_count += 1
+
+            if not is_tty:
+                write_record(tag.to_json())
+
+        except Tag.DoesNotExist:
+            rprint(f'[yellow]Tag not found: {tag_id or old_name}[/yellow]', file=sys.stderr)
+            continue
+
+    rprint(f'[green]Updated {updated_count} tags[/green]', file=sys.stderr)
+    return 0
+
+
+# =============================================================================
+# DELETE
+# =============================================================================
+
+def delete_tags(yes: bool = False, dry_run: bool = False) -> int:
+    """
+    Delete Tags from stdin JSONL.
+
+    Requires --yes flag to confirm deletion.
+
+    Exit codes:
+        0: Success
+        1: No input or missing --yes flag
+    """
+    from archivebox.misc.jsonl import read_stdin
+    from archivebox.core.models import Tag
+
+    records = list(read_stdin())
+    if not records:
+        rprint('[yellow]No records provided via stdin[/yellow]', file=sys.stderr)
+        return 1
+
+    # Collect tag IDs or names
+    tag_ids = []
+    tag_names = []
+    for r in records:
+        if r.get('id'):
+            tag_ids.append(r['id'])
+        elif r.get('name'):
+            tag_names.append(r['name'])
+
+    if not tag_ids and not tag_names:
+        rprint('[yellow]No valid tag IDs or names in input[/yellow]', file=sys.stderr)
+        return 1
+
+    from django.db.models import Q
+    query = Q()
+    if tag_ids:
+        query |= Q(id__in=tag_ids)
+    if tag_names:
+        query |= Q(name__in=tag_names)
+
+    tags = Tag.objects.filter(query)
+    count = tags.count()
+
+    if count == 0:
+        rprint('[yellow]No matching tags found[/yellow]', file=sys.stderr)
+        return 0
+
+    if dry_run:
+        rprint(f'[yellow]Would delete {count} tags (dry run)[/yellow]', file=sys.stderr)
+        for tag in tags:
+            rprint(f'  {tag.name}', file=sys.stderr)
+        return 0
+
+    if not yes:
+        rprint('[red]Use --yes to confirm deletion[/red]', file=sys.stderr)
+        return 1
+
+    # Perform deletion
+    deleted_count, _ = tags.delete()
+    rprint(f'[green]Deleted {deleted_count} tags[/green]', file=sys.stderr)
+    return 0
+
+
+# =============================================================================
+# CLI Commands
+# =============================================================================
+
+@click.group()
+def main():
+    """Manage Tag records."""
+    pass
+
+
+@main.command('create')
+@click.argument('names', nargs=-1)
+def create_cmd(names: tuple):
+    """Create Tags from names."""
+    sys.exit(create_tags(names))
+
+
+@main.command('list')
+@click.option('--name', help='Filter by exact name')
+@click.option('--name__icontains', help='Filter by name contains')
+@click.option('--limit', '-n', type=int, help='Limit number of results')
+def list_cmd(name: Optional[str], name__icontains: Optional[str], limit: Optional[int]):
+    """List Tags as JSONL."""
+    sys.exit(list_tags(name=name, name__icontains=name__icontains, limit=limit))
+
+
+@main.command('update')
+@click.option('--name', '-n', help='Set new name')
+def update_cmd(name: Optional[str]):
+    """Update Tags from stdin JSONL."""
+    sys.exit(update_tags(name=name))
+
+
+@main.command('delete')
+@click.option('--yes', '-y', is_flag=True, help='Confirm deletion')
+@click.option('--dry-run', is_flag=True, help='Show what would be deleted')
+def delete_cmd(yes: bool, dry_run: bool):
+    """Delete Tags from stdin JSONL."""
+    sys.exit(delete_tags(yes=yes, dry_run=dry_run))
+
+
+if __name__ == '__main__':
+    main()
--- a/archivebox/cli/cli_utils.py
+++ b/archivebox/cli/cli_utils.py
@@ -0,0 +1,46 @@
+"""
+Shared CLI utilities for ArchiveBox commands.
+
+This module contains common utilities used across multiple CLI commands,
+extracted to avoid code duplication.
+"""
+
+__package__ = 'archivebox.cli'
+
+from typing import Optional
+
+
+def apply_filters(queryset, filter_kwargs: dict, limit: Optional[int] = None):
+    """
+    Apply Django-style filters from CLI kwargs to a QuerySet.
+
+    Supports: --status=queued, --url__icontains=example, --id__in=uuid1,uuid2
+
+    Args:
+        queryset: Django QuerySet to filter
+        filter_kwargs: Dict of filter key-value pairs from CLI
+        limit: Optional limit on results
+
+    Returns:
+        Filtered QuerySet
+
+    Example:
+        queryset = Snapshot.objects.all()
+        filter_kwargs = {'status': 'queued', 'url__icontains': 'example.com'}
+        filtered = apply_filters(queryset, filter_kwargs, limit=10)
+    """
+    filters = {}
+    for key, value in filter_kwargs.items():
+        if value is None or key in ('limit', 'offset'):
+            continue
+        # Handle CSV lists for __in filters
+        if key.endswith('__in') and isinstance(value, str):
+            value = [v.strip() for v in value.split(',')]
+        filters[key] = value
+
+    if filters:
+        queryset = queryset.filter(**filters)
+    if limit:
+        queryset = queryset[:limit]
+
+    return queryset
--- a/archivebox/cli/tests_piping.py
+++ b/archivebox/cli/tests_piping.py
@@ -1,17 +1,18 @@
 #!/usr/bin/env python3
 """
-Tests for CLI piping workflow: crawl | snapshot | extract
+Tests for CLI piping workflow: crawl | snapshot | archiveresult | run

 This module tests the JSONL-based piping between CLI commands as described in:
 https://github.com/ArchiveBox/ArchiveBox/issues/1363

 Workflows tested:
-    archivebox crawl URL         -> Crawl JSONL
-    archivebox snapshot          -> Snapshot JSONL (accepts Crawl or URL input)
-    archivebox extract           -> ArchiveResult JSONL (accepts Snapshot input)
+    archivebox crawl create URL        -> Crawl JSONL
+    archivebox snapshot create         -> Snapshot JSONL (accepts Crawl or URL input)
+    archivebox archiveresult create    -> ArchiveResult JSONL (accepts Snapshot input)
+    archivebox run                     -> Process queued records (accepts any JSONL)

 Pipeline:
-    archivebox crawl URL | archivebox snapshot | archivebox extract
+    archivebox crawl create URL | archivebox snapshot create | archivebox archiveresult create | archivebox run

 Each command should:
    - Accept URLs, IDs, or JSONL as input (args or stdin)
@@ -154,13 +155,13 @@ class TestJSONLParsing(unittest.TestCase):
 class TestJSONLOutput(unittest.TestCase):
    """Test JSONL output formatting."""

-    def test_crawl_to_jsonl(self):
-        """Crawl model should serialize to JSONL correctly."""
+    def test_crawl_to_json(self):
+        """Crawl model should serialize to JSON correctly."""
        from archivebox.misc.jsonl import TYPE_CRAWL

-        # Create a mock crawl with to_jsonl method configured
+        # Create a mock crawl with to_json method configured
        mock_crawl = MagicMock()
-        mock_crawl.to_jsonl.return_value = {
+        mock_crawl.to_json.return_value = {
            'type': TYPE_CRAWL,
            'schema_version': '0.9.0',
            'id': 'test-crawl-uuid',
@@ -172,7 +173,7 @@ class TestJSONLOutput(unittest.TestCase):
            'created_at': None,
        }

-        result = mock_crawl.to_jsonl()
+        result = mock_crawl.to_json()
        self.assertEqual(result['type'], TYPE_CRAWL)
        self.assertEqual(result['id'], 'test-crawl-uuid')
        self.assertEqual(result['urls'], 'https://example.com')
@@ -351,8 +352,8 @@ class TestSnapshotCommand(unittest.TestCase):
    # using real Snapshot instances.


-class TestExtractCommand(unittest.TestCase):
-    """Unit tests for archivebox extract command."""
+class TestArchiveResultCommand(unittest.TestCase):
+    """Unit tests for archivebox archiveresult command."""

    def setUp(self):
        """Set up test environment."""
@@ -363,8 +364,8 @@ class TestExtractCommand(unittest.TestCase):
        """Clean up test environment."""
        shutil.rmtree(self.test_dir, ignore_errors=True)

-    def test_extract_accepts_snapshot_id(self):
-        """extract should accept snapshot IDs as input."""
+    def test_archiveresult_accepts_snapshot_id(self):
+        """archiveresult should accept snapshot IDs as input."""
        from archivebox.misc.jsonl import read_args_or_stdin

        uuid = '01234567-89ab-cdef-0123-456789abcdef'
@@ -374,8 +375,8 @@ class TestExtractCommand(unittest.TestCase):
        self.assertEqual(len(records), 1)
        self.assertEqual(records[0]['id'], uuid)

-    def test_extract_accepts_jsonl_snapshot(self):
-        """extract should accept JSONL Snapshot records."""
+    def test_archiveresult_accepts_jsonl_snapshot(self):
+        """archiveresult should accept JSONL Snapshot records."""
        from archivebox.misc.jsonl import read_args_or_stdin, TYPE_SNAPSHOT

        stdin = StringIO('{"type": "Snapshot", "id": "abc123", "url": "https://example.com"}\n')
@@ -387,8 +388,8 @@ class TestExtractCommand(unittest.TestCase):
        self.assertEqual(records[0]['type'], TYPE_SNAPSHOT)
        self.assertEqual(records[0]['id'], 'abc123')

-    def test_extract_gathers_snapshot_ids(self):
-        """extract should gather snapshot IDs from various input formats."""
+    def test_archiveresult_gathers_snapshot_ids(self):
+        """archiveresult should gather snapshot IDs from various input formats."""
        from archivebox.misc.jsonl import TYPE_SNAPSHOT, TYPE_ARCHIVERESULT

        records = [
@@ -529,7 +530,7 @@ class TestPipingWorkflowIntegration(unittest.TestCase):

        # Create crawl with multiple URLs (as newline-separated string)
        urls = 'https://test-crawl-1.example.com\nhttps://test-crawl-2.example.com'
-        crawl = Crawl.from_jsonl({'urls': urls}, overrides={'created_by_id': created_by_id})
+        crawl = Crawl.from_json({'urls': urls}, overrides={'created_by_id': created_by_id})

        self.assertIsNotNone(crawl)
        self.assertIsNotNone(crawl.id)
@@ -543,7 +544,7 @@ class TestPipingWorkflowIntegration(unittest.TestCase):
        self.assertIn('https://test-crawl-2.example.com', urls_list)

        # Verify output format
-        output = crawl.to_jsonl()
+        output = crawl.to_json()
        self.assertEqual(output['type'], TYPE_CRAWL)
        self.assertIn('id', output)
        self.assertEqual(output['urls'], urls)
@@ -566,8 +567,8 @@ class TestPipingWorkflowIntegration(unittest.TestCase):

        # Step 1: Create crawl (simulating 'archivebox crawl')
        urls = 'https://crawl-to-snap-1.example.com\nhttps://crawl-to-snap-2.example.com'
-        crawl = Crawl.from_jsonl({'urls': urls}, overrides={'created_by_id': created_by_id})
-        crawl_output = crawl.to_jsonl()
+        crawl = Crawl.from_json({'urls': urls}, overrides={'created_by_id': created_by_id})
+        crawl_output = crawl.to_json()

        # Step 2: Parse crawl output as snapshot input
        stdin = StringIO(json.dumps(crawl_output) + '\n')
@@ -581,7 +582,7 @@ class TestPipingWorkflowIntegration(unittest.TestCase):
        # Step 3: Create snapshots from crawl URLs
        created_snapshots = []
        for url in crawl.get_urls_list():
-            snapshot = Snapshot.from_jsonl({'url': url}, overrides={'created_by_id': created_by_id})
+            snapshot = Snapshot.from_json({'url': url}, overrides={'created_by_id': created_by_id})
            if snapshot:
                created_snapshots.append(snapshot)

@@ -589,7 +590,7 @@ class TestPipingWorkflowIntegration(unittest.TestCase):

        # Verify snapshot output
        for snapshot in created_snapshots:
-            output = snapshot.to_jsonl()
+            output = snapshot.to_json()
            self.assertEqual(output['type'], TYPE_SNAPSHOT)
            self.assertIn(output['url'], [
                'https://crawl-to-snap-1.example.com',
@@ -619,13 +620,13 @@ class TestPipingWorkflowIntegration(unittest.TestCase):

        # Create snapshot
        overrides = {'created_by_id': created_by_id}
-        snapshot = Snapshot.from_jsonl(records[0], overrides=overrides)
+        snapshot = Snapshot.from_json(records[0], overrides=overrides)

        self.assertIsNotNone(snapshot.id)
        self.assertEqual(snapshot.url, url)

        # Verify output format
-        output = snapshot.to_jsonl()
+        output = snapshot.to_json()
        self.assertEqual(output['type'], TYPE_SNAPSHOT)
        self.assertIn('id', output)
        self.assertEqual(output['url'], url)
@@ -647,8 +648,8 @@ class TestPipingWorkflowIntegration(unittest.TestCase):
        # Step 1: Create snapshot (simulating 'archivebox snapshot')
        url = 'https://test-extract-1.example.com'
        overrides = {'created_by_id': created_by_id}
-        snapshot = Snapshot.from_jsonl({'url': url}, overrides=overrides)
-        snapshot_output = snapshot.to_jsonl()
+        snapshot = Snapshot.from_json({'url': url}, overrides=overrides)
+        snapshot_output = snapshot.to_json()

        # Step 2: Parse snapshot output as extract input
        stdin = StringIO(json.dumps(snapshot_output) + '\n')
@@ -686,8 +687,8 @@ class TestPipingWorkflowIntegration(unittest.TestCase):

        # === archivebox crawl https://example.com ===
        url = 'https://test-pipeline-full.example.com'
-        crawl = Crawl.from_jsonl({'url': url}, overrides={'created_by_id': created_by_id})
-        crawl_jsonl = json.dumps(crawl.to_jsonl())
+        crawl = Crawl.from_json({'url': url}, overrides={'created_by_id': created_by_id})
+        crawl_jsonl = json.dumps(crawl.to_json())

        # === | archivebox snapshot ===
        stdin = StringIO(crawl_jsonl + '\n')
@@ -705,7 +706,7 @@ class TestPipingWorkflowIntegration(unittest.TestCase):
                if crawl_id:
                    db_crawl = Crawl.objects.get(id=crawl_id)
                    for crawl_url in db_crawl.get_urls_list():
-                        snapshot = Snapshot.from_jsonl({'url': crawl_url}, overrides={'created_by_id': created_by_id})
+                        snapshot = Snapshot.from_json({'url': crawl_url}, overrides={'created_by_id': created_by_id})
                        if snapshot:
                            created_snapshots.append(snapshot)

@@ -713,7 +714,7 @@ class TestPipingWorkflowIntegration(unittest.TestCase):
        self.assertEqual(created_snapshots[0].url, url)

        # === | archivebox extract ===
-        snapshot_jsonl_lines = [json.dumps(s.to_jsonl()) for s in created_snapshots]
+        snapshot_jsonl_lines = [json.dumps(s.to_json()) for s in created_snapshots]
        stdin = StringIO('\n'.join(snapshot_jsonl_lines) + '\n')
        stdin.isatty = lambda: False

@@ -757,12 +758,12 @@ class TestDepthWorkflows(unittest.TestCase):

        # Create crawl with depth 0
        url = 'https://depth0-test.example.com'
-        crawl = Crawl.from_jsonl({'url': url, 'max_depth': 0}, overrides={'created_by_id': created_by_id})
+        crawl = Crawl.from_json({'url': url, 'max_depth': 0}, overrides={'created_by_id': created_by_id})

        self.assertEqual(crawl.max_depth, 0)

        # Create snapshot
-        snapshot = Snapshot.from_jsonl({'url': url}, overrides={'created_by_id': created_by_id})
+        snapshot = Snapshot.from_json({'url': url}, overrides={'created_by_id': created_by_id})
        self.assertEqual(snapshot.url, url)

    def test_depth_metadata_in_crawl(self):
@@ -773,7 +774,7 @@ class TestDepthWorkflows(unittest.TestCase):
        created_by_id = get_or_create_system_user_pk()

        # Create crawl with depth
-        crawl = Crawl.from_jsonl(
+        crawl = Crawl.from_json(
            {'url': 'https://depth-meta-test.example.com', 'max_depth': 2},
            overrides={'created_by_id': created_by_id}
        )
@@ -781,7 +782,7 @@ class TestDepthWorkflows(unittest.TestCase):
        self.assertEqual(crawl.max_depth, 2)

        # Verify in JSONL output
-        output = crawl.to_jsonl()
+        output = crawl.to_json()
        self.assertEqual(output['max_depth'], 2)


@@ -956,5 +957,129 @@ class TestEdgeCases(unittest.TestCase):
        self.assertEqual(urls[2], 'https://url3.com')


+# =============================================================================
+# Pass-Through Behavior Tests
+# =============================================================================
+
+class TestPassThroughBehavior(unittest.TestCase):
+    """Test pass-through behavior in CLI commands."""
+
+    def test_crawl_passes_through_other_types(self):
+        """crawl create should pass through records with other types."""
+        from archivebox.misc.jsonl import TYPE_CRAWL
+
+        # Input: a Tag record (not a Crawl or URL)
+        tag_record = {'type': 'Tag', 'id': 'test-tag', 'name': 'example'}
+        url_record = {'url': 'https://example.com'}
+
+        # Mock stdin with both records
+        stdin = StringIO(
+            json.dumps(tag_record) + '\n' +
+            json.dumps(url_record)
+        )
+        stdin.isatty = lambda: False
+
+        # The Tag should be passed through, the URL should create a Crawl
+        # (This is a unit test of the pass-through logic)
+        from archivebox.misc.jsonl import read_args_or_stdin
+        records = list(read_args_or_stdin((), stream=stdin))
+
+        self.assertEqual(len(records), 2)
+        # First record is a Tag (other type)
+        self.assertEqual(records[0]['type'], 'Tag')
+        # Second record has a URL
+        self.assertIn('url', records[1])
+
+    def test_snapshot_passes_through_crawl(self):
+        """snapshot create should pass through Crawl records."""
+        from archivebox.misc.jsonl import TYPE_CRAWL, TYPE_SNAPSHOT
+
+        crawl_record = {
+            'type': TYPE_CRAWL,
+            'id': 'test-crawl',
+            'urls': 'https://example.com',
+        }
+
+        # Crawl records should be passed through AND create snapshots
+        # This tests the accumulation behavior
+        self.assertEqual(crawl_record['type'], TYPE_CRAWL)
+        self.assertIn('urls', crawl_record)
+
+    def test_archiveresult_passes_through_snapshot(self):
+        """archiveresult create should pass through Snapshot records."""
+        from archivebox.misc.jsonl import TYPE_SNAPSHOT
+
+        snapshot_record = {
+            'type': TYPE_SNAPSHOT,
+            'id': 'test-snapshot',
+            'url': 'https://example.com',
+        }
+
+        # Snapshot records should be passed through
+        self.assertEqual(snapshot_record['type'], TYPE_SNAPSHOT)
+        self.assertIn('url', snapshot_record)
+
+    def test_run_passes_through_unknown_types(self):
+        """run should pass through records with unknown types."""
+        unknown_record = {'type': 'Unknown', 'id': 'test', 'data': 'value'}
+
+        # Unknown types should be passed through unchanged
+        self.assertEqual(unknown_record['type'], 'Unknown')
+        self.assertIn('data', unknown_record)
+
+
+class TestPipelineAccumulation(unittest.TestCase):
+    """Test that pipelines accumulate records correctly."""
+
+    def test_full_pipeline_output_types(self):
+        """Full pipeline should output all record types."""
+        from archivebox.misc.jsonl import TYPE_CRAWL, TYPE_SNAPSHOT, TYPE_ARCHIVERESULT
+
+        # Simulated pipeline output after: crawl | snapshot | archiveresult | run
+        # Should contain Crawl, Snapshot, and ArchiveResult records
+        pipeline_output = [
+            {'type': TYPE_CRAWL, 'id': 'c1', 'urls': 'https://example.com'},
+            {'type': TYPE_SNAPSHOT, 'id': 's1', 'url': 'https://example.com'},
+            {'type': TYPE_ARCHIVERESULT, 'id': 'ar1', 'plugin': 'title'},
+        ]
+
+        types = {r['type'] for r in pipeline_output}
+        self.assertIn(TYPE_CRAWL, types)
+        self.assertIn(TYPE_SNAPSHOT, types)
+        self.assertIn(TYPE_ARCHIVERESULT, types)
+
+    def test_pipeline_preserves_ids(self):
+        """Pipeline should preserve record IDs through all stages."""
+        records = [
+            {'type': 'Crawl', 'id': 'c1', 'urls': 'https://example.com'},
+            {'type': 'Snapshot', 'id': 's1', 'url': 'https://example.com'},
+        ]
+
+        # All records should have IDs
+        for record in records:
+            self.assertIn('id', record)
+            self.assertTrue(record['id'])
+
+    def test_jq_transform_pattern(self):
+        """Test pattern for jq transforms in pipeline."""
+        # Simulated: archiveresult list --status=failed | jq 'del(.id) | .status = "queued"'
+        failed_record = {
+            'type': 'ArchiveResult',
+            'id': 'ar1',
+            'status': 'failed',
+            'plugin': 'wget',
+        }
+
+        # Transform: delete id, set status to queued
+        transformed = {
+            'type': failed_record['type'],
+            'status': 'queued',
+            'plugin': failed_record['plugin'],
+        }
+
+        self.assertNotIn('id', transformed)
+        self.assertEqual(transformed['status'], 'queued')
+
+
 if __name__ == '__main__':
    unittest.main()
--- a/archivebox/config/configset.py
+++ b/archivebox/config/configset.py
@@ -120,6 +120,7 @@ class BaseConfigSet(BaseSettings):
 def get_config(
    scope: str = "global",
    defaults: Optional[Dict] = None,
+    persona: Any = None,
    user: Any = None,
    crawl: Any = None,
    snapshot: Any = None,
@@ -131,14 +132,16 @@ def get_config(
    1. Per-snapshot config (snapshot.config JSON field)
    2. Per-crawl config (crawl.config JSON field)
    3. Per-user config (user.config JSON field)
-    4. Environment variables
-    5. Config file (ArchiveBox.conf)
-    6. Plugin schema defaults (config.json)
-    7. Core config defaults
+    4. Per-persona config (persona.get_derived_config() - includes CHROME_USER_DATA_DIR etc.)
+    5. Environment variables
+    6. Config file (ArchiveBox.conf)
+    7. Plugin schema defaults (config.json)
+    8. Core config defaults

    Args:
        scope: Config scope ('global', 'crawl', 'snapshot', etc.)
        defaults: Default values to start with
+        persona: Persona object (provides derived paths like CHROME_USER_DATA_DIR)
        user: User object with config JSON field
        crawl: Crawl object with config JSON field
        snapshot: Snapshot object with config JSON field
@@ -205,6 +208,10 @@ def get_config(
    except ImportError:
        pass

+    # Apply persona config overrides (includes derived paths like CHROME_USER_DATA_DIR)
+    if persona and hasattr(persona, "get_derived_config"):
+        config.update(persona.get_derived_config())
+
    # Apply user config overrides
    if user and hasattr(user, "config") and user.config:
        config.update(user.config)
@@ -213,6 +220,10 @@ def get_config(
    if crawl and hasattr(crawl, "config") and crawl.config:
        config.update(crawl.config)

+    # Add CRAWL_OUTPUT_DIR for snapshot hooks to find shared Chrome session
+    if crawl and hasattr(crawl, "OUTPUT_DIR"):
+        config['CRAWL_OUTPUT_DIR'] = str(crawl.OUTPUT_DIR)
+
    # Apply snapshot config overrides (highest priority)
    if snapshot and hasattr(snapshot, "config") and snapshot.config:
        config.update(snapshot.config)
--- a/archivebox/core/forms.py
+++ b/archivebox/core/forms.py
@@ -158,7 +158,7 @@ class AddLinkForm(forms.Form):
            'search_backend_ripgrep', 'search_backend_sonic', 'search_backend_sqlite'
        }
        binary = {'apt', 'brew', 'custom', 'env', 'npm', 'pip'}
-        extensions = {'captcha2', 'istilldontcareaboutcookies', 'ublock'}
+        extensions = {'twocaptcha', 'istilldontcareaboutcookies', 'ublock'}

        # Populate plugin field choices
        self.fields['chrome_plugins'].choices = [
--- a/archivebox/core/migrations/0025_cleanup_schema.py
+++ b/archivebox/core/migrations/0025_cleanup_schema.py
@@ -10,8 +10,8 @@ import archivebox.base_models.models

 def cleanup_extra_columns(apps, schema_editor):
    """
-    Remove extra columns that were needed for v0.7.2/v0.8.6rc0 migration but don't exist in final models.
-    The actual models use @property methods to access these values from the process FK.
+    Create Process records from old cmd/pwd/cmd_version columns and remove those columns.
+    This preserves the execution details by moving them to the Process model.
    """
    with schema_editor.connection.cursor() as cursor:
        # Check if cmd column exists (means we came from v0.7.2/v0.8.6rc0)
@@ -19,8 +19,41 @@ def cleanup_extra_columns(apps, schema_editor):
        has_cmd = cursor.fetchone()[0] > 0

        if has_cmd:
-            print("  Cleaning up temporary columns from core_archiveresult...")
-            # Rebuild table without the extra columns
+            print("  Migrating cmd/pwd/cmd_version data to Process records...")
+
+            # For each ArchiveResult, create a Process record with cmd/pwd data
+            # Note: cmd_version from old schema is not preserved (it's now derived from Binary)
+            cursor.execute("""
+                SELECT id, cmd, pwd, binary_id, iface_id, start_ts, end_ts, status
+                FROM core_archiveresult
+            """)
+            archive_results = cursor.fetchall()
+
+            from archivebox.uuid_compat import uuid7
+            from archivebox.base_models.models import get_or_create_system_user_pk
+
+            machine_id = cursor.execute("SELECT id FROM machine_machine LIMIT 1").fetchone()[0]
+
+            for ar_id, cmd, pwd, binary_id, iface_id, start_ts, end_ts, status in archive_results:
+                # Create Process record
+                process_id = str(uuid7())
+                cursor.execute("""
+                    INSERT INTO machine_process (
+                        id, created_at, modified_at,
+                        machine_id, binary_id, iface_id,
+                        pwd, cmd, env, timeout,
+                        pid, exit_code, stdout, stderr,
+                        started_at, ended_at, url, status, retry_at
+                    ) VALUES (?, datetime('now'), datetime('now'), ?, ?, ?, ?, ?, '{}', 120, NULL, NULL, '', '', ?, ?, '', ?, NULL)
+                """, (process_id, machine_id, binary_id, iface_id, pwd or '', cmd or '[]', start_ts, end_ts, status or 'queued'))
+
+                # Update ArchiveResult to point to new Process
+                cursor.execute("UPDATE core_archiveresult SET process_id = ? WHERE id = ?", (process_id, ar_id))
+
+            print(f"  ✓ Created {len(archive_results)} Process records from ArchiveResult data")
+
+            # Now rebuild table without the extra columns
+            print("  Rebuilding core_archiveresult table...")
            cursor.execute("""
                CREATE TABLE core_archiveresult_final (
                    id INTEGER PRIMARY KEY AUTOINCREMENT,
@@ -48,14 +81,14 @@ def cleanup_extra_columns(apps, schema_editor):
                    num_uses_succeeded INTEGER NOT NULL DEFAULT 0,
                    num_uses_failed INTEGER NOT NULL DEFAULT 0,

-                    process_id TEXT,
+                    process_id TEXT NOT NULL,

                    FOREIGN KEY (snapshot_id) REFERENCES core_snapshot(id) ON DELETE CASCADE,
                    FOREIGN KEY (process_id) REFERENCES machine_process(id) ON DELETE RESTRICT
                )
            """)

-            # Copy data (cmd, pwd, etc. are now accessed via process FK)
+            # Copy data (cmd, pwd, etc. are now in Process records)
            cursor.execute("""
                INSERT INTO core_archiveresult_final SELECT
                    id, uuid, created_at, modified_at,
--- a/archivebox/core/migrations/0027_alter_archiveresult_hook_name_alter_archiveresult_id_and_more.py
+++ b/archivebox/core/migrations/0027_alter_archiveresult_hook_name_alter_archiveresult_id_and_more.py
@@ -0,0 +1,108 @@
+# Generated by Django 6.0 on 2025-12-31 09:04
+
+import django.db.models.deletion
+import django.utils.timezone
+import uuid
+from django.db import migrations, models
+
+
+class Migration(migrations.Migration):
+
+    dependencies = [
+        ('core', '0026_final_field_adjustments'),
+        ('crawls', '0002_upgrade_to_0_9_0'),
+        ('machine', '0001_initial'),
+    ]
+
+    operations = [
+        migrations.AlterField(
+            model_name='archiveresult',
+            name='hook_name',
+            field=models.CharField(blank=True, db_index=True, default='', help_text='Full filename of the hook that executed (e.g., on_Snapshot__50_wget.py)', max_length=255),
+        ),
+        migrations.AlterField(
+            model_name='archiveresult',
+            name='id',
+            field=models.AutoField(editable=False, primary_key=True, serialize=False),
+        ),
+        migrations.AlterField(
+            model_name='archiveresult',
+            name='output_files',
+            field=models.JSONField(default=dict, help_text='Dict of {relative_path: {metadata}}'),
+        ),
+        migrations.AlterField(
+            model_name='archiveresult',
+            name='output_json',
+            field=models.JSONField(blank=True, default=None, help_text='Structured metadata (headers, redirects, etc.)', null=True),
+        ),
+        migrations.AlterField(
+            model_name='archiveresult',
+            name='output_mimetypes',
+            field=models.CharField(blank=True, default='', help_text='CSV of mimetypes sorted by size', max_length=512),
+        ),
+        migrations.AlterField(
+            model_name='archiveresult',
+            name='output_size',
+            field=models.BigIntegerField(default=0, help_text='Total bytes of all output files'),
+        ),
+        migrations.AlterField(
+            model_name='archiveresult',
+            name='output_str',
+            field=models.TextField(blank=True, default='', help_text='Human-readable output summary'),
+        ),
+        migrations.AlterField(
+            model_name='archiveresult',
+            name='plugin',
+            field=models.CharField(db_index=True, default='', max_length=32),
+        ),
+        migrations.AlterField(
+            model_name='archiveresult',
+            name='process',
+            field=models.OneToOneField(help_text='Process execution details for this archive result', on_delete=django.db.models.deletion.PROTECT, related_name='archiveresult', to='machine.process'),
+        ),
+        migrations.AlterField(
+            model_name='archiveresult',
+            name='retry_at',
+            field=models.DateTimeField(blank=True, db_index=True, default=django.utils.timezone.now, null=True),
+        ),
+        migrations.AlterField(
+            model_name='archiveresult',
+            name='status',
+            field=models.CharField(choices=[('queued', 'Queued'), ('started', 'Started'), ('backoff', 'Waiting to retry'), ('succeeded', 'Succeeded'), ('failed', 'Failed'), ('skipped', 'Skipped')], db_index=True, default='queued', max_length=15),
+        ),
+        migrations.AlterField(
+            model_name='archiveresult',
+            name='uuid',
+            field=models.UUIDField(blank=True, db_index=True, default=uuid.uuid7, null=True),
+        ),
+        migrations.AlterField(
+            model_name='snapshot',
+            name='config',
+            field=models.JSONField(default=dict),
+        ),
+        migrations.AlterField(
+            model_name='snapshot',
+            name='crawl',
+            field=models.ForeignKey(on_delete=django.db.models.deletion.CASCADE, related_name='snapshot_set', to='crawls.crawl'),
+        ),
+        migrations.AlterField(
+            model_name='snapshot',
+            name='current_step',
+            field=models.PositiveSmallIntegerField(db_index=True, default=0, help_text='Current hook step being executed (0-9). Used for sequential hook execution.'),
+        ),
+        migrations.AlterField(
+            model_name='snapshot',
+            name='depth',
+            field=models.PositiveSmallIntegerField(db_index=True, default=0),
+        ),
+        migrations.AlterField(
+            model_name='snapshot',
+            name='id',
+            field=models.UUIDField(default=uuid.uuid7, editable=False, primary_key=True, serialize=False, unique=True),
+        ),
+        migrations.AlterField(
+            model_name='snapshottag',
+            name='id',
+            field=models.AutoField(primary_key=True, serialize=False),
+        ),
+    ]
--- a/archivebox/machine/admin.py
+++ b/archivebox/machine/admin.py
@@ -144,7 +144,7 @@ class BinaryAdmin(BaseModelAdmin):


 class ProcessAdmin(BaseModelAdmin):
-    list_display = ('id', 'created_at', 'machine_info', 'archiveresult_link', 'cmd_str', 'status', 'exit_code', 'pid', 'binary_info', 'health')
+    list_display = ('id', 'created_at', 'machine_info', 'archiveresult_link', 'cmd_str', 'status', 'exit_code', 'pid', 'binary_info')
    sort_fields = ('id', 'created_at', 'status', 'exit_code', 'pid')
    search_fields = ('id', 'machine__id', 'binary__name', 'cmd', 'pwd', 'stdout', 'stderr')

@@ -171,10 +171,6 @@ class ProcessAdmin(BaseModelAdmin):
            'fields': ('stdout', 'stderr'),
            'classes': ('card', 'wide', 'collapse'),
        }),
-        ('Usage', {
-            'fields': ('num_uses_succeeded', 'num_uses_failed'),
-            'classes': ('card',),
-        }),
        ('Timestamps', {
            'fields': ('created_at', 'modified_at'),
            'classes': ('card',),
--- a/archivebox/machine/migrations/0001_initial.py
+++ b/archivebox/machine/migrations/0001_initial.py
@@ -105,8 +105,6 @@ class Migration(migrations.Migration):
                    id TEXT PRIMARY KEY NOT NULL,
                    created_at DATETIME NOT NULL,
                    modified_at DATETIME NOT NULL,
-                    num_uses_succeeded INTEGER NOT NULL DEFAULT 0,
-                    num_uses_failed INTEGER NOT NULL DEFAULT 0,

                    machine_id TEXT NOT NULL,
                    binary_id TEXT,
@@ -234,8 +232,6 @@ class Migration(migrations.Migration):
                        ('id', models.UUIDField(default=uuid7, editable=False, primary_key=True, serialize=False, unique=True)),
                        ('created_at', models.DateTimeField(db_index=True, default=django.utils.timezone.now)),
                        ('modified_at', models.DateTimeField(auto_now=True)),
-                        ('num_uses_succeeded', models.PositiveIntegerField(default=0)),
-                        ('num_uses_failed', models.PositiveIntegerField(default=0)),
                        ('pwd', models.CharField(blank=True, default='', help_text='Working directory for process execution', max_length=512)),
                        ('cmd', models.JSONField(blank=True, default=list, help_text='Command as array of arguments')),
                        ('env', models.JSONField(blank=True, default=dict, help_text='Environment variables for process')),
--- a/archivebox/misc/jsonl.py
+++ b/archivebox/misc/jsonl.py
@@ -24,7 +24,7 @@ __package__ = 'archivebox.misc'

 import sys
 import json
-from typing import Iterator, Dict, Any, Optional, TextIO, Callable
+from typing import Iterator, Dict, Any, Optional, TextIO
 from pathlib import Path


@@ -150,36 +150,3 @@ def write_records(records: Iterator[Dict[str, Any]], stream: Optional[TextIO] =
        count += 1
    return count

-
-def filter_by_type(records: Iterator[Dict[str, Any]], record_type: str) -> Iterator[Dict[str, Any]]:
-    """
-    Filter records by type.
-    """
-    for record in records:
-        if record.get('type') == record_type:
-            yield record
-
-
-def process_records(
-    records: Iterator[Dict[str, Any]],
-    handlers: Dict[str, Callable[[Dict[str, Any]], Optional[Dict[str, Any]]]]
-) -> Iterator[Dict[str, Any]]:
-    """
-    Process records through type-specific handlers.
-
-    Args:
-        records: Input record iterator
-        handlers: Dict mapping type names to handler functions
-                 Handlers return output records or None to skip
-
-    Yields output records from handlers.
-    """
-    for record in records:
-        record_type = record.get('type')
-        handler = handlers.get(record_type)
-        if handler:
-            result = handler(record)
-            if result:
-                yield result
-
-
--- a/archivebox/misc/util.py
+++ b/archivebox/misc/util.py
@@ -480,12 +480,39 @@ for url_str, num_urls in _test_url_strs.items():

 def chrome_cleanup():
    """
-    Cleans up any state or runtime files that chrome leaves behind when killed by
-    a timeout or other error
+    Cleans up any state or runtime files that Chrome leaves behind when killed by
+    a timeout or other error. Handles:
+    - All persona chrome_user_data directories (via Persona.cleanup_chrome_all())
+    - Explicit CHROME_USER_DATA_DIR from config
+    - Legacy Docker chromium path
    """
    import os
+    from pathlib import Path
    from archivebox.config.permissions import IN_DOCKER
-    
+
+    # Clean up all persona chrome directories using Persona class
+    try:
+        from archivebox.personas.models import Persona
+
+        # Clean up all personas
+        Persona.cleanup_chrome_all()
+
+        # Also clean up the active persona's explicit CHROME_USER_DATA_DIR if set
+        # (in case it's a custom path not under PERSONAS_DIR)
+        from archivebox.config.configset import get_config
+        config = get_config()
+        chrome_user_data_dir = config.get('CHROME_USER_DATA_DIR')
+        if chrome_user_data_dir:
+            singleton_lock = Path(chrome_user_data_dir) / 'SingletonLock'
+            if os.path.lexists(singleton_lock):
+                try:
+                    singleton_lock.unlink()
+                except OSError:
+                    pass
+    except Exception:
+        pass  # Persona/config not available during early startup
+
+    # Legacy Docker cleanup (for backwards compatibility)
    if IN_DOCKER:
        singleton_lock = "/home/archivebox/.config/chromium/SingletonLock"
        if os.path.lexists(singleton_lock):
--- a/archivebox/personas/migrations/0001_initial.py
+++ b/archivebox/personas/migrations/0001_initial.py
@@ -0,0 +1,29 @@
+# Generated by Django 6.0 on 2025-12-31 09:06
+
+import archivebox.base_models.models
+import django.db.models.deletion
+import django.utils.timezone
+from django.conf import settings
+from django.db import migrations, models
+
+
+class Migration(migrations.Migration):
+
+    initial = True
+
+    dependencies = [
+        migrations.swappable_dependency(settings.AUTH_USER_MODEL),
+    ]
+
+    operations = [
+        migrations.CreateModel(
+            name='Persona',
+            fields=[
+                ('id', models.BigAutoField(auto_created=True, primary_key=True, serialize=False, verbose_name='ID')),
+                ('config', models.JSONField(blank=True, default=dict, null=True)),
+                ('name', models.CharField(max_length=64, unique=True)),
+                ('created_at', models.DateTimeField(db_index=True, default=django.utils.timezone.now)),
+                ('created_by', models.ForeignKey(default=archivebox.base_models.models.get_or_create_system_user_pk, on_delete=django.db.models.deletion.CASCADE, to=settings.AUTH_USER_MODEL)),
+            ],
+        ),
+    ]
--- a/archivebox/personas/models.py
+++ b/archivebox/personas/models.py
@@ -1,59 +1,155 @@
-# from django.db import models
+"""
+Persona management for ArchiveBox.

-# from django.conf import settings
+A Persona represents a browser profile/identity used for archiving.
+Each persona has its own:
+- Chrome user data directory (for cookies, localStorage, extensions, etc.)
+- Chrome extensions directory
+- Cookies file
+- Config overrides
+"""
+
+__package__ = 'archivebox.personas'
+
+from pathlib import Path
+from typing import TYPE_CHECKING, Iterator
+
+from django.db import models
+from django.conf import settings
+from django.utils import timezone
+
+from archivebox.base_models.models import ModelWithConfig, get_or_create_system_user_pk
+
+if TYPE_CHECKING:
+    from django.db.models import QuerySet


-# class Persona(models.Model):
-#     """Aka a "SessionType", its a template for a crawler browsing session containing some config."""
+class Persona(ModelWithConfig):
+    """
+    Browser persona/profile for archiving sessions.

-#     id = models.UUIDField(primary_key=True, default=None, null=False, editable=False, unique=True, verbose_name='ID')
-    
-#     created_by = models.ForeignKey(settings.AUTH_USER_MODEL, on_delete=models.CASCADE, default=None, null=False)
-#     created_at = AutoDateTimeField(default=None, null=False, db_index=True)
-#     modified_at = models.DateTimeField(auto_now=True)
-    
-#     name = models.CharField(max_length=100, blank=False, null=False, editable=False)
-    
-#     persona_dir = models.FilePathField(path=settings.PERSONAS_DIR, allow_files=False, allow_folders=True, blank=True, null=False, editable=False)
-#     config = models.JSONField(default=dict)
-#     # e.g. {
-#     #    USER_AGENT: 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36',
-#     #    COOKIES_TXT_FILE: '/path/to/cookies.txt',
-#     #    CHROME_USER_DATA_DIR: '/path/to/chrome/user/data/dir',
-#     #    CHECK_SSL_VALIDITY: False,
-#     #    SAVE_ARCHIVEDOTORG: True,
-#     #    CHROME_BINARY: 'chromium'
-#     #    ...
-#     # }
-#     # domain_allowlist = models.CharField(max_length=1024, blank=True, null=False, default='')
-#     # domain_denylist = models.CharField(max_length=1024, blank=True, null=False, default='')
-    
-#     class Meta:
-#         app_label = 'personas'
-#         verbose_name = 'Session Type'
-#         verbose_name_plural = 'Session Types'
-#         unique_together = (('created_by', 'name'),)
-    
+    Each persona provides:
+    - CHROME_USER_DATA_DIR: Chrome profile directory
+    - CHROME_EXTENSIONS_DIR: Installed extensions directory
+    - COOKIES_FILE: Cookies file for wget/curl
+    - config: JSON field with persona-specific config overrides

-#     def clean(self):
-#         self.persona_dir = settings.PERSONAS_DIR / self.name
-#         assert self.persona_dir == settings.PERSONAS_DIR / self.name, f'Persona dir {self.persona_dir} must match settings.PERSONAS_DIR / self.name'
-        
-        
-#         # make sure config keys all exist in FLAT_CONFIG
-#         # make sure config values all match expected types
-#         pass
-        
-#     def save(self, *args, **kwargs):
-#         self.full_clean()
-        
-#         # make sure basic file structure is present in persona_dir:
-#         # - PERSONAS_DIR / self.name / 
-#         #   - chrome_profile/
-#         #   - chrome_downloads/
-#         #   - chrome_extensions/
-#         #   - cookies.txt
-#         #   - auth.json
-#         #   - config.json    # json dump of the model
-        
-#         super().save(*args, **kwargs)
+    Usage:
+        # Get persona and its derived config
+        config = get_config(persona=crawl.persona, crawl=crawl, snapshot=snapshot)
+        chrome_dir = config['CHROME_USER_DATA_DIR']
+
+        # Or access directly from persona
+        persona = Persona.objects.get(name='Default')
+        persona.CHROME_USER_DATA_DIR  # -> Path to chrome_user_data
+    """
+
+    name = models.CharField(max_length=64, unique=True)
+    created_at = models.DateTimeField(default=timezone.now, db_index=True)
+    created_by = models.ForeignKey(settings.AUTH_USER_MODEL, on_delete=models.CASCADE, default=get_or_create_system_user_pk)
+
+    class Meta:
+        app_label = 'personas'
+
+    def __str__(self) -> str:
+        return self.name
+
+    @property
+    def path(self) -> Path:
+        """Path to persona directory under PERSONAS_DIR."""
+        from archivebox.config.constants import CONSTANTS
+        return CONSTANTS.PERSONAS_DIR / self.name
+
+    @property
+    def CHROME_USER_DATA_DIR(self) -> str:
+        """Derived path to Chrome user data directory for this persona."""
+        return str(self.path / 'chrome_user_data')
+
+    @property
+    def CHROME_EXTENSIONS_DIR(self) -> str:
+        """Derived path to Chrome extensions directory for this persona."""
+        return str(self.path / 'chrome_extensions')
+
+    @property
+    def COOKIES_FILE(self) -> str:
+        """Derived path to cookies.txt file for this persona (if exists)."""
+        cookies_path = self.path / 'cookies.txt'
+        return str(cookies_path) if cookies_path.exists() else ''
+
+    def get_derived_config(self) -> dict:
+        """
+        Get config dict with derived paths filled in.
+
+        Returns dict with:
+        - All values from self.config JSONField
+        - CHROME_USER_DATA_DIR (derived from persona path)
+        - CHROME_EXTENSIONS_DIR (derived from persona path)
+        - COOKIES_FILE (derived from persona path, if file exists)
+        - ACTIVE_PERSONA (set to this persona's name)
+        """
+        derived = dict(self.config or {})
+
+        # Add derived paths (don't override if explicitly set in config)
+        if 'CHROME_USER_DATA_DIR' not in derived:
+            derived['CHROME_USER_DATA_DIR'] = self.CHROME_USER_DATA_DIR
+        if 'CHROME_EXTENSIONS_DIR' not in derived:
+            derived['CHROME_EXTENSIONS_DIR'] = self.CHROME_EXTENSIONS_DIR
+        if 'COOKIES_FILE' not in derived and self.COOKIES_FILE:
+            derived['COOKIES_FILE'] = self.COOKIES_FILE
+
+        # Always set ACTIVE_PERSONA to this persona's name
+        derived['ACTIVE_PERSONA'] = self.name
+
+        return derived
+
+    def ensure_dirs(self) -> None:
+        """Create persona directories if they don't exist."""
+        self.path.mkdir(parents=True, exist_ok=True)
+        (self.path / 'chrome_user_data').mkdir(parents=True, exist_ok=True)
+        (self.path / 'chrome_extensions').mkdir(parents=True, exist_ok=True)
+
+    def cleanup_chrome(self) -> bool:
+        """
+        Clean up Chrome state files (SingletonLock, etc.) for this persona.
+
+        Returns:
+            True if cleanup was performed, False if no cleanup needed
+        """
+        cleaned = False
+        chrome_dir = self.path / 'chrome_user_data'
+
+        if not chrome_dir.exists():
+            return False
+
+        # Clean up SingletonLock files
+        for lock_file in chrome_dir.glob('**/SingletonLock'):
+            try:
+                lock_file.unlink()
+                cleaned = True
+            except OSError:
+                pass
+
+        # Clean up SingletonSocket files
+        for socket_file in chrome_dir.glob('**/SingletonSocket'):
+            try:
+                socket_file.unlink()
+                cleaned = True
+            except OSError:
+                pass
+
+        return cleaned
+
+    @classmethod
+    def get_or_create_default(cls) -> 'Persona':
+        """Get or create the Default persona."""
+        persona, _ = cls.objects.get_or_create(name='Default')
+        return persona
+
+    @classmethod
+    def cleanup_chrome_all(cls) -> int:
+        """Clean up Chrome state files for all personas."""
+        cleaned = 0
+        for persona in cls.objects.all():
+            if persona.cleanup_chrome():
+                cleaned += 1
+        return cleaned
--- a/archivebox/plugins/chrome/chrome_utils.js
+++ b/archivebox/plugins/chrome/chrome_utils.js
@@ -56,6 +56,40 @@ function getEnvInt(name, defaultValue = 0) {
    return isNaN(val) ? defaultValue : val;
 }

+/**
+ * Get array environment variable (JSON array or comma-separated string).
+ *
+ * Parsing strategy:
+ * - If value starts with '[', parse as JSON array
+ * - Otherwise, parse as comma-separated values
+ *
+ * This prevents incorrect splitting of arguments that contain internal commas.
+ * For arguments with commas, use JSON format:
+ *   CHROME_ARGS='["--user-data-dir=/path/with,comma", "--window-size=1440,900"]'
+ *
+ * @param {string} name - Environment variable name
+ * @param {string[]} [defaultValue=[]] - Default value if not set
+ * @returns {string[]} - Array of strings
+ */
+function getEnvArray(name, defaultValue = []) {
+    const val = getEnv(name, '');
+    if (!val) return defaultValue;
+
+    // If starts with '[', parse as JSON array
+    if (val.startsWith('[')) {
+        try {
+            const parsed = JSON.parse(val);
+            if (Array.isArray(parsed)) return parsed;
+        } catch (e) {
+            console.error(`[!] Failed to parse ${name} as JSON array: ${e.message}`);
+            // Fall through to comma-separated parsing
+        }
+    }
+
+    // Parse as comma-separated values
+    return val.split(',').map(s => s.trim()).filter(Boolean);
+}
+
 /**
 * Parse resolution string into width/height.
 * @param {string} resolution - Resolution string like "1440,2000"
@@ -169,86 +203,115 @@ function waitForDebugPort(port, timeout = 30000) {

 /**
 * Kill zombie Chrome processes from stale crawls.
- * Scans DATA_DIR/crawls/<crawl_id>/chrome/<name>.pid for stale processes.
+ * Recursively scans DATA_DIR for any */chrome/*.pid files from stale crawls.
+ * Does not assume specific directory structure - works with nested paths.
 * @param {string} [dataDir] - Data directory (defaults to DATA_DIR env or '.')
 * @returns {number} - Number of zombies killed
 */
 function killZombieChrome(dataDir = null) {
    dataDir = dataDir || getEnv('DATA_DIR', '.');
-    const crawlsDir = path.join(dataDir, 'crawls');
    const now = Date.now();
    const fiveMinutesAgo = now - 300000;
    let killed = 0;

    console.error('[*] Checking for zombie Chrome processes...');

-    if (!fs.existsSync(crawlsDir)) {
-        console.error('[+] No crawls directory found');
+    if (!fs.existsSync(dataDir)) {
+        console.error('[+] No data directory found');
        return 0;
    }

+    /**
+     * Recursively find all chrome/.pid files in directory tree
+     * @param {string} dir - Directory to search
+     * @param {number} depth - Current recursion depth (limit to 10)
+     * @returns {Array<{pidFile: string, crawlDir: string}>} - Array of PID file info
+     */
+    function findChromePidFiles(dir, depth = 0) {
+        if (depth > 10) return [];  // Prevent infinite recursion
+
+        const results = [];
+        try {
+            const entries = fs.readdirSync(dir, { withFileTypes: true });
+
+            for (const entry of entries) {
+                if (!entry.isDirectory()) continue;
+
+                const fullPath = path.join(dir, entry.name);
+
+                // Found a chrome directory - check for .pid files
+                if (entry.name === 'chrome') {
+                    try {
+                        const pidFiles = fs.readdirSync(fullPath).filter(f => f.endsWith('.pid'));
+                        const crawlDir = dir;  // Parent of chrome/ is the crawl dir
+
+                        for (const pidFileName of pidFiles) {
+                            results.push({
+                                pidFile: path.join(fullPath, pidFileName),
+                                crawlDir: crawlDir,
+                            });
+                        }
+                    } catch (e) {
+                        // Skip if can't read chrome dir
+                    }
+                } else {
+                    // Recurse into subdirectory (skip hidden dirs and node_modules)
+                    if (!entry.name.startsWith('.') && entry.name !== 'node_modules') {
+                        results.push(...findChromePidFiles(fullPath, depth + 1));
+                    }
+                }
+            }
+        } catch (e) {
+            // Skip if can't read directory
+        }
+        return results;
+    }
+
    try {
-        const crawls = fs.readdirSync(crawlsDir, { withFileTypes: true });
-
-        for (const crawl of crawls) {
-            if (!crawl.isDirectory()) continue;
-
-            const crawlDir = path.join(crawlsDir, crawl.name);
-            const chromeDir = path.join(crawlDir, 'chrome');
-
-            if (!fs.existsSync(chromeDir)) continue;
+        const chromePids = findChromePidFiles(dataDir);

+        for (const {pidFile, crawlDir} of chromePids) {
            // Check if crawl was modified recently (still active)
            try {
                const crawlStats = fs.statSync(crawlDir);
                if (crawlStats.mtimeMs > fiveMinutesAgo) {
-                    continue;
+                    continue;  // Crawl is active, skip
                }
            } catch (e) {
                continue;
            }

-            // Crawl is stale, check for PIDs
+            // Crawl is stale, check PID
            try {
-                const pidFiles = fs.readdirSync(chromeDir).filter(f => f.endsWith('.pid'));
+                const pid = parseInt(fs.readFileSync(pidFile, 'utf8').trim(), 10);
+                if (isNaN(pid) || pid <= 0) continue;

-                for (const pidFileName of pidFiles) {
-                    const pidFile = path.join(chromeDir, pidFileName);
+                // Check if process exists
+                try {
+                    process.kill(pid, 0);
+                } catch (e) {
+                    // Process dead, remove stale PID file
+                    try { fs.unlinkSync(pidFile); } catch (e) {}
+                    continue;
+                }

-                    try {
-                        const pid = parseInt(fs.readFileSync(pidFile, 'utf8').trim(), 10);
-                        if (isNaN(pid) || pid <= 0) continue;
+                // Process alive and crawl is stale - zombie!
+                console.error(`[!] Found zombie (PID ${pid}) from stale crawl ${path.basename(crawlDir)}`);

-                        // Check if process exists
-                        try {
-                            process.kill(pid, 0);
-                        } catch (e) {
-                            // Process dead, remove stale PID file
-                            try { fs.unlinkSync(pidFile); } catch (e) {}
-                            continue;
-                        }
-
-                        // Process alive and crawl is stale - zombie!
-                        console.error(`[!] Found zombie (PID ${pid}) from stale crawl ${crawl.name}`);
-
-                        try {
-                            try { process.kill(-pid, 'SIGKILL'); } catch (e) { process.kill(pid, 'SIGKILL'); }
-                            killed++;
-                            console.error(`[+] Killed zombie (PID ${pid})`);
-                            try { fs.unlinkSync(pidFile); } catch (e) {}
-                        } catch (e) {
-                            console.error(`[!] Failed to kill PID ${pid}: ${e.message}`);
-                        }
-                    } catch (e) {
-                        // Skip invalid PID files
-                    }
+                try {
+                    try { process.kill(-pid, 'SIGKILL'); } catch (e) { process.kill(pid, 'SIGKILL'); }
+                    killed++;
+                    console.error(`[+] Killed zombie (PID ${pid})`);
+                    try { fs.unlinkSync(pidFile); } catch (e) {}
+                } catch (e) {
+                    console.error(`[!] Failed to kill PID ${pid}: ${e.message}`);
                }
            } catch (e) {
-                // Skip if can't read chrome dir
+                // Skip invalid PID files
            }
        }
    } catch (e) {
-        console.error(`[!] Error scanning crawls: ${e.message}`);
+        console.error(`[!] Error scanning for Chrome processes: ${e.message}`);
    }

    if (killed > 0) {
@@ -257,6 +320,31 @@ function killZombieChrome(dataDir = null) {
        console.error('[+] No zombies found');
    }

+    // Clean up stale SingletonLock files from persona chrome_user_data directories
+    const personasDir = path.join(dataDir, 'personas');
+    if (fs.existsSync(personasDir)) {
+        try {
+            const personas = fs.readdirSync(personasDir, { withFileTypes: true });
+            for (const persona of personas) {
+                if (!persona.isDirectory()) continue;
+
+                const userDataDir = path.join(personasDir, persona.name, 'chrome_user_data');
+                const singletonLock = path.join(userDataDir, 'SingletonLock');
+
+                if (fs.existsSync(singletonLock)) {
+                    try {
+                        fs.unlinkSync(singletonLock);
+                        console.error(`[+] Removed stale SingletonLock: ${singletonLock}`);
+                    } catch (e) {
+                        // Ignore - may be in use by active Chrome
+                    }
+                }
+            }
+        } catch (e) {
+            // Ignore errors scanning personas directory
+        }
+    }
+
    return killed;
 }

@@ -270,8 +358,10 @@ function killZombieChrome(dataDir = null) {
 * @param {Object} options - Launch options
 * @param {string} [options.binary] - Chrome binary path (auto-detected if not provided)
 * @param {string} [options.outputDir='chrome'] - Directory for output files
+ * @param {string} [options.userDataDir] - Chrome user data directory for persistent sessions
 * @param {string} [options.resolution='1440,2000'] - Window resolution
 * @param {boolean} [options.headless=true] - Run in headless mode
+ * @param {boolean} [options.sandbox=true] - Enable Chrome sandbox
 * @param {boolean} [options.checkSsl=true] - Check SSL certificates
 * @param {string[]} [options.extensionPaths=[]] - Paths to unpacked extensions
 * @param {boolean} [options.killZombies=true] - Kill zombie processes first
@@ -281,8 +371,10 @@ async function launchChromium(options = {}) {
    const {
        binary = findChromium(),
        outputDir = 'chrome',
+        userDataDir = getEnv('CHROME_USER_DATA_DIR'),
        resolution = getEnv('CHROME_RESOLUTION') || getEnv('RESOLUTION', '1440,2000'),
        headless = getEnvBool('CHROME_HEADLESS', true),
+        sandbox = getEnvBool('CHROME_SANDBOX', true),
        checkSsl = getEnvBool('CHROME_CHECK_SSL_VALIDITY', getEnvBool('CHECK_SSL_VALIDITY', true)),
        extensionPaths = [],
        killZombies = true,
@@ -304,41 +396,65 @@ async function launchChromium(options = {}) {
        fs.mkdirSync(outputDir, { recursive: true });
    }

+    // Create user data directory if specified and doesn't exist
+    if (userDataDir) {
+        if (!fs.existsSync(userDataDir)) {
+            fs.mkdirSync(userDataDir, { recursive: true });
+            console.error(`[*] Created user data directory: ${userDataDir}`);
+        }
+        // Clean up any stale SingletonLock file from previous crashed sessions
+        const singletonLock = path.join(userDataDir, 'SingletonLock');
+        if (fs.existsSync(singletonLock)) {
+            try {
+                fs.unlinkSync(singletonLock);
+                console.error(`[*] Removed stale SingletonLock: ${singletonLock}`);
+            } catch (e) {
+                console.error(`[!] Failed to remove SingletonLock: ${e.message}`);
+            }
+        }
+    }
+
    // Find a free port
    const debugPort = await findFreePort();
    console.error(`[*] Using debug port: ${debugPort}`);

-    // Build Chrome arguments
-    const chromiumArgs = [
+    // Get base Chrome args from config (static flags from CHROME_ARGS env var)
+    // These come from config.json defaults, merged by get_config() in Python
+    const baseArgs = getEnvArray('CHROME_ARGS', []);
+
+    // Get extra user-provided args
+    const extraArgs = getEnvArray('CHROME_ARGS_EXTRA', []);
+
+    // Build dynamic Chrome arguments (these must be computed at runtime)
+    const dynamicArgs = [
+        // Remote debugging setup
        `--remote-debugging-port=${debugPort}`,
        '--remote-debugging-address=127.0.0.1',
-        '--no-sandbox',
-        '--disable-setuid-sandbox',
+
+        // Sandbox settings (disable in Docker)
+        ...(sandbox ? [] : ['--no-sandbox', '--disable-setuid-sandbox']),
+
+        // Docker-specific workarounds
        '--disable-dev-shm-usage',
        '--disable-gpu',
-        '--disable-sync',
-        '--no-first-run',
-        '--no-default-browser-check',
-        '--disable-default-apps',
-        '--disable-infobars',
-        '--disable-blink-features=AutomationControlled',
-        '--disable-component-update',
-        '--disable-domain-reliability',
-        '--disable-breakpad',
-        '--disable-background-networking',
-        '--disable-background-timer-throttling',
-        '--disable-backgrounding-occluded-windows',
-        '--disable-renderer-backgrounding',
-        '--disable-ipc-flooding-protection',
-        '--password-store=basic',
-        '--use-mock-keychain',
-        '--font-render-hinting=none',
-        '--force-color-profile=srgb',
+
+        // Window size
        `--window-size=${width},${height}`,
+
+        // User data directory (for persistent sessions with persona)
+        ...(userDataDir ? [`--user-data-dir=${userDataDir}`] : []),
+
+        // Headless mode
        ...(headless ? ['--headless=new'] : []),
+
+        // SSL certificate checking
        ...(checkSsl ? [] : ['--ignore-certificate-errors']),
    ];

+    // Combine all args: base (from config) + dynamic (runtime) + extra (user overrides)
+    // Dynamic args come after base so they can override if needed
+    const chromiumArgs = [...baseArgs, ...dynamicArgs, ...extraArgs];
+
    // Add extension loading flags
    if (extensionPaths.length > 0) {
        const extPathsArg = extensionPaths.join(',');
@@ -533,9 +649,9 @@ async function killChrome(pid, outputDir = null) {
    }

    // Step 8: Clean up PID files
+    // Note: hook-specific .pid files are cleaned up by run_hook() and Snapshot.cleanup()
    if (outputDir) {
        try { fs.unlinkSync(path.join(outputDir, 'chrome.pid')); } catch (e) {}
-        try { fs.unlinkSync(path.join(outputDir, 'hook.pid')); } catch (e) {}
    }

    console.error('[*] Chrome cleanup completed');
@@ -766,7 +882,8 @@ async function loadOrInstallExtension(ext, extensions_dir = null) {
    }

    // Determine extensions directory
-    const EXTENSIONS_DIR = extensions_dir || process.env.CHROME_EXTENSIONS_DIR || './data/chrome_extensions';
+    // Use provided dir, or fall back to getExtensionsDir() which handles env vars and defaults
+    const EXTENSIONS_DIR = extensions_dir || getExtensionsDir();

    // Set statically computable extension metadata
    ext.webstore_id = ext.webstore_id || ext.id;
@@ -1225,12 +1342,183 @@ function findChromium() {
    return null;
 }

+// ============================================================================
+// Shared Extension Installer Utilities
+// ============================================================================
+
+/**
+ * Get the extensions directory path.
+ * Centralized path calculation used by extension installers and chrome launch.
+ *
+ * Path is derived from environment variables in this priority:
+ * 1. CHROME_EXTENSIONS_DIR (explicit override)
+ * 2. DATA_DIR/personas/ACTIVE_PERSONA/chrome_extensions (default)
+ *
+ * @returns {string} - Absolute path to extensions directory
+ */
+function getExtensionsDir() {
+    const dataDir = getEnv('DATA_DIR', '.');
+    const persona = getEnv('ACTIVE_PERSONA', 'Default');
+    return getEnv('CHROME_EXTENSIONS_DIR') ||
+        path.join(dataDir, 'personas', persona, 'chrome_extensions');
+}
+
+/**
+ * Get machine type string for platform-specific paths.
+ * Matches Python's archivebox.config.paths.get_machine_type()
+ *
+ * @returns {string} - Machine type (e.g., 'x86_64-linux', 'arm64-darwin')
+ */
+function getMachineType() {
+    if (process.env.MACHINE_TYPE) {
+        return process.env.MACHINE_TYPE;
+    }
+
+    let machine = process.arch;
+    const system = process.platform;
+
+    // Normalize machine type to match Python's convention
+    if (machine === 'arm64' || machine === 'aarch64') {
+        machine = 'arm64';
+    } else if (machine === 'x64' || machine === 'x86_64' || machine === 'amd64') {
+        machine = 'x86_64';
+    } else if (machine === 'ia32' || machine === 'x86') {
+        machine = 'x86';
+    }
+
+    return `${machine}-${system}`;
+}
+
+/**
+ * Get LIB_DIR path for platform-specific binaries.
+ * Returns DATA_DIR/lib/MACHINE_TYPE/
+ *
+ * @returns {string} - Absolute path to lib directory
+ */
+function getLibDir() {
+    if (process.env.LIB_DIR) {
+        return process.env.LIB_DIR;
+    }
+    const dataDir = getEnv('DATA_DIR', './data');
+    const machineType = getMachineType();
+    return path.join(dataDir, 'lib', machineType);
+}
+
+/**
+ * Get NODE_MODULES_DIR path for npm packages.
+ * Returns LIB_DIR/npm/node_modules/
+ *
+ * @returns {string} - Absolute path to node_modules directory
+ */
+function getNodeModulesDir() {
+    if (process.env.NODE_MODULES_DIR) {
+        return process.env.NODE_MODULES_DIR;
+    }
+    return path.join(getLibDir(), 'npm', 'node_modules');
+}
+
+/**
+ * Get all test environment paths as a JSON object.
+ * This is the single source of truth for path calculations - Python calls this
+ * to avoid duplicating path logic.
+ *
+ * @returns {Object} - Object with all test environment paths
+ */
+function getTestEnv() {
+    const dataDir = getEnv('DATA_DIR', './data');
+    const machineType = getMachineType();
+    const libDir = getLibDir();
+    const nodeModulesDir = getNodeModulesDir();
+
+    return {
+        DATA_DIR: dataDir,
+        MACHINE_TYPE: machineType,
+        LIB_DIR: libDir,
+        NODE_MODULES_DIR: nodeModulesDir,
+        NPM_BIN_DIR: path.join(libDir, 'npm', '.bin'),
+        CHROME_EXTENSIONS_DIR: getExtensionsDir(),
+    };
+}
+
+/**
+ * Install a Chrome extension with caching support.
+ *
+ * This is the main entry point for extension installer hooks. It handles:
+ * - Checking for cached extension metadata
+ * - Installing the extension if not cached
+ * - Writing cache file for future runs
+ *
+ * @param {Object} extension - Extension metadata object
+ * @param {string} extension.webstore_id - Chrome Web Store extension ID
+ * @param {string} extension.name - Human-readable extension name (used for cache file)
+ * @param {Object} [options] - Options
+ * @param {string} [options.extensionsDir] - Override extensions directory
+ * @param {boolean} [options.quiet=false] - Suppress info logging
+ * @returns {Promise<Object|null>} - Installed extension metadata or null on failure
+ */
+async function installExtensionWithCache(extension, options = {}) {
+    const {
+        extensionsDir = getExtensionsDir(),
+        quiet = false,
+    } = options;
+
+    const cacheFile = path.join(extensionsDir, `${extension.name}.extension.json`);
+
+    // Check if extension is already cached and valid
+    if (fs.existsSync(cacheFile)) {
+        try {
+            const cached = JSON.parse(fs.readFileSync(cacheFile, 'utf-8'));
+            const manifestPath = path.join(cached.unpacked_path, 'manifest.json');
+
+            if (fs.existsSync(manifestPath)) {
+                if (!quiet) {
+                    console.log(`[*] ${extension.name} extension already installed (using cache)`);
+                }
+                return cached;
+            }
+        } catch (e) {
+            // Cache file corrupted, re-install
+            console.warn(`[⚠️] Extension cache corrupted for ${extension.name}, re-installing...`);
+        }
+    }
+
+    // Install extension
+    if (!quiet) {
+        console.log(`[*] Installing ${extension.name} extension...`);
+    }
+
+    const installedExt = await loadOrInstallExtension(extension, extensionsDir);
+
+    if (!installedExt?.version) {
+        console.error(`[❌] Failed to install ${extension.name} extension`);
+        return null;
+    }
+
+    // Write cache file
+    try {
+        await fs.promises.mkdir(extensionsDir, { recursive: true });
+        await fs.promises.writeFile(cacheFile, JSON.stringify(installedExt, null, 2));
+        if (!quiet) {
+            console.log(`[+] Extension metadata written to ${cacheFile}`);
+        }
+    } catch (e) {
+        console.warn(`[⚠️] Failed to write cache file: ${e.message}`);
+    }
+
+    if (!quiet) {
+        console.log(`[+] ${extension.name} extension installed`);
+    }
+
+    return installedExt;
+}
+
 // Export all functions
 module.exports = {
    // Environment helpers
    getEnv,
    getEnvBool,
    getEnvInt,
+    getEnvArray,
    parseResolution,
    // PID file management
    writePidWithMtime,
@@ -1261,6 +1549,14 @@ module.exports = {
    getExtensionPaths,
    waitForExtensionTarget,
    getExtensionTargets,
+    // Shared path utilities (single source of truth for Python/JS)
+    getMachineType,
+    getLibDir,
+    getNodeModulesDir,
+    getExtensionsDir,
+    getTestEnv,
+    // Shared extension installer utilities
+    installExtensionWithCache,
    // Deprecated - use enableExtensions option instead
    getExtensionLaunchArgs,
 };
@@ -1273,16 +1569,31 @@ if (require.main === module) {
        console.log('Usage: chrome_utils.js <command> [args...]');
        console.log('');
        console.log('Commands:');
-        console.log('  findChromium');
-        console.log('  installChromium');
-        console.log('  installPuppeteerCore [npm_prefix]');
-        console.log('  launchChromium [output_dir] [extension_paths_json]');
-        console.log('  killChrome <pid> [output_dir]');
-        console.log('  killZombieChrome [data_dir]');
-        console.log('  getExtensionId <path>');
-        console.log('  loadExtensionManifest <path>');
-        console.log('  getExtensionLaunchArgs <extensions_json>');
-        console.log('  loadOrInstallExtension <webstore_id> <name> [extensions_dir]');
+        console.log('  findChromium              Find Chrome/Chromium binary');
+        console.log('  installChromium           Install Chromium via @puppeteer/browsers');
+        console.log('  installPuppeteerCore      Install puppeteer-core npm package');
+        console.log('  launchChromium            Launch Chrome with CDP debugging');
+        console.log('  killChrome <pid>          Kill Chrome process by PID');
+        console.log('  killZombieChrome          Clean up zombie Chrome processes');
+        console.log('');
+        console.log('  getMachineType            Get machine type (e.g., x86_64-linux)');
+        console.log('  getLibDir                 Get LIB_DIR path');
+        console.log('  getNodeModulesDir         Get NODE_MODULES_DIR path');
+        console.log('  getExtensionsDir          Get Chrome extensions directory');
+        console.log('  getTestEnv                Get all paths as JSON (for tests)');
+        console.log('');
+        console.log('  getExtensionId <path>     Get extension ID from unpacked path');
+        console.log('  loadExtensionManifest     Load extension manifest.json');
+        console.log('  loadOrInstallExtension    Load or install an extension');
+        console.log('  installExtensionWithCache Install extension with caching');
+        console.log('');
+        console.log('Environment variables:');
+        console.log('  DATA_DIR                  Base data directory');
+        console.log('  LIB_DIR                   Library directory (computed if not set)');
+        console.log('  MACHINE_TYPE              Machine type override');
+        console.log('  NODE_MODULES_DIR          Node modules directory');
+        console.log('  CHROME_BINARY             Chrome binary path');
+        console.log('  CHROME_EXTENSIONS_DIR     Extensions directory');
        process.exit(1);
    }

@@ -1395,6 +1706,46 @@ if (require.main === module) {
                    break;
                }

+                case 'getMachineType': {
+                    console.log(getMachineType());
+                    break;
+                }
+
+                case 'getLibDir': {
+                    console.log(getLibDir());
+                    break;
+                }
+
+                case 'getNodeModulesDir': {
+                    console.log(getNodeModulesDir());
+                    break;
+                }
+
+                case 'getExtensionsDir': {
+                    console.log(getExtensionsDir());
+                    break;
+                }
+
+                case 'getTestEnv': {
+                    console.log(JSON.stringify(getTestEnv(), null, 2));
+                    break;
+                }
+
+                case 'installExtensionWithCache': {
+                    const [webstore_id, name] = commandArgs;
+                    if (!webstore_id || !name) {
+                        console.error('Usage: installExtensionWithCache <webstore_id> <name>');
+                        process.exit(1);
+                    }
+                    const ext = await installExtensionWithCache({ webstore_id, name });
+                    if (ext) {
+                        console.log(JSON.stringify(ext, null, 2));
+                    } else {
+                        process.exit(1);
+                    }
+                    break;
+                }
+
                default:
                    console.error(`Unknown command: ${command}`);
                    process.exit(1);
--- a/archivebox/plugins/chrome/config.json
+++ b/archivebox/plugins/chrome/config.json
@@ -42,7 +42,7 @@
    "CHROME_USER_DATA_DIR": {
      "type": "string",
      "default": "",
-      "description": "Path to Chrome user data directory for persistent sessions"
+      "description": "Path to Chrome user data directory for persistent sessions (derived from ACTIVE_PERSONA if not set)"
    },
    "CHROME_USER_AGENT": {
      "type": "string",
@@ -53,16 +53,74 @@
    "CHROME_ARGS": {
      "type": "array",
      "items": {"type": "string"},
-      "default": [],
+      "default": [
+        "--no-first-run",
+        "--no-default-browser-check",
+        "--disable-default-apps",
+        "--disable-sync",
+        "--disable-infobars",
+        "--disable-blink-features=AutomationControlled",
+        "--disable-component-update",
+        "--disable-domain-reliability",
+        "--disable-breakpad",
+        "--disable-client-side-phishing-detection",
+        "--disable-hang-monitor",
+        "--disable-speech-synthesis-api",
+        "--disable-speech-api",
+        "--disable-print-preview",
+        "--disable-notifications",
+        "--disable-desktop-notifications",
+        "--disable-popup-blocking",
+        "--disable-prompt-on-repost",
+        "--disable-external-intent-requests",
+        "--disable-session-crashed-bubble",
+        "--disable-search-engine-choice-screen",
+        "--disable-datasaver-prompt",
+        "--ash-no-nudges",
+        "--hide-crash-restore-bubble",
+        "--suppress-message-center-popups",
+        "--noerrdialogs",
+        "--no-pings",
+        "--silent-debugger-extension-api",
+        "--deny-permission-prompts",
+        "--safebrowsing-disable-auto-update",
+        "--metrics-recording-only",
+        "--password-store=basic",
+        "--use-mock-keychain",
+        "--disable-cookie-encryption",
+        "--font-render-hinting=none",
+        "--force-color-profile=srgb",
+        "--disable-partial-raster",
+        "--disable-skia-runtime-opts",
+        "--disable-2d-canvas-clip-aa",
+        "--enable-webgl",
+        "--hide-scrollbars",
+        "--export-tagged-pdf",
+        "--generate-pdf-document-outline",
+        "--disable-lazy-loading",
+        "--disable-renderer-backgrounding",
+        "--disable-background-networking",
+        "--disable-background-timer-throttling",
+        "--disable-backgrounding-occluded-windows",
+        "--disable-ipc-flooding-protection",
+        "--disable-extensions-http-throttling",
+        "--disable-field-trial-config",
+        "--disable-back-forward-cache",
+        "--autoplay-policy=no-user-gesture-required",
+        "--disable-gesture-requirement-for-media-playback",
+        "--lang=en-US,en;q=0.9",
+        "--log-level=2",
+        "--enable-logging=stderr"
+      ],
      "x-aliases": ["CHROME_DEFAULT_ARGS"],
-      "description": "Default Chrome command-line arguments"
+      "description": "Default Chrome command-line arguments (static flags only, dynamic args like --user-data-dir are added at runtime)"
    },
    "CHROME_ARGS_EXTRA": {
      "type": "array",
      "items": {"type": "string"},
      "default": [],
      "x-aliases": ["CHROME_EXTRA_ARGS"],
-      "description": "Extra arguments to append to Chrome command"
+      "description": "Extra arguments to append to Chrome command (for user customization)"
    },
    "CHROME_PAGELOAD_TIMEOUT": {
      "type": "integer",
--- a/archivebox/plugins/chrome/on_Crawl__00_install_puppeteer_chromium.py
+++ b/archivebox/plugins/chrome/on_Crawl__00_install_puppeteer_chromium.py
@@ -0,0 +1,265 @@
+#!/usr/bin/env python3
+"""
+Install hook for Chrome/Chromium and puppeteer-core.
+
+Runs at crawl start to install/find Chromium and puppeteer-core.
+Also validates config and computes derived values.
+
+Outputs:
+    - JSONL for Binary and Machine config updates
+    - COMPUTED:KEY=VALUE lines that hooks.py parses and adds to env
+
+Respects CHROME_BINARY env var for custom binary paths.
+Uses `npx @puppeteer/browsers install chromium@latest` and parses output.
+
+NOTE: We use Chromium instead of Chrome because Chrome 137+ removed support for
+--load-extension and --disable-extensions-except flags, which are needed for
+loading unpacked extensions in headless mode.
+"""
+
+import os
+import sys
+import json
+import subprocess
+from pathlib import Path
+
+
+def get_env(name: str, default: str = '') -> str:
+    return os.environ.get(name, default).strip()
+
+
+def get_env_bool(name: str, default: bool = False) -> bool:
+    val = get_env(name, '').lower()
+    if val in ('true', '1', 'yes', 'on'):
+        return True
+    if val in ('false', '0', 'no', 'off'):
+        return False
+    return default
+
+
+def detect_docker() -> bool:
+    """Detect if running inside Docker container."""
+    return (
+        os.path.exists('/.dockerenv') or
+        os.environ.get('IN_DOCKER', '').lower() in ('true', '1', 'yes') or
+        os.path.exists('/run/.containerenv')
+    )
+
+
+def get_chrome_version(binary_path: str) -> str | None:
+    """Get Chrome/Chromium version string."""
+    try:
+        result = subprocess.run(
+            [binary_path, '--version'],
+            capture_output=True,
+            text=True,
+            timeout=5
+        )
+        if result.returncode == 0:
+            return result.stdout.strip()
+    except Exception:
+        pass
+    return None
+
+
+def install_puppeteer_core() -> bool:
+    """Install puppeteer-core to NODE_MODULES_DIR if not present."""
+    node_modules_dir = os.environ.get('NODE_MODULES_DIR', '').strip()
+    if not node_modules_dir:
+        # No isolated node_modules, skip (will use global)
+        return True
+
+    node_modules_path = Path(node_modules_dir)
+    if (node_modules_path / 'puppeteer-core').exists():
+        return True
+
+    # Get npm prefix from NODE_MODULES_DIR (parent of node_modules)
+    npm_prefix = node_modules_path.parent
+
+    try:
+        print(f"[*] Installing puppeteer-core to {npm_prefix}...", file=sys.stderr)
+        result = subprocess.run(
+            ['npm', 'install', '--prefix', str(npm_prefix), 'puppeteer-core', '@puppeteer/browsers'],
+            capture_output=True,
+            text=True,
+            timeout=60
+        )
+        if result.returncode == 0:
+            print(f"[+] puppeteer-core installed", file=sys.stderr)
+            return True
+        else:
+            print(f"[!] Failed to install puppeteer-core: {result.stderr}", file=sys.stderr)
+            return False
+    except Exception as e:
+        print(f"[!] Failed to install puppeteer-core: {e}", file=sys.stderr)
+        return False
+
+
+def install_chromium() -> dict | None:
+    """Install Chromium using @puppeteer/browsers and parse output for binary path.
+
+    Output format: "chromium@<version> <path_to_binary>"
+    e.g.: "chromium@1563294 /Users/x/.cache/puppeteer/chromium/.../Chromium"
+
+    Note: npx is fast when chromium is already cached - it returns the path without re-downloading.
+    """
+    try:
+        print("[*] Installing Chromium via @puppeteer/browsers...", file=sys.stderr)
+
+        # Use --path to install to puppeteer's standard cache location
+        cache_path = os.path.expanduser('~/.cache/puppeteer')
+
+        result = subprocess.run(
+            ['npx', '@puppeteer/browsers', 'install', 'chromium@1563297', f'--path={cache_path}'],
+            capture_output=True,
+            text=True,
+            stdin=subprocess.DEVNULL,
+            timeout=300
+        )
+
+        if result.returncode != 0:
+            print(f"[!] Failed to install Chromium: {result.stderr}", file=sys.stderr)
+            return None
+
+        # Parse output: "chromium@1563294 /path/to/Chromium"
+        output = result.stdout.strip()
+        parts = output.split(' ', 1)
+        if len(parts) != 2:
+            print(f"[!] Failed to parse install output: {output}", file=sys.stderr)
+            return None
+
+        version_str = parts[0]  # "chromium@1563294"
+        binary_path = parts[1].strip()
+
+        if not binary_path or not os.path.exists(binary_path):
+            print(f"[!] Binary not found at: {binary_path}", file=sys.stderr)
+            return None
+
+        # Extract version number
+        version = version_str.split('@')[1] if '@' in version_str else None
+
+        print(f"[+] Chromium installed: {binary_path}", file=sys.stderr)
+
+        return {
+            'name': 'chromium',
+            'abspath': binary_path,
+            'version': version,
+            'binprovider': 'puppeteer',
+        }
+
+    except subprocess.TimeoutExpired:
+        print("[!] Chromium install timed out", file=sys.stderr)
+    except FileNotFoundError:
+        print("[!] npx not found - is Node.js installed?", file=sys.stderr)
+    except Exception as e:
+        print(f"[!] Failed to install Chromium: {e}", file=sys.stderr)
+
+    return None
+
+
+def main():
+    warnings = []
+    errors = []
+    computed = {}
+
+    # Install puppeteer-core if NODE_MODULES_DIR is set
+    install_puppeteer_core()
+
+    # Check if Chrome is enabled
+    chrome_enabled = get_env_bool('CHROME_ENABLED', True)
+
+    # Detect Docker and adjust sandbox
+    in_docker = detect_docker()
+    computed['IN_DOCKER'] = str(in_docker).lower()
+
+    chrome_sandbox = get_env_bool('CHROME_SANDBOX', True)
+    if in_docker and chrome_sandbox:
+        warnings.append(
+            "Running in Docker with CHROME_SANDBOX=true. "
+            "Chrome may fail to start. Consider setting CHROME_SANDBOX=false."
+        )
+        # Auto-disable sandbox in Docker unless explicitly set
+        if not get_env('CHROME_SANDBOX'):
+            computed['CHROME_SANDBOX'] = 'false'
+
+    # Check Node.js availability
+    node_binary = get_env('NODE_BINARY', 'node')
+    computed['NODE_BINARY'] = node_binary
+
+    # Check if CHROME_BINARY is already set and valid
+    configured_binary = get_env('CHROME_BINARY', '')
+    if configured_binary and os.path.isfile(configured_binary) and os.access(configured_binary, os.X_OK):
+        version = get_chrome_version(configured_binary)
+        computed['CHROME_BINARY'] = configured_binary
+        computed['CHROME_VERSION'] = version or 'unknown'
+
+        print(json.dumps({
+            'type': 'Binary',
+            'name': 'chromium',
+            'abspath': configured_binary,
+            'version': version,
+            'binprovider': 'env',
+        }))
+
+        # Output computed values
+        for key, value in computed.items():
+            print(f"COMPUTED:{key}={value}")
+        for warning in warnings:
+            print(f"WARNING:{warning}", file=sys.stderr)
+
+        sys.exit(0)
+
+    # Install/find Chromium via puppeteer
+    result = install_chromium()
+
+    if result and result.get('abspath'):
+        computed['CHROME_BINARY'] = result['abspath']
+        computed['CHROME_VERSION'] = result['version'] or 'unknown'
+
+        print(json.dumps({
+            'type': 'Binary',
+            'name': result['name'],
+            'abspath': result['abspath'],
+            'version': result['version'],
+            'binprovider': result['binprovider'],
+        }))
+
+        print(json.dumps({
+            'type': 'Machine',
+            '_method': 'update',
+            'key': 'config/CHROME_BINARY',
+            'value': result['abspath'],
+        }))
+
+        if result['version']:
+            print(json.dumps({
+                'type': 'Machine',
+                '_method': 'update',
+                'key': 'config/CHROMIUM_VERSION',
+                'value': result['version'],
+            }))
+
+        # Output computed values
+        for key, value in computed.items():
+            print(f"COMPUTED:{key}={value}")
+        for warning in warnings:
+            print(f"WARNING:{warning}", file=sys.stderr)
+
+        sys.exit(0)
+    else:
+        errors.append("Chromium binary not found")
+        computed['CHROME_BINARY'] = ''
+
+        # Output computed values and errors
+        for key, value in computed.items():
+            print(f"COMPUTED:{key}={value}")
+        for warning in warnings:
+            print(f"WARNING:{warning}", file=sys.stderr)
+        for error in errors:
+            print(f"ERROR:{error}", file=sys.stderr)
+
+        sys.exit(1)
+
+
+if __name__ == '__main__':
+    main()
--- a/archivebox/plugins/chrome/on_Crawl__30_chrome_launch.bg.js
+++ b/archivebox/plugins/chrome/on_Crawl__30_chrome_launch.bg.js
@@ -0,0 +1,323 @@
+#!/usr/bin/env node
+/**
+ * Launch a shared Chromium browser session for the entire crawl.
+ *
+ * This runs once per crawl and keeps Chromium alive for all snapshots to share.
+ * Each snapshot creates its own tab via on_Snapshot__20_chrome_tab.bg.js.
+ *
+ * NOTE: We use Chromium instead of Chrome because Chrome 137+ removed support for
+ * --load-extension and --disable-extensions-except flags.
+ *
+ * Usage: on_Crawl__30_chrome_launch.bg.js --crawl-id=<uuid> --source-url=<url>
+ * Output: Writes to current directory (executor creates chrome/ dir):
+ *   - cdp_url.txt: WebSocket URL for CDP connection
+ *   - chrome.pid: Chromium process ID (for cleanup)
+ *   - port.txt: Debug port number
+ *   - extensions.json: Loaded extensions metadata
+ *
+ * Environment variables:
+ *     NODE_MODULES_DIR: Path to node_modules directory for module resolution
+ *     CHROME_BINARY: Path to Chromium binary (falls back to auto-detection)
+ *     CHROME_RESOLUTION: Page resolution (default: 1440,2000)
+ *     CHROME_HEADLESS: Run in headless mode (default: true)
+ *     CHROME_CHECK_SSL_VALIDITY: Whether to check SSL certificates (default: true)
+ *     CHROME_EXTENSIONS_DIR: Directory containing Chrome extensions
+ */
+
+// Add NODE_MODULES_DIR to module resolution paths if set
+if (process.env.NODE_MODULES_DIR) {
+    module.paths.unshift(process.env.NODE_MODULES_DIR);
+}
+
+const fs = require('fs');
+const path = require('path');
+const puppeteer = require('puppeteer-core');
+const {
+    findChromium,
+    launchChromium,
+    killChrome,
+    getEnv,
+    writePidWithMtime,
+    getExtensionsDir,
+} = require('./chrome_utils.js');
+
+// Extractor metadata
+const PLUGIN_NAME = 'chrome_launch';
+const OUTPUT_DIR = '.';
+
+// Global state for cleanup
+let chromePid = null;
+let browserInstance = null;
+
+// Parse command line arguments
+function parseArgs() {
+    const args = {};
+    process.argv.slice(2).forEach((arg) => {
+        if (arg.startsWith('--')) {
+            const [key, ...valueParts] = arg.slice(2).split('=');
+            args[key.replace(/-/g, '_')] = valueParts.join('=') || true;
+        }
+    });
+    return args;
+}
+
+// Cleanup handler for SIGTERM
+async function cleanup() {
+    console.error('[*] Cleaning up Chrome session...');
+
+    // Try graceful browser close first
+    if (browserInstance) {
+        try {
+            console.error('[*] Closing browser gracefully...');
+            await browserInstance.close();
+            browserInstance = null;
+            console.error('[+] Browser closed gracefully');
+        } catch (e) {
+            console.error(`[!] Graceful close failed: ${e.message}`);
+        }
+    }
+
+    // Kill Chrome process
+    if (chromePid) {
+        await killChrome(chromePid, OUTPUT_DIR);
+    }
+
+    process.exit(0);
+}
+
+// Register signal handlers
+process.on('SIGTERM', cleanup);
+process.on('SIGINT', cleanup);
+
+async function main() {
+    const args = parseArgs();
+    const crawlId = args.crawl_id;
+
+    try {
+        const binary = findChromium();
+        if (!binary) {
+            console.error('ERROR: Chromium binary not found');
+            console.error('DEPENDENCY_NEEDED=chromium');
+            console.error('BIN_PROVIDERS=puppeteer,env,playwright,apt,brew');
+            console.error('INSTALL_HINT=npx @puppeteer/browsers install chromium@latest');
+            process.exit(1);
+        }
+
+        // Get Chromium version
+        let version = '';
+        try {
+            const { execSync } = require('child_process');
+            version = execSync(`"${binary}" --version`, { encoding: 'utf8', timeout: 5000 })
+                .trim()
+                .slice(0, 64);
+        } catch (e) {}
+
+        console.error(`[*] Using browser: ${binary}`);
+        if (version) console.error(`[*] Version: ${version}`);
+
+        // Load installed extensions
+        const extensionsDir = getExtensionsDir();
+        const userDataDir = getEnv('CHROME_USER_DATA_DIR');
+
+        if (userDataDir) {
+            console.error(`[*] Using user data dir: ${userDataDir}`);
+        }
+
+        const installedExtensions = [];
+        const extensionPaths = [];
+        if (fs.existsSync(extensionsDir)) {
+            const files = fs.readdirSync(extensionsDir);
+            for (const file of files) {
+                if (file.endsWith('.extension.json')) {
+                    try {
+                        const extPath = path.join(extensionsDir, file);
+                        const extData = JSON.parse(fs.readFileSync(extPath, 'utf-8'));
+                        if (extData.unpacked_path && fs.existsSync(extData.unpacked_path)) {
+                            installedExtensions.push(extData);
+                            extensionPaths.push(extData.unpacked_path);
+                            console.error(`[*] Loading extension: ${extData.name || file}`);
+                        }
+                    } catch (e) {
+                        console.warn(`[!] Skipping invalid extension cache: ${file}`);
+                    }
+                }
+            }
+        }
+
+        if (installedExtensions.length > 0) {
+            console.error(`[+] Found ${installedExtensions.length} extension(s) to load`);
+        }
+
+        // Note: PID file is written by run_hook() with hook-specific name
+        // Snapshot.cleanup() kills all *.pid processes when done
+        if (!fs.existsSync(OUTPUT_DIR)) {
+            fs.mkdirSync(OUTPUT_DIR, { recursive: true });
+        }
+
+        // Launch Chromium using consolidated function
+        // userDataDir is derived from ACTIVE_PERSONA by get_config() if not explicitly set
+        const result = await launchChromium({
+            binary,
+            outputDir: OUTPUT_DIR,
+            userDataDir,
+            extensionPaths,
+        });
+
+        if (!result.success) {
+            console.error(`ERROR: ${result.error}`);
+            process.exit(1);
+        }
+
+        chromePid = result.pid;
+        const cdpUrl = result.cdpUrl;
+
+        // Connect puppeteer for extension verification
+        console.error(`[*] Connecting puppeteer to CDP...`);
+        const browser = await puppeteer.connect({
+            browserWSEndpoint: cdpUrl,
+            defaultViewport: null,
+        });
+        browserInstance = browser;
+
+        // Get actual extension IDs from chrome://extensions page
+        if (extensionPaths.length > 0) {
+            await new Promise(r => setTimeout(r, 2000));
+
+            try {
+                const extPage = await browser.newPage();
+                await extPage.goto('chrome://extensions', { waitUntil: 'domcontentloaded', timeout: 10000 });
+                await new Promise(r => setTimeout(r, 2000));
+
+                // Parse extension info from the page
+                const extensionsFromPage = await extPage.evaluate(() => {
+                    const extensions = [];
+                    // Extensions manager uses shadow DOM
+                    const manager = document.querySelector('extensions-manager');
+                    if (!manager || !manager.shadowRoot) return extensions;
+
+                    const itemList = manager.shadowRoot.querySelector('extensions-item-list');
+                    if (!itemList || !itemList.shadowRoot) return extensions;
+
+                    const items = itemList.shadowRoot.querySelectorAll('extensions-item');
+                    for (const item of items) {
+                        const id = item.getAttribute('id');
+                        const nameEl = item.shadowRoot?.querySelector('#name');
+                        const name = nameEl?.textContent?.trim() || '';
+                        if (id && name) {
+                            extensions.push({ id, name });
+                        }
+                    }
+                    return extensions;
+                });
+
+                console.error(`[*] Found ${extensionsFromPage.length} extension(s) on chrome://extensions`);
+                for (const e of extensionsFromPage) {
+                    console.error(`    - ${e.id}: "${e.name}"`);
+                }
+
+                // Match extensions by name (strict matching)
+                for (const ext of installedExtensions) {
+                    // Read the extension's manifest to get its display name
+                    const manifestPath = path.join(ext.unpacked_path, 'manifest.json');
+                    if (fs.existsSync(manifestPath)) {
+                        const manifest = JSON.parse(fs.readFileSync(manifestPath, 'utf-8'));
+                        let manifestName = manifest.name || '';
+
+                        // Resolve message placeholder (e.g., __MSG_extName__)
+                        if (manifestName.startsWith('__MSG_') && manifestName.endsWith('__')) {
+                            const msgKey = manifestName.slice(6, -2); // Extract key from __MSG_key__
+                            const defaultLocale = manifest.default_locale || 'en';
+                            const messagesPath = path.join(ext.unpacked_path, '_locales', defaultLocale, 'messages.json');
+                            if (fs.existsSync(messagesPath)) {
+                                try {
+                                    const messages = JSON.parse(fs.readFileSync(messagesPath, 'utf-8'));
+                                    if (messages[msgKey] && messages[msgKey].message) {
+                                        manifestName = messages[msgKey].message;
+                                    }
+                                } catch (e) {
+                                    console.error(`[!] Failed to read messages.json: ${e.message}`);
+                                }
+                            }
+                        }
+
+                        console.error(`[*] Looking for match: ext.name="${ext.name}" manifest.name="${manifestName}"`);
+
+                        // Find matching extension from page by exact name match first
+                        let match = extensionsFromPage.find(e => e.name === manifestName);
+
+                        // If no exact match, try case-insensitive exact match
+                        if (!match) {
+                            match = extensionsFromPage.find(e =>
+                                e.name.toLowerCase() === manifestName.toLowerCase()
+                            );
+                        }
+
+                        if (match) {
+                            ext.id = match.id;
+                            console.error(`[+] Matched extension: ${ext.name} (${manifestName}) -> ${match.id}`);
+                        } else {
+                            console.error(`[!] No match found for: ${ext.name} (${manifestName})`);
+                        }
+                    }
+                }
+
+                await extPage.close();
+            } catch (e) {
+                console.error(`[!] Failed to get extensions from chrome://extensions: ${e.message}`);
+            }
+
+            // Fallback: check browser targets
+            const targets = browser.targets();
+            const builtinIds = [
+                'nkeimhogjdpnpccoofpliimaahmaaome',
+                'fignfifoniblkonapihmkfakmlgkbkcf',
+                'ahfgeienlihckogmohjhadlkjgocpleb',
+                'mhjfbmdgcfjbbpaeojofohoefgiehjai',
+            ];
+            const customExtTargets = targets.filter(t => {
+                const url = t.url();
+                if (!url.startsWith('chrome-extension://')) return false;
+                const extId = url.split('://')[1].split('/')[0];
+                return !builtinIds.includes(extId);
+            });
+
+            console.error(`[+] Found ${customExtTargets.length} custom extension target(s)`);
+
+            for (const target of customExtTargets) {
+                const url = target.url();
+                const extId = url.split('://')[1].split('/')[0];
+                console.error(`[+] Extension target: ${extId} (${target.type()})`);
+            }
+
+            if (customExtTargets.length === 0 && extensionPaths.length > 0) {
+                console.error(`[!] Warning: No custom extensions detected. Extension loading may have failed.`);
+                console.error(`[!] Make sure you are using Chromium, not Chrome (Chrome 137+ removed --load-extension support)`);
+            }
+        }
+
+        // Write extensions metadata with actual IDs
+        if (installedExtensions.length > 0) {
+            fs.writeFileSync(
+                path.join(OUTPUT_DIR, 'extensions.json'),
+                JSON.stringify(installedExtensions, null, 2)
+            );
+        }
+
+        console.error(`[+] Chromium session started for crawl ${crawlId}`);
+        console.error(`[+] CDP URL: ${cdpUrl}`);
+        console.error(`[+] PID: ${chromePid}`);
+
+        // Stay alive to handle cleanup on SIGTERM
+        console.log('[*] Chromium launch hook staying alive to handle cleanup...');
+        setInterval(() => {}, 1000000);
+
+    } catch (e) {
+        console.error(`ERROR: ${e.name}: ${e.message}`);
+        process.exit(1);
+    }
+}
+
+main().catch((e) => {
+    console.error(`Fatal error: ${e.message}`);
+    process.exit(1);
+});
--- a/archivebox/plugins/chrome/on_Snapshot__20_chrome_tab.bg.js
+++ b/archivebox/plugins/chrome/on_Snapshot__20_chrome_tab.bg.js
@@ -2,7 +2,7 @@
 /**
 * Create a Chrome tab for this snapshot in the shared crawl Chrome session.
 *
- * If a crawl-level Chrome session exists (from on_Crawl__20_chrome_launch.bg.js),
+ * If a crawl-level Chrome session exists (from on_Crawl__30_chrome_launch.bg.js),
 * this connects to it and creates a new tab. Otherwise, falls back to launching
 * its own Chrome instance.
 *
@@ -89,7 +89,7 @@ process.on('SIGINT', cleanup);
 function findCrawlChromeSession(crawlId) {
    if (!crawlId) return null;

-    // Use CRAWL_OUTPUT_DIR env var set by hooks.py
+    // Use CRAWL_OUTPUT_DIR env var set by get_config() in configset.py
    const crawlOutputDir = getEnv('CRAWL_OUTPUT_DIR', '');
    if (!crawlOutputDir) return null;

@@ -215,7 +215,7 @@ async function launchNewChrome(url, binary) {
    console.log(`[*] Launched Chrome (PID: ${chromePid}), waiting for debug port...`);

    // Write PID immediately for cleanup
-    fs.writeFileSync(path.join(OUTPUT_DIR, 'pid.txt'), String(chromePid));
+    fs.writeFileSync(path.join(OUTPUT_DIR, 'chrome.pid'), String(chromePid));

    try {
        // Wait for Chrome to be ready
--- a/archivebox/plugins/chrome/tests/chrome_test_helpers.py
+++ b/archivebox/plugins/chrome/tests/chrome_test_helpers.py
@@ -0,0 +1,869 @@
+"""
+Shared Chrome test helpers for plugin integration tests.
+
+This module provides common utilities for Chrome-based plugin tests, reducing
+duplication across test files. Functions delegate to chrome_utils.js (the single
+source of truth) with Python fallbacks.
+
+Function names match the JS equivalents in snake_case:
+    JS: getMachineType()  -> Python: get_machine_type()
+    JS: getLibDir()       -> Python: get_lib_dir()
+    JS: getNodeModulesDir() -> Python: get_node_modules_dir()
+    JS: getExtensionsDir() -> Python: get_extensions_dir()
+    JS: findChromium()    -> Python: find_chromium()
+    JS: killChrome()      -> Python: kill_chrome()
+    JS: getTestEnv()      -> Python: get_test_env()
+
+Usage:
+    # Path helpers (delegate to chrome_utils.js):
+    from archivebox.plugins.chrome.tests.chrome_test_helpers import (
+        get_test_env,           # env dict with LIB_DIR, NODE_MODULES_DIR, MACHINE_TYPE
+        get_machine_type,       # e.g., 'x86_64-linux', 'arm64-darwin'
+        get_lib_dir,            # Path to lib dir
+        get_node_modules_dir,   # Path to node_modules
+        get_extensions_dir,     # Path to chrome extensions
+        find_chromium,          # Find Chrome/Chromium binary
+        kill_chrome,            # Kill Chrome process by PID
+    )
+
+    # Test file helpers:
+    from archivebox.plugins.chrome.tests.chrome_test_helpers import (
+        get_plugin_dir,         # get_plugin_dir(__file__) -> plugin dir Path
+        get_hook_script,        # Find hook script by glob pattern
+        PLUGINS_ROOT,           # Path to plugins root
+        LIB_DIR,                # Path to lib dir (lazy-loaded)
+        NODE_MODULES_DIR,       # Path to node_modules (lazy-loaded)
+    )
+
+    # For Chrome session tests:
+    from archivebox.plugins.chrome.tests.chrome_test_helpers import (
+        setup_chrome_session,   # Full Chrome + tab setup
+        cleanup_chrome,         # Cleanup by PID
+        chrome_session,         # Context manager
+    )
+
+    # For extension tests:
+    from archivebox.plugins.chrome.tests.chrome_test_helpers import (
+        setup_test_env,         # Full dir structure + Chrome install
+        launch_chromium_session, # Launch Chrome, return CDP URL
+        kill_chromium_session,   # Cleanup Chrome
+    )
+
+    # Run hooks and parse JSONL:
+    from archivebox.plugins.chrome.tests.chrome_test_helpers import (
+        run_hook,               # Run hook, return (returncode, stdout, stderr)
+        parse_jsonl_output,     # Parse JSONL from stdout
+    )
+"""
+
+import json
+import os
+import platform
+import signal
+import subprocess
+import time
+from datetime import datetime
+from pathlib import Path
+from typing import Tuple, Optional, List, Dict, Any
+from contextlib import contextmanager
+
+
+# Plugin directory locations
+CHROME_PLUGIN_DIR = Path(__file__).parent.parent
+PLUGINS_ROOT = CHROME_PLUGIN_DIR.parent
+
+# Hook script locations
+CHROME_INSTALL_HOOK = CHROME_PLUGIN_DIR / 'on_Crawl__00_install_puppeteer_chromium.py'
+CHROME_LAUNCH_HOOK = CHROME_PLUGIN_DIR / 'on_Crawl__30_chrome_launch.bg.js'
+CHROME_TAB_HOOK = CHROME_PLUGIN_DIR / 'on_Snapshot__20_chrome_tab.bg.js'
+CHROME_NAVIGATE_HOOK = next(CHROME_PLUGIN_DIR.glob('on_Snapshot__*_chrome_navigate.*'), None)
+CHROME_UTILS = CHROME_PLUGIN_DIR / 'chrome_utils.js'
+
+
+# =============================================================================
+# Path Helpers - delegates to chrome_utils.js with Python fallback
+# Function names match JS: getMachineType -> get_machine_type, etc.
+# =============================================================================
+
+
+def _call_chrome_utils(command: str, *args: str, env: Optional[dict] = None) -> Tuple[int, str, str]:
+    """Call chrome_utils.js CLI command (internal helper).
+
+    This is the central dispatch for calling the JS utilities from Python.
+    All path calculations and Chrome operations are centralized in chrome_utils.js
+    to ensure consistency between Python and JavaScript code.
+
+    Args:
+        command: The CLI command (e.g., 'findChromium', 'getTestEnv')
+        *args: Additional command arguments
+        env: Environment dict (default: current env)
+
+    Returns:
+        Tuple of (returncode, stdout, stderr)
+    """
+    cmd = ['node', str(CHROME_UTILS), command] + list(args)
+    result = subprocess.run(
+        cmd,
+        capture_output=True,
+        text=True,
+        timeout=30,
+        env=env or os.environ.copy()
+    )
+    return result.returncode, result.stdout, result.stderr
+
+
+def get_plugin_dir(test_file: str) -> Path:
+    """Get the plugin directory from a test file path.
+
+    Usage:
+        PLUGIN_DIR = get_plugin_dir(__file__)
+
+    Args:
+        test_file: The __file__ of the test module (e.g., test_screenshot.py)
+
+    Returns:
+        Path to the plugin directory (e.g., plugins/screenshot/)
+    """
+    return Path(test_file).parent.parent
+
+
+def get_hook_script(plugin_dir: Path, pattern: str) -> Optional[Path]:
+    """Find a hook script in a plugin directory by pattern.
+
+    Usage:
+        HOOK = get_hook_script(PLUGIN_DIR, 'on_Snapshot__*_screenshot.*')
+
+    Args:
+        plugin_dir: Path to the plugin directory
+        pattern: Glob pattern to match
+
+    Returns:
+        Path to the hook script or None if not found
+    """
+    matches = list(plugin_dir.glob(pattern))
+    return matches[0] if matches else None
+
+
+def get_machine_type() -> str:
+    """Get machine type string (e.g., 'x86_64-linux', 'arm64-darwin').
+
+    Matches JS: getMachineType()
+
+    Tries chrome_utils.js first, falls back to Python computation.
+    """
+    # Try JS first (single source of truth)
+    returncode, stdout, stderr = _call_chrome_utils('getMachineType')
+    if returncode == 0 and stdout.strip():
+        return stdout.strip()
+
+    # Fallback to Python computation
+    if os.environ.get('MACHINE_TYPE'):
+        return os.environ['MACHINE_TYPE']
+
+    machine = platform.machine().lower()
+    system = platform.system().lower()
+    if machine in ('arm64', 'aarch64'):
+        machine = 'arm64'
+    elif machine in ('x86_64', 'amd64'):
+        machine = 'x86_64'
+    return f"{machine}-{system}"
+
+
+def get_lib_dir() -> Path:
+    """Get LIB_DIR path for platform-specific binaries.
+
+    Matches JS: getLibDir()
+
+    Tries chrome_utils.js first, falls back to Python computation.
+    """
+    # Try JS first
+    returncode, stdout, stderr = _call_chrome_utils('getLibDir')
+    if returncode == 0 and stdout.strip():
+        return Path(stdout.strip())
+
+    # Fallback to Python
+    if os.environ.get('LIB_DIR'):
+        return Path(os.environ['LIB_DIR'])
+    from archivebox.config.common import STORAGE_CONFIG
+    return Path(str(STORAGE_CONFIG.LIB_DIR))
+
+
+def get_node_modules_dir() -> Path:
+    """Get NODE_MODULES_DIR path for npm packages.
+
+    Matches JS: getNodeModulesDir()
+
+    Tries chrome_utils.js first, falls back to Python computation.
+    """
+    # Try JS first
+    returncode, stdout, stderr = _call_chrome_utils('getNodeModulesDir')
+    if returncode == 0 and stdout.strip():
+        return Path(stdout.strip())
+
+    # Fallback to Python
+    if os.environ.get('NODE_MODULES_DIR'):
+        return Path(os.environ['NODE_MODULES_DIR'])
+    lib_dir = get_lib_dir()
+    return lib_dir / 'npm' / 'node_modules'
+
+
+def get_extensions_dir() -> str:
+    """Get the Chrome extensions directory path.
+
+    Matches JS: getExtensionsDir()
+
+    Tries chrome_utils.js first, falls back to Python computation.
+    """
+    try:
+        returncode, stdout, stderr = _call_chrome_utils('getExtensionsDir')
+        if returncode == 0 and stdout.strip():
+            return stdout.strip()
+    except subprocess.TimeoutExpired:
+        pass  # Fall through to default computation
+
+    # Fallback to default computation if JS call fails
+    data_dir = os.environ.get('DATA_DIR', '.')
+    persona = os.environ.get('ACTIVE_PERSONA', 'Default')
+    return str(Path(data_dir) / 'personas' / persona / 'chrome_extensions')
+
+
+def find_chromium(data_dir: Optional[str] = None) -> Optional[str]:
+    """Find the Chromium binary path.
+
+    Matches JS: findChromium()
+
+    Uses chrome_utils.js which checks:
+    - CHROME_BINARY env var
+    - @puppeteer/browsers install locations
+    - System Chromium locations
+    - Falls back to Chrome (with warning)
+
+    Args:
+        data_dir: Optional DATA_DIR override
+
+    Returns:
+        Path to Chromium binary or None if not found
+    """
+    env = os.environ.copy()
+    if data_dir:
+        env['DATA_DIR'] = str(data_dir)
+    returncode, stdout, stderr = _call_chrome_utils('findChromium', env=env)
+    if returncode == 0 and stdout.strip():
+        return stdout.strip()
+    return None
+
+
+def kill_chrome(pid: int, output_dir: Optional[str] = None) -> bool:
+    """Kill a Chrome process by PID.
+
+    Matches JS: killChrome()
+
+    Uses chrome_utils.js which handles:
+    - SIGTERM then SIGKILL
+    - Process group killing
+    - Zombie process cleanup
+
+    Args:
+        pid: Process ID to kill
+        output_dir: Optional chrome output directory for PID file cleanup
+
+    Returns:
+        True if the kill command succeeded
+    """
+    args = [str(pid)]
+    if output_dir:
+        args.append(str(output_dir))
+    returncode, stdout, stderr = _call_chrome_utils('killChrome', *args)
+    return returncode == 0
+
+
+def get_test_env() -> dict:
+    """Get environment dict with all paths set correctly for tests.
+
+    Matches JS: getTestEnv()
+
+    Tries chrome_utils.js first for path values, builds env dict.
+    Use this for all subprocess calls in plugin tests.
+    """
+    env = os.environ.copy()
+
+    # Try to get all paths from JS (single source of truth)
+    returncode, stdout, stderr = _call_chrome_utils('getTestEnv')
+    if returncode == 0 and stdout.strip():
+        try:
+            js_env = json.loads(stdout)
+            env.update(js_env)
+            return env
+        except json.JSONDecodeError:
+            pass
+
+    # Fallback to Python computation
+    lib_dir = get_lib_dir()
+    env['LIB_DIR'] = str(lib_dir)
+    env['NODE_MODULES_DIR'] = str(get_node_modules_dir())
+    env['MACHINE_TYPE'] = get_machine_type()
+    return env
+
+
+# Backward compatibility aliases (deprecated, use new names)
+find_chromium_binary = find_chromium
+kill_chrome_via_js = kill_chrome
+get_machine_type_from_js = get_machine_type
+get_test_env_from_js = get_test_env
+
+
+# =============================================================================
+# Module-level constants (lazy-loaded on first access)
+# Import these directly: from chrome_test_helpers import LIB_DIR, NODE_MODULES_DIR
+# =============================================================================
+
+# These are computed once when first accessed
+_LIB_DIR: Optional[Path] = None
+_NODE_MODULES_DIR: Optional[Path] = None
+
+
+def _get_lib_dir_cached() -> Path:
+    global _LIB_DIR
+    if _LIB_DIR is None:
+        _LIB_DIR = get_lib_dir()
+    return _LIB_DIR
+
+
+def _get_node_modules_dir_cached() -> Path:
+    global _NODE_MODULES_DIR
+    if _NODE_MODULES_DIR is None:
+        _NODE_MODULES_DIR = get_node_modules_dir()
+    return _NODE_MODULES_DIR
+
+
+# Module-level constants that can be imported directly
+# Usage: from chrome_test_helpers import LIB_DIR, NODE_MODULES_DIR
+class _LazyPath:
+    """Lazy path that computes value on first access."""
+    def __init__(self, getter):
+        self._getter = getter
+        self._value = None
+
+    def __fspath__(self):
+        if self._value is None:
+            self._value = self._getter()
+        return str(self._value)
+
+    def __truediv__(self, other):
+        if self._value is None:
+            self._value = self._getter()
+        return self._value / other
+
+    def __str__(self):
+        return self.__fspath__()
+
+    def __repr__(self):
+        return f"<LazyPath: {self.__fspath__()}>"
+
+
+LIB_DIR = _LazyPath(_get_lib_dir_cached)
+NODE_MODULES_DIR = _LazyPath(_get_node_modules_dir_cached)
+
+
+# =============================================================================
+# Hook Execution Helpers
+# =============================================================================
+
+
+def run_hook(
+    hook_script: Path,
+    url: str,
+    snapshot_id: str,
+    cwd: Optional[Path] = None,
+    env: Optional[dict] = None,
+    timeout: int = 60,
+    extra_args: Optional[List[str]] = None,
+) -> Tuple[int, str, str]:
+    """Run a hook script and return (returncode, stdout, stderr).
+
+    Usage:
+        returncode, stdout, stderr = run_hook(
+            HOOK_SCRIPT, 'https://example.com', 'test-snap-123',
+            cwd=tmpdir, env=get_test_env()
+        )
+
+    Args:
+        hook_script: Path to the hook script
+        url: URL to process
+        snapshot_id: Snapshot ID
+        cwd: Working directory (default: current dir)
+        env: Environment dict (default: get_test_env())
+        timeout: Timeout in seconds
+        extra_args: Additional arguments to pass
+
+    Returns:
+        Tuple of (returncode, stdout, stderr)
+    """
+    if env is None:
+        env = get_test_env()
+
+    # Determine interpreter based on file extension
+    if hook_script.suffix == '.py':
+        cmd = ['python', str(hook_script)]
+    elif hook_script.suffix == '.js':
+        cmd = ['node', str(hook_script)]
+    else:
+        cmd = [str(hook_script)]
+
+    cmd.extend([f'--url={url}', f'--snapshot-id={snapshot_id}'])
+    if extra_args:
+        cmd.extend(extra_args)
+
+    result = subprocess.run(
+        cmd,
+        cwd=str(cwd) if cwd else None,
+        capture_output=True,
+        text=True,
+        env=env,
+        timeout=timeout
+    )
+    return result.returncode, result.stdout, result.stderr
+
+
+def parse_jsonl_output(stdout: str, record_type: str = 'ArchiveResult') -> Optional[Dict[str, Any]]:
+    """Parse JSONL output from hook stdout and return the specified record type.
+
+    Usage:
+        result = parse_jsonl_output(stdout)
+        if result and result['status'] == 'succeeded':
+            print("Success!")
+
+    Args:
+        stdout: The stdout from a hook execution
+        record_type: The 'type' field to look for (default: 'ArchiveResult')
+
+    Returns:
+        The parsed JSON dict or None if not found
+    """
+    for line in stdout.strip().split('\n'):
+        line = line.strip()
+        if not line.startswith('{'):
+            continue
+        try:
+            record = json.loads(line)
+            if record.get('type') == record_type:
+                return record
+        except json.JSONDecodeError:
+            continue
+    return None
+
+
+def run_hook_and_parse(
+    hook_script: Path,
+    url: str,
+    snapshot_id: str,
+    cwd: Optional[Path] = None,
+    env: Optional[dict] = None,
+    timeout: int = 60,
+    extra_args: Optional[List[str]] = None,
+) -> Tuple[int, Optional[Dict[str, Any]], str]:
+    """Run a hook and parse its JSONL output.
+
+    Convenience function combining run_hook() and parse_jsonl_output().
+
+    Returns:
+        Tuple of (returncode, parsed_result_or_none, stderr)
+    """
+    returncode, stdout, stderr = run_hook(
+        hook_script, url, snapshot_id,
+        cwd=cwd, env=env, timeout=timeout, extra_args=extra_args
+    )
+    result = parse_jsonl_output(stdout)
+    return returncode, result, stderr
+
+
+# =============================================================================
+# Extension Test Helpers
+# Used by extension tests (ublock, istilldontcareaboutcookies, twocaptcha)
+# =============================================================================
+
+
+def setup_test_env(tmpdir: Path) -> dict:
+    """Set up isolated data/lib directory structure for extension tests.
+
+    Creates structure matching real ArchiveBox data dir:
+        <tmpdir>/data/
+            lib/
+                arm64-darwin/   (or x86_64-linux, etc.)
+                    npm/
+                        .bin/
+                        node_modules/
+            personas/
+                Default/
+                    chrome_extensions/
+            users/
+                testuser/
+                    crawls/
+                    snapshots/
+
+    Calls chrome install hook which handles puppeteer-core and chromium installation.
+    Returns env dict with DATA_DIR, LIB_DIR, NPM_BIN_DIR, NODE_MODULES_DIR, CHROME_BINARY, etc.
+
+    Args:
+        tmpdir: Base temporary directory for the test
+
+    Returns:
+        Environment dict with all paths set, or pytest.skip() if Chrome install fails
+    """
+    import pytest
+
+    # Determine machine type (matches archivebox.config.paths.get_machine_type())
+    machine = platform.machine().lower()
+    system = platform.system().lower()
+    if machine in ('arm64', 'aarch64'):
+        machine = 'arm64'
+    elif machine in ('x86_64', 'amd64'):
+        machine = 'x86_64'
+    machine_type = f"{machine}-{system}"
+
+    # Create proper directory structure matching real ArchiveBox layout
+    data_dir = tmpdir / 'data'
+    lib_dir = data_dir / 'lib' / machine_type
+    npm_dir = lib_dir / 'npm'
+    npm_bin_dir = npm_dir / '.bin'
+    node_modules_dir = npm_dir / 'node_modules'
+
+    # Extensions go under personas/Default/
+    chrome_extensions_dir = data_dir / 'personas' / 'Default' / 'chrome_extensions'
+
+    # User data goes under users/{username}/
+    date_str = datetime.now().strftime('%Y%m%d')
+    users_dir = data_dir / 'users' / 'testuser'
+    crawls_dir = users_dir / 'crawls' / date_str
+    snapshots_dir = users_dir / 'snapshots' / date_str
+
+    # Create all directories
+    node_modules_dir.mkdir(parents=True, exist_ok=True)
+    npm_bin_dir.mkdir(parents=True, exist_ok=True)
+    chrome_extensions_dir.mkdir(parents=True, exist_ok=True)
+    crawls_dir.mkdir(parents=True, exist_ok=True)
+    snapshots_dir.mkdir(parents=True, exist_ok=True)
+
+    # Build complete env dict
+    env = os.environ.copy()
+    env.update({
+        'DATA_DIR': str(data_dir),
+        'LIB_DIR': str(lib_dir),
+        'MACHINE_TYPE': machine_type,
+        'NPM_BIN_DIR': str(npm_bin_dir),
+        'NODE_MODULES_DIR': str(node_modules_dir),
+        'CHROME_EXTENSIONS_DIR': str(chrome_extensions_dir),
+        'CRAWLS_DIR': str(crawls_dir),
+        'SNAPSHOTS_DIR': str(snapshots_dir),
+    })
+
+    # Only set headless if not already in environment (allow override for debugging)
+    if 'CHROME_HEADLESS' not in os.environ:
+        env['CHROME_HEADLESS'] = 'true'
+
+    # Call chrome install hook (installs puppeteer-core and chromium, outputs JSONL)
+    result = subprocess.run(
+        ['python', str(CHROME_INSTALL_HOOK)],
+        capture_output=True, text=True, timeout=120, env=env
+    )
+    if result.returncode != 0:
+        pytest.skip(f"Chrome install hook failed: {result.stderr}")
+
+    # Parse JSONL output to get CHROME_BINARY
+    chrome_binary = None
+    for line in result.stdout.strip().split('\n'):
+        if not line.strip():
+            continue
+        try:
+            data = json.loads(line)
+            if data.get('type') == 'Binary' and data.get('abspath'):
+                chrome_binary = data['abspath']
+                break
+        except json.JSONDecodeError:
+            continue
+
+    if not chrome_binary or not Path(chrome_binary).exists():
+        pytest.skip(f"Chromium binary not found: {chrome_binary}")
+
+    env['CHROME_BINARY'] = chrome_binary
+    return env
+
+
+def launch_chromium_session(env: dict, chrome_dir: Path, crawl_id: str) -> Tuple[subprocess.Popen, str]:
+    """Launch Chromium and return (process, cdp_url).
+
+    This launches Chrome using the chrome launch hook and waits for the CDP URL
+    to become available. Use this for extension tests that need direct CDP access.
+
+    Args:
+        env: Environment dict (from setup_test_env)
+        chrome_dir: Directory for Chrome to write its files (cdp_url.txt, chrome.pid, etc.)
+        crawl_id: ID for the crawl
+
+    Returns:
+        Tuple of (chrome_launch_process, cdp_url)
+
+    Raises:
+        RuntimeError: If Chrome fails to launch or CDP URL not available after 20s
+    """
+    chrome_dir.mkdir(parents=True, exist_ok=True)
+
+    chrome_launch_process = subprocess.Popen(
+        ['node', str(CHROME_LAUNCH_HOOK), f'--crawl-id={crawl_id}'],
+        cwd=str(chrome_dir),
+        stdout=subprocess.PIPE,
+        stderr=subprocess.PIPE,
+        text=True,
+        env=env
+    )
+
+    # Wait for Chromium to launch and CDP URL to be available
+    cdp_url = None
+    for i in range(20):
+        if chrome_launch_process.poll() is not None:
+            stdout, stderr = chrome_launch_process.communicate()
+            raise RuntimeError(f"Chromium launch failed:\nStdout: {stdout}\nStderr: {stderr}")
+        cdp_file = chrome_dir / 'cdp_url.txt'
+        if cdp_file.exists():
+            cdp_url = cdp_file.read_text().strip()
+            break
+        time.sleep(1)
+
+    if not cdp_url:
+        chrome_launch_process.kill()
+        raise RuntimeError("Chromium CDP URL not found after 20s")
+
+    return chrome_launch_process, cdp_url
+
+
+def kill_chromium_session(chrome_launch_process: subprocess.Popen, chrome_dir: Path) -> None:
+    """Clean up Chromium process launched by launch_chromium_session.
+
+    Uses chrome_utils.js killChrome for proper process group handling.
+
+    Args:
+        chrome_launch_process: The Popen object from launch_chromium_session
+        chrome_dir: The chrome directory containing chrome.pid
+    """
+    # First try to terminate the launch process gracefully
+    try:
+        chrome_launch_process.send_signal(signal.SIGTERM)
+        chrome_launch_process.wait(timeout=5)
+    except Exception:
+        pass
+
+    # Read PID and use JS to kill with proper cleanup
+    chrome_pid_file = chrome_dir / 'chrome.pid'
+    if chrome_pid_file.exists():
+        try:
+            chrome_pid = int(chrome_pid_file.read_text().strip())
+            kill_chrome(chrome_pid, str(chrome_dir))
+        except (ValueError, FileNotFoundError):
+            pass
+
+
+@contextmanager
+def chromium_session(env: dict, chrome_dir: Path, crawl_id: str):
+    """Context manager for Chromium sessions with automatic cleanup.
+
+    Usage:
+        with chromium_session(env, chrome_dir, 'test-crawl') as (process, cdp_url):
+            # Use cdp_url to connect with puppeteer
+            pass
+        # Chromium automatically cleaned up
+
+    Args:
+        env: Environment dict (from setup_test_env)
+        chrome_dir: Directory for Chrome files
+        crawl_id: ID for the crawl
+
+    Yields:
+        Tuple of (chrome_launch_process, cdp_url)
+    """
+    chrome_launch_process = None
+    try:
+        chrome_launch_process, cdp_url = launch_chromium_session(env, chrome_dir, crawl_id)
+        yield chrome_launch_process, cdp_url
+    finally:
+        if chrome_launch_process:
+            kill_chromium_session(chrome_launch_process, chrome_dir)
+
+
+# =============================================================================
+# Tab-based Test Helpers
+# Used by tab-based tests (infiniscroll, modalcloser)
+# =============================================================================
+
+
+def setup_chrome_session(
+    tmpdir: Path,
+    crawl_id: str = 'test-crawl',
+    snapshot_id: str = 'test-snapshot',
+    test_url: str = 'about:blank',
+    navigate: bool = True,
+    timeout: int = 15,
+) -> Tuple[subprocess.Popen, int, Path]:
+    """Set up a Chrome session with tab and optional navigation.
+
+    Creates the directory structure, launches Chrome, creates a tab,
+    and optionally navigates to the test URL.
+
+    Args:
+        tmpdir: Temporary directory for test files
+        crawl_id: ID to use for the crawl
+        snapshot_id: ID to use for the snapshot
+        test_url: URL to navigate to (if navigate=True)
+        navigate: Whether to navigate to the URL after creating tab
+        timeout: Seconds to wait for Chrome to start
+
+    Returns:
+        Tuple of (chrome_launch_process, chrome_pid, snapshot_chrome_dir)
+
+    Raises:
+        RuntimeError: If Chrome fails to start or tab creation fails
+    """
+    crawl_dir = Path(tmpdir) / 'crawl'
+    crawl_dir.mkdir(exist_ok=True)
+    chrome_dir = crawl_dir / 'chrome'
+    chrome_dir.mkdir(exist_ok=True)
+
+    env = get_test_env()
+    env['CHROME_HEADLESS'] = 'true'
+
+    # Launch Chrome at crawl level
+    chrome_launch_process = subprocess.Popen(
+        ['node', str(CHROME_LAUNCH_HOOK), f'--crawl-id={crawl_id}'],
+        cwd=str(chrome_dir),
+        stdout=subprocess.PIPE,
+        stderr=subprocess.PIPE,
+        text=True,
+        env=env
+    )
+
+    # Wait for Chrome to launch
+    for i in range(timeout):
+        if chrome_launch_process.poll() is not None:
+            stdout, stderr = chrome_launch_process.communicate()
+            raise RuntimeError(f"Chrome launch failed:\nStdout: {stdout}\nStderr: {stderr}")
+        if (chrome_dir / 'cdp_url.txt').exists():
+            break
+        time.sleep(1)
+
+    if not (chrome_dir / 'cdp_url.txt').exists():
+        raise RuntimeError(f"Chrome CDP URL not found after {timeout}s")
+
+    chrome_pid = int((chrome_dir / 'chrome.pid').read_text().strip())
+
+    # Create snapshot directory structure
+    snapshot_dir = Path(tmpdir) / 'snapshot'
+    snapshot_dir.mkdir(exist_ok=True)
+    snapshot_chrome_dir = snapshot_dir / 'chrome'
+    snapshot_chrome_dir.mkdir(exist_ok=True)
+
+    # Create tab
+    tab_env = env.copy()
+    tab_env['CRAWL_OUTPUT_DIR'] = str(crawl_dir)
+    try:
+        result = subprocess.run(
+            ['node', str(CHROME_TAB_HOOK), f'--url={test_url}', f'--snapshot-id={snapshot_id}', f'--crawl-id={crawl_id}'],
+            cwd=str(snapshot_chrome_dir),
+            capture_output=True,
+            text=True,
+            timeout=60,
+            env=tab_env
+        )
+        if result.returncode != 0:
+            cleanup_chrome(chrome_launch_process, chrome_pid)
+            raise RuntimeError(f"Tab creation failed: {result.stderr}")
+    except subprocess.TimeoutExpired:
+        cleanup_chrome(chrome_launch_process, chrome_pid)
+        raise RuntimeError("Tab creation timed out after 60s")
+
+    # Navigate to URL if requested
+    if navigate and CHROME_NAVIGATE_HOOK and test_url != 'about:blank':
+        try:
+            result = subprocess.run(
+                ['node', str(CHROME_NAVIGATE_HOOK), f'--url={test_url}', f'--snapshot-id={snapshot_id}'],
+                cwd=str(snapshot_chrome_dir),
+                capture_output=True,
+                text=True,
+                timeout=120,
+                env=env
+            )
+            if result.returncode != 0:
+                cleanup_chrome(chrome_launch_process, chrome_pid)
+                raise RuntimeError(f"Navigation failed: {result.stderr}")
+        except subprocess.TimeoutExpired:
+            cleanup_chrome(chrome_launch_process, chrome_pid)
+            raise RuntimeError("Navigation timed out after 120s")
+
+    return chrome_launch_process, chrome_pid, snapshot_chrome_dir
+
+
+def cleanup_chrome(chrome_launch_process: subprocess.Popen, chrome_pid: int, chrome_dir: Optional[Path] = None) -> None:
+    """Clean up Chrome processes using chrome_utils.js killChrome.
+
+    Uses the centralized kill logic from chrome_utils.js which handles:
+    - SIGTERM then SIGKILL
+    - Process group killing
+    - Zombie process cleanup
+
+    Args:
+        chrome_launch_process: The Popen object for the chrome launch hook
+        chrome_pid: The PID of the Chrome process
+        chrome_dir: Optional path to chrome output directory
+    """
+    # First try to terminate the launch process gracefully
+    try:
+        chrome_launch_process.send_signal(signal.SIGTERM)
+        chrome_launch_process.wait(timeout=5)
+    except Exception:
+        pass
+
+    # Use JS to kill Chrome with proper process group handling
+    kill_chrome(chrome_pid, str(chrome_dir) if chrome_dir else None)
+
+
+@contextmanager
+def chrome_session(
+    tmpdir: Path,
+    crawl_id: str = 'test-crawl',
+    snapshot_id: str = 'test-snapshot',
+    test_url: str = 'about:blank',
+    navigate: bool = True,
+    timeout: int = 15,
+):
+    """Context manager for Chrome sessions with automatic cleanup.
+
+    Usage:
+        with chrome_session(tmpdir, test_url='https://example.com') as (process, pid, chrome_dir):
+            # Run tests with chrome session
+            pass
+        # Chrome automatically cleaned up
+
+    Args:
+        tmpdir: Temporary directory for test files
+        crawl_id: ID to use for the crawl
+        snapshot_id: ID to use for the snapshot
+        test_url: URL to navigate to (if navigate=True)
+        navigate: Whether to navigate to the URL after creating tab
+        timeout: Seconds to wait for Chrome to start
+
+    Yields:
+        Tuple of (chrome_launch_process, chrome_pid, snapshot_chrome_dir)
+    """
+    chrome_launch_process = None
+    chrome_pid = None
+    try:
+        chrome_launch_process, chrome_pid, snapshot_chrome_dir = setup_chrome_session(
+            tmpdir=tmpdir,
+            crawl_id=crawl_id,
+            snapshot_id=snapshot_id,
+            test_url=test_url,
+            navigate=navigate,
+            timeout=timeout,
+        )
+        yield chrome_launch_process, chrome_pid, snapshot_chrome_dir
+    finally:
+        if chrome_launch_process and chrome_pid:
+            cleanup_chrome(chrome_launch_process, chrome_pid)
--- a/archivebox/plugins/chrome/tests/test_chrome.py
+++ b/archivebox/plugins/chrome/tests/test_chrome.py
@@ -28,70 +28,25 @@ import tempfile
 import shutil
 import platform

-PLUGIN_DIR = Path(__file__).parent.parent
-CHROME_LAUNCH_HOOK = PLUGIN_DIR / 'on_Crawl__20_chrome_launch.bg.js'
-CHROME_TAB_HOOK = PLUGIN_DIR / 'on_Snapshot__20_chrome_tab.bg.js'
-CHROME_NAVIGATE_HOOK = next(PLUGIN_DIR.glob('on_Snapshot__*_chrome_navigate.*'), None)
+from archivebox.plugins.chrome.tests.chrome_test_helpers import (
+    get_test_env,
+    get_lib_dir,
+    get_node_modules_dir,
+    find_chromium_binary,
+    CHROME_PLUGIN_DIR as PLUGIN_DIR,
+    CHROME_LAUNCH_HOOK,
+    CHROME_TAB_HOOK,
+    CHROME_NAVIGATE_HOOK,
+)

-# Get LIB_DIR and MACHINE_TYPE from environment or compute them
-def get_lib_dir_and_machine_type():
-    """Get or compute LIB_DIR and MACHINE_TYPE for tests."""
-    from archivebox.config.paths import get_machine_type
-    from archivebox.config.common import STORAGE_CONFIG
-
-    lib_dir = os.environ.get('LIB_DIR') or str(STORAGE_CONFIG.LIB_DIR)
-    machine_type = os.environ.get('MACHINE_TYPE') or get_machine_type()
-
-    return Path(lib_dir), machine_type
-
-# Setup NODE_MODULES_DIR to find npm packages
-LIB_DIR, MACHINE_TYPE = get_lib_dir_and_machine_type()
-# Note: LIB_DIR already includes machine_type (e.g., data/lib/arm64-darwin)
-NODE_MODULES_DIR = LIB_DIR / 'npm' / 'node_modules'
+# Get LIB_DIR and NODE_MODULES_DIR from shared helpers
+LIB_DIR = get_lib_dir()
+NODE_MODULES_DIR = get_node_modules_dir()
 NPM_PREFIX = LIB_DIR / 'npm'

 # Chromium install location (relative to DATA_DIR)
 CHROMIUM_INSTALL_DIR = Path(os.environ.get('DATA_DIR', '.')).resolve() / 'chromium'

-def get_test_env():
-    """Get environment with NODE_MODULES_DIR and CHROME_BINARY set correctly."""
-    env = os.environ.copy()
-    env['NODE_MODULES_DIR'] = str(NODE_MODULES_DIR)
-    env['LIB_DIR'] = str(LIB_DIR)
-    env['MACHINE_TYPE'] = MACHINE_TYPE
-    # Ensure CHROME_BINARY is set to Chromium
-    if 'CHROME_BINARY' not in env:
-        chromium = find_chromium_binary()
-        if chromium:
-            env['CHROME_BINARY'] = chromium
-    return env
-
-
-def find_chromium_binary(data_dir=None):
-    """Find the Chromium binary using chrome_utils.js findChromium().
-
-    This uses the centralized findChromium() function which checks:
-    - CHROME_BINARY env var
-    - @puppeteer/browsers install locations (in data_dir/chromium)
-    - System Chromium locations
-    - Falls back to Chrome (with warning)
-
-    Args:
-        data_dir: Directory where chromium was installed (contains chromium/ subdir)
-    """
-    chrome_utils = PLUGIN_DIR / 'chrome_utils.js'
-    # Use provided data_dir, or fall back to env var, or current dir
-    search_dir = data_dir or os.environ.get('DATA_DIR', '.')
-    result = subprocess.run(
-        ['node', str(chrome_utils), 'findChromium', str(search_dir)],
-        capture_output=True,
-        text=True,
-        timeout=10
-    )
-    if result.returncode == 0 and result.stdout.strip():
-        return result.stdout.strip()
-    return None
-

@pytest.fixture(scope="session", autouse=True)
 def ensure_chromium_and_puppeteer_installed():
@@ -176,6 +131,7 @@ def test_chrome_launch_and_tab_creation():
        crawl_dir = Path(tmpdir) / 'crawl'
        crawl_dir.mkdir()
        chrome_dir = crawl_dir / 'chrome'
+        chrome_dir.mkdir()

        # Get test environment with NODE_MODULES_DIR set
        env = get_test_env()
@@ -184,7 +140,7 @@ def test_chrome_launch_and_tab_creation():
        # Launch Chrome at crawl level (background process)
        chrome_launch_process = subprocess.Popen(
            ['node', str(CHROME_LAUNCH_HOOK), '--crawl-id=test-crawl-123'],
-            cwd=str(crawl_dir),
+            cwd=str(chrome_dir),
            stdout=subprocess.PIPE,
            stderr=subprocess.PIPE,
            text=True,
@@ -292,7 +248,7 @@ def test_chrome_navigation():
        # Launch Chrome (background process)
        chrome_launch_process = subprocess.Popen(
            ['node', str(CHROME_LAUNCH_HOOK), '--crawl-id=test-crawl-nav'],
-            cwd=str(crawl_dir),
+            cwd=str(chrome_dir),
            stdout=subprocess.PIPE,
            stderr=subprocess.PIPE,
            text=True,
@@ -363,7 +319,7 @@ def test_tab_cleanup_on_sigterm():
        # Launch Chrome (background process)
        chrome_launch_process = subprocess.Popen(
            ['node', str(CHROME_LAUNCH_HOOK), '--crawl-id=test-cleanup'],
-            cwd=str(crawl_dir),
+            cwd=str(chrome_dir),
            stdout=subprocess.PIPE,
            stderr=subprocess.PIPE,
            text=True,
@@ -423,11 +379,12 @@ def test_multiple_snapshots_share_chrome():
        crawl_dir = Path(tmpdir) / 'crawl'
        crawl_dir.mkdir()
        chrome_dir = crawl_dir / 'chrome'
+        chrome_dir.mkdir()

        # Launch Chrome at crawl level
        chrome_launch_process = subprocess.Popen(
            ['node', str(CHROME_LAUNCH_HOOK), '--crawl-id=test-multi-crawl'],
-            cwd=str(crawl_dir),
+            cwd=str(chrome_dir),
            stdout=subprocess.PIPE,
            stderr=subprocess.PIPE,
            text=True,
@@ -513,7 +470,7 @@ def test_chrome_cleanup_on_crawl_end():
        # Launch Chrome in background
        chrome_launch_process = subprocess.Popen(
            ['node', str(CHROME_LAUNCH_HOOK), '--crawl-id=test-crawl-end'],
-            cwd=str(crawl_dir),
+            cwd=str(chrome_dir),
            stdout=subprocess.PIPE,
            stderr=subprocess.PIPE,
            text=True,
@@ -554,11 +511,12 @@ def test_zombie_prevention_hook_killed():
        crawl_dir = Path(tmpdir) / 'crawl'
        crawl_dir.mkdir()
        chrome_dir = crawl_dir / 'chrome'
+        chrome_dir.mkdir()

        # Launch Chrome
        chrome_launch_process = subprocess.Popen(
            ['node', str(CHROME_LAUNCH_HOOK), '--crawl-id=test-zombie'],
-            cwd=str(crawl_dir),
+            cwd=str(chrome_dir),
            stdout=subprocess.PIPE,
            stderr=subprocess.PIPE,
            text=True,
--- a/archivebox/plugins/consolelog/on_Snapshot__21_consolelog.bg.js
+++ b/archivebox/plugins/consolelog/on_Snapshot__21_consolelog.bg.js
@@ -19,7 +19,7 @@ const puppeteer = require('puppeteer-core');
 const PLUGIN_NAME = 'consolelog';
 const OUTPUT_DIR = '.';
 const OUTPUT_FILE = 'console.jsonl';
-const PID_FILE = 'hook.pid';
+// PID file is now written by run_hook() with hook-specific name
 const CHROME_SESSION_DIR = '../chrome';

 function parseArgs() {
@@ -221,8 +221,8 @@ async function main() {
        // Set up listeners BEFORE navigation
        await setupListeners();

-        // Write PID file so chrome_cleanup can kill any remaining processes
-        fs.writeFileSync(path.join(OUTPUT_DIR, PID_FILE), String(process.pid));
+        // Note: PID file is written by run_hook() with hook-specific name
+        // Snapshot.cleanup() kills all *.pid processes when done

        // Wait for chrome_navigate to complete (BLOCKING)
        await waitForNavigation();
--- a/archivebox/plugins/dom/tests/test_dom.py
+++ b/archivebox/plugins/dom/tests/test_dom.py
@@ -20,29 +20,22 @@ from pathlib import Path

 import pytest

+from archivebox.plugins.chrome.tests.chrome_test_helpers import (
+    get_test_env,
+    get_plugin_dir,
+    get_hook_script,
+    run_hook_and_parse,
+    LIB_DIR,
+    NODE_MODULES_DIR,
+    PLUGINS_ROOT,
+)

-PLUGIN_DIR = Path(__file__).parent.parent
-PLUGINS_ROOT = PLUGIN_DIR.parent
-DOM_HOOK = next(PLUGIN_DIR.glob('on_Snapshot__*_dom.*'), None)
-NPM_PROVIDER_HOOK = next((PLUGINS_ROOT / 'npm').glob('on_Binary__install_using_npm_provider.py'), None)
+
+PLUGIN_DIR = get_plugin_dir(__file__)
+DOM_HOOK = get_hook_script(PLUGIN_DIR, 'on_Snapshot__*_dom.*')
+NPM_PROVIDER_HOOK = get_hook_script(PLUGINS_ROOT / 'npm', 'on_Binary__install_using_npm_provider.py')
 TEST_URL = 'https://example.com'

-# Get LIB_DIR for NODE_MODULES_DIR
-def get_lib_dir():
-    """Get LIB_DIR for tests."""
-    from archivebox.config.common import STORAGE_CONFIG
-    return Path(os.environ.get('LIB_DIR') or str(STORAGE_CONFIG.LIB_DIR))
-
-LIB_DIR = get_lib_dir()
-NODE_MODULES_DIR = LIB_DIR / 'npm' / 'node_modules'
-
-def get_test_env():
-    """Get environment with NODE_MODULES_DIR set correctly."""
-    env = os.environ.copy()
-    env['NODE_MODULES_DIR'] = str(NODE_MODULES_DIR)
-    env['LIB_DIR'] = str(LIB_DIR)
-    return env
-

 def test_hook_script_exists():
    """Verify on_Snapshot hook exists."""
--- a/archivebox/plugins/favicon/tests/test_favicon.py
+++ b/archivebox/plugins/favicon/tests/test_favicon.py
@@ -2,7 +2,6 @@
 Integration tests for favicon plugin

 Tests verify:
-    pass
 1. Plugin script exists
 2. requests library is available
 3. Favicon extraction works for real example.com
@@ -21,9 +20,15 @@ from pathlib import Path

 import pytest

+from archivebox.plugins.chrome.tests.chrome_test_helpers import (
+    get_plugin_dir,
+    get_hook_script,
+    parse_jsonl_output,
+)

-PLUGIN_DIR = Path(__file__).parent.parent
-FAVICON_HOOK = next(PLUGIN_DIR.glob('on_Snapshot__*_favicon.*'), None)
+
+PLUGIN_DIR = get_plugin_dir(__file__)
+FAVICON_HOOK = get_hook_script(PLUGIN_DIR, 'on_Snapshot__*_favicon.*')
 TEST_URL = 'https://example.com'


--- a/archivebox/plugins/infiniscroll/tests/test_infiniscroll.py
+++ b/archivebox/plugins/infiniscroll/tests/test_infiniscroll.py
@@ -14,7 +14,6 @@ Tests verify:
 import json
 import os
 import re
-import signal
 import subprocess
 import time
 import tempfile
@@ -22,37 +21,19 @@ from pathlib import Path

 import pytest

+# Import shared Chrome test helpers
+from archivebox.plugins.chrome.tests.chrome_test_helpers import (
+    get_test_env,
+    setup_chrome_session,
+    cleanup_chrome,
+)
+

 PLUGIN_DIR = Path(__file__).parent.parent
-PLUGINS_ROOT = PLUGIN_DIR.parent
 INFINISCROLL_HOOK = next(PLUGIN_DIR.glob('on_Snapshot__*_infiniscroll.*'), None)
-CHROME_LAUNCH_HOOK = PLUGINS_ROOT / 'chrome' / 'on_Crawl__20_chrome_launch.bg.js'
-CHROME_TAB_HOOK = PLUGINS_ROOT / 'chrome' / 'on_Snapshot__20_chrome_tab.bg.js'
-CHROME_NAVIGATE_HOOK = next((PLUGINS_ROOT / 'chrome').glob('on_Snapshot__*_chrome_navigate.*'), None)
 TEST_URL = 'https://www.singsing.movie/'


-def get_node_modules_dir():
-    """Get NODE_MODULES_DIR for tests, checking env first."""
-    # Check if NODE_MODULES_DIR is already set in environment
-    if os.environ.get('NODE_MODULES_DIR'):
-        return Path(os.environ['NODE_MODULES_DIR'])
-    # Otherwise compute from LIB_DIR
-    from archivebox.config.common import STORAGE_CONFIG
-    lib_dir = Path(os.environ.get('LIB_DIR') or str(STORAGE_CONFIG.LIB_DIR))
-    return lib_dir / 'npm' / 'node_modules'
-
-
-NODE_MODULES_DIR = get_node_modules_dir()
-
-
-def get_test_env():
-    """Get environment with NODE_MODULES_DIR set correctly."""
-    env = os.environ.copy()
-    env['NODE_MODULES_DIR'] = str(NODE_MODULES_DIR)
-    return env
-
-
 def test_hook_script_exists():
    """Verify on_Snapshot hook exists."""
    assert INFINISCROLL_HOOK is not None, "Infiniscroll hook not found"
@@ -117,94 +98,18 @@ def test_fails_gracefully_without_chrome_session():
            f"Should mention chrome/CDP/puppeteer in error: {result.stderr}"


-def setup_chrome_session(tmpdir):
-    """Helper to set up Chrome session with tab and navigation."""
-    crawl_dir = Path(tmpdir) / 'crawl'
-    crawl_dir.mkdir()
-    chrome_dir = crawl_dir / 'chrome'
-
-    env = get_test_env()
-    env['CHROME_HEADLESS'] = 'true'
-
-    # Launch Chrome at crawl level
-    chrome_launch_process = subprocess.Popen(
-        ['node', str(CHROME_LAUNCH_HOOK), '--crawl-id=test-infiniscroll'],
-        cwd=str(crawl_dir),
-        stdout=subprocess.PIPE,
-        stderr=subprocess.PIPE,
-        text=True,
-        env=env
-    )
-
-    # Wait for Chrome to launch
-    for i in range(15):
-        if chrome_launch_process.poll() is not None:
-            stdout, stderr = chrome_launch_process.communicate()
-            raise RuntimeError(f"Chrome launch failed:\nStdout: {stdout}\nStderr: {stderr}")
-        if (chrome_dir / 'cdp_url.txt').exists():
-            break
-        time.sleep(1)
-
-    if not (chrome_dir / 'cdp_url.txt').exists():
-        raise RuntimeError("Chrome CDP URL not found after 15s")
-
-    chrome_pid = int((chrome_dir / 'chrome.pid').read_text().strip())
-
-    # Create snapshot directory structure
-    snapshot_dir = Path(tmpdir) / 'snapshot'
-    snapshot_dir.mkdir()
-    snapshot_chrome_dir = snapshot_dir / 'chrome'
-    snapshot_chrome_dir.mkdir()
-
-    # Create tab
-    tab_env = env.copy()
-    tab_env['CRAWL_OUTPUT_DIR'] = str(crawl_dir)
-    result = subprocess.run(
-        ['node', str(CHROME_TAB_HOOK), f'--url={TEST_URL}', '--snapshot-id=snap-infiniscroll', '--crawl-id=test-infiniscroll'],
-        cwd=str(snapshot_chrome_dir),
-        capture_output=True,
-        text=True,
-        timeout=60,
-        env=tab_env
-    )
-    if result.returncode != 0:
-        raise RuntimeError(f"Tab creation failed: {result.stderr}")
-
-    # Navigate to URL
-    result = subprocess.run(
-        ['node', str(CHROME_NAVIGATE_HOOK), f'--url={TEST_URL}', '--snapshot-id=snap-infiniscroll'],
-        cwd=str(snapshot_chrome_dir),
-        capture_output=True,
-        text=True,
-        timeout=120,
-        env=env
-    )
-    if result.returncode != 0:
-        raise RuntimeError(f"Navigation failed: {result.stderr}")
-
-    return chrome_launch_process, chrome_pid, snapshot_chrome_dir
-
-
-def cleanup_chrome(chrome_launch_process, chrome_pid):
-    """Helper to clean up Chrome processes."""
-    try:
-        chrome_launch_process.send_signal(signal.SIGTERM)
-        chrome_launch_process.wait(timeout=5)
-    except:
-        pass
-    try:
-        os.kill(chrome_pid, signal.SIGKILL)
-    except OSError:
-        pass
-
-
 def test_scrolls_page_and_outputs_stats():
    """Integration test: scroll page and verify JSONL output format."""
    with tempfile.TemporaryDirectory() as tmpdir:
        chrome_launch_process = None
        chrome_pid = None
        try:
-            chrome_launch_process, chrome_pid, snapshot_chrome_dir = setup_chrome_session(tmpdir)
+            chrome_launch_process, chrome_pid, snapshot_chrome_dir = setup_chrome_session(
+                Path(tmpdir),
+                crawl_id='test-infiniscroll',
+                snapshot_id='snap-infiniscroll',
+                test_url=TEST_URL,
+            )

            # Create infiniscroll output directory (sibling to chrome)
            infiniscroll_dir = snapshot_chrome_dir.parent / 'infiniscroll'
@@ -264,7 +169,12 @@ def test_config_scroll_limit_honored():
        chrome_launch_process = None
        chrome_pid = None
        try:
-            chrome_launch_process, chrome_pid, snapshot_chrome_dir = setup_chrome_session(tmpdir)
+            chrome_launch_process, chrome_pid, snapshot_chrome_dir = setup_chrome_session(
+                Path(tmpdir),
+                crawl_id='test-scroll-limit',
+                snapshot_id='snap-limit',
+                test_url=TEST_URL,
+            )

            infiniscroll_dir = snapshot_chrome_dir.parent / 'infiniscroll'
            infiniscroll_dir.mkdir()
@@ -316,7 +226,12 @@ def test_config_timeout_honored():
        chrome_launch_process = None
        chrome_pid = None
        try:
-            chrome_launch_process, chrome_pid, snapshot_chrome_dir = setup_chrome_session(tmpdir)
+            chrome_launch_process, chrome_pid, snapshot_chrome_dir = setup_chrome_session(
+                Path(tmpdir),
+                crawl_id='test-timeout',
+                snapshot_id='snap-timeout',
+                test_url=TEST_URL,
+            )

            infiniscroll_dir = snapshot_chrome_dir.parent / 'infiniscroll'
            infiniscroll_dir.mkdir()
--- a/archivebox/plugins/istilldontcareaboutcookies/on_Crawl__20_install_istilldontcareaboutcookies_extension.js
+++ b/archivebox/plugins/istilldontcareaboutcookies/on_Crawl__20_install_istilldontcareaboutcookies_extension.js
@@ -0,0 +1,59 @@
+#!/usr/bin/env node
+/**
+ * I Still Don't Care About Cookies Extension Plugin
+ *
+ * Installs and configures the "I still don't care about cookies" Chrome extension
+ * for automatic cookie consent banner dismissal during page archiving.
+ *
+ * Extension: https://chromewebstore.google.com/detail/edibdbjcniadpccecjdfdjjppcpchdlm
+ *
+ * Priority: 02 (early) - Must install before Chrome session starts at Crawl level
+ * Hook: on_Crawl (runs once per crawl, not per snapshot)
+ *
+ * This extension automatically:
+ * - Dismisses cookie consent popups
+ * - Removes cookie banners
+ * - Accepts necessary cookies to proceed with browsing
+ * - Works on thousands of websites out of the box
+ */
+
+// Import extension utilities
+const { installExtensionWithCache } = require('../chrome/chrome_utils.js');
+
+// Extension metadata
+const EXTENSION = {
+    webstore_id: 'edibdbjcniadpccecjdfdjjppcpchdlm',
+    name: 'istilldontcareaboutcookies',
+};
+
+/**
+ * Main entry point - install extension before archiving
+ *
+ * Note: This extension works out of the box with no configuration needed.
+ * It automatically detects and dismisses cookie banners on page load.
+ */
+async function main() {
+    const extension = await installExtensionWithCache(EXTENSION);
+
+    if (extension) {
+        console.log('[+] Cookie banners will be automatically dismissed during archiving');
+    }
+
+    return extension;
+}
+
+// Export functions for use by other plugins
+module.exports = {
+    EXTENSION,
+};
+
+// Run if executed directly
+if (require.main === module) {
+    main().then(() => {
+        console.log('[✓] I Still Don\'t Care About Cookies extension setup complete');
+        process.exit(0);
+    }).catch(err => {
+        console.error('[❌] I Still Don\'t Care About Cookies extension setup failed:', err);
+        process.exit(1);
+    });
+}
--- a/archivebox/plugins/istilldontcareaboutcookies/tests/test_istilldontcareaboutcookies.py
+++ b/archivebox/plugins/istilldontcareaboutcookies/tests/test_istilldontcareaboutcookies.py
@@ -14,9 +14,17 @@ from pathlib import Path

 import pytest

+from archivebox.plugins.chrome.tests.chrome_test_helpers import (
+    setup_test_env,
+    launch_chromium_session,
+    kill_chromium_session,
+    CHROME_LAUNCH_HOOK,
+    PLUGINS_ROOT,
+)
+

 PLUGIN_DIR = Path(__file__).parent.parent
-INSTALL_SCRIPT = next(PLUGIN_DIR.glob('on_Crawl__*_istilldontcareaboutcookies.*'), None)
+INSTALL_SCRIPT = next(PLUGIN_DIR.glob('on_Crawl__*_install_istilldontcareaboutcookies_extension.*'), None)


 def test_install_script_exists():
@@ -124,79 +132,6 @@ def test_no_configuration_required():
        assert "API" not in (result.stdout + result.stderr) or result.returncode == 0


-def setup_test_lib_dirs(tmpdir: Path) -> dict:
-    """Create isolated lib directories for tests and return env dict.
-
-    Sets up:
-        LIB_DIR: tmpdir/lib/<arch>
-        NODE_MODULES_DIR: tmpdir/lib/<arch>/npm/node_modules
-        NPM_BIN_DIR: tmpdir/lib/<arch>/npm/bin
-        PIP_VENV_DIR: tmpdir/lib/<arch>/pip/venv
-        PIP_BIN_DIR: tmpdir/lib/<arch>/pip/venv/bin
-    """
-    import platform
-    arch = platform.machine()
-    system = platform.system().lower()
-    arch_dir = f"{arch}-{system}"
-
-    lib_dir = tmpdir / 'lib' / arch_dir
-    npm_dir = lib_dir / 'npm'
-    node_modules_dir = npm_dir / 'node_modules'
-    npm_bin_dir = npm_dir / 'bin'
-    pip_venv_dir = lib_dir / 'pip' / 'venv'
-    pip_bin_dir = pip_venv_dir / 'bin'
-
-    # Create directories
-    node_modules_dir.mkdir(parents=True, exist_ok=True)
-    npm_bin_dir.mkdir(parents=True, exist_ok=True)
-    pip_bin_dir.mkdir(parents=True, exist_ok=True)
-
-    # Install puppeteer-core to the test node_modules if not present
-    if not (node_modules_dir / 'puppeteer-core').exists():
-        result = subprocess.run(
-            ['npm', 'install', '--prefix', str(npm_dir), 'puppeteer-core'],
-            capture_output=True,
-            text=True,
-            timeout=120
-        )
-        if result.returncode != 0:
-            pytest.skip(f"Failed to install puppeteer-core: {result.stderr}")
-
-    return {
-        'LIB_DIR': str(lib_dir),
-        'NODE_MODULES_DIR': str(node_modules_dir),
-        'NPM_BIN_DIR': str(npm_bin_dir),
-        'PIP_VENV_DIR': str(pip_venv_dir),
-        'PIP_BIN_DIR': str(pip_bin_dir),
-    }
-
-
-PLUGINS_ROOT = PLUGIN_DIR.parent
-
-
-def find_chromium_binary():
-    """Find the Chromium binary using chrome_utils.js findChromium().
-
-    This uses the centralized findChromium() function which checks:
-    - CHROME_BINARY env var
-    - @puppeteer/browsers install locations
-    - System Chromium locations
-    - Falls back to Chrome (with warning)
-    """
-    chrome_utils = PLUGINS_ROOT / 'chrome' / 'chrome_utils.js'
-    result = subprocess.run(
-        ['node', str(chrome_utils), 'findChromium'],
-        capture_output=True,
-        text=True,
-        timeout=10
-    )
-    if result.returncode == 0 and result.stdout.strip():
-        return result.stdout.strip()
-    return None
-
-
-CHROME_LAUNCH_HOOK = PLUGINS_ROOT / 'chrome' / 'on_Crawl__20_chrome_launch.bg.js'
-
 TEST_URL = 'https://www.filmin.es/'


@@ -210,22 +145,11 @@ def test_extension_loads_in_chromium():
    with tempfile.TemporaryDirectory() as tmpdir:
        tmpdir = Path(tmpdir)

-        # Set up isolated lib directories for this test
-        lib_env = setup_test_lib_dirs(tmpdir)
+        # Set up isolated env with proper directory structure
+        env = setup_test_env(tmpdir)
+        env.setdefault('CHROME_HEADLESS', 'true')

-        # Set up extensions directory
-        ext_dir = tmpdir / 'chrome_extensions'
-        ext_dir.mkdir(parents=True)
-
-        env = os.environ.copy()
-        env.update(lib_env)
-        env['CHROME_EXTENSIONS_DIR'] = str(ext_dir)
-        env['CHROME_HEADLESS'] = 'true'
-
-        # Ensure CHROME_BINARY points to Chromium
-        chromium = find_chromium_binary()
-        if chromium:
-            env['CHROME_BINARY'] = chromium
+        ext_dir = Path(env['CHROME_EXTENSIONS_DIR'])

        # Step 1: Install the extension
        result = subprocess.run(
@@ -245,13 +169,16 @@ def test_extension_loads_in_chromium():
        print(f"Extension installed: {ext_data.get('name')} v{ext_data.get('version')}")

        # Step 2: Launch Chromium using the chrome hook (loads extensions automatically)
-        crawl_dir = tmpdir / 'crawl'
-        crawl_dir.mkdir()
+        crawl_id = 'test-cookies'
+        crawl_dir = Path(env['CRAWLS_DIR']) / crawl_id
+        crawl_dir.mkdir(parents=True, exist_ok=True)
        chrome_dir = crawl_dir / 'chrome'
+        chrome_dir.mkdir(parents=True, exist_ok=True)
+        env['CRAWL_OUTPUT_DIR'] = str(crawl_dir)

        chrome_launch_process = subprocess.Popen(
-            ['node', str(CHROME_LAUNCH_HOOK), '--crawl-id=test-cookies'],
-            cwd=str(crawl_dir),
+            ['node', str(CHROME_LAUNCH_HOOK), f'--crawl-id={crawl_id}'],
+            cwd=str(chrome_dir),
            stdout=subprocess.PIPE,
            stderr=subprocess.PIPE,
            text=True,
@@ -400,156 +327,314 @@ const puppeteer = require('puppeteer-core');
                    pass


-def test_hides_cookie_consent_on_filmin():
-    """Live test: verify extension hides cookie consent popup on filmin.es.
+def check_cookie_consent_visibility(cdp_url: str, test_url: str, env: dict, script_dir: Path) -> dict:
+    """Check if cookie consent elements are visible on a page.

-    Uses Chromium with extensions loaded automatically via chrome hook.
+    Returns dict with:
+        - visible: bool - whether any cookie consent element is visible
+        - selector: str - which selector matched (if visible)
+        - elements_found: list - all cookie-related elements found in DOM
+        - html_snippet: str - snippet of the page HTML for debugging
    """
-    with tempfile.TemporaryDirectory() as tmpdir:
-        tmpdir = Path(tmpdir)
-
-        # Set up isolated lib directories for this test
-        lib_env = setup_test_lib_dirs(tmpdir)
-
-        # Set up extensions directory
-        ext_dir = tmpdir / 'chrome_extensions'
-        ext_dir.mkdir(parents=True)
-
-        env = os.environ.copy()
-        env.update(lib_env)
-        env['CHROME_EXTENSIONS_DIR'] = str(ext_dir)
-        env['CHROME_HEADLESS'] = 'true'
-
-        # Ensure CHROME_BINARY points to Chromium
-        chromium = find_chromium_binary()
-        if chromium:
-            env['CHROME_BINARY'] = chromium
-
-        # Step 1: Install the extension
-        result = subprocess.run(
-            ['node', str(INSTALL_SCRIPT)],
-            cwd=str(tmpdir),
-            capture_output=True,
-            text=True,
-            env=env,
-            timeout=60
-        )
-        assert result.returncode == 0, f"Extension install failed: {result.stderr}"
-
-        # Verify extension cache was created
-        cache_file = ext_dir / 'istilldontcareaboutcookies.extension.json'
-        assert cache_file.exists(), "Extension cache not created"
-        ext_data = json.loads(cache_file.read_text())
-        print(f"Extension installed: {ext_data.get('name')} v{ext_data.get('version')}")
-
-        # Step 2: Launch Chromium using the chrome hook (loads extensions automatically)
-        crawl_dir = tmpdir / 'crawl'
-        crawl_dir.mkdir()
-        chrome_dir = crawl_dir / 'chrome'
-
-        chrome_launch_process = subprocess.Popen(
-            ['node', str(CHROME_LAUNCH_HOOK), '--crawl-id=test-cookies'],
-            cwd=str(crawl_dir),
-            stdout=subprocess.PIPE,
-            stderr=subprocess.PIPE,
-            text=True,
-            env=env
-        )
-
-        # Wait for Chromium to launch and CDP URL to be available
-        cdp_url = None
-        for i in range(20):
-            if chrome_launch_process.poll() is not None:
-                stdout, stderr = chrome_launch_process.communicate()
-                raise RuntimeError(f"Chromium launch failed:\nStdout: {stdout}\nStderr: {stderr}")
-            cdp_file = chrome_dir / 'cdp_url.txt'
-            if cdp_file.exists():
-                cdp_url = cdp_file.read_text().strip()
-                break
-            time.sleep(1)
-
-        assert cdp_url, "Chromium CDP URL not found after 20s"
-        print(f"Chromium launched with CDP URL: {cdp_url}")
-
-        try:
-            # Step 3: Connect to Chromium and test cookie consent hiding
-            test_script = f'''
+    test_script = f'''
 if (process.env.NODE_MODULES_DIR) module.paths.unshift(process.env.NODE_MODULES_DIR);
 const puppeteer = require('puppeteer-core');

 (async () => {{
    const browser = await puppeteer.connect({{ browserWSEndpoint: '{cdp_url}' }});

-    // Wait for extension to initialize
-    await new Promise(r => setTimeout(r, 2000));
-
    const page = await browser.newPage();
-    await page.setUserAgent('Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36');
+    await page.setUserAgent('Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36');
    await page.setViewport({{ width: 1440, height: 900 }});

-    console.error('Navigating to {TEST_URL}...');
-    await page.goto('{TEST_URL}', {{ waitUntil: 'networkidle2', timeout: 30000 }});
+    console.error('Navigating to {test_url}...');
+    await page.goto('{test_url}', {{ waitUntil: 'networkidle2', timeout: 30000 }});

-    // Wait for extension content script to process page
-    await new Promise(r => setTimeout(r, 5000));
+    // Wait for page to fully render and any cookie scripts to run
+    await new Promise(r => setTimeout(r, 3000));

-    // Check cookie consent visibility
+    // Check cookie consent visibility using multiple common selectors
    const result = await page.evaluate(() => {{
-        const selectors = ['.cky-consent-container', '.cky-popup-center', '.cky-overlay'];
+        // Common cookie consent selectors used by various consent management platforms
+        const selectors = [
+            // CookieYes
+            '.cky-consent-container', '.cky-popup-center', '.cky-overlay', '.cky-modal',
+            // OneTrust
+            '#onetrust-consent-sdk', '#onetrust-banner-sdk', '.onetrust-pc-dark-filter',
+            // Cookiebot
+            '#CybotCookiebotDialog', '#CybotCookiebotDialogBodyUnderlay',
+            // Generic cookie banners
+            '[class*="cookie-consent"]', '[class*="cookie-banner"]', '[class*="cookie-notice"]',
+            '[class*="cookie-popup"]', '[class*="cookie-modal"]', '[class*="cookie-dialog"]',
+            '[id*="cookie-consent"]', '[id*="cookie-banner"]', '[id*="cookie-notice"]',
+            '[id*="cookieconsent"]', '[id*="cookie-law"]',
+            // GDPR banners
+            '[class*="gdpr"]', '[id*="gdpr"]',
+            // Consent banners
+            '[class*="consent-banner"]', '[class*="consent-modal"]', '[class*="consent-popup"]',
+            // Privacy banners
+            '[class*="privacy-banner"]', '[class*="privacy-notice"]',
+            // Common frameworks
+            '.cc-window', '.cc-banner', '#cc-main',  // Cookie Consent by Insites
+            '.qc-cmp2-container',  // Quantcast
+            '.sp-message-container',  // SourcePoint
+        ];
+
+        const elementsFound = [];
+        let visibleElement = null;
+
        for (const sel of selectors) {{
-            const el = document.querySelector(sel);
-            if (el) {{
-                const style = window.getComputedStyle(el);
-                const rect = el.getBoundingClientRect();
-                const visible = style.display !== 'none' &&
-                               style.visibility !== 'hidden' &&
-                               rect.width > 0 && rect.height > 0;
-                if (visible) return {{ visible: true, selector: sel }};
+            try {{
+                const elements = document.querySelectorAll(sel);
+                for (const el of elements) {{
+                    const style = window.getComputedStyle(el);
+                    const rect = el.getBoundingClientRect();
+                    const isVisible = style.display !== 'none' &&
+                                     style.visibility !== 'hidden' &&
+                                     style.opacity !== '0' &&
+                                     rect.width > 0 && rect.height > 0;
+
+                    elementsFound.push({{
+                        selector: sel,
+                        visible: isVisible,
+                        display: style.display,
+                        visibility: style.visibility,
+                        opacity: style.opacity,
+                        width: rect.width,
+                        height: rect.height
+                    }});
+
+                    if (isVisible && !visibleElement) {{
+                        visibleElement = {{ selector: sel, width: rect.width, height: rect.height }};
+                    }}
+                }}
+            }} catch (e) {{
+                // Invalid selector, skip
            }}
        }}
-        return {{ visible: false }};
+
+        // Also grab a snippet of the HTML to help debug
+        const bodyHtml = document.body.innerHTML.slice(0, 2000);
+        const hasCookieKeyword = bodyHtml.toLowerCase().includes('cookie') ||
+                                  bodyHtml.toLowerCase().includes('consent') ||
+                                  bodyHtml.toLowerCase().includes('gdpr');
+
+        return {{
+            visible: visibleElement !== null,
+            selector: visibleElement ? visibleElement.selector : null,
+            elements_found: elementsFound,
+            has_cookie_keyword_in_html: hasCookieKeyword,
+            html_snippet: bodyHtml.slice(0, 500)
+        }};
    }});

-    console.error('Cookie consent:', JSON.stringify(result));
+    console.error('Cookie consent check result:', JSON.stringify({{
+        visible: result.visible,
+        selector: result.selector,
+        elements_found_count: result.elements_found.length
+    }}));
+
    browser.disconnect();
    console.log(JSON.stringify(result));
 }})();
 '''
-            script_path = tmpdir / 'test_extension.js'
-            script_path.write_text(test_script)
+    script_path = script_dir / 'check_cookies.js'
+    script_path.write_text(test_script)

-            result = subprocess.run(
-                ['node', str(script_path)],
-                cwd=str(tmpdir),
-                capture_output=True,
-                text=True,
-                env=env,
-                timeout=90
+    result = subprocess.run(
+        ['node', str(script_path)],
+        cwd=str(script_dir),
+        capture_output=True,
+        text=True,
+        env=env,
+        timeout=90
+    )
+
+    if result.returncode != 0:
+        raise RuntimeError(f"Cookie check script failed: {result.stderr}")
+
+    output_lines = [l for l in result.stdout.strip().split('\n') if l.startswith('{')]
+    if not output_lines:
+        raise RuntimeError(f"No JSON output from cookie check: {result.stdout}\nstderr: {result.stderr}")
+
+    return json.loads(output_lines[-1])
+
+
+def test_hides_cookie_consent_on_filmin():
+    """Live test: verify extension hides cookie consent popup on filmin.es.
+
+    This test runs TWO browser sessions:
+    1. WITHOUT extension - verifies cookie consent IS visible (baseline)
+    2. WITH extension - verifies cookie consent is HIDDEN
+
+    This ensures we're actually testing the extension's effect, not just
+    that a page happens to not have cookie consent.
+    """
+    with tempfile.TemporaryDirectory() as tmpdir:
+        tmpdir = Path(tmpdir)
+
+        # Set up isolated env with proper directory structure
+        env_base = setup_test_env(tmpdir)
+        env_base['CHROME_HEADLESS'] = 'true'
+
+        ext_dir = Path(env_base['CHROME_EXTENSIONS_DIR'])
+
+        # ============================================================
+        # STEP 1: BASELINE - Run WITHOUT extension, verify cookie consent IS visible
+        # ============================================================
+        print("\n" + "="*60)
+        print("STEP 1: BASELINE TEST (no extension)")
+        print("="*60)
+
+        data_dir = Path(env_base['DATA_DIR'])
+
+        env_no_ext = env_base.copy()
+        env_no_ext['CHROME_EXTENSIONS_DIR'] = str(data_dir / 'personas' / 'Default' / 'empty_extensions')
+        (data_dir / 'personas' / 'Default' / 'empty_extensions').mkdir(parents=True, exist_ok=True)
+
+        # Launch baseline Chromium in crawls directory
+        baseline_crawl_id = 'baseline-no-ext'
+        baseline_crawl_dir = Path(env_base['CRAWLS_DIR']) / baseline_crawl_id
+        baseline_crawl_dir.mkdir(parents=True, exist_ok=True)
+        baseline_chrome_dir = baseline_crawl_dir / 'chrome'
+        env_no_ext['CRAWL_OUTPUT_DIR'] = str(baseline_crawl_dir)
+        baseline_process = None
+
+        try:
+            baseline_process, baseline_cdp_url = launch_chromium_session(
+                env_no_ext, baseline_chrome_dir, baseline_crawl_id
+            )
+            print(f"Baseline Chromium launched: {baseline_cdp_url}")
+
+            # Wait a moment for browser to be ready
+            time.sleep(2)
+
+            baseline_result = check_cookie_consent_visibility(
+                baseline_cdp_url, TEST_URL, env_no_ext, tmpdir
            )

-            print(f"stderr: {result.stderr}")
-            print(f"stdout: {result.stdout}")
+            print(f"Baseline result: visible={baseline_result['visible']}, "
+                  f"elements_found={len(baseline_result['elements_found'])}")

-            assert result.returncode == 0, f"Test failed: {result.stderr}"
-
-            output_lines = [l for l in result.stdout.strip().split('\n') if l.startswith('{')]
-            assert output_lines, f"No JSON output: {result.stdout}"
-
-            test_result = json.loads(output_lines[-1])
-            assert not test_result['visible'], \
-                f"Cookie consent should be hidden by extension. Result: {test_result}"
+            if baseline_result['elements_found']:
+                print("Elements found in baseline:")
+                for el in baseline_result['elements_found'][:5]:  # Show first 5
+                    print(f"  - {el['selector']}: visible={el['visible']}, "
+                          f"display={el['display']}, size={el['width']}x{el['height']}")

        finally:
-            # Clean up Chromium
-            try:
-                chrome_launch_process.send_signal(signal.SIGTERM)
-                chrome_launch_process.wait(timeout=5)
-            except:
-                pass
-            chrome_pid_file = chrome_dir / 'chrome.pid'
-            if chrome_pid_file.exists():
-                try:
-                    chrome_pid = int(chrome_pid_file.read_text().strip())
-                    os.kill(chrome_pid, signal.SIGKILL)
-                except (OSError, ValueError):
-                    pass
+            if baseline_process:
+                kill_chromium_session(baseline_process, baseline_chrome_dir)
+
+        # Verify baseline shows cookie consent
+        if not baseline_result['visible']:
+            # If no cookie consent visible in baseline, we can't test the extension
+            # This could happen if:
+            # - The site changed and no longer shows cookie consent
+            # - Cookie consent is region-specific
+            # - Our selectors don't match this site
+            print("\nWARNING: No cookie consent visible in baseline!")
+            print(f"HTML has cookie keywords: {baseline_result.get('has_cookie_keyword_in_html')}")
+            print(f"HTML snippet: {baseline_result.get('html_snippet', '')[:200]}")
+
+            pytest.skip(
+                f"Cannot test extension: no cookie consent visible in baseline on {TEST_URL}. "
+                f"Elements found: {len(baseline_result['elements_found'])}. "
+                f"The site may have changed or cookie consent may be region-specific."
+            )
+
+        print(f"\n✓ Baseline confirmed: Cookie consent IS visible (selector: {baseline_result['selector']})")
+
+        # ============================================================
+        # STEP 2: Install the extension
+        # ============================================================
+        print("\n" + "="*60)
+        print("STEP 2: INSTALLING EXTENSION")
+        print("="*60)
+
+        env_with_ext = env_base.copy()
+        env_with_ext['CHROME_EXTENSIONS_DIR'] = str(ext_dir)
+
+        result = subprocess.run(
+            ['node', str(INSTALL_SCRIPT)],
+            cwd=str(tmpdir),
+            capture_output=True,
+            text=True,
+            env=env_with_ext,
+            timeout=60
+        )
+        assert result.returncode == 0, f"Extension install failed: {result.stderr}"
+
+        cache_file = ext_dir / 'istilldontcareaboutcookies.extension.json'
+        assert cache_file.exists(), "Extension cache not created"
+        ext_data = json.loads(cache_file.read_text())
+        print(f"Extension installed: {ext_data.get('name')} v{ext_data.get('version')}")
+
+        # ============================================================
+        # STEP 3: Run WITH extension, verify cookie consent is HIDDEN
+        # ============================================================
+        print("\n" + "="*60)
+        print("STEP 3: TEST WITH EXTENSION")
+        print("="*60)
+
+        # Launch extension test Chromium in crawls directory
+        ext_crawl_id = 'test-with-ext'
+        ext_crawl_dir = Path(env_base['CRAWLS_DIR']) / ext_crawl_id
+        ext_crawl_dir.mkdir(parents=True, exist_ok=True)
+        ext_chrome_dir = ext_crawl_dir / 'chrome'
+        env_with_ext['CRAWL_OUTPUT_DIR'] = str(ext_crawl_dir)
+        ext_process = None
+
+        try:
+            ext_process, ext_cdp_url = launch_chromium_session(
+                env_with_ext, ext_chrome_dir, ext_crawl_id
+            )
+            print(f"Extension Chromium launched: {ext_cdp_url}")
+
+            # Check that extension was loaded
+            extensions_file = ext_chrome_dir / 'extensions.json'
+            if extensions_file.exists():
+                loaded_exts = json.loads(extensions_file.read_text())
+                print(f"Extensions loaded: {[e.get('name') for e in loaded_exts]}")
+
+            # Wait for extension to initialize
+            time.sleep(3)
+
+            ext_result = check_cookie_consent_visibility(
+                ext_cdp_url, TEST_URL, env_with_ext, tmpdir
+            )
+
+            print(f"Extension result: visible={ext_result['visible']}, "
+                  f"elements_found={len(ext_result['elements_found'])}")
+
+            if ext_result['elements_found']:
+                print("Elements found with extension:")
+                for el in ext_result['elements_found'][:5]:
+                    print(f"  - {el['selector']}: visible={el['visible']}, "
+                          f"display={el['display']}, size={el['width']}x{el['height']}")
+
+        finally:
+            if ext_process:
+                kill_chromium_session(ext_process, ext_chrome_dir)
+
+        # ============================================================
+        # STEP 4: Compare results
+        # ============================================================
+        print("\n" + "="*60)
+        print("STEP 4: COMPARISON")
+        print("="*60)
+        print(f"Baseline (no extension): cookie consent visible = {baseline_result['visible']}")
+        print(f"With extension: cookie consent visible = {ext_result['visible']}")
+
+        assert baseline_result['visible'], \
+            "Baseline should show cookie consent (this shouldn't happen, we checked above)"
+
+        assert not ext_result['visible'], \
+            f"Cookie consent should be HIDDEN by extension.\n" \
+            f"Baseline showed consent at: {baseline_result['selector']}\n" \
+            f"But with extension, consent is still visible.\n" \
+            f"Elements still visible: {[e for e in ext_result['elements_found'] if e['visible']]}"
+
+        print("\n✓ SUCCESS: Extension correctly hides cookie consent!")
+        print(f"  - Baseline showed consent at: {baseline_result['selector']}")
+        print(f"  - Extension successfully hid it")
--- a/archivebox/plugins/mercury/tests/test_mercury.py
+++ b/archivebox/plugins/mercury/tests/test_mercury.py
@@ -2,7 +2,6 @@
 Integration tests for mercury plugin

 Tests verify:
-    pass
 1. Hook script exists
 2. Dependencies installed via validation hooks
 3. Verify deps with abx-pkg
@@ -19,9 +18,15 @@ import tempfile
 from pathlib import Path
 import pytest

-PLUGIN_DIR = Path(__file__).parent.parent
-PLUGINS_ROOT = PLUGIN_DIR.parent
-MERCURY_HOOK = next(PLUGIN_DIR.glob('on_Snapshot__*_mercury.*'), None)
+from archivebox.plugins.chrome.tests.chrome_test_helpers import (
+    get_plugin_dir,
+    get_hook_script,
+    PLUGINS_ROOT,
+)
+
+
+PLUGIN_DIR = get_plugin_dir(__file__)
+MERCURY_HOOK = get_hook_script(PLUGIN_DIR, 'on_Snapshot__*_mercury.*')
 TEST_URL = 'https://example.com'

 def test_hook_script_exists():
--- a/archivebox/plugins/modalcloser/tests/test_modalcloser.py
+++ b/archivebox/plugins/modalcloser/tests/test_modalcloser.py
@@ -22,38 +22,20 @@ from pathlib import Path

 import pytest

+# Import shared Chrome test helpers
+from archivebox.plugins.chrome.tests.chrome_test_helpers import (
+    get_test_env,
+    setup_chrome_session,
+    cleanup_chrome,
+)
+

 PLUGIN_DIR = Path(__file__).parent.parent
-PLUGINS_ROOT = PLUGIN_DIR.parent
 MODALCLOSER_HOOK = next(PLUGIN_DIR.glob('on_Snapshot__*_modalcloser.*'), None)
-CHROME_LAUNCH_HOOK = PLUGINS_ROOT / 'chrome' / 'on_Crawl__20_chrome_launch.bg.js'
-CHROME_TAB_HOOK = PLUGINS_ROOT / 'chrome' / 'on_Snapshot__20_chrome_tab.bg.js'
-CHROME_NAVIGATE_HOOK = next((PLUGINS_ROOT / 'chrome').glob('on_Snapshot__*_chrome_navigate.*'), None)
 TEST_URL = 'https://www.singsing.movie/'
 COOKIE_CONSENT_TEST_URL = 'https://www.filmin.es/'


-def get_node_modules_dir():
-    """Get NODE_MODULES_DIR for tests, checking env first."""
-    # Check if NODE_MODULES_DIR is already set in environment
-    if os.environ.get('NODE_MODULES_DIR'):
-        return Path(os.environ['NODE_MODULES_DIR'])
-    # Otherwise compute from LIB_DIR
-    from archivebox.config.common import STORAGE_CONFIG
-    lib_dir = Path(os.environ.get('LIB_DIR') or str(STORAGE_CONFIG.LIB_DIR))
-    return lib_dir / 'npm' / 'node_modules'
-
-
-NODE_MODULES_DIR = get_node_modules_dir()
-
-
-def get_test_env():
-    """Get environment with NODE_MODULES_DIR set correctly."""
-    env = os.environ.copy()
-    env['NODE_MODULES_DIR'] = str(NODE_MODULES_DIR)
-    return env
-
-
 def test_hook_script_exists():
    """Verify on_Snapshot hook exists."""
    assert MODALCLOSER_HOOK is not None, "Modalcloser hook not found"
@@ -118,75 +100,6 @@ def test_fails_gracefully_without_chrome_session():
            f"Should mention chrome/CDP/puppeteer in error: {result.stderr}"


-def setup_chrome_session(tmpdir):
-    """Helper to set up Chrome session with tab."""
-    crawl_dir = Path(tmpdir) / 'crawl'
-    crawl_dir.mkdir()
-    chrome_dir = crawl_dir / 'chrome'
-
-    env = get_test_env()
-    env['CHROME_HEADLESS'] = 'true'
-
-    # Launch Chrome at crawl level
-    chrome_launch_process = subprocess.Popen(
-        ['node', str(CHROME_LAUNCH_HOOK), '--crawl-id=test-modalcloser'],
-        cwd=str(crawl_dir),
-        stdout=subprocess.PIPE,
-        stderr=subprocess.PIPE,
-        text=True,
-        env=env
-    )
-
-    # Wait for Chrome to launch
-    for i in range(15):
-        if chrome_launch_process.poll() is not None:
-            stdout, stderr = chrome_launch_process.communicate()
-            raise RuntimeError(f"Chrome launch failed:\nStdout: {stdout}\nStderr: {stderr}")
-        if (chrome_dir / 'cdp_url.txt').exists():
-            break
-        time.sleep(1)
-
-    if not (chrome_dir / 'cdp_url.txt').exists():
-        raise RuntimeError("Chrome CDP URL not found after 15s")
-
-    chrome_pid = int((chrome_dir / 'chrome.pid').read_text().strip())
-
-    # Create snapshot directory structure
-    snapshot_dir = Path(tmpdir) / 'snapshot'
-    snapshot_dir.mkdir()
-    snapshot_chrome_dir = snapshot_dir / 'chrome'
-    snapshot_chrome_dir.mkdir()
-
-    # Create tab
-    tab_env = env.copy()
-    tab_env['CRAWL_OUTPUT_DIR'] = str(crawl_dir)
-    result = subprocess.run(
-        ['node', str(CHROME_TAB_HOOK), f'--url={TEST_URL}', '--snapshot-id=snap-modalcloser', '--crawl-id=test-modalcloser'],
-        cwd=str(snapshot_chrome_dir),
-        capture_output=True,
-        text=True,
-        timeout=60,
-        env=tab_env
-    )
-    if result.returncode != 0:
-        raise RuntimeError(f"Tab creation failed: {result.stderr}")
-
-    return chrome_launch_process, chrome_pid, snapshot_chrome_dir
-
-
-def cleanup_chrome(chrome_launch_process, chrome_pid):
-    """Helper to clean up Chrome processes."""
-    try:
-        chrome_launch_process.send_signal(signal.SIGTERM)
-        chrome_launch_process.wait(timeout=5)
-    except:
-        pass
-    try:
-        os.kill(chrome_pid, signal.SIGKILL)
-    except OSError:
-        pass
-
-
 def test_background_script_handles_sigterm():
    """Test that background script runs and handles SIGTERM correctly."""
    with tempfile.TemporaryDirectory() as tmpdir:
@@ -194,7 +107,12 @@ def test_background_script_handles_sigterm():
        chrome_pid = None
        modalcloser_process = None
        try:
-            chrome_launch_process, chrome_pid, snapshot_chrome_dir = setup_chrome_session(tmpdir)
+            chrome_launch_process, chrome_pid, snapshot_chrome_dir = setup_chrome_session(
+                Path(tmpdir),
+                crawl_id='test-modalcloser',
+                snapshot_id='snap-modalcloser',
+                test_url=TEST_URL,
+            )

            # Create modalcloser output directory (sibling to chrome)
            modalcloser_dir = snapshot_chrome_dir.parent / 'modalcloser'
@@ -264,7 +182,12 @@ def test_dialog_handler_logs_dialogs():
        chrome_pid = None
        modalcloser_process = None
        try:
-            chrome_launch_process, chrome_pid, snapshot_chrome_dir = setup_chrome_session(tmpdir)
+            chrome_launch_process, chrome_pid, snapshot_chrome_dir = setup_chrome_session(
+                Path(tmpdir),
+                crawl_id='test-dialog',
+                snapshot_id='snap-dialog',
+                test_url=TEST_URL,
+            )

            modalcloser_dir = snapshot_chrome_dir.parent / 'modalcloser'
            modalcloser_dir.mkdir()
@@ -312,7 +235,12 @@ def test_config_poll_interval():
        chrome_pid = None
        modalcloser_process = None
        try:
-            chrome_launch_process, chrome_pid, snapshot_chrome_dir = setup_chrome_session(tmpdir)
+            chrome_launch_process, chrome_pid, snapshot_chrome_dir = setup_chrome_session(
+                Path(tmpdir),
+                crawl_id='test-poll',
+                snapshot_id='snap-poll',
+                test_url=TEST_URL,
+            )

            modalcloser_dir = snapshot_chrome_dir.parent / 'modalcloser'
            modalcloser_dir.mkdir()
--- a/archivebox/plugins/pdf/tests/test_pdf.py
+++ b/archivebox/plugins/pdf/tests/test_pdf.py
@@ -21,29 +21,22 @@ from pathlib import Path

 import pytest

+from archivebox.plugins.chrome.tests.chrome_test_helpers import (
+    get_test_env,
+    get_plugin_dir,
+    get_hook_script,
+    run_hook_and_parse,
+    LIB_DIR,
+    NODE_MODULES_DIR,
+    PLUGINS_ROOT,
+)

-PLUGIN_DIR = Path(__file__).parent.parent
-PLUGINS_ROOT = PLUGIN_DIR.parent
-PDF_HOOK = next(PLUGIN_DIR.glob('on_Snapshot__*_pdf.*'), None)
+
+PLUGIN_DIR = get_plugin_dir(__file__)
+PDF_HOOK = get_hook_script(PLUGIN_DIR, 'on_Snapshot__*_pdf.*')
 NPM_PROVIDER_HOOK = PLUGINS_ROOT / 'npm' / 'on_Binary__install_using_npm_provider.py'
 TEST_URL = 'https://example.com'

-# Get LIB_DIR for NODE_MODULES_DIR
-def get_lib_dir():
-    """Get LIB_DIR for tests."""
-    from archivebox.config.common import STORAGE_CONFIG
-    return Path(os.environ.get('LIB_DIR') or str(STORAGE_CONFIG.LIB_DIR))
-
-LIB_DIR = get_lib_dir()
-NODE_MODULES_DIR = LIB_DIR / 'npm' / 'node_modules'
-
-def get_test_env():
-    """Get environment with NODE_MODULES_DIR set correctly."""
-    env = os.environ.copy()
-    env['NODE_MODULES_DIR'] = str(NODE_MODULES_DIR)
-    env['LIB_DIR'] = str(LIB_DIR)
-    return env
-

 def test_hook_script_exists():
    """Verify on_Snapshot hook exists."""
--- a/archivebox/plugins/readability/tests/test_readability.py
+++ b/archivebox/plugins/readability/tests/test_readability.py
@@ -2,7 +2,6 @@
 Integration tests for readability plugin

 Tests verify:
-    pass
 1. Validate hook checks for readability-extractor binary
 2. Verify deps with abx-pkg
 3. Plugin reports missing dependency correctly
@@ -18,10 +17,15 @@ from pathlib import Path

 import pytest

+from archivebox.plugins.chrome.tests.chrome_test_helpers import (
+    get_plugin_dir,
+    get_hook_script,
+    PLUGINS_ROOT,
+)

-PLUGIN_DIR = Path(__file__).parent.parent
-PLUGINS_ROOT = PLUGIN_DIR.parent
-READABILITY_HOOK = next(PLUGIN_DIR.glob('on_Snapshot__*_readability.*'))
+
+PLUGIN_DIR = get_plugin_dir(__file__)
+READABILITY_HOOK = get_hook_script(PLUGIN_DIR, 'on_Snapshot__*_readability.*')
 TEST_URL = 'https://example.com'


--- a/archivebox/plugins/redirects/on_Snapshot__31_redirects.bg.js
+++ b/archivebox/plugins/redirects/on_Snapshot__31_redirects.bg.js
@@ -19,7 +19,7 @@ const puppeteer = require('puppeteer-core');
 const PLUGIN_NAME = 'redirects';
 const OUTPUT_DIR = '.';
 const OUTPUT_FILE = 'redirects.jsonl';
-const PID_FILE = 'hook.pid';
+// PID file is now written by run_hook() with hook-specific name
 const CHROME_SESSION_DIR = '../chrome';

 // Global state
@@ -274,8 +274,8 @@ async function main() {
        // Set up redirect listener BEFORE navigation
        await setupRedirectListener();

-        // Write PID file
-        fs.writeFileSync(path.join(OUTPUT_DIR, PID_FILE), String(process.pid));
+        // Note: PID file is written by run_hook() with hook-specific name
+        // Snapshot.cleanup() kills all *.pid processes when done

        // Wait for chrome_navigate to complete (BLOCKING)
        await waitForNavigation();
--- a/archivebox/plugins/responses/on_Snapshot__24_responses.bg.js
+++ b/archivebox/plugins/responses/on_Snapshot__24_responses.bg.js
@@ -19,7 +19,7 @@ const puppeteer = require('puppeteer-core');

 const PLUGIN_NAME = 'responses';
 const OUTPUT_DIR = '.';
-const PID_FILE = 'hook.pid';
+// PID file is now written by run_hook() with hook-specific name
 const CHROME_SESSION_DIR = '../chrome';

 // Resource types to capture (by default, capture everything)
@@ -323,8 +323,8 @@ async function main() {
        // Set up listener BEFORE navigation
        await setupListener();

-        // Write PID file
-        fs.writeFileSync(path.join(OUTPUT_DIR, PID_FILE), String(process.pid));
+        // Note: PID file is written by run_hook() with hook-specific name
+        // Snapshot.cleanup() kills all *.pid processes when done

        // Wait for chrome_navigate to complete (BLOCKING)
        await waitForNavigation();
--- a/archivebox/plugins/screenshot/tests/test_screenshot.py
+++ b/archivebox/plugins/screenshot/tests/test_screenshot.py
@@ -20,28 +20,20 @@ from pathlib import Path

 import pytest

+from archivebox.plugins.chrome.tests.chrome_test_helpers import (
+    get_test_env,
+    get_plugin_dir,
+    get_hook_script,
+    run_hook_and_parse,
+    LIB_DIR,
+    NODE_MODULES_DIR,
+)

-PLUGIN_DIR = Path(__file__).parent.parent
-PLUGINS_ROOT = PLUGIN_DIR.parent
-SCREENSHOT_HOOK = next(PLUGIN_DIR.glob('on_Snapshot__*_screenshot.*'), None)
+
+PLUGIN_DIR = get_plugin_dir(__file__)
+SCREENSHOT_HOOK = get_hook_script(PLUGIN_DIR, 'on_Snapshot__*_screenshot.*')
 TEST_URL = 'https://example.com'

-# Get LIB_DIR for NODE_MODULES_DIR
-def get_lib_dir():
-    """Get LIB_DIR for tests."""
-    from archivebox.config.common import STORAGE_CONFIG
-    return Path(os.environ.get('LIB_DIR') or str(STORAGE_CONFIG.LIB_DIR))
-
-LIB_DIR = get_lib_dir()
-NODE_MODULES_DIR = LIB_DIR / 'npm' / 'node_modules'
-
-def get_test_env():
-    """Get environment with NODE_MODULES_DIR set correctly."""
-    env = os.environ.copy()
-    env['NODE_MODULES_DIR'] = str(NODE_MODULES_DIR)
-    env['LIB_DIR'] = str(LIB_DIR)
-    return env
-

 def test_hook_script_exists():
    """Verify on_Snapshot hook exists."""
--- a/archivebox/plugins/singlefile/on_Crawl__20_install_singlefile_extension.js
+++ b/archivebox/plugins/singlefile/on_Crawl__20_install_singlefile_extension.js
@@ -0,0 +1,281 @@
+#!/usr/bin/env node
+/**
+ * SingleFile Extension Plugin
+ *
+ * DISABLED: Extension functionality commented out - using single-file-cli only
+ *
+ * Installs and uses the SingleFile Chrome extension for archiving complete web pages.
+ * Falls back to single-file-cli if the extension is not available.
+ *
+ * Extension: https://chromewebstore.google.com/detail/mpiodijhokgodhhofbcjdecpffjipkle
+ *
+ * Priority: 04 (early) - Must install before Chrome session starts at Crawl level
+ * Hook: on_Crawl (runs once per crawl, not per snapshot)
+ *
+ * This extension automatically:
+ * - Saves complete web pages as single HTML files
+ * - Inlines all resources (CSS, JS, images, fonts)
+ * - Preserves page fidelity better than wget/curl
+ * - Works with SPAs and dynamically loaded content
+ */
+
+const path = require('path');
+const fs = require('fs');
+const { promisify } = require('util');
+const { exec } = require('child_process');
+
+const execAsync = promisify(exec);
+
+// DISABLED: Extension functionality - using single-file-cli only
+// // Import extension utilities
+// const extensionUtils = require('../chrome/chrome_utils.js');
+
+// // Extension metadata
+// const EXTENSION = {
+//     webstore_id: 'mpiodijhokgodhhofbcjdecpffjipkle',
+//     name: 'singlefile',
+// };
+
+// // Get extensions directory from environment or use default
+// const EXTENSIONS_DIR = process.env.CHROME_EXTENSIONS_DIR ||
+//     path.join(process.env.DATA_DIR || './data', 'personas', process.env.ACTIVE_PERSONA || 'Default', 'chrome_extensions');
+
+// const CHROME_DOWNLOADS_DIR = process.env.CHROME_DOWNLOADS_DIR ||
+//     path.join(process.env.DATA_DIR || './data', 'personas', process.env.ACTIVE_PERSONA || 'Default', 'chrome_downloads');
+
+const OUTPUT_DIR = '.';
+const OUTPUT_FILE = 'singlefile.html';
+
+// DISABLED: Extension functionality - using single-file-cli only
+// /**
+//  * Install the SingleFile extension
+//  */
+// async function installSinglefileExtension() {
+//     console.log('[*] Installing SingleFile extension...');
+
+//     // Install the extension
+//     const extension = await extensionUtils.loadOrInstallExtension(EXTENSION, EXTENSIONS_DIR);
+
+//     if (!extension) {
+//         console.error('[❌] Failed to install SingleFile extension');
+//         return null;
+//     }
+
+//     console.log('[+] SingleFile extension installed');
+//     console.log('[+] Web pages will be saved as single HTML files');
+
+//     return extension;
+// }
+
+// /**
+//  * Wait for a specified amount of time
+//  */
+// function wait(ms) {
+//     return new Promise(resolve => setTimeout(resolve, ms));
+// }
+
+// /**
+//  * Save a page using the SingleFile extension
+//  *
+//  * @param {Object} page - Puppeteer page object
+//  * @param {Object} extension - Extension metadata with dispatchAction method
+//  * @param {Object} options - Additional options
+//  * @returns {Promise<string|null>} - Path to saved file or null on failure
+//  */
+// async function saveSinglefileWithExtension(page, extension, options = {}) {
+//     if (!extension || !extension.version) {
+//         throw new Error('SingleFile extension not found or not loaded');
+//     }
+
+//     const url = await page.url();
+
+//     // Check for unsupported URL schemes
+//     const URL_SCHEMES_IGNORED = ['about', 'chrome', 'chrome-extension', 'data', 'javascript', 'blob'];
+//     const scheme = url.split(':')[0];
+//     if (URL_SCHEMES_IGNORED.includes(scheme)) {
+//         console.log(`[⚠️] Skipping SingleFile for URL scheme: ${scheme}`);
+//         return null;
+//     }
+
+//     // Ensure downloads directory exists
+//     await fs.promises.mkdir(CHROME_DOWNLOADS_DIR, { recursive: true });
+
+//     // Get list of existing files to ignore
+//     const files_before = new Set(
+//         (await fs.promises.readdir(CHROME_DOWNLOADS_DIR))
+//             .filter(fn => fn.endsWith('.html'))
+//     );
+
+//     // Output directory is current directory (hook already runs in output dir)
+//     const out_path = path.join(OUTPUT_DIR, OUTPUT_FILE);
+
+//     console.log(`[🛠️] Saving SingleFile HTML using extension (${extension.id})...`);
+
+//     // Bring page to front (extension action button acts on foreground tab)
+//     await page.bringToFront();
+
+//     // Trigger the extension's action (toolbar button click)
+//     await extension.dispatchAction();
+
+//     // Wait for file to appear in downloads directory
+//     const check_delay = 3000; // 3 seconds
+//     const max_tries = 10;
+//     let files_new = [];
+
+//     for (let attempt = 0; attempt < max_tries; attempt++) {
+//         await wait(check_delay);
+
+//         const files_after = (await fs.promises.readdir(CHROME_DOWNLOADS_DIR))
+//             .filter(fn => fn.endsWith('.html'));
+
+//         files_new = files_after.filter(file => !files_before.has(file));
+
+//         if (files_new.length === 0) {
+//             continue;
+//         }
+
+//         // Find the matching file by checking if it contains the URL in the HTML header
+//         for (const file of files_new) {
+//             const dl_path = path.join(CHROME_DOWNLOADS_DIR, file);
+//             const dl_text = await fs.promises.readFile(dl_path, 'utf-8');
+//             const dl_header = dl_text.split('meta charset')[0];
+
+//             if (dl_header.includes(`url: ${url}`)) {
+//                 console.log(`[✍️] Moving SingleFile download from ${file} to ${out_path}`);
+//                 await fs.promises.rename(dl_path, out_path);
+//                 return out_path;
+//             }
+//         }
+//     }
+
+//     console.warn(`[❌] Couldn't find matching SingleFile HTML in ${CHROME_DOWNLOADS_DIR} after waiting ${(check_delay * max_tries) / 1000}s`);
+//     console.warn(`[⚠️] New files found: ${files_new.join(', ')}`);
+//     return null;
+// }
+
+/**
+ * Save a page using single-file-cli (fallback method)
+ *
+ * @param {string} url - URL to archive
+ * @param {Object} options - Additional options
+ * @returns {Promise<string|null>} - Path to saved file or null on failure
+ */
+async function saveSinglefileWithCLI(url, options = {}) {
+    console.log('[*] Falling back to single-file-cli...');
+
+    // Find single-file binary
+    let binary = null;
+    try {
+        const { stdout } = await execAsync('which single-file');
+        binary = stdout.trim();
+    } catch (err) {
+        console.error('[❌] single-file-cli not found. Install with: npm install -g single-file-cli');
+        return null;
+    }
+
+    // Output directory is current directory (hook already runs in output dir)
+    const out_path = path.join(OUTPUT_DIR, OUTPUT_FILE);
+
+    // Build command
+    const cmd = [
+        binary,
+        '--browser-headless',
+        url,
+        out_path,
+    ];
+
+    // Add optional args
+    if (options.userAgent) {
+        cmd.splice(2, 0, '--browser-user-agent', options.userAgent);
+    }
+    if (options.cookiesFile && fs.existsSync(options.cookiesFile)) {
+        cmd.splice(2, 0, '--browser-cookies-file', options.cookiesFile);
+    }
+    if (options.ignoreSSL) {
+        cmd.splice(2, 0, '--browser-ignore-insecure-certs');
+    }
+
+    // Execute
+    try {
+        const timeout = options.timeout || 120000;
+        await execAsync(cmd.join(' '), { timeout });
+
+        if (fs.existsSync(out_path) && fs.statSync(out_path).size > 0) {
+            console.log(`[+] SingleFile saved via CLI: ${out_path}`);
+            return out_path;
+        }
+
+        console.error('[❌] SingleFile CLI completed but no output file found');
+        return null;
+    } catch (err) {
+        console.error(`[❌] SingleFile CLI error: ${err.message}`);
+        return null;
+    }
+}
+
+// DISABLED: Extension functionality - using single-file-cli only
+// /**
+//  * Main entry point - install extension before archiving
+//  */
+// async function main() {
+//     // Check if extension is already cached
+//     const cacheFile = path.join(EXTENSIONS_DIR, 'singlefile.extension.json');
+
+//     if (fs.existsSync(cacheFile)) {
+//         try {
+//             const cached = JSON.parse(fs.readFileSync(cacheFile, 'utf-8'));
+//             const manifestPath = path.join(cached.unpacked_path, 'manifest.json');
+
+//             if (fs.existsSync(manifestPath)) {
+//                 console.log('[*] SingleFile extension already installed (using cache)');
+//                 return cached;
+//             }
+//         } catch (e) {
+//             // Cache file corrupted, re-install
+//             console.warn('[⚠️] Extension cache corrupted, re-installing...');
+//         }
+//     }
+
+//     // Install extension
+//     const extension = await installSinglefileExtension();
+
+//     // Export extension metadata for chrome plugin to load
+//     if (extension) {
+//         // Write extension info to a cache file that chrome plugin can read
+//         await fs.promises.mkdir(EXTENSIONS_DIR, { recursive: true });
+//         await fs.promises.writeFile(
+//             cacheFile,
+//             JSON.stringify(extension, null, 2)
+//         );
+//         console.log(`[+] Extension metadata written to ${cacheFile}`);
+//     }
+
+//     return extension;
+// }
+
+// Export functions for use by other plugins
+module.exports = {
+    // DISABLED: Extension functionality - using single-file-cli only
+    // EXTENSION,
+    // installSinglefileExtension,
+    // saveSinglefileWithExtension,
+    saveSinglefileWithCLI,
+};
+
+// DISABLED: Extension functionality - using single-file-cli only
+// // Run if executed directly
+// if (require.main === module) {
+//     main().then(() => {
+//         console.log('[✓] SingleFile extension setup complete');
+//         process.exit(0);
+//     }).catch(err => {
+//         console.error('[❌] SingleFile extension setup failed:', err);
+//         process.exit(1);
+//     });
+// }
+
+// No-op when run directly (extension install disabled)
+if (require.main === module) {
+    console.log('[*] SingleFile extension install disabled - using single-file-cli only');
+    process.exit(0);
+}
--- a/archivebox/plugins/singlefile/on_Snapshot__50_singlefile.py
+++ b/archivebox/plugins/singlefile/on_Snapshot__50_singlefile.py
@@ -77,27 +77,9 @@ def has_staticfile_output() -> bool:
    return staticfile_dir.exists() and any(staticfile_dir.iterdir())


-# Chrome binary search paths
-CHROMIUM_BINARY_NAMES_LINUX = [
-    'chromium', 'chromium-browser', 'chromium-browser-beta',
-    'chromium-browser-unstable', 'chromium-browser-canary', 'chromium-browser-dev',
-]
-CHROME_BINARY_NAMES_LINUX = [
-    'google-chrome', 'google-chrome-stable', 'google-chrome-beta',
-    'google-chrome-canary', 'google-chrome-unstable', 'google-chrome-dev', 'chrome',
-]
-CHROME_BINARY_NAMES_MACOS = [
-    '/Applications/Google Chrome.app/Contents/MacOS/Google Chrome',
-    '/Applications/Google Chrome Canary.app/Contents/MacOS/Google Chrome Canary',
-]
-CHROMIUM_BINARY_NAMES_MACOS = ['/Applications/Chromium.app/Contents/MacOS/Chromium']
-
-ALL_CHROME_BINARIES = (
-    CHROME_BINARY_NAMES_LINUX + CHROMIUM_BINARY_NAMES_LINUX +
-    CHROME_BINARY_NAMES_MACOS + CHROMIUM_BINARY_NAMES_MACOS
-)
-
-
+# Chrome session directory (relative to extractor output dir)
+# Note: Chrome binary is obtained via CHROME_BINARY env var, not searched for.
+# The centralized Chrome binary search is in chrome_utils.js findChromium().
 CHROME_SESSION_DIR = '../chrome'


--- a/archivebox/plugins/singlefile/tests/test_singlefile.py
+++ b/archivebox/plugins/singlefile/tests/test_singlefile.py
@@ -2,195 +2,173 @@
 Integration tests for singlefile plugin

 Tests verify:
-1. Hook script exists and has correct metadata
-2. Extension installation and caching works
-3. Chrome/node dependencies available
-4. Hook can be executed successfully
+1. Hook scripts exist with correct naming
+2. CLI-based singlefile extraction works
+3. Dependencies available via abx-pkg
+4. Output contains valid HTML
+5. Connects to Chrome session via CDP when available
+6. Works with extensions loaded (ublock, etc.)
 """

 import json
 import os
 import subprocess
-import sys
 import tempfile
 from pathlib import Path

 import pytest

+from archivebox.plugins.chrome.tests.chrome_test_helpers import (
+    get_test_env,
+    get_plugin_dir,
+    get_hook_script,
+    setup_chrome_session,
+    cleanup_chrome,
+)

-PLUGIN_DIR = Path(__file__).parent.parent
-PLUGINS_ROOT = PLUGIN_DIR.parent
-INSTALL_SCRIPT = next(PLUGIN_DIR.glob('on_Crawl__*_singlefile.*'), None)
-NPM_PROVIDER_HOOK = PLUGINS_ROOT / 'npm' / 'on_Binary__install_using_npm_provider.py'
+
+PLUGIN_DIR = get_plugin_dir(__file__)
+SNAPSHOT_HOOK = get_hook_script(PLUGIN_DIR, 'on_Snapshot__*_singlefile.py')
 TEST_URL = "https://example.com"


-def test_install_script_exists():
-    """Verify install script exists"""
-    assert INSTALL_SCRIPT.exists(), f"Install script not found: {INSTALL_SCRIPT}"
+def test_snapshot_hook_exists():
+    """Verify snapshot extraction hook exists"""
+    assert SNAPSHOT_HOOK is not None and SNAPSHOT_HOOK.exists(), f"Snapshot hook not found in {PLUGIN_DIR}"


-def test_extension_metadata():
-    """Test that SingleFile extension has correct metadata"""
-    with tempfile.TemporaryDirectory() as tmpdir:
-        env = os.environ.copy()
-        env["CHROME_EXTENSIONS_DIR"] = str(Path(tmpdir) / "chrome_extensions")
-
-        result = subprocess.run(
-            ["node", "-e", f"const ext = require('{INSTALL_SCRIPT}'); console.log(JSON.stringify(ext.EXTENSION))"],
-            capture_output=True,
-            text=True,
-            env=env
-        )
-
-        assert result.returncode == 0, f"Failed to load extension metadata: {result.stderr}"
-
-        metadata = json.loads(result.stdout)
-        assert metadata["webstore_id"] == "mpiodijhokgodhhofbcjdecpffjipkle"
-        assert metadata["name"] == "singlefile"
-
-
-def test_install_creates_cache():
-    """Test that install creates extension cache"""
-    with tempfile.TemporaryDirectory() as tmpdir:
-        ext_dir = Path(tmpdir) / "chrome_extensions"
-        ext_dir.mkdir(parents=True)
-
-        env = os.environ.copy()
-        env["CHROME_EXTENSIONS_DIR"] = str(ext_dir)
-
-        result = subprocess.run(
-            ["node", str(INSTALL_SCRIPT)],
-            capture_output=True,
-            text=True,
-            env=env,
-            timeout=60
-        )
-
-        # Check output mentions installation
-        assert "SingleFile" in result.stdout or "singlefile" in result.stdout
-
-        # Check cache file was created
-        cache_file = ext_dir / "singlefile.extension.json"
-        assert cache_file.exists(), "Cache file should be created"
-
-        # Verify cache content
-        cache_data = json.loads(cache_file.read_text())
-        assert cache_data["webstore_id"] == "mpiodijhokgodhhofbcjdecpffjipkle"
-        assert cache_data["name"] == "singlefile"
-
-
-def test_install_twice_uses_cache():
-    """Test that running install twice uses existing cache on second run"""
-    with tempfile.TemporaryDirectory() as tmpdir:
-        ext_dir = Path(tmpdir) / "chrome_extensions"
-        ext_dir.mkdir(parents=True)
-
-        env = os.environ.copy()
-        env["CHROME_EXTENSIONS_DIR"] = str(ext_dir)
-
-        # First install - downloads the extension
-        result1 = subprocess.run(
-            ["node", str(INSTALL_SCRIPT)],
-            capture_output=True,
-            text=True,
-            env=env,
-            timeout=60
-        )
-        assert result1.returncode == 0, f"First install failed: {result1.stderr}"
-
-        # Verify cache was created
-        cache_file = ext_dir / "singlefile.extension.json"
-        assert cache_file.exists(), "Cache file should exist after first install"
-
-        # Second install - should use cache
-        result2 = subprocess.run(
-            ["node", str(INSTALL_SCRIPT)],
-            capture_output=True,
-            text=True,
-            env=env,
-            timeout=30
-        )
-        assert result2.returncode == 0, f"Second install failed: {result2.stderr}"
-
-        # Second run should be faster (uses cache) and mention cache
-        assert "already installed" in result2.stdout or "cache" in result2.stdout.lower() or result2.returncode == 0
-
-
-def test_no_configuration_required():
-    """Test that SingleFile works without configuration"""
-    with tempfile.TemporaryDirectory() as tmpdir:
-        ext_dir = Path(tmpdir) / "chrome_extensions"
-        ext_dir.mkdir(parents=True)
-
-        env = os.environ.copy()
-        env["CHROME_EXTENSIONS_DIR"] = str(ext_dir)
-        # No API keys needed
-
-        result = subprocess.run(
-            ["node", str(INSTALL_SCRIPT)],
-            capture_output=True,
-            text=True,
-            env=env,
-            timeout=60
-        )
-
-        # Should work without API keys
-        assert result.returncode == 0
-
-
-def test_priority_order():
-    """Test that singlefile has correct priority (04)"""
-    # Extract priority from filename
-    filename = INSTALL_SCRIPT.name
-    assert "04" in filename, "SingleFile should have priority 04"
-    assert filename.startswith("on_Crawl__04_"), "Should follow priority naming convention for Crawl hooks"
-
-
-def test_output_directory_structure():
-    """Test that plugin defines correct output structure"""
-    # Verify the script mentions singlefile output directory
-    script_content = INSTALL_SCRIPT.read_text()
-
-    # Should mention singlefile output directory
-    assert "singlefile" in script_content.lower()
-    # Should mention HTML output
-    assert ".html" in script_content or "html" in script_content.lower()
+def test_snapshot_hook_priority():
+    """Test that snapshot hook has correct priority (50)"""
+    filename = SNAPSHOT_HOOK.name
+    assert "50" in filename, "SingleFile snapshot hook should have priority 50"
+    assert filename.startswith("on_Snapshot__50_"), "Should follow priority naming convention"


 def test_verify_deps_with_abx_pkg():
-    """Verify dependencies are available via abx-pkg after hook installation."""
-    from abx_pkg import Binary, EnvProvider, BinProviderOverrides
+    """Verify dependencies are available via abx-pkg."""
+    from abx_pkg import Binary, EnvProvider

    EnvProvider.model_rebuild()

-    # Verify node is available (singlefile uses Chrome extension, needs Node)
+    # Verify node is available
    node_binary = Binary(name='node', binproviders=[EnvProvider()])
    node_loaded = node_binary.load()
    assert node_loaded and node_loaded.abspath, "Node.js required for singlefile plugin"


-def test_singlefile_hook_runs():
-    """Verify singlefile hook can be executed and completes."""
-    # Prerequisites checked by earlier test
-
+def test_singlefile_cli_archives_example_com():
+    """Test that singlefile CLI archives example.com and produces valid HTML."""
    with tempfile.TemporaryDirectory() as tmpdir:
        tmpdir = Path(tmpdir)

-        # Run singlefile extraction hook
+        env = get_test_env()
+        env['SINGLEFILE_ENABLED'] = 'true'
+
+        # Run singlefile snapshot hook
        result = subprocess.run(
-            ['node', str(INSTALL_SCRIPT), f'--url={TEST_URL}', '--snapshot-id=test789'],
+            ['python', str(SNAPSHOT_HOOK), f'--url={TEST_URL}', '--snapshot-id=test789'],
            cwd=tmpdir,
            capture_output=True,
            text=True,
+            env=env,
            timeout=120
        )

-        # Hook should complete successfully (even if it just installs extension)
        assert result.returncode == 0, f"Hook execution failed: {result.stderr}"

-        # Verify extension installation happens
-        assert 'SingleFile extension' in result.stdout or result.returncode == 0, "Should install extension or complete"
+        # Verify output file exists
+        output_file = tmpdir / 'singlefile.html'
+        assert output_file.exists(), f"singlefile.html not created. stdout: {result.stdout}, stderr: {result.stderr}"
+
+        # Verify it contains real HTML
+        html_content = output_file.read_text()
+        assert len(html_content) > 500, "Output file too small to be valid HTML"
+        assert '<!DOCTYPE html>' in html_content or '<html' in html_content, "Output should contain HTML doctype or html tag"
+        assert 'Example Domain' in html_content, "Output should contain example.com content"
+
+
+def test_singlefile_with_chrome_session():
+    """Test singlefile connects to existing Chrome session via CDP.
+
+    When a Chrome session exists (chrome/cdp_url.txt), singlefile should
+    connect to it instead of launching a new Chrome instance.
+    """
+    with tempfile.TemporaryDirectory() as tmpdir:
+        tmpdir = Path(tmpdir)
+
+        try:
+            # Set up Chrome session using shared helper
+            chrome_launch_process, chrome_pid, snapshot_chrome_dir = setup_chrome_session(
+                tmpdir=tmpdir,
+                crawl_id='singlefile-test-crawl',
+                snapshot_id='singlefile-test-snap',
+                test_url=TEST_URL,
+                navigate=False,  # Don't navigate, singlefile will do that
+                timeout=20,
+            )
+
+            # singlefile looks for ../chrome/cdp_url.txt relative to cwd
+            # So we need to run from a directory that has ../chrome pointing to our chrome dir
+            singlefile_output_dir = tmpdir / 'snapshot' / 'singlefile'
+            singlefile_output_dir.mkdir(parents=True, exist_ok=True)
+
+            # Create symlink so singlefile can find the chrome session
+            chrome_link = singlefile_output_dir.parent / 'chrome'
+            if not chrome_link.exists():
+                chrome_link.symlink_to(tmpdir / 'crawl' / 'chrome')
+
+            env = get_test_env()
+            env['SINGLEFILE_ENABLED'] = 'true'
+            env['CHROME_HEADLESS'] = 'true'
+
+            # Run singlefile - it should find and use the existing Chrome session
+            result = subprocess.run(
+                ['python', str(SNAPSHOT_HOOK), f'--url={TEST_URL}', '--snapshot-id=singlefile-test-snap'],
+                cwd=str(singlefile_output_dir),
+                capture_output=True,
+                text=True,
+                env=env,
+                timeout=120
+            )
+
+            # Verify output
+            output_file = singlefile_output_dir / 'singlefile.html'
+            if output_file.exists():
+                html_content = output_file.read_text()
+                assert len(html_content) > 500, "Output file too small"
+                assert 'Example Domain' in html_content, "Should contain example.com content"
+            else:
+                # If singlefile couldn't connect to Chrome, it may have failed
+                # Check if it mentioned browser-server in its args (indicating it tried to use CDP)
+                assert result.returncode == 0 or 'browser-server' in result.stderr or 'cdp' in result.stderr.lower(), \
+                    f"Singlefile should attempt CDP connection. stderr: {result.stderr}"
+
+        finally:
+            cleanup_chrome(chrome_launch_process, chrome_pid)
+
+
+def test_singlefile_disabled_skips():
+    """Test that SINGLEFILE_ENABLED=False exits without JSONL."""
+    with tempfile.TemporaryDirectory() as tmpdir:
+        tmpdir = Path(tmpdir)
+
+        env = get_test_env()
+        env['SINGLEFILE_ENABLED'] = 'False'
+
+        result = subprocess.run(
+            ['python', str(SNAPSHOT_HOOK), f'--url={TEST_URL}', '--snapshot-id=test-disabled'],
+            cwd=tmpdir,
+            capture_output=True,
+            text=True,
+            env=env,
+            timeout=30
+        )
+
+        assert result.returncode == 0, f"Should exit 0 when disabled: {result.stderr}"
+
+        # Should NOT emit JSONL when disabled
+        jsonl_lines = [line for line in result.stdout.strip().split('\n') if line.strip().startswith('{')]
+        assert len(jsonl_lines) == 0, f"Should not emit JSONL when disabled, but got: {jsonl_lines}"


 if __name__ == '__main__':
--- a/archivebox/plugins/ssl/on_Snapshot__23_ssl.bg.js
+++ b/archivebox/plugins/ssl/on_Snapshot__23_ssl.bg.js
@@ -19,7 +19,7 @@ const puppeteer = require('puppeteer-core');
 const PLUGIN_NAME = 'ssl';
 const OUTPUT_DIR = '.';
 const OUTPUT_FILE = 'ssl.jsonl';
-const PID_FILE = 'hook.pid';
+// PID file is now written by run_hook() with hook-specific name
 const CHROME_SESSION_DIR = '../chrome';

 function parseArgs() {
@@ -211,8 +211,8 @@ async function main() {
        // Set up listener BEFORE navigation
        await setupListener(url);

-        // Write PID file so chrome_cleanup can kill any remaining processes
-        fs.writeFileSync(path.join(OUTPUT_DIR, PID_FILE), String(process.pid));
+        // Note: PID file is written by run_hook() with hook-specific name
+        // Snapshot.cleanup() kills all *.pid processes when done

        // Wait for chrome_navigate to complete (BLOCKING)
        await waitForNavigation();
--- a/archivebox/plugins/staticfile/on_Snapshot__31_staticfile.bg.js
+++ b/archivebox/plugins/staticfile/on_Snapshot__31_staticfile.bg.js
@@ -18,7 +18,7 @@ const puppeteer = require('puppeteer-core');

 const PLUGIN_NAME = 'staticfile';
 const OUTPUT_DIR = '.';
-const PID_FILE = 'hook.pid';
+// PID file is now written by run_hook() with hook-specific name
 const CHROME_SESSION_DIR = '../chrome';

 // Content-Types that indicate static files
@@ -398,8 +398,8 @@ async function main() {
        // Set up static file listener BEFORE navigation
        await setupStaticFileListener();

-        // Write PID file
-        fs.writeFileSync(path.join(OUTPUT_DIR, PID_FILE), String(process.pid));
+        // Note: PID file is written by run_hook() with hook-specific name
+        // Snapshot.cleanup() kills all *.pid processes when done

        // Wait for chrome_navigate to complete (BLOCKING)
        await waitForNavigation();
--- a/archivebox/plugins/title/tests/test_title.py
+++ b/archivebox/plugins/title/tests/test_title.py
@@ -2,7 +2,6 @@
 Integration tests for title plugin

 Tests verify:
-    pass
 1. Plugin script exists
 2. Node.js is available
 3. Title extraction works for real example.com
@@ -20,9 +19,15 @@ from pathlib import Path

 import pytest

+from archivebox.plugins.chrome.tests.chrome_test_helpers import (
+    get_plugin_dir,
+    get_hook_script,
+    parse_jsonl_output,
+)

-PLUGIN_DIR = Path(__file__).parent.parent
-TITLE_HOOK = next(PLUGIN_DIR.glob('on_Snapshot__*_title.*'), None)
+
+PLUGIN_DIR = get_plugin_dir(__file__)
+TITLE_HOOK = get_hook_script(PLUGIN_DIR, 'on_Snapshot__*_title.*')
 TEST_URL = 'https://example.com'


--- a/archivebox/plugins/twocaptcha/config.json
+++ b/archivebox/plugins/twocaptcha/config.json
@@ -0,0 +1,50 @@
+{
+  "$schema": "http://json-schema.org/draft-07/schema#",
+  "type": "object",
+  "additionalProperties": false,
+  "required_plugins": ["chrome"],
+  "properties": {
+    "TWOCAPTCHA_ENABLED": {
+      "type": "boolean",
+      "default": true,
+      "x-aliases": ["CAPTCHA2_ENABLED", "USE_CAPTCHA2", "USE_TWOCAPTCHA"],
+      "description": "Enable 2captcha browser extension for automatic CAPTCHA solving"
+    },
+    "TWOCAPTCHA_API_KEY": {
+      "type": "string",
+      "default": "",
+      "x-aliases": ["API_KEY_2CAPTCHA", "CAPTCHA2_API_KEY"],
+      "x-sensitive": true,
+      "description": "2captcha API key for CAPTCHA solving service (get from https://2captcha.com)"
+    },
+    "TWOCAPTCHA_RETRY_COUNT": {
+      "type": "integer",
+      "default": 3,
+      "minimum": 0,
+      "maximum": 10,
+      "x-aliases": ["CAPTCHA2_RETRY_COUNT"],
+      "description": "Number of times to retry CAPTCHA solving on error"
+    },
+    "TWOCAPTCHA_RETRY_DELAY": {
+      "type": "integer",
+      "default": 5,
+      "minimum": 0,
+      "maximum": 60,
+      "x-aliases": ["CAPTCHA2_RETRY_DELAY"],
+      "description": "Delay in seconds between CAPTCHA solving retries"
+    },
+    "TWOCAPTCHA_TIMEOUT": {
+      "type": "integer",
+      "default": 60,
+      "minimum": 5,
+      "x-fallback": "TIMEOUT",
+      "x-aliases": ["CAPTCHA2_TIMEOUT"],
+      "description": "Timeout for CAPTCHA solving in seconds"
+    },
+    "TWOCAPTCHA_AUTO_SUBMIT": {
+      "type": "boolean",
+      "default": false,
+      "description": "Automatically submit forms after CAPTCHA is solved"
+    }
+  }
+}
--- a/archivebox/plugins/twocaptcha/on_Crawl__20_install_twocaptcha_extension.js
+++ b/archivebox/plugins/twocaptcha/on_Crawl__20_install_twocaptcha_extension.js
@@ -0,0 +1,66 @@
+#!/usr/bin/env node
+/**
+ * 2Captcha Extension Plugin
+ *
+ * Installs and configures the 2captcha Chrome extension for automatic
+ * CAPTCHA solving during page archiving.
+ *
+ * Extension: https://chromewebstore.google.com/detail/ifibfemgeogfhoebkmokieepdoobkbpo
+ * Documentation: https://2captcha.com/blog/how-to-use-2captcha-solver-extension-in-puppeteer
+ *
+ * Priority: 01 (early) - Must install before Chrome session starts at Crawl level
+ * Hook: on_Crawl (runs once per crawl, not per snapshot)
+ *
+ * Requirements:
+ * - TWOCAPTCHA_API_KEY environment variable must be set
+ * - Extension will automatically solve reCAPTCHA, hCaptcha, Cloudflare Turnstile, etc.
+ */
+
+// Import extension utilities
+const { installExtensionWithCache } = require('../chrome/chrome_utils.js');
+
+// Extension metadata
+const EXTENSION = {
+    webstore_id: 'ifibfemgeogfhoebkmokieepdoobkbpo',
+    name: 'twocaptcha',
+};
+
+/**
+ * Main entry point - install extension before archiving
+ *
+ * Note: 2captcha configuration is handled by on_Crawl__25_configure_twocaptcha_extension_options.js
+ * during first-time browser setup to avoid repeated configuration on every snapshot.
+ * The API key is injected via chrome.storage API once per browser session.
+ */
+async function main() {
+    const extension = await installExtensionWithCache(EXTENSION);
+
+    if (extension) {
+        // Check if API key is configured
+        const apiKey = process.env.TWOCAPTCHA_API_KEY || process.env.API_KEY_2CAPTCHA;
+        if (!apiKey || apiKey === 'YOUR_API_KEY_HERE') {
+            console.warn('[⚠️] 2captcha extension installed but TWOCAPTCHA_API_KEY not configured');
+            console.warn('[⚠️] Set TWOCAPTCHA_API_KEY environment variable to enable automatic CAPTCHA solving');
+        } else {
+            console.log('[+] 2captcha extension installed and API key configured');
+        }
+    }
+
+    return extension;
+}
+
+// Export functions for use by other plugins
+module.exports = {
+    EXTENSION,
+};
+
+// Run if executed directly
+if (require.main === module) {
+    main().then(() => {
+        console.log('[✓] 2captcha extension setup complete');
+        process.exit(0);
+    }).catch(err => {
+        console.error('[❌] 2captcha extension setup failed:', err);
+        process.exit(1);
+    });
+}
--- a/archivebox/plugins/twocaptcha/on_Crawl__25_configure_twocaptcha_extension_options.js
+++ b/archivebox/plugins/twocaptcha/on_Crawl__25_configure_twocaptcha_extension_options.js
@@ -0,0 +1,348 @@
+#!/usr/bin/env node
+/**
+ * 2Captcha Extension Configuration
+ *
+ * Configures the 2captcha extension with API key and settings after Crawl-level Chrome session starts.
+ * Runs once per crawl to inject configuration into extension storage.
+ *
+ * Priority: 25 (after chrome_launch at 30, before snapshots start)
+ * Hook: on_Crawl (runs once per crawl, not per snapshot)
+ *
+ * Config Options (from config.json / environment):
+ * - TWOCAPTCHA_API_KEY: API key for 2captcha service
+ * - TWOCAPTCHA_ENABLED: Enable/disable the extension
+ * - TWOCAPTCHA_RETRY_COUNT: Number of retries on error
+ * - TWOCAPTCHA_RETRY_DELAY: Delay between retries (seconds)
+ * - TWOCAPTCHA_AUTO_SUBMIT: Auto-submit forms after solving
+ *
+ * Requirements:
+ * - TWOCAPTCHA_API_KEY environment variable must be set
+ * - chrome plugin must have loaded extensions (extensions.json must exist)
+ */
+
+const path = require('path');
+const fs = require('fs');
+// Add NODE_MODULES_DIR to module resolution paths if set
+if (process.env.NODE_MODULES_DIR) module.paths.unshift(process.env.NODE_MODULES_DIR);
+const puppeteer = require('puppeteer-core');
+
+// Get crawl's chrome directory from environment variable set by hooks.py
+function getCrawlChromeSessionDir() {
+    const crawlOutputDir = process.env.CRAWL_OUTPUT_DIR || '';
+    if (!crawlOutputDir) {
+        return null;
+    }
+    return path.join(crawlOutputDir, 'chrome');
+}
+
+const CHROME_SESSION_DIR = getCrawlChromeSessionDir() || '../chrome';
+const CONFIG_MARKER = path.join(CHROME_SESSION_DIR, '.twocaptcha_configured');
+
+// Get environment variable with default
+function getEnv(name, defaultValue = '') {
+    return (process.env[name] || defaultValue).trim();
+}
+
+// Get boolean environment variable
+function getEnvBool(name, defaultValue = false) {
+    const val = getEnv(name, '').toLowerCase();
+    if (['true', '1', 'yes', 'on'].includes(val)) return true;
+    if (['false', '0', 'no', 'off'].includes(val)) return false;
+    return defaultValue;
+}
+
+// Get integer environment variable
+function getEnvInt(name, defaultValue = 0) {
+    const val = parseInt(getEnv(name, String(defaultValue)), 10);
+    return isNaN(val) ? defaultValue : val;
+}
+
+// Parse command line arguments
+function parseArgs() {
+    const args = {};
+    process.argv.slice(2).forEach(arg => {
+        if (arg.startsWith('--')) {
+            const [key, ...valueParts] = arg.slice(2).split('=');
+            args[key.replace(/-/g, '_')] = valueParts.join('=') || true;
+        }
+    });
+    return args;
+}
+
+/**
+ * Get 2captcha configuration from environment variables.
+ * Supports both TWOCAPTCHA_* and legacy API_KEY_2CAPTCHA naming.
+ */
+function getTwoCaptchaConfig() {
+    const apiKey = getEnv('TWOCAPTCHA_API_KEY') || getEnv('API_KEY_2CAPTCHA') || getEnv('CAPTCHA2_API_KEY');
+    const isEnabled = getEnvBool('TWOCAPTCHA_ENABLED', true);
+    const retryCount = getEnvInt('TWOCAPTCHA_RETRY_COUNT', 3);
+    const retryDelay = getEnvInt('TWOCAPTCHA_RETRY_DELAY', 5);
+    const autoSubmit = getEnvBool('TWOCAPTCHA_AUTO_SUBMIT', false);
+
+    // Build the full config object matching the extension's storage structure
+    // Structure: chrome.storage.local.set({config: {...}})
+    return {
+        // API key - both variants for compatibility
+        apiKey: apiKey,
+        api_key: apiKey,
+
+        // Plugin enabled state
+        isPluginEnabled: isEnabled,
+
+        // Retry settings
+        repeatOnErrorTimes: retryCount,
+        repeatOnErrorDelay: retryDelay,
+
+        // Auto-submit setting
+        autoSubmitForms: autoSubmit,
+        submitFormsDelay: 0,
+
+        // Enable all CAPTCHA types
+        enabledForNormal: true,
+        enabledForRecaptchaV2: true,
+        enabledForInvisibleRecaptchaV2: true,
+        enabledForRecaptchaV3: true,
+        enabledForRecaptchaAudio: false,
+        enabledForGeetest: true,
+        enabledForGeetest_v4: true,
+        enabledForKeycaptcha: true,
+        enabledForArkoselabs: true,
+        enabledForLemin: true,
+        enabledForYandex: true,
+        enabledForCapyPuzzle: true,
+        enabledForTurnstile: true,
+        enabledForAmazonWaf: true,
+        enabledForMTCaptcha: true,
+
+        // Auto-solve all CAPTCHA types
+        autoSolveNormal: true,
+        autoSolveRecaptchaV2: true,
+        autoSolveInvisibleRecaptchaV2: true,
+        autoSolveRecaptchaV3: true,
+        autoSolveRecaptchaAudio: false,
+        autoSolveGeetest: true,
+        autoSolveGeetest_v4: true,
+        autoSolveKeycaptcha: true,
+        autoSolveArkoselabs: true,
+        autoSolveLemin: true,
+        autoSolveYandex: true,
+        autoSolveCapyPuzzle: true,
+        autoSolveTurnstile: true,
+        autoSolveAmazonWaf: true,
+        autoSolveMTCaptcha: true,
+
+        // Other settings with sensible defaults
+        recaptchaV2Type: 'token',
+        recaptchaV3MinScore: 0.3,
+        buttonPosition: 'inner',
+        useProxy: false,
+        proxy: '',
+        proxytype: 'HTTP',
+        blackListDomain: '',
+        autoSubmitRules: [],
+        normalSources: [],
+    };
+}
+
+async function configure2Captcha() {
+    // Check if already configured in this session
+    if (fs.existsSync(CONFIG_MARKER)) {
+        console.error('[*] 2captcha already configured in this browser session');
+        return { success: true, skipped: true };
+    }
+
+    // Get configuration
+    const config = getTwoCaptchaConfig();
+
+    // Check if API key is set
+    if (!config.apiKey || config.apiKey === 'YOUR_API_KEY_HERE') {
+        console.warn('[!] 2captcha extension loaded but TWOCAPTCHA_API_KEY not configured');
+        console.warn('[!] Set TWOCAPTCHA_API_KEY environment variable to enable automatic CAPTCHA solving');
+        return { success: false, error: 'TWOCAPTCHA_API_KEY not configured' };
+    }
+
+    console.error('[*] Configuring 2captcha extension...');
+    console.error(`[*]   API Key: ${config.apiKey.slice(0, 8)}...${config.apiKey.slice(-4)}`);
+    console.error(`[*]   Enabled: ${config.isPluginEnabled}`);
+    console.error(`[*]   Retry Count: ${config.repeatOnErrorTimes}`);
+    console.error(`[*]   Retry Delay: ${config.repeatOnErrorDelay}s`);
+    console.error(`[*]   Auto Submit: ${config.autoSubmitForms}`);
+    console.error(`[*]   Auto Solve: all CAPTCHA types enabled`);
+
+    try {
+        // Connect to the existing Chrome session via CDP
+        const cdpFile = path.join(CHROME_SESSION_DIR, 'cdp_url.txt');
+        if (!fs.existsSync(cdpFile)) {
+            return { success: false, error: 'CDP URL not found - chrome plugin must run first' };
+        }
+
+        const cdpUrl = fs.readFileSync(cdpFile, 'utf-8').trim();
+        const browser = await puppeteer.connect({ browserWSEndpoint: cdpUrl });
+
+        try {
+            // First, navigate to a page to trigger extension content scripts and wake up service worker
+            console.error('[*] Waking up extension by visiting a page...');
+            const triggerPage = await browser.newPage();
+            try {
+                await triggerPage.goto('https://www.google.com', { waitUntil: 'domcontentloaded', timeout: 10000 });
+                await new Promise(r => setTimeout(r, 3000)); // Give extension time to initialize
+            } catch (e) {
+                console.warn(`[!] Trigger page failed: ${e.message}`);
+            }
+            try { await triggerPage.close(); } catch (e) {}
+
+            // Get 2captcha extension info from extensions.json
+            const extensionsFile = path.join(CHROME_SESSION_DIR, 'extensions.json');
+            if (!fs.existsSync(extensionsFile)) {
+                return { success: false, error: 'extensions.json not found - chrome plugin must run first' };
+            }
+
+            const extensions = JSON.parse(fs.readFileSync(extensionsFile, 'utf-8'));
+            const captchaExt = extensions.find(ext => ext.name === 'twocaptcha');
+
+            if (!captchaExt) {
+                console.error('[*] 2captcha extension not installed, skipping configuration');
+                return { success: true, skipped: true };
+            }
+
+            if (!captchaExt.id) {
+                return { success: false, error: '2captcha extension ID not found in extensions.json' };
+            }
+
+            const extensionId = captchaExt.id;
+            console.error(`[*] 2captcha Extension ID: ${extensionId}`);
+
+            // Configure via options page
+            console.error('[*] Configuring via options page...');
+            const optionsUrl = `chrome-extension://${extensionId}/options/options.html`;
+
+            let configPage = await browser.newPage();
+
+            try {
+                // Navigate to options page - catch error but continue since page may still load
+                try {
+                    await configPage.goto(optionsUrl, { waitUntil: 'networkidle0', timeout: 10000 });
+                } catch (navError) {
+                    // Navigation may throw ERR_BLOCKED_BY_CLIENT but page still loads
+                    console.error(`[*] Navigation threw error (may still work): ${navError.message}`);
+                }
+
+                // Wait a moment for page to settle
+                await new Promise(r => setTimeout(r, 3000));
+
+                // Check all pages for the extension page (Chrome may open it in a different tab)
+                const pages = await browser.pages();
+                for (const page of pages) {
+                    const url = page.url();
+                    if (url.startsWith(`chrome-extension://${extensionId}`)) {
+                        configPage = page;
+                        break;
+                    }
+                }
+
+                const currentUrl = configPage.url();
+                console.error(`[*] Current URL: ${currentUrl}`);
+
+                if (!currentUrl.startsWith(`chrome-extension://${extensionId}`)) {
+                    return { success: false, error: `Failed to navigate to options page, got: ${currentUrl}` };
+                }
+
+                // Wait for Config object to be available
+                console.error('[*] Waiting for Config object...');
+                await configPage.waitForFunction(() => typeof Config !== 'undefined', { timeout: 10000 });
+
+                // Use chrome.storage.local.set with the config wrapper
+                const result = await configPage.evaluate((cfg) => {
+                    return new Promise((resolve) => {
+                        if (typeof chrome !== 'undefined' && chrome.storage) {
+                            chrome.storage.local.set({ config: cfg }, () => {
+                                if (chrome.runtime.lastError) {
+                                    resolve({ success: false, error: chrome.runtime.lastError.message });
+                                } else {
+                                    resolve({ success: true, method: 'options_page' });
+                                }
+                            });
+                        } else {
+                            resolve({ success: false, error: 'chrome.storage not available' });
+                        }
+                    });
+                }, config);
+
+                if (result.success) {
+                    console.error(`[+] 2captcha configured via ${result.method}`);
+                    fs.writeFileSync(CONFIG_MARKER, JSON.stringify({
+                        timestamp: new Date().toISOString(),
+                        method: result.method,
+                        extensionId: extensionId,
+                        config: {
+                            apiKeySet: !!config.apiKey,
+                            isPluginEnabled: config.isPluginEnabled,
+                            repeatOnErrorTimes: config.repeatOnErrorTimes,
+                            repeatOnErrorDelay: config.repeatOnErrorDelay,
+                            autoSubmitForms: config.autoSubmitForms,
+                            autoSolveEnabled: true,
+                        }
+                    }, null, 2));
+                    return { success: true, method: result.method };
+                }
+
+                return { success: false, error: result.error || 'Config failed' };
+            } finally {
+                try { await configPage.close(); } catch (e) {}
+            }
+        } finally {
+            browser.disconnect();
+        }
+    } catch (e) {
+        return { success: false, error: `${e.name}: ${e.message}` };
+    }
+}
+
+async function main() {
+    const args = parseArgs();
+    const url = args.url;
+    const snapshotId = args.snapshot_id;
+
+    if (!url || !snapshotId) {
+        console.error('Usage: on_Crawl__25_configure_twocaptcha_extension_options.js --url=<url> --snapshot-id=<uuid>');
+        process.exit(1);
+    }
+
+    const startTs = new Date();
+    let status = 'failed';
+    let error = '';
+
+    try {
+        const result = await configure2Captcha();
+
+        if (result.skipped) {
+            status = 'skipped';
+        } else if (result.success) {
+            status = 'succeeded';
+        } else {
+            status = 'failed';
+            error = result.error || 'Configuration failed';
+        }
+    } catch (e) {
+        error = `${e.name}: ${e.message}`;
+        status = 'failed';
+    }
+
+    const endTs = new Date();
+    const duration = (endTs - startTs) / 1000;
+
+    if (error) {
+        console.error(`ERROR: ${error}`);
+    }
+
+    // Config hooks don't emit JSONL - they're utility hooks for setup
+    // Exit code indicates success/failure
+
+    process.exit(status === 'succeeded' || status === 'skipped' ? 0 : 1);
+}
+
+main().catch(e => {
+    console.error(`Fatal error: ${e.message}`);
+    process.exit(1);
+});
--- a/archivebox/plugins/twocaptcha/templates/icon.html
+++ b/archivebox/plugins/twocaptcha/templates/icon.html
--- a/archivebox/plugins/twocaptcha/tests/test_twocaptcha.py
+++ b/archivebox/plugins/twocaptcha/tests/test_twocaptcha.py
@@ -0,0 +1,237 @@
+"""
+Integration tests for twocaptcha plugin
+
+Run with: TWOCAPTCHA_API_KEY=your_key pytest archivebox/plugins/twocaptcha/tests/ -xvs
+
+NOTE: Chrome 137+ removed --load-extension support, so these tests MUST use Chromium.
+"""
+
+import json
+import os
+import signal
+import subprocess
+import tempfile
+import time
+from pathlib import Path
+
+import pytest
+
+from archivebox.plugins.chrome.tests.chrome_test_helpers import (
+    setup_test_env,
+    launch_chromium_session,
+    kill_chromium_session,
+    CHROME_LAUNCH_HOOK,
+    PLUGINS_ROOT,
+)
+
+
+PLUGIN_DIR = Path(__file__).parent.parent
+INSTALL_SCRIPT = PLUGIN_DIR / 'on_Crawl__20_install_twocaptcha_extension.js'
+CONFIG_SCRIPT = PLUGIN_DIR / 'on_Crawl__25_configure_twocaptcha_extension_options.js'
+
+TEST_URL = 'https://2captcha.com/demo/recaptcha-v2'
+
+
+# Alias for backward compatibility with existing test names
+launch_chrome = launch_chromium_session
+kill_chrome = kill_chromium_session
+
+
+class TestTwoCaptcha:
+    """Integration tests requiring TWOCAPTCHA_API_KEY."""
+
+    @pytest.fixture(autouse=True)
+    def setup(self):
+        self.api_key = os.environ.get('TWOCAPTCHA_API_KEY') or os.environ.get('API_KEY_2CAPTCHA')
+        if not self.api_key:
+            pytest.skip("TWOCAPTCHA_API_KEY required")
+
+    def test_install_and_load(self):
+        """Extension installs and loads in Chromium."""
+        with tempfile.TemporaryDirectory() as tmpdir:
+            tmpdir = Path(tmpdir)
+            env = setup_test_env(tmpdir)
+            env['TWOCAPTCHA_API_KEY'] = self.api_key
+
+            # Install
+            result = subprocess.run(['node', str(INSTALL_SCRIPT)], env=env, timeout=120, capture_output=True, text=True)
+            assert result.returncode == 0, f"Install failed: {result.stderr}"
+
+            cache = Path(env['CHROME_EXTENSIONS_DIR']) / 'twocaptcha.extension.json'
+            assert cache.exists()
+            data = json.loads(cache.read_text())
+            assert data['webstore_id'] == 'ifibfemgeogfhoebkmokieepdoobkbpo'
+
+            # Launch Chromium in crawls directory
+            crawl_id = 'test'
+            crawl_dir = Path(env['CRAWLS_DIR']) / crawl_id
+            chrome_dir = crawl_dir / 'chrome'
+            env['CRAWL_OUTPUT_DIR'] = str(crawl_dir)
+            process, cdp_url = launch_chrome(env, chrome_dir, crawl_id)
+
+            try:
+                exts = json.loads((chrome_dir / 'extensions.json').read_text())
+                assert any(e['name'] == 'twocaptcha' for e in exts), f"Not loaded: {exts}"
+                print(f"[+] Extension loaded: id={next(e['id'] for e in exts if e['name']=='twocaptcha')}")
+            finally:
+                kill_chrome(process, chrome_dir)
+
+    def test_config_applied(self):
+        """Configuration is applied to extension and verified via Config.getAll()."""
+        with tempfile.TemporaryDirectory() as tmpdir:
+            tmpdir = Path(tmpdir)
+            env = setup_test_env(tmpdir)
+            env['TWOCAPTCHA_API_KEY'] = self.api_key
+            env['TWOCAPTCHA_RETRY_COUNT'] = '5'
+            env['TWOCAPTCHA_RETRY_DELAY'] = '10'
+
+            subprocess.run(['node', str(INSTALL_SCRIPT)], env=env, timeout=120, capture_output=True)
+
+            # Launch Chromium in crawls directory
+            crawl_id = 'cfg'
+            crawl_dir = Path(env['CRAWLS_DIR']) / crawl_id
+            chrome_dir = crawl_dir / 'chrome'
+            env['CRAWL_OUTPUT_DIR'] = str(crawl_dir)
+            process, cdp_url = launch_chrome(env, chrome_dir, crawl_id)
+
+            try:
+                result = subprocess.run(
+                    ['node', str(CONFIG_SCRIPT), '--url=https://example.com', '--snapshot-id=test'],
+                    env=env, timeout=30, capture_output=True, text=True
+                )
+                assert result.returncode == 0, f"Config failed: {result.stderr}"
+                assert (chrome_dir / '.twocaptcha_configured').exists()
+
+                # Verify config via options.html and Config.getAll()
+                # Get the actual extension ID from the config marker (Chrome computes IDs differently)
+                config_marker = json.loads((chrome_dir / '.twocaptcha_configured').read_text())
+                ext_id = config_marker['extensionId']
+                script = f'''
+if (process.env.NODE_MODULES_DIR) module.paths.unshift(process.env.NODE_MODULES_DIR);
+const puppeteer = require('puppeteer-core');
+(async () => {{
+    const browser = await puppeteer.connect({{ browserWSEndpoint: '{cdp_url}' }});
+
+    // Load options.html and use Config.getAll() to verify
+    const optionsUrl = 'chrome-extension://{ext_id}/options/options.html';
+    const page = await browser.newPage();
+    console.error('[*] Loading options page:', optionsUrl);
+
+    // Navigate - catch error but continue since page may still load
+    try {{
+        await page.goto(optionsUrl, {{ waitUntil: 'networkidle0', timeout: 10000 }});
+    }} catch (e) {{
+        console.error('[*] Navigation threw error (may still work):', e.message);
+    }}
+
+    // Wait for page to settle
+    await new Promise(r => setTimeout(r, 2000));
+    console.error('[*] Current URL:', page.url());
+
+    // Wait for Config object to be available
+    await page.waitForFunction(() => typeof Config !== 'undefined', {{ timeout: 5000 }});
+
+    // Call Config.getAll() - the extension's own API (returns a Promise)
+    const cfg = await page.evaluate(async () => await Config.getAll());
+    console.error('[*] Config.getAll() returned:', JSON.stringify(cfg));
+
+    await page.close();
+    browser.disconnect();
+    console.log(JSON.stringify(cfg));
+}})();
+'''
+                (tmpdir / 'v.js').write_text(script)
+                r = subprocess.run(['node', str(tmpdir / 'v.js')], env=env, timeout=30, capture_output=True, text=True)
+                print(r.stderr)
+                assert r.returncode == 0, f"Verify failed: {r.stderr}"
+
+                cfg = json.loads(r.stdout.strip().split('\n')[-1])
+                print(f"[*] Config from extension: {json.dumps(cfg, indent=2)}")
+
+                # Verify all the fields we care about
+                assert cfg.get('apiKey') == self.api_key or cfg.get('api_key') == self.api_key, f"API key not set: {cfg}"
+                assert cfg.get('isPluginEnabled') == True, f"Plugin not enabled: {cfg}"
+                assert cfg.get('repeatOnErrorTimes') == 5, f"Retry count wrong: {cfg}"
+                assert cfg.get('repeatOnErrorDelay') == 10, f"Retry delay wrong: {cfg}"
+                assert cfg.get('autoSolveRecaptchaV2') == True, f"autoSolveRecaptchaV2 not enabled: {cfg}"
+                assert cfg.get('autoSolveRecaptchaV3') == True, f"autoSolveRecaptchaV3 not enabled: {cfg}"
+                assert cfg.get('autoSolveTurnstile') == True, f"autoSolveTurnstile not enabled: {cfg}"
+                assert cfg.get('enabledForRecaptchaV2') == True, f"enabledForRecaptchaV2 not enabled: {cfg}"
+
+                print(f"[+] Config verified via Config.getAll()!")
+            finally:
+                kill_chrome(process, chrome_dir)
+
+    def test_solves_recaptcha(self):
+        """Extension solves reCAPTCHA on demo page."""
+        with tempfile.TemporaryDirectory() as tmpdir:
+            tmpdir = Path(tmpdir)
+            env = setup_test_env(tmpdir)
+            env['TWOCAPTCHA_API_KEY'] = self.api_key
+
+            subprocess.run(['node', str(INSTALL_SCRIPT)], env=env, timeout=120, capture_output=True)
+
+            # Launch Chromium in crawls directory
+            crawl_id = 'solve'
+            crawl_dir = Path(env['CRAWLS_DIR']) / crawl_id
+            chrome_dir = crawl_dir / 'chrome'
+            env['CRAWL_OUTPUT_DIR'] = str(crawl_dir)
+            process, cdp_url = launch_chrome(env, chrome_dir, crawl_id)
+
+            try:
+                subprocess.run(['node', str(CONFIG_SCRIPT), '--url=x', '--snapshot-id=x'], env=env, timeout=30, capture_output=True)
+
+                script = f'''
+if (process.env.NODE_MODULES_DIR) module.paths.unshift(process.env.NODE_MODULES_DIR);
+const puppeteer = require('puppeteer-core');
+(async () => {{
+    const browser = await puppeteer.connect({{ browserWSEndpoint: '{cdp_url}' }});
+    const page = await browser.newPage();
+    await page.setViewport({{ width: 1440, height: 900 }});
+    console.error('[*] Loading {TEST_URL}...');
+    await page.goto('{TEST_URL}', {{ waitUntil: 'networkidle2', timeout: 30000 }});
+    await new Promise(r => setTimeout(r, 3000));
+
+    const start = Date.now();
+    const maxWait = 90000;
+
+    while (Date.now() - start < maxWait) {{
+        const state = await page.evaluate(() => {{
+            const resp = document.querySelector('textarea[name="g-recaptcha-response"]');
+            const solver = document.querySelector('.captcha-solver');
+            return {{
+                solved: resp ? resp.value.length > 0 : false,
+                state: solver?.getAttribute('data-state'),
+                text: solver?.textContent?.trim() || ''
+            }};
+        }});
+        const sec = Math.round((Date.now() - start) / 1000);
+        console.error('[*] ' + sec + 's state=' + state.state + ' solved=' + state.solved + ' text=' + state.text.slice(0,30));
+        if (state.solved) {{ console.error('[+] SOLVED!'); break; }}
+        if (state.state === 'error') {{ console.error('[!] ERROR'); break; }}
+        await new Promise(r => setTimeout(r, 2000));
+    }}
+
+    const final = await page.evaluate(() => {{
+        const resp = document.querySelector('textarea[name="g-recaptcha-response"]');
+        return {{ solved: resp ? resp.value.length > 0 : false, preview: resp?.value?.slice(0,50) || '' }};
+    }});
+    browser.disconnect();
+    console.log(JSON.stringify(final));
+}})();
+'''
+                (tmpdir / 's.js').write_text(script)
+                print("\n[*] Solving CAPTCHA (10-60s)...")
+                r = subprocess.run(['node', str(tmpdir / 's.js')], env=env, timeout=120, capture_output=True, text=True)
+                print(r.stderr)
+                assert r.returncode == 0, f"Failed: {r.stderr}"
+
+                final = json.loads([l for l in r.stdout.strip().split('\n') if l.startswith('{')][-1])
+                assert final.get('solved'), f"Not solved: {final}"
+                print(f"[+] SOLVED! {final.get('preview','')[:30]}...")
+            finally:
+                kill_chrome(process, chrome_dir)
+
+
+if __name__ == '__main__':
+    pytest.main([__file__, '-xvs'])
--- a/archivebox/plugins/ublock/on_Crawl__20_install_ublock_extension.js
+++ b/archivebox/plugins/ublock/on_Crawl__20_install_ublock_extension.js
@@ -0,0 +1,60 @@
+#!/usr/bin/env node
+/**
+ * uBlock Origin Extension Plugin
+ *
+ * Installs and configures the uBlock Origin Chrome extension for ad blocking
+ * and privacy protection during page archiving.
+ *
+ * Extension: https://chromewebstore.google.com/detail/cjpalhdlnbpafiamejdnhcphjbkeiagm
+ *
+ * Priority: 03 (early) - Must install before Chrome session starts at Crawl level
+ * Hook: on_Crawl (runs once per crawl, not per snapshot)
+ *
+ * This extension automatically:
+ * - Blocks ads, trackers, and malware domains
+ * - Reduces page load time and bandwidth usage
+ * - Improves privacy during archiving
+ * - Removes clutter from archived pages
+ * - Uses efficient blocking with filter lists
+ */
+
+// Import extension utilities
+const { installExtensionWithCache } = require('../chrome/chrome_utils.js');
+
+// Extension metadata
+const EXTENSION = {
+    webstore_id: 'cjpalhdlnbpafiamejdnhcphjbkeiagm',
+    name: 'ublock',
+};
+
+/**
+ * Main entry point - install extension before archiving
+ *
+ * Note: uBlock Origin works automatically with default filter lists.
+ * No configuration needed - blocks ads, trackers, and malware domains out of the box.
+ */
+async function main() {
+    const extension = await installExtensionWithCache(EXTENSION);
+
+    if (extension) {
+        console.log('[+] Ads and trackers will be blocked during archiving');
+    }
+
+    return extension;
+}
+
+// Export functions for use by other plugins
+module.exports = {
+    EXTENSION,
+};
+
+// Run if executed directly
+if (require.main === module) {
+    main().then(() => {
+        console.log('[✓] uBlock Origin extension setup complete');
+        process.exit(0);
+    }).catch(err => {
+        console.error('[❌] uBlock Origin extension setup failed:', err);
+        process.exit(1);
+    });
+}
--- a/archivebox/plugins/ublock/tests/test_ublock.py
+++ b/archivebox/plugins/ublock/tests/test_ublock.py
@@ -12,9 +12,17 @@ from pathlib import Path

 import pytest

+from archivebox.plugins.chrome.tests.chrome_test_helpers import (
+    setup_test_env,
+    launch_chromium_session,
+    kill_chromium_session,
+    CHROME_LAUNCH_HOOK,
+    PLUGINS_ROOT,
+)
+

 PLUGIN_DIR = Path(__file__).parent.parent
-INSTALL_SCRIPT = next(PLUGIN_DIR.glob('on_Crawl__*_ublock.*'), None)
+INSTALL_SCRIPT = next(PLUGIN_DIR.glob('on_Crawl__*_install_ublock_extension.*'), None)


 def test_install_script_exists():
@@ -157,91 +165,143 @@ def test_large_extension_size():
            assert size_bytes > 1_000_000, f"uBlock Origin should be > 1MB, got {size_bytes} bytes"


-PLUGINS_ROOT = PLUGIN_DIR.parent
-CHROME_INSTALL_HOOK = PLUGINS_ROOT / 'chrome' / 'on_Crawl__00_chrome_install.py'
-CHROME_LAUNCH_HOOK = PLUGINS_ROOT / 'chrome' / 'on_Crawl__20_chrome_launch.bg.js'
+def check_ad_blocking(cdp_url: str, test_url: str, env: dict, script_dir: Path) -> dict:
+    """Check ad blocking effectiveness by counting ad elements on page.

-
-def setup_test_env(tmpdir: Path) -> dict:
-    """Set up isolated data/lib directory structure for tests.
-
-    Creates structure like:
-        <tmpdir>/data/
-            lib/
-                arm64-darwin/   (or x86_64-linux, etc.)
-                    npm/
-                        bin/
-                        node_modules/
-            chrome_extensions/
-
-    Calls chrome install hook which handles puppeteer-core and chromium installation.
-    Returns env dict with DATA_DIR, LIB_DIR, NPM_BIN_DIR, NODE_MODULES_DIR, CHROME_BINARY, etc.
+    Returns dict with:
+        - adElementsFound: int - number of ad-related elements found
+        - adElementsVisible: int - number of visible ad elements
+        - blockedRequests: int - number of blocked network requests (ads/trackers)
+        - totalRequests: int - total network requests made
+        - percentBlocked: int - percentage of ad elements hidden (0-100)
    """
-    import platform
+    test_script = f'''
+if (process.env.NODE_MODULES_DIR) module.paths.unshift(process.env.NODE_MODULES_DIR);
+const puppeteer = require('puppeteer-core');

-    # Determine machine type (matches archivebox.config.paths.get_machine_type())
-    machine = platform.machine().lower()
-    system = platform.system().lower()
-    if machine in ('arm64', 'aarch64'):
-        machine = 'arm64'
-    elif machine in ('x86_64', 'amd64'):
-        machine = 'x86_64'
-    machine_type = f"{machine}-{system}"
+(async () => {{
+    const browser = await puppeteer.connect({{ browserWSEndpoint: '{cdp_url}' }});

-    # Create proper directory structure
-    data_dir = tmpdir / 'data'
-    lib_dir = data_dir / 'lib' / machine_type
-    npm_dir = lib_dir / 'npm'
-    npm_bin_dir = npm_dir / 'bin'
-    node_modules_dir = npm_dir / 'node_modules'
-    chrome_extensions_dir = data_dir / 'chrome_extensions'
+    const page = await browser.newPage();
+    await page.setUserAgent('Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36');
+    await page.setViewport({{ width: 1440, height: 900 }});

-    # Create all directories
-    node_modules_dir.mkdir(parents=True, exist_ok=True)
-    npm_bin_dir.mkdir(parents=True, exist_ok=True)
-    chrome_extensions_dir.mkdir(parents=True, exist_ok=True)
+    // Track network requests
+    let blockedRequests = 0;
+    let totalRequests = 0;
+    const adDomains = ['doubleclick', 'googlesyndication', 'googleadservices', 'facebook.com/tr',
+                       'analytics', 'adservice', 'advertising', 'taboola', 'outbrain', 'criteo',
+                       'amazon-adsystem', 'ads.yahoo', 'gemini.yahoo', 'yimg.com/cv/', 'beap.gemini'];

-    # Build complete env dict
-    env = os.environ.copy()
-    env.update({
-        'DATA_DIR': str(data_dir),
-        'LIB_DIR': str(lib_dir),
-        'MACHINE_TYPE': machine_type,
-        'NPM_BIN_DIR': str(npm_bin_dir),
-        'NODE_MODULES_DIR': str(node_modules_dir),
-        'CHROME_EXTENSIONS_DIR': str(chrome_extensions_dir),
-    })
+    page.on('request', request => {{
+        totalRequests++;
+        const url = request.url().toLowerCase();
+        if (adDomains.some(d => url.includes(d))) {{
+            // This is an ad request
+        }}
+    }});
+
+    page.on('requestfailed', request => {{
+        const url = request.url().toLowerCase();
+        if (adDomains.some(d => url.includes(d))) {{
+            blockedRequests++;
+        }}
+    }});
+
+    console.error('Navigating to {test_url}...');
+    await page.goto('{test_url}', {{ waitUntil: 'domcontentloaded', timeout: 60000 }});
+
+    // Wait for page to fully render and ads to load
+    await new Promise(r => setTimeout(r, 5000));
+
+    // Check for ad elements in the DOM
+    const result = await page.evaluate(() => {{
+        // Common ad-related selectors
+        const adSelectors = [
+            // Generic ad containers
+            '[class*="ad-"]', '[class*="ad_"]', '[class*="-ad"]', '[class*="_ad"]',
+            '[id*="ad-"]', '[id*="ad_"]', '[id*="-ad"]', '[id*="_ad"]',
+            '[class*="advertisement"]', '[id*="advertisement"]',
+            '[class*="sponsored"]', '[id*="sponsored"]',
+            // Google ads
+            'ins.adsbygoogle', '[data-ad-client]', '[data-ad-slot]',
+            // Yahoo specific
+            '[class*="gemini"]', '[data-beacon]', '[class*="native-ad"]',
+            '[class*="stream-ad"]', '[class*="LDRB"]', '[class*="ntv-ad"]',
+            // iframes (often ads)
+            'iframe[src*="ad"]', 'iframe[src*="doubleclick"]', 'iframe[src*="googlesyndication"]',
+            // Common ad sizes
+            '[style*="300px"][style*="250px"]', '[style*="728px"][style*="90px"]',
+            '[style*="160px"][style*="600px"]', '[style*="320px"][style*="50px"]',
+        ];
+
+        let adElementsFound = 0;
+        let adElementsVisible = 0;
+
+        for (const selector of adSelectors) {{
+            try {{
+                const elements = document.querySelectorAll(selector);
+                for (const el of elements) {{
+                    adElementsFound++;
+                    const style = window.getComputedStyle(el);
+                    const rect = el.getBoundingClientRect();
+                    const isVisible = style.display !== 'none' &&
+                                     style.visibility !== 'hidden' &&
+                                     style.opacity !== '0' &&
+                                     rect.width > 0 && rect.height > 0;
+                    if (isVisible) {{
+                        adElementsVisible++;
+                    }}
+                }}
+            }} catch (e) {{
+                // Invalid selector, skip
+            }}
+        }}
+
+        return {{
+            adElementsFound,
+            adElementsVisible,
+            pageTitle: document.title
+        }};
+    }});
+
+    result.blockedRequests = blockedRequests;
+    result.totalRequests = totalRequests;
+    // Calculate how many ad elements were hidden (found but not visible)
+    const hiddenAds = result.adElementsFound - result.adElementsVisible;
+    result.percentBlocked = result.adElementsFound > 0
+        ? Math.round((hiddenAds / result.adElementsFound) * 100)
+        : 0;
+
+    console.error('Ad blocking result:', JSON.stringify(result));
+    browser.disconnect();
+    console.log(JSON.stringify(result));
+}})();
+'''
+    script_path = script_dir / 'check_ads.js'
+    script_path.write_text(test_script)

-    # Call chrome install hook (installs puppeteer-core and chromium, outputs JSONL)
    result = subprocess.run(
-        ['python', str(CHROME_INSTALL_HOOK)],
-        capture_output=True, text=True, timeout=10, env=env
+        ['node', str(script_path)],
+        cwd=str(script_dir),
+        capture_output=True,
+        text=True,
+        env=env,
+        timeout=90
    )
+
    if result.returncode != 0:
-        pytest.skip(f"Chrome install hook failed: {result.stderr}")
+        raise RuntimeError(f"Ad check script failed: {result.stderr}")

-    # Parse JSONL output to get CHROME_BINARY
-    chrome_binary = None
-    for line in result.stdout.strip().split('\n'):
-        if not line.strip():
-            continue
-        try:
-            data = json.loads(line)
-            if data.get('type') == 'Binary' and data.get('abspath'):
-                chrome_binary = data['abspath']
-                break
-        except json.JSONDecodeError:
-            continue
+    output_lines = [l for l in result.stdout.strip().split('\n') if l.startswith('{')]
+    if not output_lines:
+        raise RuntimeError(f"No JSON output from ad check: {result.stdout}\nstderr: {result.stderr}")

-    if not chrome_binary or not Path(chrome_binary).exists():
-        pytest.skip(f"Chromium binary not found: {chrome_binary}")
-
-    env['CHROME_BINARY'] = chrome_binary
-    return env
+    return json.loads(output_lines[-1])


-# Test URL: ad blocker test page that shows if ads are blocked
-TEST_URL = 'https://d3ward.github.io/toolz/adblock.html'
+# Test URL: Yahoo has many ads that uBlock should block
+TEST_URL = 'https://www.yahoo.com/'


@pytest.mark.timeout(15)
@@ -290,14 +350,18 @@ def test_extension_loads_in_chromium():
        print(f"[test] NODE_MODULES_DIR={env.get('NODE_MODULES_DIR')}", flush=True)
        print(f"[test] puppeteer-core exists: {(Path(env['NODE_MODULES_DIR']) / 'puppeteer-core').exists()}", flush=True)
        print("[test] Launching Chromium...", flush=True)
-        data_dir = Path(env['DATA_DIR'])
-        crawl_dir = data_dir / 'crawl'
-        crawl_dir.mkdir()
+
+        # Launch Chromium in crawls directory
+        crawl_id = 'test-ublock'
+        crawl_dir = Path(env['CRAWLS_DIR']) / crawl_id
+        crawl_dir.mkdir(parents=True, exist_ok=True)
        chrome_dir = crawl_dir / 'chrome'
+        chrome_dir.mkdir(parents=True, exist_ok=True)
+        env['CRAWL_OUTPUT_DIR'] = str(crawl_dir)

        chrome_launch_process = subprocess.Popen(
-            ['node', str(CHROME_LAUNCH_HOOK), '--crawl-id=test-ublock'],
-            cwd=str(crawl_dir),
+            ['node', str(CHROME_LAUNCH_HOOK), f'--crawl-id={crawl_id}'],
+            cwd=str(chrome_dir),
            stdout=subprocess.PIPE,
            stderr=subprocess.PIPE,
            text=True,
@@ -457,161 +521,177 @@ const puppeteer = require('puppeteer-core');
 def test_blocks_ads_on_test_page():
    """Live test: verify uBlock Origin blocks ads on a test page.

-    Uses Chromium with extensions loaded automatically via chrome hook.
-    Tests against d3ward's ad blocker test page which checks ad domains.
+    This test runs TWO browser sessions:
+    1. WITHOUT extension - verifies ads are NOT blocked (baseline)
+    2. WITH extension - verifies ads ARE blocked
+
+    This ensures we're actually testing the extension's effect, not just
+    that a test page happens to show ads as blocked.
    """
-    import signal
    import time

    with tempfile.TemporaryDirectory() as tmpdir:
        tmpdir = Path(tmpdir)

        # Set up isolated env with proper directory structure
-        env = setup_test_env(tmpdir)
-        env['CHROME_HEADLESS'] = 'true'
+        env_base = setup_test_env(tmpdir)
+        env_base['CHROME_HEADLESS'] = 'true'

-        ext_dir = Path(env['CHROME_EXTENSIONS_DIR'])
+        # ============================================================
+        # STEP 1: BASELINE - Run WITHOUT extension, verify ads are NOT blocked
+        # ============================================================
+        print("\n" + "="*60)
+        print("STEP 1: BASELINE TEST (no extension)")
+        print("="*60)
+
+        data_dir = Path(env_base['DATA_DIR'])
+
+        env_no_ext = env_base.copy()
+        env_no_ext['CHROME_EXTENSIONS_DIR'] = str(data_dir / 'personas' / 'Default' / 'empty_extensions')
+        (data_dir / 'personas' / 'Default' / 'empty_extensions').mkdir(parents=True, exist_ok=True)
+
+        # Launch baseline Chromium in crawls directory
+        baseline_crawl_id = 'baseline-no-ext'
+        baseline_crawl_dir = Path(env_base['CRAWLS_DIR']) / baseline_crawl_id
+        baseline_crawl_dir.mkdir(parents=True, exist_ok=True)
+        baseline_chrome_dir = baseline_crawl_dir / 'chrome'
+        env_no_ext['CRAWL_OUTPUT_DIR'] = str(baseline_crawl_dir)
+        baseline_process = None
+
+        try:
+            baseline_process, baseline_cdp_url = launch_chromium_session(
+                env_no_ext, baseline_chrome_dir, baseline_crawl_id
+            )
+            print(f"Baseline Chromium launched: {baseline_cdp_url}")
+
+            # Wait a moment for browser to be ready
+            time.sleep(2)
+
+            baseline_result = check_ad_blocking(
+                baseline_cdp_url, TEST_URL, env_no_ext, tmpdir
+            )
+
+            print(f"Baseline result: {baseline_result['adElementsVisible']} visible ads "
+                  f"(found {baseline_result['adElementsFound']} ad elements)")
+
+        finally:
+            if baseline_process:
+                kill_chromium_session(baseline_process, baseline_chrome_dir)
+
+        # Verify baseline shows ads ARE visible (not blocked)
+        if baseline_result['adElementsFound'] == 0:
+            pytest.skip(
+                f"Cannot test extension: no ad elements found on {TEST_URL}. "
+                f"The page may have changed or loaded differently."
+            )
+
+        if baseline_result['adElementsVisible'] == 0:
+            print(f"\nWARNING: Baseline shows 0 visible ads despite finding {baseline_result['adElementsFound']} elements!")
+            print("This suggests either:")
+            print("  - There's another ad blocker interfering")
+            print("  - Network-level ad blocking is in effect")
+
+            pytest.skip(
+                f"Cannot test extension: baseline shows no visible ads "
+                f"despite finding {baseline_result['adElementsFound']} ad elements."
+            )
+
+        print(f"\n✓ Baseline confirmed: {baseline_result['adElementsVisible']} visible ads without extension")
+
+        # ============================================================
+        # STEP 2: Install the uBlock extension
+        # ============================================================
+        print("\n" + "="*60)
+        print("STEP 2: INSTALLING EXTENSION")
+        print("="*60)
+
+        ext_dir = Path(env_base['CHROME_EXTENSIONS_DIR'])

-        # Step 1: Install the uBlock extension
        result = subprocess.run(
            ['node', str(INSTALL_SCRIPT)],
            capture_output=True,
            text=True,
-            env=env,
-            timeout=15
+            env=env_base,
+            timeout=60
        )
        assert result.returncode == 0, f"Extension install failed: {result.stderr}"

-        # Verify extension cache was created
        cache_file = ext_dir / 'ublock.extension.json'
        assert cache_file.exists(), "Extension cache not created"
        ext_data = json.loads(cache_file.read_text())
        print(f"Extension installed: {ext_data.get('name')} v{ext_data.get('version')}")

-        # Step 2: Launch Chromium using the chrome hook (loads extensions automatically)
-        data_dir = Path(env['DATA_DIR'])
-        crawl_dir = data_dir / 'crawl'
-        crawl_dir.mkdir()
-        chrome_dir = crawl_dir / 'chrome'
+        # ============================================================
+        # STEP 3: Run WITH extension, verify ads ARE blocked
+        # ============================================================
+        print("\n" + "="*60)
+        print("STEP 3: TEST WITH EXTENSION")
+        print("="*60)

-        chrome_launch_process = subprocess.Popen(
-            ['node', str(CHROME_LAUNCH_HOOK), '--crawl-id=test-ublock'],
-            cwd=str(crawl_dir),
-            stdout=subprocess.PIPE,
-            stderr=subprocess.PIPE,
-            text=True,
-            env=env
-        )
-
-        # Wait for Chrome to launch and CDP URL to be available
-        cdp_url = None
-        for i in range(20):
-            if chrome_launch_process.poll() is not None:
-                stdout, stderr = chrome_launch_process.communicate()
-                raise RuntimeError(f"Chrome launch failed:\nStdout: {stdout}\nStderr: {stderr}")
-            cdp_file = chrome_dir / 'cdp_url.txt'
-            if cdp_file.exists():
-                cdp_url = cdp_file.read_text().strip()
-                break
-            time.sleep(1)
-
-        assert cdp_url, "Chrome CDP URL not found after 20s"
-        print(f"Chrome launched with CDP URL: {cdp_url}")
-
-        # Check that extensions were loaded
-        extensions_file = chrome_dir / 'extensions.json'
-        if extensions_file.exists():
-            loaded_exts = json.loads(extensions_file.read_text())
-            print(f"Extensions loaded: {[e.get('name') for e in loaded_exts]}")
+        # Launch extension test Chromium in crawls directory
+        ext_crawl_id = 'test-with-ext'
+        ext_crawl_dir = Path(env_base['CRAWLS_DIR']) / ext_crawl_id
+        ext_crawl_dir.mkdir(parents=True, exist_ok=True)
+        ext_chrome_dir = ext_crawl_dir / 'chrome'
+        env_base['CRAWL_OUTPUT_DIR'] = str(ext_crawl_dir)
+        ext_process = None

        try:
-            # Step 3: Connect to Chrome and test ad blocking
-            test_script = f'''
-if (process.env.NODE_MODULES_DIR) module.paths.unshift(process.env.NODE_MODULES_DIR);
-const puppeteer = require('puppeteer-core');
+            ext_process, ext_cdp_url = launch_chromium_session(
+                env_base, ext_chrome_dir, ext_crawl_id
+            )
+            print(f"Extension Chromium launched: {ext_cdp_url}")

-(async () => {{
-    const browser = await puppeteer.connect({{ browserWSEndpoint: '{cdp_url}' }});
+            # Check that extension was loaded
+            extensions_file = ext_chrome_dir / 'extensions.json'
+            if extensions_file.exists():
+                loaded_exts = json.loads(extensions_file.read_text())
+                print(f"Extensions loaded: {[e.get('name') for e in loaded_exts]}")

-    // Wait for extension to initialize
-    await new Promise(r => setTimeout(r, 500));
+            # Wait for extension to initialize
+            time.sleep(3)

-    // Check extension loaded by looking at targets
-    const targets = browser.targets();
-    const extTargets = targets.filter(t =>
-        t.url().startsWith('chrome-extension://') ||
-        t.type() === 'service_worker' ||
-        t.type() === 'background_page'
-    );
-    console.error('Extension targets found:', extTargets.length);
-    extTargets.forEach(t => console.error('  -', t.type(), t.url().substring(0, 60)));
-
-    const page = await browser.newPage();
-    await page.setUserAgent('Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36');
-    await page.setViewport({{ width: 1440, height: 900 }});
-
-    console.error('Navigating to {TEST_URL}...');
-    await page.goto('{TEST_URL}', {{ waitUntil: 'networkidle2', timeout: 60000 }});
-
-    // Wait for the test page to run its checks
-    await new Promise(r => setTimeout(r, 5000));
-
-    // The d3ward test page shows blocked percentage
-    const result = await page.evaluate(() => {{
-        const scoreEl = document.querySelector('#score');
-        const score = scoreEl ? scoreEl.textContent : null;
-        const blockedItems = document.querySelectorAll('.blocked').length;
-        const totalItems = document.querySelectorAll('.testlist li').length;
-        return {{
-            score,
-            blockedItems,
-            totalItems,
-            percentBlocked: totalItems > 0 ? Math.round((blockedItems / totalItems) * 100) : 0
-        }};
-    }});
-
-    console.error('Ad blocking result:', JSON.stringify(result));
-    browser.disconnect();
-    console.log(JSON.stringify(result));
-}})();
-'''
-            script_path = tmpdir / 'test_ublock.js'
-            script_path.write_text(test_script)
-
-            result = subprocess.run(
-                ['node', str(script_path)],
-                cwd=str(tmpdir),
-                capture_output=True,
-                text=True,
-                env=env,
-                timeout=10
+            ext_result = check_ad_blocking(
+                ext_cdp_url, TEST_URL, env_base, tmpdir
            )

-            print(f"stderr: {result.stderr}")
-            print(f"stdout: {result.stdout}")
-
-            assert result.returncode == 0, f"Test failed: {result.stderr}"
-
-            output_lines = [l for l in result.stdout.strip().split('\n') if l.startswith('{')]
-            assert output_lines, f"No JSON output: {result.stdout}"
-
-            test_result = json.loads(output_lines[-1])
-
-            # uBlock should block most ad domains on the test page
-            assert test_result['percentBlocked'] >= 50, \
-                f"uBlock should block at least 50% of ads, only blocked {test_result['percentBlocked']}%. Result: {test_result}"
+            print(f"Extension result: {ext_result['adElementsVisible']} visible ads "
+                  f"(found {ext_result['adElementsFound']} ad elements)")

        finally:
-            # Clean up Chrome
-            try:
-                chrome_launch_process.send_signal(signal.SIGTERM)
-                chrome_launch_process.wait(timeout=5)
-            except:
-                pass
-            chrome_pid_file = chrome_dir / 'chrome.pid'
-            if chrome_pid_file.exists():
-                try:
-                    chrome_pid = int(chrome_pid_file.read_text().strip())
-                    os.kill(chrome_pid, signal.SIGKILL)
-                except (OSError, ValueError):
-                    pass
+            if ext_process:
+                kill_chromium_session(ext_process, ext_chrome_dir)
+
+        # ============================================================
+        # STEP 4: Compare results
+        # ============================================================
+        print("\n" + "="*60)
+        print("STEP 4: COMPARISON")
+        print("="*60)
+        print(f"Baseline (no extension): {baseline_result['adElementsVisible']} visible ads")
+        print(f"With extension: {ext_result['adElementsVisible']} visible ads")
+
+        # Calculate reduction in visible ads
+        ads_blocked = baseline_result['adElementsVisible'] - ext_result['adElementsVisible']
+        reduction_percent = (ads_blocked / baseline_result['adElementsVisible'] * 100) if baseline_result['adElementsVisible'] > 0 else 0
+
+        print(f"Reduction: {ads_blocked} fewer visible ads ({reduction_percent:.0f}% reduction)")
+
+        # Extension should significantly reduce visible ads
+        assert ext_result['adElementsVisible'] < baseline_result['adElementsVisible'], \
+            f"uBlock should reduce visible ads.\n" \
+            f"Baseline: {baseline_result['adElementsVisible']} visible ads\n" \
+            f"With extension: {ext_result['adElementsVisible']} visible ads\n" \
+            f"Expected fewer ads with extension."
+
+        # Extension should block at least 30% of ads
+        assert reduction_percent >= 30, \
+            f"uBlock should block at least 30% of ads.\n" \
+            f"Baseline: {baseline_result['adElementsVisible']} visible ads\n" \
+            f"With extension: {ext_result['adElementsVisible']} visible ads\n" \
+            f"Reduction: only {reduction_percent:.0f}% (expected at least 30%)"
+
+        print(f"\n✓ SUCCESS: uBlock correctly blocks ads!")
+        print(f"  - Baseline: {baseline_result['adElementsVisible']} visible ads")
+        print(f"  - With extension: {ext_result['adElementsVisible']} visible ads")
+        print(f"  - Blocked: {ads_blocked} ads ({reduction_percent:.0f}% reduction)")
--- a/archivebox/plugins/wget/on_Crawl__10_install_wget.py
+++ b/archivebox/plugins/wget/on_Crawl__10_install_wget.py
@@ -0,0 +1,130 @@
+#!/usr/bin/env python3
+"""
+Validate and compute derived wget config values.
+
+This hook runs early in the Crawl lifecycle to:
+1. Validate config values with warnings (not hard errors)
+2. Compute derived values (USE_WGET from WGET_ENABLED)
+3. Check binary availability and version
+
+Output:
+    - COMPUTED:KEY=VALUE lines that hooks.py parses and adds to env
+    - Binary JSONL records to stdout when binaries are found
+"""
+
+import json
+import os
+import shutil
+import subprocess
+import sys
+
+from abx_pkg import Binary, EnvProvider
+
+
+# Read config from environment (already validated by JSONSchema)
+def get_env(name: str, default: str = '') -> str:
+    return os.environ.get(name, default).strip()
+
+def get_env_bool(name: str, default: bool = False) -> bool:
+    val = get_env(name, '').lower()
+    if val in ('true', '1', 'yes', 'on'):
+        return True
+    if val in ('false', '0', 'no', 'off'):
+        return False
+    return default
+
+def get_env_int(name: str, default: int = 0) -> int:
+    try:
+        return int(get_env(name, str(default)))
+    except ValueError:
+        return default
+
+
+def output_binary(binary: Binary, name: str):
+    """Output Binary JSONL record to stdout."""
+    machine_id = os.environ.get('MACHINE_ID', '')
+
+    record = {
+        'type': 'Binary',
+        'name': name,
+        'abspath': str(binary.abspath),
+        'version': str(binary.version) if binary.version else '',
+        'sha256': binary.sha256 or '',
+        'binprovider': 'env',
+        'machine_id': machine_id,
+    }
+    print(json.dumps(record))
+
+
+def main():
+    warnings = []
+    errors = []
+    computed = {}
+
+    # Get config values
+    wget_enabled = get_env_bool('WGET_ENABLED', True)
+    wget_save_warc = get_env_bool('WGET_SAVE_WARC', True)
+    wget_timeout = get_env_int('WGET_TIMEOUT') or get_env_int('TIMEOUT', 60)
+    wget_binary = get_env('WGET_BINARY', 'wget')
+
+    # Compute derived values (USE_WGET for backward compatibility)
+    use_wget = wget_enabled
+    computed['USE_WGET'] = str(use_wget).lower()
+
+    # Validate timeout with warning (not error)
+    if use_wget and wget_timeout < 20:
+        warnings.append(
+            f"WGET_TIMEOUT={wget_timeout} is very low. "
+            "wget may fail to archive sites if set to less than ~20 seconds. "
+            "Consider setting WGET_TIMEOUT=60 or higher."
+        )
+
+    # Check binary availability using abx-pkg
+    provider = EnvProvider()
+    try:
+        binary = Binary(name=wget_binary, binproviders=[provider]).load()
+        binary_path = str(binary.abspath) if binary.abspath else ''
+    except Exception:
+        binary = None
+        binary_path = ''
+
+    if not binary_path:
+        if use_wget:
+            errors.append(f"WGET_BINARY={wget_binary} not found. Install wget or set WGET_ENABLED=false.")
+        computed['WGET_BINARY'] = ''
+    else:
+        computed['WGET_BINARY'] = binary_path
+        wget_version = str(binary.version) if binary.version else 'unknown'
+        computed['WGET_VERSION'] = wget_version
+
+        # Output Binary JSONL record
+        output_binary(binary, name='wget')
+
+    # Check for compression support
+    if computed.get('WGET_BINARY'):
+        try:
+            result = subprocess.run(
+                [computed['WGET_BINARY'], '--compression=auto', '--help'],
+                capture_output=True, timeout=5
+            )
+            computed['WGET_AUTO_COMPRESSION'] = 'true' if result.returncode == 0 else 'false'
+        except Exception:
+            computed['WGET_AUTO_COMPRESSION'] = 'false'
+
+    # Output results
+    # Format: KEY=VALUE lines that hooks.py will parse and add to env
+    for key, value in computed.items():
+        print(f"COMPUTED:{key}={value}")
+
+    for warning in warnings:
+        print(f"WARNING:{warning}", file=sys.stderr)
+
+    for error in errors:
+        print(f"ERROR:{error}", file=sys.stderr)
+
+    # Exit with error if any hard errors
+    sys.exit(1 if errors else 0)
+
+
+if __name__ == '__main__':
+    main()
--- a/archivebox/tests/conftest.py
+++ b/archivebox/tests/conftest.py
@@ -0,0 +1,218 @@
+"""archivebox/tests/conftest.py - Pytest fixtures for CLI tests."""
+
+import os
+import sys
+import json
+import subprocess
+from pathlib import Path
+from typing import List, Dict, Any, Optional, Tuple
+
+import pytest
+
+
+# =============================================================================
+# Fixtures
+# =============================================================================
+
+@pytest.fixture
+def isolated_data_dir(tmp_path, settings):
+    """
+    Create isolated DATA_DIR for each test.
+
+    Uses tmp_path for isolation, configures Django settings.
+    """
+    data_dir = tmp_path / 'archivebox_data'
+    data_dir.mkdir()
+
+    # Set environment for subprocess calls
+    os.environ['DATA_DIR'] = str(data_dir)
+
+    # Update Django settings
+    settings.DATA_DIR = data_dir
+
+    yield data_dir
+
+    # Cleanup handled by tmp_path fixture
+
+
+@pytest.fixture
+def initialized_archive(isolated_data_dir):
+    """
+    Initialize ArchiveBox archive in isolated directory.
+
+    Runs `archivebox init` to set up database and directories.
+    """
+    from archivebox.cli.archivebox_init import init
+    init(setup=True, quick=True)
+    return isolated_data_dir
+
+
+@pytest.fixture
+def cli_env(initialized_archive):
+    """
+    Environment dict for CLI subprocess calls.
+
+    Includes DATA_DIR and disables slow extractors.
+    """
+    return {
+        **os.environ,
+        'DATA_DIR': str(initialized_archive),
+        'USE_COLOR': 'False',
+        'SHOW_PROGRESS': 'False',
+        'SAVE_TITLE': 'True',
+        'SAVE_FAVICON': 'False',
+        'SAVE_WGET': 'False',
+        'SAVE_WARC': 'False',
+        'SAVE_PDF': 'False',
+        'SAVE_SCREENSHOT': 'False',
+        'SAVE_DOM': 'False',
+        'SAVE_SINGLEFILE': 'False',
+        'SAVE_READABILITY': 'False',
+        'SAVE_MERCURY': 'False',
+        'SAVE_GIT': 'False',
+        'SAVE_YTDLP': 'False',
+        'SAVE_HEADERS': 'False',
+    }
+
+
+# =============================================================================
+# CLI Helpers
+# =============================================================================
+
+def run_archivebox_cmd(
+    args: List[str],
+    stdin: Optional[str] = None,
+    cwd: Optional[Path] = None,
+    env: Optional[Dict[str, str]] = None,
+    timeout: int = 60,
+) -> Tuple[str, str, int]:
+    """
+    Run archivebox command, return (stdout, stderr, returncode).
+
+    Args:
+        args: Command arguments (e.g., ['crawl', 'create', 'https://example.com'])
+        stdin: Optional string to pipe to stdin
+        cwd: Working directory (defaults to DATA_DIR from env)
+        env: Environment variables (defaults to os.environ with DATA_DIR)
+        timeout: Command timeout in seconds
+
+    Returns:
+        Tuple of (stdout, stderr, returncode)
+    """
+    cmd = [sys.executable, '-m', 'archivebox'] + args
+
+    env = env or {**os.environ}
+    cwd = cwd or Path(env.get('DATA_DIR', '.'))
+
+    result = subprocess.run(
+        cmd,
+        input=stdin,
+        capture_output=True,
+        text=True,
+        cwd=cwd,
+        env=env,
+        timeout=timeout,
+    )
+
+    return result.stdout, result.stderr, result.returncode
+
+
+# =============================================================================
+# Output Assertions
+# =============================================================================
+
+def parse_jsonl_output(stdout: str) -> List[Dict[str, Any]]:
+    """Parse JSONL output into list of dicts."""
+    records = []
+    for line in stdout.strip().split('\n'):
+        line = line.strip()
+        if line and line.startswith('{'):
+            try:
+                records.append(json.loads(line))
+            except json.JSONDecodeError:
+                pass
+    return records
+
+
+def assert_jsonl_contains_type(stdout: str, record_type: str, min_count: int = 1):
+    """Assert output contains at least min_count records of type."""
+    records = parse_jsonl_output(stdout)
+    matching = [r for r in records if r.get('type') == record_type]
+    assert len(matching) >= min_count, \
+        f"Expected >= {min_count} {record_type}, got {len(matching)}"
+    return matching
+
+
+def assert_jsonl_pass_through(stdout: str, input_records: List[Dict[str, Any]]):
+    """Assert that input records appear in output (pass-through behavior)."""
+    output_records = parse_jsonl_output(stdout)
+    output_ids = {r.get('id') for r in output_records if r.get('id')}
+
+    for input_rec in input_records:
+        input_id = input_rec.get('id')
+        if input_id:
+            assert input_id in output_ids, \
+                f"Input record {input_id} not found in output (pass-through failed)"
+
+
+def assert_record_has_fields(record: Dict[str, Any], required_fields: List[str]):
+    """Assert record has all required fields with non-None values."""
+    for field in required_fields:
+        assert field in record, f"Record missing field: {field}"
+        assert record[field] is not None, f"Record field is None: {field}"
+
+
+# =============================================================================
+# Database Assertions
+# =============================================================================
+
+def assert_db_count(model_class, filters: Dict[str, Any], expected: int):
+    """Assert database count matches expected."""
+    actual = model_class.objects.filter(**filters).count()
+    assert actual == expected, \
+        f"Expected {expected} {model_class.__name__}, got {actual}"
+
+
+def assert_db_exists(model_class, **filters):
+    """Assert at least one record exists matching filters."""
+    assert model_class.objects.filter(**filters).exists(), \
+        f"No {model_class.__name__} found matching {filters}"
+
+
+# =============================================================================
+# Test Data Factories
+# =============================================================================
+
+def create_test_url(domain: str = 'example.com', path: str = None) -> str:
+    """Generate unique test URL."""
+    import uuid
+    path = path or uuid.uuid4().hex[:8]
+    return f'https://{domain}/{path}'
+
+
+def create_test_crawl_json(urls: List[str] = None, **kwargs) -> Dict[str, Any]:
+    """Create Crawl JSONL record for testing."""
+    from archivebox.misc.jsonl import TYPE_CRAWL
+
+    urls = urls or [create_test_url()]
+    return {
+        'type': TYPE_CRAWL,
+        'urls': '\n'.join(urls),
+        'max_depth': kwargs.get('max_depth', 0),
+        'tags_str': kwargs.get('tags_str', ''),
+        'status': kwargs.get('status', 'queued'),
+        **{k: v for k, v in kwargs.items() if k not in ('max_depth', 'tags_str', 'status')},
+    }
+
+
+def create_test_snapshot_json(url: str = None, **kwargs) -> Dict[str, Any]:
+    """Create Snapshot JSONL record for testing."""
+    from archivebox.misc.jsonl import TYPE_SNAPSHOT
+
+    return {
+        'type': TYPE_SNAPSHOT,
+        'url': url or create_test_url(),
+        'tags_str': kwargs.get('tags_str', ''),
+        'status': kwargs.get('status', 'queued'),
+        **{k: v for k, v in kwargs.items() if k not in ('tags_str', 'status')},
+    }
--- a/archivebox/tests/test_cli_archiveresult.py
+++ b/archivebox/tests/test_cli_archiveresult.py
@@ -0,0 +1,264 @@
+"""
+Tests for archivebox archiveresult CLI command.
+
+Tests cover:
+- archiveresult create (from Snapshot JSONL, with --plugin, pass-through)
+- archiveresult list (with filters)
+- archiveresult update
+- archiveresult delete
+"""
+
+import json
+import pytest
+
+from archivebox.tests.conftest import (
+    run_archivebox_cmd,
+    parse_jsonl_output,
+    create_test_url,
+)
+
+
+class TestArchiveResultCreate:
+    """Tests for `archivebox archiveresult create`."""
+
+    def test_create_from_snapshot_jsonl(self, cli_env, initialized_archive):
+        """Create archive results from Snapshot JSONL input."""
+        url = create_test_url()
+
+        # Create a snapshot first
+        stdout1, _, _ = run_archivebox_cmd(['snapshot', 'create', url], env=cli_env)
+        snapshot = parse_jsonl_output(stdout1)[0]
+
+        # Pipe snapshot to archiveresult create
+        stdout2, stderr, code = run_archivebox_cmd(
+            ['archiveresult', 'create', '--plugin=title'],
+            stdin=json.dumps(snapshot),
+            env=cli_env,
+        )
+
+        assert code == 0, f"Command failed: {stderr}"
+
+        records = parse_jsonl_output(stdout2)
+        # Should have the Snapshot passed through and ArchiveResult created
+        types = [r.get('type') for r in records]
+        assert 'Snapshot' in types
+        assert 'ArchiveResult' in types
+
+        ar = next(r for r in records if r['type'] == 'ArchiveResult')
+        assert ar['plugin'] == 'title'
+
+    def test_create_with_specific_plugin(self, cli_env, initialized_archive):
+        """Create archive result for specific plugin."""
+        url = create_test_url()
+        stdout1, _, _ = run_archivebox_cmd(['snapshot', 'create', url], env=cli_env)
+        snapshot = parse_jsonl_output(stdout1)[0]
+
+        stdout2, stderr, code = run_archivebox_cmd(
+            ['archiveresult', 'create', '--plugin=screenshot'],
+            stdin=json.dumps(snapshot),
+            env=cli_env,
+        )
+
+        assert code == 0
+        records = parse_jsonl_output(stdout2)
+        ar_records = [r for r in records if r.get('type') == 'ArchiveResult']
+        assert len(ar_records) >= 1
+        assert ar_records[0]['plugin'] == 'screenshot'
+
+    def test_create_pass_through_crawl(self, cli_env, initialized_archive):
+        """Pass-through Crawl records unchanged."""
+        url = create_test_url()
+
+        # Create crawl and snapshot
+        stdout1, _, _ = run_archivebox_cmd(['crawl', 'create', url], env=cli_env)
+        crawl = parse_jsonl_output(stdout1)[0]
+
+        stdout2, _, _ = run_archivebox_cmd(
+            ['snapshot', 'create'],
+            stdin=json.dumps(crawl),
+            env=cli_env,
+        )
+
+        # Now pipe all to archiveresult create
+        stdout3, stderr, code = run_archivebox_cmd(
+            ['archiveresult', 'create', '--plugin=title'],
+            stdin=stdout2,
+            env=cli_env,
+        )
+
+        assert code == 0
+        records = parse_jsonl_output(stdout3)
+
+        types = [r.get('type') for r in records]
+        assert 'Crawl' in types
+        assert 'Snapshot' in types
+        assert 'ArchiveResult' in types
+
+    def test_create_pass_through_only_when_no_snapshots(self, cli_env, initialized_archive):
+        """Only pass-through records but no new snapshots returns success."""
+        crawl_record = {'type': 'Crawl', 'id': 'fake-id', 'urls': 'https://example.com'}
+
+        stdout, stderr, code = run_archivebox_cmd(
+            ['archiveresult', 'create'],
+            stdin=json.dumps(crawl_record),
+            env=cli_env,
+        )
+
+        assert code == 0
+        assert 'Passed through' in stderr
+
+
+class TestArchiveResultList:
+    """Tests for `archivebox archiveresult list`."""
+
+    def test_list_empty(self, cli_env, initialized_archive):
+        """List with no archive results returns empty."""
+        stdout, stderr, code = run_archivebox_cmd(
+            ['archiveresult', 'list'],
+            env=cli_env,
+        )
+
+        assert code == 0
+        assert 'Listed 0 archive results' in stderr
+
+    def test_list_filter_by_status(self, cli_env, initialized_archive):
+        """Filter archive results by status."""
+        # Create snapshot and archive result
+        url = create_test_url()
+        stdout1, _, _ = run_archivebox_cmd(['snapshot', 'create', url], env=cli_env)
+        snapshot = parse_jsonl_output(stdout1)[0]
+        run_archivebox_cmd(
+            ['archiveresult', 'create', '--plugin=title'],
+            stdin=json.dumps(snapshot),
+            env=cli_env,
+        )
+
+        stdout, stderr, code = run_archivebox_cmd(
+            ['archiveresult', 'list', '--status=queued'],
+            env=cli_env,
+        )
+
+        assert code == 0
+        records = parse_jsonl_output(stdout)
+        for r in records:
+            assert r['status'] == 'queued'
+
+    def test_list_filter_by_plugin(self, cli_env, initialized_archive):
+        """Filter archive results by plugin."""
+        url = create_test_url()
+        stdout1, _, _ = run_archivebox_cmd(['snapshot', 'create', url], env=cli_env)
+        snapshot = parse_jsonl_output(stdout1)[0]
+        run_archivebox_cmd(
+            ['archiveresult', 'create', '--plugin=title'],
+            stdin=json.dumps(snapshot),
+            env=cli_env,
+        )
+
+        stdout, stderr, code = run_archivebox_cmd(
+            ['archiveresult', 'list', '--plugin=title'],
+            env=cli_env,
+        )
+
+        assert code == 0
+        records = parse_jsonl_output(stdout)
+        for r in records:
+            assert r['plugin'] == 'title'
+
+    def test_list_with_limit(self, cli_env, initialized_archive):
+        """Limit number of results."""
+        # Create multiple archive results
+        for _ in range(3):
+            url = create_test_url()
+            stdout1, _, _ = run_archivebox_cmd(['snapshot', 'create', url], env=cli_env)
+            snapshot = parse_jsonl_output(stdout1)[0]
+            run_archivebox_cmd(
+                ['archiveresult', 'create', '--plugin=title'],
+                stdin=json.dumps(snapshot),
+                env=cli_env,
+            )
+
+        stdout, stderr, code = run_archivebox_cmd(
+            ['archiveresult', 'list', '--limit=2'],
+            env=cli_env,
+        )
+
+        assert code == 0
+        records = parse_jsonl_output(stdout)
+        assert len(records) == 2
+
+
+class TestArchiveResultUpdate:
+    """Tests for `archivebox archiveresult update`."""
+
+    def test_update_status(self, cli_env, initialized_archive):
+        """Update archive result status."""
+        url = create_test_url()
+        stdout1, _, _ = run_archivebox_cmd(['snapshot', 'create', url], env=cli_env)
+        snapshot = parse_jsonl_output(stdout1)[0]
+
+        stdout2, _, _ = run_archivebox_cmd(
+            ['archiveresult', 'create', '--plugin=title'],
+            stdin=json.dumps(snapshot),
+            env=cli_env,
+        )
+        ar = next(r for r in parse_jsonl_output(stdout2) if r.get('type') == 'ArchiveResult')
+
+        stdout3, stderr, code = run_archivebox_cmd(
+            ['archiveresult', 'update', '--status=failed'],
+            stdin=json.dumps(ar),
+            env=cli_env,
+        )
+
+        assert code == 0
+        assert 'Updated 1 archive results' in stderr
+
+        records = parse_jsonl_output(stdout3)
+        assert records[0]['status'] == 'failed'
+
+
+class TestArchiveResultDelete:
+    """Tests for `archivebox archiveresult delete`."""
+
+    def test_delete_requires_yes(self, cli_env, initialized_archive):
+        """Delete requires --yes flag."""
+        url = create_test_url()
+        stdout1, _, _ = run_archivebox_cmd(['snapshot', 'create', url], env=cli_env)
+        snapshot = parse_jsonl_output(stdout1)[0]
+
+        stdout2, _, _ = run_archivebox_cmd(
+            ['archiveresult', 'create', '--plugin=title'],
+            stdin=json.dumps(snapshot),
+            env=cli_env,
+        )
+        ar = next(r for r in parse_jsonl_output(stdout2) if r.get('type') == 'ArchiveResult')
+
+        stdout, stderr, code = run_archivebox_cmd(
+            ['archiveresult', 'delete'],
+            stdin=json.dumps(ar),
+            env=cli_env,
+        )
+
+        assert code == 1
+        assert '--yes' in stderr
+
+    def test_delete_with_yes(self, cli_env, initialized_archive):
+        """Delete with --yes flag works."""
+        url = create_test_url()
+        stdout1, _, _ = run_archivebox_cmd(['snapshot', 'create', url], env=cli_env)
+        snapshot = parse_jsonl_output(stdout1)[0]
+
+        stdout2, _, _ = run_archivebox_cmd(
+            ['archiveresult', 'create', '--plugin=title'],
+            stdin=json.dumps(snapshot),
+            env=cli_env,
+        )
+        ar = next(r for r in parse_jsonl_output(stdout2) if r.get('type') == 'ArchiveResult')
+
+        stdout, stderr, code = run_archivebox_cmd(
+            ['archiveresult', 'delete', '--yes'],
+            stdin=json.dumps(ar),
+            env=cli_env,
+        )
+
+        assert code == 0
+        assert 'Deleted 1 archive results' in stderr
--- a/archivebox/tests/test_cli_crawl.py
+++ b/archivebox/tests/test_cli_crawl.py
@@ -0,0 +1,261 @@
+"""
+Tests for archivebox crawl CLI command.
+
+Tests cover:
+- crawl create (with URLs, from stdin, pass-through)
+- crawl list (with filters)
+- crawl update
+- crawl delete
+"""
+
+import json
+import pytest
+
+from archivebox.tests.conftest import (
+    run_archivebox_cmd,
+    parse_jsonl_output,
+    assert_jsonl_contains_type,
+    create_test_url,
+    create_test_crawl_json,
+)
+
+
+class TestCrawlCreate:
+    """Tests for `archivebox crawl create`."""
+
+    def test_create_from_url_args(self, cli_env, initialized_archive):
+        """Create crawl from URL arguments."""
+        url = create_test_url()
+
+        stdout, stderr, code = run_archivebox_cmd(
+            ['crawl', 'create', url],
+            env=cli_env,
+        )
+
+        assert code == 0, f"Command failed: {stderr}"
+        assert 'Created crawl' in stderr
+
+        # Check JSONL output
+        records = parse_jsonl_output(stdout)
+        assert len(records) == 1
+        assert records[0]['type'] == 'Crawl'
+        assert url in records[0]['urls']
+
+    def test_create_from_stdin_urls(self, cli_env, initialized_archive):
+        """Create crawl from stdin URLs (one per line)."""
+        urls = [create_test_url() for _ in range(3)]
+        stdin = '\n'.join(urls)
+
+        stdout, stderr, code = run_archivebox_cmd(
+            ['crawl', 'create'],
+            stdin=stdin,
+            env=cli_env,
+        )
+
+        assert code == 0, f"Command failed: {stderr}"
+
+        records = parse_jsonl_output(stdout)
+        assert len(records) == 1
+        crawl = records[0]
+        assert crawl['type'] == 'Crawl'
+        # All URLs should be in the crawl
+        for url in urls:
+            assert url in crawl['urls']
+
+    def test_create_with_depth(self, cli_env, initialized_archive):
+        """Create crawl with --depth flag."""
+        url = create_test_url()
+
+        stdout, stderr, code = run_archivebox_cmd(
+            ['crawl', 'create', '--depth=2', url],
+            env=cli_env,
+        )
+
+        assert code == 0
+        records = parse_jsonl_output(stdout)
+        assert records[0]['max_depth'] == 2
+
+    def test_create_with_tag(self, cli_env, initialized_archive):
+        """Create crawl with --tag flag."""
+        url = create_test_url()
+
+        stdout, stderr, code = run_archivebox_cmd(
+            ['crawl', 'create', '--tag=test-tag', url],
+            env=cli_env,
+        )
+
+        assert code == 0
+        records = parse_jsonl_output(stdout)
+        assert 'test-tag' in records[0].get('tags_str', '')
+
+    def test_create_pass_through_other_types(self, cli_env, initialized_archive):
+        """Pass-through records of other types unchanged."""
+        tag_record = {'type': 'Tag', 'id': 'fake-tag-id', 'name': 'test'}
+        url = create_test_url()
+        stdin = json.dumps(tag_record) + '\n' + json.dumps({'url': url})
+
+        stdout, stderr, code = run_archivebox_cmd(
+            ['crawl', 'create'],
+            stdin=stdin,
+            env=cli_env,
+        )
+
+        assert code == 0
+        records = parse_jsonl_output(stdout)
+
+        # Should have both the passed-through Tag and the new Crawl
+        types = [r.get('type') for r in records]
+        assert 'Tag' in types
+        assert 'Crawl' in types
+
+    def test_create_pass_through_existing_crawl(self, cli_env, initialized_archive):
+        """Existing Crawl records (with id) are passed through."""
+        # First create a crawl
+        url = create_test_url()
+        stdout1, _, _ = run_archivebox_cmd(['crawl', 'create', url], env=cli_env)
+        crawl = parse_jsonl_output(stdout1)[0]
+
+        # Now pipe it back - should pass through
+        stdout2, stderr, code = run_archivebox_cmd(
+            ['crawl', 'create'],
+            stdin=json.dumps(crawl),
+            env=cli_env,
+        )
+
+        assert code == 0
+        records = parse_jsonl_output(stdout2)
+        assert len(records) == 1
+        assert records[0]['id'] == crawl['id']
+
+
+class TestCrawlList:
+    """Tests for `archivebox crawl list`."""
+
+    def test_list_empty(self, cli_env, initialized_archive):
+        """List with no crawls returns empty."""
+        stdout, stderr, code = run_archivebox_cmd(
+            ['crawl', 'list'],
+            env=cli_env,
+        )
+
+        assert code == 0
+        assert 'Listed 0 crawls' in stderr
+
+    def test_list_returns_created(self, cli_env, initialized_archive):
+        """List returns previously created crawls."""
+        url = create_test_url()
+        run_archivebox_cmd(['crawl', 'create', url], env=cli_env)
+
+        stdout, stderr, code = run_archivebox_cmd(
+            ['crawl', 'list'],
+            env=cli_env,
+        )
+
+        assert code == 0
+        records = parse_jsonl_output(stdout)
+        assert len(records) >= 1
+        assert any(url in r.get('urls', '') for r in records)
+
+    def test_list_filter_by_status(self, cli_env, initialized_archive):
+        """Filter crawls by status."""
+        url = create_test_url()
+        run_archivebox_cmd(['crawl', 'create', url], env=cli_env)
+
+        stdout, stderr, code = run_archivebox_cmd(
+            ['crawl', 'list', '--status=queued'],
+            env=cli_env,
+        )
+
+        assert code == 0
+        records = parse_jsonl_output(stdout)
+        for r in records:
+            assert r['status'] == 'queued'
+
+    def test_list_with_limit(self, cli_env, initialized_archive):
+        """Limit number of results."""
+        # Create multiple crawls
+        for _ in range(3):
+            run_archivebox_cmd(['crawl', 'create', create_test_url()], env=cli_env)
+
+        stdout, stderr, code = run_archivebox_cmd(
+            ['crawl', 'list', '--limit=2'],
+            env=cli_env,
+        )
+
+        assert code == 0
+        records = parse_jsonl_output(stdout)
+        assert len(records) == 2
+
+
+class TestCrawlUpdate:
+    """Tests for `archivebox crawl update`."""
+
+    def test_update_status(self, cli_env, initialized_archive):
+        """Update crawl status."""
+        # Create a crawl
+        url = create_test_url()
+        stdout1, _, _ = run_archivebox_cmd(['crawl', 'create', url], env=cli_env)
+        crawl = parse_jsonl_output(stdout1)[0]
+
+        # Update it
+        stdout2, stderr, code = run_archivebox_cmd(
+            ['crawl', 'update', '--status=started'],
+            stdin=json.dumps(crawl),
+            env=cli_env,
+        )
+
+        assert code == 0
+        assert 'Updated 1 crawls' in stderr
+
+        records = parse_jsonl_output(stdout2)
+        assert records[0]['status'] == 'started'
+
+
+class TestCrawlDelete:
+    """Tests for `archivebox crawl delete`."""
+
+    def test_delete_requires_yes(self, cli_env, initialized_archive):
+        """Delete requires --yes flag."""
+        url = create_test_url()
+        stdout1, _, _ = run_archivebox_cmd(['crawl', 'create', url], env=cli_env)
+        crawl = parse_jsonl_output(stdout1)[0]
+
+        stdout, stderr, code = run_archivebox_cmd(
+            ['crawl', 'delete'],
+            stdin=json.dumps(crawl),
+            env=cli_env,
+        )
+
+        assert code == 1
+        assert '--yes' in stderr
+
+    def test_delete_with_yes(self, cli_env, initialized_archive):
+        """Delete with --yes flag works."""
+        url = create_test_url()
+        stdout1, _, _ = run_archivebox_cmd(['crawl', 'create', url], env=cli_env)
+        crawl = parse_jsonl_output(stdout1)[0]
+
+        stdout, stderr, code = run_archivebox_cmd(
+            ['crawl', 'delete', '--yes'],
+            stdin=json.dumps(crawl),
+            env=cli_env,
+        )
+
+        assert code == 0
+        assert 'Deleted 1 crawls' in stderr
+
+    def test_delete_dry_run(self, cli_env, initialized_archive):
+        """Dry run shows what would be deleted."""
+        url = create_test_url()
+        stdout1, _, _ = run_archivebox_cmd(['crawl', 'create', url], env=cli_env)
+        crawl = parse_jsonl_output(stdout1)[0]
+
+        stdout, stderr, code = run_archivebox_cmd(
+            ['crawl', 'delete', '--dry-run'],
+            stdin=json.dumps(crawl),
+            env=cli_env,
+        )
+
+        assert code == 0
+        assert 'Would delete' in stderr
+        assert 'dry run' in stderr.lower()
--- a/archivebox/tests/test_cli_run.py
+++ b/archivebox/tests/test_cli_run.py
@@ -0,0 +1,254 @@
+"""
+Tests for archivebox run CLI command.
+
+Tests cover:
+- run with stdin JSONL (Crawl, Snapshot, ArchiveResult)
+- create-or-update behavior (records with/without id)
+- pass-through output (for chaining)
+"""
+
+import json
+import pytest
+
+from archivebox.tests.conftest import (
+    run_archivebox_cmd,
+    parse_jsonl_output,
+    create_test_url,
+    create_test_crawl_json,
+    create_test_snapshot_json,
+)
+
+
+class TestRunWithCrawl:
+    """Tests for `archivebox run` with Crawl input."""
+
+    def test_run_with_new_crawl(self, cli_env, initialized_archive):
+        """Run creates and processes a new Crawl (no id)."""
+        crawl_record = create_test_crawl_json()
+
+        stdout, stderr, code = run_archivebox_cmd(
+            ['run'],
+            stdin=json.dumps(crawl_record),
+            env=cli_env,
+            timeout=120,
+        )
+
+        assert code == 0, f"Command failed: {stderr}"
+
+        # Should output the created Crawl
+        records = parse_jsonl_output(stdout)
+        crawl_records = [r for r in records if r.get('type') == 'Crawl']
+        assert len(crawl_records) >= 1
+        assert crawl_records[0].get('id')  # Should have an id now
+
+    def test_run_with_existing_crawl(self, cli_env, initialized_archive):
+        """Run re-queues an existing Crawl (with id)."""
+        url = create_test_url()
+
+        # First create a crawl
+        stdout1, _, _ = run_archivebox_cmd(['crawl', 'create', url], env=cli_env)
+        crawl = parse_jsonl_output(stdout1)[0]
+
+        # Run with the existing crawl
+        stdout2, stderr, code = run_archivebox_cmd(
+            ['run'],
+            stdin=json.dumps(crawl),
+            env=cli_env,
+            timeout=120,
+        )
+
+        assert code == 0
+        records = parse_jsonl_output(stdout2)
+        assert len(records) >= 1
+
+
+class TestRunWithSnapshot:
+    """Tests for `archivebox run` with Snapshot input."""
+
+    def test_run_with_new_snapshot(self, cli_env, initialized_archive):
+        """Run creates and processes a new Snapshot (no id, just url)."""
+        snapshot_record = create_test_snapshot_json()
+
+        stdout, stderr, code = run_archivebox_cmd(
+            ['run'],
+            stdin=json.dumps(snapshot_record),
+            env=cli_env,
+            timeout=120,
+        )
+
+        assert code == 0, f"Command failed: {stderr}"
+
+        records = parse_jsonl_output(stdout)
+        snapshot_records = [r for r in records if r.get('type') == 'Snapshot']
+        assert len(snapshot_records) >= 1
+        assert snapshot_records[0].get('id')
+
+    def test_run_with_existing_snapshot(self, cli_env, initialized_archive):
+        """Run re-queues an existing Snapshot (with id)."""
+        url = create_test_url()
+
+        # First create a snapshot
+        stdout1, _, _ = run_archivebox_cmd(['snapshot', 'create', url], env=cli_env)
+        snapshot = parse_jsonl_output(stdout1)[0]
+
+        # Run with the existing snapshot
+        stdout2, stderr, code = run_archivebox_cmd(
+            ['run'],
+            stdin=json.dumps(snapshot),
+            env=cli_env,
+            timeout=120,
+        )
+
+        assert code == 0
+        records = parse_jsonl_output(stdout2)
+        assert len(records) >= 1
+
+    def test_run_with_plain_url(self, cli_env, initialized_archive):
+        """Run accepts plain URL records (no type field)."""
+        url = create_test_url()
+        url_record = {'url': url}
+
+        stdout, stderr, code = run_archivebox_cmd(
+            ['run'],
+            stdin=json.dumps(url_record),
+            env=cli_env,
+            timeout=120,
+        )
+
+        assert code == 0
+        records = parse_jsonl_output(stdout)
+        assert len(records) >= 1
+
+
+class TestRunWithArchiveResult:
+    """Tests for `archivebox run` with ArchiveResult input."""
+
+    def test_run_requeues_failed_archiveresult(self, cli_env, initialized_archive):
+        """Run re-queues a failed ArchiveResult."""
+        url = create_test_url()
+
+        # Create snapshot and archive result
+        stdout1, _, _ = run_archivebox_cmd(['snapshot', 'create', url], env=cli_env)
+        snapshot = parse_jsonl_output(stdout1)[0]
+
+        stdout2, _, _ = run_archivebox_cmd(
+            ['archiveresult', 'create', '--plugin=title'],
+            stdin=json.dumps(snapshot),
+            env=cli_env,
+        )
+        ar = next(r for r in parse_jsonl_output(stdout2) if r.get('type') == 'ArchiveResult')
+
+        # Update to failed
+        ar['status'] = 'failed'
+        run_archivebox_cmd(
+            ['archiveresult', 'update', '--status=failed'],
+            stdin=json.dumps(ar),
+            env=cli_env,
+        )
+
+        # Now run should re-queue it
+        stdout3, stderr, code = run_archivebox_cmd(
+            ['run'],
+            stdin=json.dumps(ar),
+            env=cli_env,
+            timeout=120,
+        )
+
+        assert code == 0
+        records = parse_jsonl_output(stdout3)
+        ar_records = [r for r in records if r.get('type') == 'ArchiveResult']
+        assert len(ar_records) >= 1
+
+
+class TestRunPassThrough:
+    """Tests for pass-through behavior in `archivebox run`."""
+
+    def test_run_passes_through_unknown_types(self, cli_env, initialized_archive):
+        """Run passes through records with unknown types."""
+        unknown_record = {'type': 'Unknown', 'id': 'fake-id', 'data': 'test'}
+
+        stdout, stderr, code = run_archivebox_cmd(
+            ['run'],
+            stdin=json.dumps(unknown_record),
+            env=cli_env,
+        )
+
+        assert code == 0
+        records = parse_jsonl_output(stdout)
+        unknown_records = [r for r in records if r.get('type') == 'Unknown']
+        assert len(unknown_records) == 1
+        assert unknown_records[0]['data'] == 'test'
+
+    def test_run_outputs_all_processed_records(self, cli_env, initialized_archive):
+        """Run outputs all processed records for chaining."""
+        url = create_test_url()
+        crawl_record = create_test_crawl_json(urls=[url])
+
+        stdout, stderr, code = run_archivebox_cmd(
+            ['run'],
+            stdin=json.dumps(crawl_record),
+            env=cli_env,
+            timeout=120,
+        )
+
+        assert code == 0
+        records = parse_jsonl_output(stdout)
+        # Should have at least the Crawl in output
+        assert len(records) >= 1
+
+
+class TestRunMixedInput:
+    """Tests for `archivebox run` with mixed record types."""
+
+    def test_run_handles_mixed_types(self, cli_env, initialized_archive):
+        """Run handles mixed Crawl/Snapshot/ArchiveResult input."""
+        crawl = create_test_crawl_json()
+        snapshot = create_test_snapshot_json()
+        unknown = {'type': 'Tag', 'id': 'fake', 'name': 'test'}
+
+        stdin = '\n'.join([
+            json.dumps(crawl),
+            json.dumps(snapshot),
+            json.dumps(unknown),
+        ])
+
+        stdout, stderr, code = run_archivebox_cmd(
+            ['run'],
+            stdin=stdin,
+            env=cli_env,
+            timeout=120,
+        )
+
+        assert code == 0
+        records = parse_jsonl_output(stdout)
+
+        types = set(r.get('type') for r in records)
+        # Should have processed Crawl and Snapshot, passed through Tag
+        assert 'Crawl' in types or 'Snapshot' in types or 'Tag' in types
+
+
+class TestRunEmpty:
+    """Tests for `archivebox run` edge cases."""
+
+    def test_run_empty_stdin(self, cli_env, initialized_archive):
+        """Run with empty stdin returns success."""
+        stdout, stderr, code = run_archivebox_cmd(
+            ['run'],
+            stdin='',
+            env=cli_env,
+        )
+
+        assert code == 0
+
+    def test_run_no_records_to_process(self, cli_env, initialized_archive):
+        """Run with only pass-through records shows message."""
+        unknown = {'type': 'Unknown', 'id': 'fake'}
+
+        stdout, stderr, code = run_archivebox_cmd(
+            ['run'],
+            stdin=json.dumps(unknown),
+            env=cli_env,
+        )
+
+        assert code == 0
+        assert 'No records to process' in stderr
--- a/archivebox/tests/test_cli_snapshot.py
+++ b/archivebox/tests/test_cli_snapshot.py
@@ -0,0 +1,274 @@
+"""
+Tests for archivebox snapshot CLI command.
+
+Tests cover:
+- snapshot create (from URLs, from Crawl JSONL, pass-through)
+- snapshot list (with filters)
+- snapshot update
+- snapshot delete
+"""
+
+import json
+import pytest
+
+from archivebox.tests.conftest import (
+    run_archivebox_cmd,
+    parse_jsonl_output,
+    assert_jsonl_contains_type,
+    create_test_url,
+)
+
+
+class TestSnapshotCreate:
+    """Tests for `archivebox snapshot create`."""
+
+    def test_create_from_url_args(self, cli_env, initialized_archive):
+        """Create snapshot from URL arguments."""
+        url = create_test_url()
+
+        stdout, stderr, code = run_archivebox_cmd(
+            ['snapshot', 'create', url],
+            env=cli_env,
+        )
+
+        assert code == 0, f"Command failed: {stderr}"
+        assert 'Created' in stderr
+
+        records = parse_jsonl_output(stdout)
+        assert len(records) == 1
+        assert records[0]['type'] == 'Snapshot'
+        assert records[0]['url'] == url
+
+    def test_create_from_crawl_jsonl(self, cli_env, initialized_archive):
+        """Create snapshots from Crawl JSONL input."""
+        url = create_test_url()
+
+        # First create a crawl
+        stdout1, _, _ = run_archivebox_cmd(['crawl', 'create', url], env=cli_env)
+        crawl = parse_jsonl_output(stdout1)[0]
+
+        # Pipe crawl to snapshot create
+        stdout2, stderr, code = run_archivebox_cmd(
+            ['snapshot', 'create'],
+            stdin=json.dumps(crawl),
+            env=cli_env,
+        )
+
+        assert code == 0, f"Command failed: {stderr}"
+
+        records = parse_jsonl_output(stdout2)
+        # Should have the Crawl passed through and the Snapshot created
+        types = [r.get('type') for r in records]
+        assert 'Crawl' in types
+        assert 'Snapshot' in types
+
+        snapshot = next(r for r in records if r['type'] == 'Snapshot')
+        assert snapshot['url'] == url
+
+    def test_create_with_tag(self, cli_env, initialized_archive):
+        """Create snapshot with --tag flag."""
+        url = create_test_url()
+
+        stdout, stderr, code = run_archivebox_cmd(
+            ['snapshot', 'create', '--tag=test-tag', url],
+            env=cli_env,
+        )
+
+        assert code == 0
+        records = parse_jsonl_output(stdout)
+        assert 'test-tag' in records[0].get('tags_str', '')
+
+    def test_create_pass_through_other_types(self, cli_env, initialized_archive):
+        """Pass-through records of other types unchanged."""
+        tag_record = {'type': 'Tag', 'id': 'fake-tag-id', 'name': 'test'}
+        url = create_test_url()
+        stdin = json.dumps(tag_record) + '\n' + json.dumps({'url': url})
+
+        stdout, stderr, code = run_archivebox_cmd(
+            ['snapshot', 'create'],
+            stdin=stdin,
+            env=cli_env,
+        )
+
+        assert code == 0
+        records = parse_jsonl_output(stdout)
+
+        types = [r.get('type') for r in records]
+        assert 'Tag' in types
+        assert 'Snapshot' in types
+
+    def test_create_multiple_urls(self, cli_env, initialized_archive):
+        """Create snapshots from multiple URLs."""
+        urls = [create_test_url() for _ in range(3)]
+
+        stdout, stderr, code = run_archivebox_cmd(
+            ['snapshot', 'create'] + urls,
+            env=cli_env,
+        )
+
+        assert code == 0
+        records = parse_jsonl_output(stdout)
+        assert len(records) == 3
+
+        created_urls = {r['url'] for r in records}
+        for url in urls:
+            assert url in created_urls
+
+
+class TestSnapshotList:
+    """Tests for `archivebox snapshot list`."""
+
+    def test_list_empty(self, cli_env, initialized_archive):
+        """List with no snapshots returns empty."""
+        stdout, stderr, code = run_archivebox_cmd(
+            ['snapshot', 'list'],
+            env=cli_env,
+        )
+
+        assert code == 0
+        assert 'Listed 0 snapshots' in stderr
+
+    def test_list_returns_created(self, cli_env, initialized_archive):
+        """List returns previously created snapshots."""
+        url = create_test_url()
+        run_archivebox_cmd(['snapshot', 'create', url], env=cli_env)
+
+        stdout, stderr, code = run_archivebox_cmd(
+            ['snapshot', 'list'],
+            env=cli_env,
+        )
+
+        assert code == 0
+        records = parse_jsonl_output(stdout)
+        assert len(records) >= 1
+        assert any(r.get('url') == url for r in records)
+
+    def test_list_filter_by_status(self, cli_env, initialized_archive):
+        """Filter snapshots by status."""
+        url = create_test_url()
+        run_archivebox_cmd(['snapshot', 'create', url], env=cli_env)
+
+        stdout, stderr, code = run_archivebox_cmd(
+            ['snapshot', 'list', '--status=queued'],
+            env=cli_env,
+        )
+
+        assert code == 0
+        records = parse_jsonl_output(stdout)
+        for r in records:
+            assert r['status'] == 'queued'
+
+    def test_list_filter_by_url_contains(self, cli_env, initialized_archive):
+        """Filter snapshots by URL contains."""
+        url = create_test_url(domain='unique-domain-12345.com')
+        run_archivebox_cmd(['snapshot', 'create', url], env=cli_env)
+
+        stdout, stderr, code = run_archivebox_cmd(
+            ['snapshot', 'list', '--url__icontains=unique-domain-12345'],
+            env=cli_env,
+        )
+
+        assert code == 0
+        records = parse_jsonl_output(stdout)
+        assert len(records) == 1
+        assert 'unique-domain-12345' in records[0]['url']
+
+    def test_list_with_limit(self, cli_env, initialized_archive):
+        """Limit number of results."""
+        for _ in range(3):
+            run_archivebox_cmd(['snapshot', 'create', create_test_url()], env=cli_env)
+
+        stdout, stderr, code = run_archivebox_cmd(
+            ['snapshot', 'list', '--limit=2'],
+            env=cli_env,
+        )
+
+        assert code == 0
+        records = parse_jsonl_output(stdout)
+        assert len(records) == 2
+
+
+class TestSnapshotUpdate:
+    """Tests for `archivebox snapshot update`."""
+
+    def test_update_status(self, cli_env, initialized_archive):
+        """Update snapshot status."""
+        url = create_test_url()
+        stdout1, _, _ = run_archivebox_cmd(['snapshot', 'create', url], env=cli_env)
+        snapshot = parse_jsonl_output(stdout1)[0]
+
+        stdout2, stderr, code = run_archivebox_cmd(
+            ['snapshot', 'update', '--status=started'],
+            stdin=json.dumps(snapshot),
+            env=cli_env,
+        )
+
+        assert code == 0
+        assert 'Updated 1 snapshots' in stderr
+
+        records = parse_jsonl_output(stdout2)
+        assert records[0]['status'] == 'started'
+
+    def test_update_add_tag(self, cli_env, initialized_archive):
+        """Update snapshot by adding tag."""
+        url = create_test_url()
+        stdout1, _, _ = run_archivebox_cmd(['snapshot', 'create', url], env=cli_env)
+        snapshot = parse_jsonl_output(stdout1)[0]
+
+        stdout2, stderr, code = run_archivebox_cmd(
+            ['snapshot', 'update', '--tag=new-tag'],
+            stdin=json.dumps(snapshot),
+            env=cli_env,
+        )
+
+        assert code == 0
+        assert 'Updated 1 snapshots' in stderr
+
+
+class TestSnapshotDelete:
+    """Tests for `archivebox snapshot delete`."""
+
+    def test_delete_requires_yes(self, cli_env, initialized_archive):
+        """Delete requires --yes flag."""
+        url = create_test_url()
+        stdout1, _, _ = run_archivebox_cmd(['snapshot', 'create', url], env=cli_env)
+        snapshot = parse_jsonl_output(stdout1)[0]
+
+        stdout, stderr, code = run_archivebox_cmd(
+            ['snapshot', 'delete'],
+            stdin=json.dumps(snapshot),
+            env=cli_env,
+        )
+
+        assert code == 1
+        assert '--yes' in stderr
+
+    def test_delete_with_yes(self, cli_env, initialized_archive):
+        """Delete with --yes flag works."""
+        url = create_test_url()
+        stdout1, _, _ = run_archivebox_cmd(['snapshot', 'create', url], env=cli_env)
+        snapshot = parse_jsonl_output(stdout1)[0]
+
+        stdout, stderr, code = run_archivebox_cmd(
+            ['snapshot', 'delete', '--yes'],
+            stdin=json.dumps(snapshot),
+            env=cli_env,
+        )
+
+        assert code == 0
+        assert 'Deleted 1 snapshots' in stderr
+
+    def test_delete_dry_run(self, cli_env, initialized_archive):
+        """Dry run shows what would be deleted."""
+        url = create_test_url()
+        stdout1, _, _ = run_archivebox_cmd(['snapshot', 'create', url], env=cli_env)
+        snapshot = parse_jsonl_output(stdout1)[0]
+
+        stdout, stderr, code = run_archivebox_cmd(
+            ['snapshot', 'delete', '--dry-run'],
+            stdin=json.dumps(snapshot),
+            env=cli_env,
+        )
+
+        assert code == 0
+        assert 'Would delete' in stderr
--- a/archivebox/workers/supervisord_util.py
+++ b/archivebox/workers/supervisord_util.py
@@ -32,7 +32,7 @@ _supervisord_proc = None

 ORCHESTRATOR_WORKER = {
    "name": "worker_orchestrator",
-    "command": "archivebox manage orchestrator",  # runs forever by default
+    "command": "archivebox run",  # runs forever by default
    "autostart": "true",
    "autorestart": "true",
    "stdout_logfile": "logs/worker_orchestrator.log",
--- a/old/TODO_chrome_plugin_cleanup.md
+++ b/old/TODO_chrome_plugin_cleanup.md
@@ -133,7 +133,7 @@ This plugin provides shared Chrome infrastructure for other plugins. It manages
 chrome/
 ├── on_Crawl__00_chrome_install_config.py  # Configure Chrome settings
 ├── on_Crawl__00_chrome_install.py         # Install Chrome binary
-├── on_Crawl__20_chrome_launch.bg.js       # Launch Chrome (Crawl-level, bg)
+├── on_Crawl__30_chrome_launch.bg.js       # Launch Chrome (Crawl-level, bg)
 ├── on_Snapshot__20_chrome_tab.bg.js       # Open tab (Snapshot-level, bg)
 ├── on_Snapshot__30_chrome_navigate.js     # Navigate to URL (foreground)
 ├── on_Snapshot__45_chrome_tab_cleanup.py  # Close tab, kill bg hooks