Files
ArchiveBox/TODO_cli_refactor.md
2025-12-30 16:12:53 -08:00

4.7 KiB

ArchiveBox CLI Refactor TODO

Design Decisions

  1. Keep archivebox add as high-level convenience command
  2. Unified archivebox run for processing (replaces per-model run and orchestrator)
  3. Expose all models including binary, process, machine
  4. Clean break from old command structure (no backward compatibility aliases)

Final Architecture

archivebox <model> <action> [args...] [--filters]
archivebox run [stdin JSONL]

Actions (4 per model):

  • create - Create records (from args, stdin, or JSONL), dedupes by indexed fields
  • list - Query records (with filters, returns JSONL)
  • update - Modify records (from stdin JSONL, PATCH semantics)
  • delete - Remove records (from stdin JSONL, requires --yes)

Unified Run Command:

  • archivebox run - Process queued work
    • With stdin JSONL: Process piped records, exit when complete
    • Without stdin (TTY): Run orchestrator in foreground until killed

Models (7 total):

  • crawl - Crawl jobs
  • snapshot - Individual archived pages
  • archiveresult - Plugin extraction results
  • tag - Tags/labels
  • binary - Detected binaries (chrome, wget, etc.)
  • process - Process execution records (read-only)
  • machine - Machine/host records (read-only)

Implementation Checklist

Phase 1: Unified Run Command

  • Create archivebox/cli/archivebox_run.py - unified processing command

Phase 2: Core Model Commands

  • Refactor archivebox/cli/archivebox_snapshot.py to Click group with create|list|update|delete
  • Refactor archivebox/cli/archivebox_crawl.py to Click group with create|list|update|delete
  • Create archivebox/cli/archivebox_archiveresult.py with create|list|update|delete
  • Create archivebox/cli/archivebox_tag.py with create|list|update|delete

Phase 3: System Model Commands

  • Create archivebox/cli/archivebox_binary.py with create|list|update|delete
  • Create archivebox/cli/archivebox_process.py with list only (read-only)
  • Create archivebox/cli/archivebox_machine.py with list only (read-only)

Phase 4: Registry & Cleanup

  • Update archivebox/cli/__init__.py command registry
  • Delete archivebox/cli/archivebox_extract.py
  • Delete archivebox/cli/archivebox_remove.py
  • Delete archivebox/cli/archivebox_search.py
  • Delete archivebox/cli/archivebox_orchestrator.py
  • Update archivebox/cli/archivebox_add.py internals (no changes needed - uses models directly)
  • Update archivebox/cli/tests_piping.py

Phase 5: Tests for New Commands

  • Add tests for archivebox run command
  • Add tests for archivebox crawl create|list|update|delete
  • Add tests for archivebox snapshot create|list|update|delete
  • Add tests for archivebox archiveresult create|list|update|delete
  • Add tests for archivebox tag create|list|update|delete
  • Add tests for archivebox binary create|list|update|delete
  • Add tests for archivebox process list
  • Add tests for archivebox machine list

Usage Examples

Basic CRUD

# Create
archivebox crawl create https://example.com https://foo.com --depth=1
archivebox snapshot create https://example.com --tag=news

# List with filters
archivebox crawl list --status=queued
archivebox snapshot list --url__icontains=example.com
archivebox archiveresult list --status=failed --plugin=screenshot

# Update (reads JSONL from stdin, applies changes)
archivebox snapshot list --tag=old | archivebox snapshot update --tag=new

# Delete (requires --yes)
archivebox crawl list --url__icontains=example.com | archivebox crawl delete --yes

Unified Run Command

# Run orchestrator in foreground (replaces `archivebox orchestrator`)
archivebox run

# Process specific records (pipe any JSONL type, exits when done)
archivebox snapshot list --status=queued | archivebox run
archivebox archiveresult list --status=failed | archivebox run
archivebox crawl list --status=queued | archivebox run

# Mixed types work too - run handles any JSONL
cat mixed_records.jsonl | archivebox run

Composed Workflows

# Full pipeline (replaces old `archivebox add`)
archivebox crawl create https://example.com --status=queued \
  | archivebox snapshot create --status=queued \
  | archivebox archiveresult create --status=queued \
  | archivebox run

# Re-run failed extractions
archivebox archiveresult list --status=failed | archivebox run

# Delete all snapshots for a domain
archivebox snapshot list --url__icontains=spam.com | archivebox snapshot delete --yes

Keep archivebox add as convenience

# This remains the simple user-friendly interface:
archivebox add https://example.com --depth=1 --tag=news

# Internally equivalent to the composed pipeline above