mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-01-02 17:05:38 +10:00
4.7 KiB
4.7 KiB
ArchiveBox CLI Refactor TODO
Design Decisions
- Keep
archivebox addas high-level convenience command - Unified
archivebox runfor processing (replaces per-modelrunandorchestrator) - Expose all models including binary, process, machine
- Clean break from old command structure (no backward compatibility aliases)
Final Architecture
archivebox <model> <action> [args...] [--filters]
archivebox run [stdin JSONL]
Actions (4 per model):
create- Create records (from args, stdin, or JSONL), dedupes by indexed fieldslist- Query records (with filters, returns JSONL)update- Modify records (from stdin JSONL, PATCH semantics)delete- Remove records (from stdin JSONL, requires --yes)
Unified Run Command:
archivebox run- Process queued work- With stdin JSONL: Process piped records, exit when complete
- Without stdin (TTY): Run orchestrator in foreground until killed
Models (7 total):
crawl- Crawl jobssnapshot- Individual archived pagesarchiveresult- Plugin extraction resultstag- Tags/labelsbinary- Detected binaries (chrome, wget, etc.)process- Process execution records (read-only)machine- Machine/host records (read-only)
Implementation Checklist
Phase 1: Unified Run Command
- Create
archivebox/cli/archivebox_run.py- unified processing command
Phase 2: Core Model Commands
- Refactor
archivebox/cli/archivebox_snapshot.pyto Click group with create|list|update|delete - Refactor
archivebox/cli/archivebox_crawl.pyto Click group with create|list|update|delete - Create
archivebox/cli/archivebox_archiveresult.pywith create|list|update|delete - Create
archivebox/cli/archivebox_tag.pywith create|list|update|delete
Phase 3: System Model Commands
- Create
archivebox/cli/archivebox_binary.pywith create|list|update|delete - Create
archivebox/cli/archivebox_process.pywith list only (read-only) - Create
archivebox/cli/archivebox_machine.pywith list only (read-only)
Phase 4: Registry & Cleanup
- Update
archivebox/cli/__init__.pycommand registry - Delete
archivebox/cli/archivebox_extract.py - Delete
archivebox/cli/archivebox_remove.py - Delete
archivebox/cli/archivebox_search.py - Delete
archivebox/cli/archivebox_orchestrator.py - Update
archivebox/cli/archivebox_add.pyinternals (no changes needed - uses models directly) - Update
archivebox/cli/tests_piping.py
Phase 5: Tests for New Commands
- Add tests for
archivebox runcommand - Add tests for
archivebox crawl create|list|update|delete - Add tests for
archivebox snapshot create|list|update|delete - Add tests for
archivebox archiveresult create|list|update|delete - Add tests for
archivebox tag create|list|update|delete - Add tests for
archivebox binary create|list|update|delete - Add tests for
archivebox process list - Add tests for
archivebox machine list
Usage Examples
Basic CRUD
# Create
archivebox crawl create https://example.com https://foo.com --depth=1
archivebox snapshot create https://example.com --tag=news
# List with filters
archivebox crawl list --status=queued
archivebox snapshot list --url__icontains=example.com
archivebox archiveresult list --status=failed --plugin=screenshot
# Update (reads JSONL from stdin, applies changes)
archivebox snapshot list --tag=old | archivebox snapshot update --tag=new
# Delete (requires --yes)
archivebox crawl list --url__icontains=example.com | archivebox crawl delete --yes
Unified Run Command
# Run orchestrator in foreground (replaces `archivebox orchestrator`)
archivebox run
# Process specific records (pipe any JSONL type, exits when done)
archivebox snapshot list --status=queued | archivebox run
archivebox archiveresult list --status=failed | archivebox run
archivebox crawl list --status=queued | archivebox run
# Mixed types work too - run handles any JSONL
cat mixed_records.jsonl | archivebox run
Composed Workflows
# Full pipeline (replaces old `archivebox add`)
archivebox crawl create https://example.com --status=queued \
| archivebox snapshot create --status=queued \
| archivebox archiveresult create --status=queued \
| archivebox run
# Re-run failed extractions
archivebox archiveresult list --status=failed | archivebox run
# Delete all snapshots for a domain
archivebox snapshot list --url__icontains=spam.com | archivebox snapshot delete --yes
Keep archivebox add as convenience
# This remains the simple user-friendly interface:
archivebox add https://example.com --depth=1 --tag=news
# Internally equivalent to the composed pipeline above