mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-01-02 17:05:38 +10:00
132 lines
4.7 KiB
Markdown
132 lines
4.7 KiB
Markdown
# ArchiveBox CLI Refactor TODO
|
|
|
|
## Design Decisions
|
|
|
|
1. **Keep `archivebox add`** as high-level convenience command
|
|
2. **Unified `archivebox run`** for processing (replaces per-model `run` and `orchestrator`)
|
|
3. **Expose all models** including binary, process, machine
|
|
4. **Clean break** from old command structure (no backward compatibility aliases)
|
|
|
|
## Final Architecture
|
|
|
|
```
|
|
archivebox <model> <action> [args...] [--filters]
|
|
archivebox run [stdin JSONL]
|
|
```
|
|
|
|
### Actions (4 per model):
|
|
- `create` - Create records (from args, stdin, or JSONL), dedupes by indexed fields
|
|
- `list` - Query records (with filters, returns JSONL)
|
|
- `update` - Modify records (from stdin JSONL, PATCH semantics)
|
|
- `delete` - Remove records (from stdin JSONL, requires --yes)
|
|
|
|
### Unified Run Command:
|
|
- `archivebox run` - Process queued work
|
|
- With stdin JSONL: Process piped records, exit when complete
|
|
- Without stdin (TTY): Run orchestrator in foreground until killed
|
|
|
|
### Models (7 total):
|
|
- `crawl` - Crawl jobs
|
|
- `snapshot` - Individual archived pages
|
|
- `archiveresult` - Plugin extraction results
|
|
- `tag` - Tags/labels
|
|
- `binary` - Detected binaries (chrome, wget, etc.)
|
|
- `process` - Process execution records (read-only)
|
|
- `machine` - Machine/host records (read-only)
|
|
|
|
---
|
|
|
|
## Implementation Checklist
|
|
|
|
### Phase 1: Unified Run Command
|
|
- [x] Create `archivebox/cli/archivebox_run.py` - unified processing command
|
|
|
|
### Phase 2: Core Model Commands
|
|
- [x] Refactor `archivebox/cli/archivebox_snapshot.py` to Click group with create|list|update|delete
|
|
- [x] Refactor `archivebox/cli/archivebox_crawl.py` to Click group with create|list|update|delete
|
|
- [x] Create `archivebox/cli/archivebox_archiveresult.py` with create|list|update|delete
|
|
- [x] Create `archivebox/cli/archivebox_tag.py` with create|list|update|delete
|
|
|
|
### Phase 3: System Model Commands
|
|
- [x] Create `archivebox/cli/archivebox_binary.py` with create|list|update|delete
|
|
- [x] Create `archivebox/cli/archivebox_process.py` with list only (read-only)
|
|
- [x] Create `archivebox/cli/archivebox_machine.py` with list only (read-only)
|
|
|
|
### Phase 4: Registry & Cleanup
|
|
- [x] Update `archivebox/cli/__init__.py` command registry
|
|
- [x] Delete `archivebox/cli/archivebox_extract.py`
|
|
- [x] Delete `archivebox/cli/archivebox_remove.py`
|
|
- [x] Delete `archivebox/cli/archivebox_search.py`
|
|
- [x] Delete `archivebox/cli/archivebox_orchestrator.py`
|
|
- [x] Update `archivebox/cli/archivebox_add.py` internals (no changes needed - uses models directly)
|
|
- [x] Update `archivebox/cli/tests_piping.py`
|
|
|
|
### Phase 5: Tests for New Commands
|
|
- [ ] Add tests for `archivebox run` command
|
|
- [ ] Add tests for `archivebox crawl create|list|update|delete`
|
|
- [ ] Add tests for `archivebox snapshot create|list|update|delete`
|
|
- [ ] Add tests for `archivebox archiveresult create|list|update|delete`
|
|
- [ ] Add tests for `archivebox tag create|list|update|delete`
|
|
- [ ] Add tests for `archivebox binary create|list|update|delete`
|
|
- [ ] Add tests for `archivebox process list`
|
|
- [ ] Add tests for `archivebox machine list`
|
|
|
|
---
|
|
|
|
## Usage Examples
|
|
|
|
### Basic CRUD
|
|
```bash
|
|
# Create
|
|
archivebox crawl create https://example.com https://foo.com --depth=1
|
|
archivebox snapshot create https://example.com --tag=news
|
|
|
|
# List with filters
|
|
archivebox crawl list --status=queued
|
|
archivebox snapshot list --url__icontains=example.com
|
|
archivebox archiveresult list --status=failed --plugin=screenshot
|
|
|
|
# Update (reads JSONL from stdin, applies changes)
|
|
archivebox snapshot list --tag=old | archivebox snapshot update --tag=new
|
|
|
|
# Delete (requires --yes)
|
|
archivebox crawl list --url__icontains=example.com | archivebox crawl delete --yes
|
|
```
|
|
|
|
### Unified Run Command
|
|
```bash
|
|
# Run orchestrator in foreground (replaces `archivebox orchestrator`)
|
|
archivebox run
|
|
|
|
# Process specific records (pipe any JSONL type, exits when done)
|
|
archivebox snapshot list --status=queued | archivebox run
|
|
archivebox archiveresult list --status=failed | archivebox run
|
|
archivebox crawl list --status=queued | archivebox run
|
|
|
|
# Mixed types work too - run handles any JSONL
|
|
cat mixed_records.jsonl | archivebox run
|
|
```
|
|
|
|
### Composed Workflows
|
|
```bash
|
|
# Full pipeline (replaces old `archivebox add`)
|
|
archivebox crawl create https://example.com --status=queued \
|
|
| archivebox snapshot create --status=queued \
|
|
| archivebox archiveresult create --status=queued \
|
|
| archivebox run
|
|
|
|
# Re-run failed extractions
|
|
archivebox archiveresult list --status=failed | archivebox run
|
|
|
|
# Delete all snapshots for a domain
|
|
archivebox snapshot list --url__icontains=spam.com | archivebox snapshot delete --yes
|
|
```
|
|
|
|
### Keep `archivebox add` as convenience
|
|
```bash
|
|
# This remains the simple user-friendly interface:
|
|
archivebox add https://example.com --depth=1 --tag=news
|
|
|
|
# Internally equivalent to the composed pipeline above
|
|
```
|