ArchiveBox

mirror of https://github.com/ArchiveBox/ArchiveBox.git synced 2026-01-03 09:25:42 +10:00

Files

Claude f3e11b61fd Implement JSONL CLI pipeline architecture (Phases 1-4, 6)

Phase 1: Model Prerequisites
- Add ArchiveResult.from_json() and from_jsonl() methods
- Fix Snapshot.to_json() to use tags_str (consistent with Crawl)

Phase 2: Shared Utilities
- Create archivebox/cli/cli_utils.py with shared apply_filters()
- Update 7 CLI files to import from cli_utils.py instead of duplicating

Phase 3: Pass-Through Behavior
- Add pass-through to crawl create (non-Crawl records pass unchanged)
- Add pass-through to snapshot create (Crawl records + others pass through)
- Add pass-through to archiveresult create (Snapshot records + others)
- Add create-or-update behavior to run command:
  - Records WITHOUT id: Create via Model.from_json()
  - Records WITH id: Lookup existing, re-queue
  - Outputs JSONL of all processed records for chaining

Phase 4: Test Infrastructure
- Create archivebox/tests/conftest.py with pytest-django fixtures
- Include CLI helpers, output assertions, database assertions

Phase 6: Config Update
- Update supervisord_util.py: orchestrator -> run command

This enables Unix-style piping:
  archivebox crawl create URL | archivebox run
  archivebox archiveresult list --status=failed | archivebox run
  curl API | jq transform | archivebox crawl create | archivebox run

2025-12-31 10:07:14 +00:00

management/commands

…

migrations

…

templatetags

…

__init__.py

…

actors.py

…

admin_archiveresults.py

…

admin_site.py

…

admin_snapshots.py

…

admin_tags.py

…

admin_users.py

…

admin.py

…

apps.py

…

asgi.py

…

forms.py

new jsonl cli interface

2025-12-30 16:12:53 -08:00

middleware.py

…

models.py

Implement JSONL CLI pipeline architecture (Phases 1-4, 6)

2025-12-31 10:07:14 +00:00

settings_logging.py

…

settings.py

…

tests.py

…

urls.py

…

views.py

…

widgets.py

…

wsgi.py

…