10 Commits

Author SHA1 Message Date
Claude
f3e11b61fd Implement JSONL CLI pipeline architecture (Phases 1-4, 6)
Phase 1: Model Prerequisites
- Add ArchiveResult.from_json() and from_jsonl() methods
- Fix Snapshot.to_json() to use tags_str (consistent with Crawl)

Phase 2: Shared Utilities
- Create archivebox/cli/cli_utils.py with shared apply_filters()
- Update 7 CLI files to import from cli_utils.py instead of duplicating

Phase 3: Pass-Through Behavior
- Add pass-through to crawl create (non-Crawl records pass unchanged)
- Add pass-through to snapshot create (Crawl records + others pass through)
- Add pass-through to archiveresult create (Snapshot records + others)
- Add create-or-update behavior to run command:
  - Records WITHOUT id: Create via Model.from_json()
  - Records WITH id: Lookup existing, re-queue
  - Outputs JSONL of all processed records for chaining

Phase 4: Test Infrastructure
- Create archivebox/tests/conftest.py with pytest-django fixtures
- Include CLI helpers, output assertions, database assertions

Phase 6: Config Update
- Update supervisord_util.py: orchestrator -> run command

This enables Unix-style piping:
  archivebox crawl create URL | archivebox run
  archivebox archiveresult list --status=failed | archivebox run
  curl API | jq transform | archivebox crawl create | archivebox run
2025-12-31 10:07:14 +00:00
Nick Sweeting
dd2302ad92 new jsonl cli interface 2025-12-30 16:12:53 -08:00
claude[bot]
762cddc8c5 fix: address PR review comments from cubic-dev-ai
- Add JSONL_INDEX_FILENAME to ALLOWED_IN_DATA_DIR for consistency
- Fix fallback logic in legacy.py to try JSON when JSONL parsing fails
- Replace bare except clauses with specific exception types
- Fix stdin double-consumption in archivebox_crawl.py
- Merge CLI --tag option with crawl tags in archivebox_snapshot.py
- Remove tautological mock tests (covered by integration tests)

Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>
2025-12-30 20:09:51 +00:00
Claude
cf387ed59f refactor: batch all URLs into single Crawl, update tests
- archivebox crawl now creates one Crawl with all URLs as newline-separated string
- Updated tests to reflect new pipeline: crawl -> snapshot -> extract
- Added tests for Crawl JSONL parsing and output
- Tests verify Crawl.from_jsonl() handles multiple URLs correctly
2025-12-30 20:06:56 +00:00
Claude
69965a2782 fix: correct CLI pipeline data flow for crawl -> snapshot -> extract
- archivebox crawl: creates Crawl records, outputs Crawl JSONL
- archivebox snapshot: accepts Crawl JSONL, creates Snapshots, outputs Snapshot JSONL
- archivebox extract: accepts Snapshot JSONL, runs extractors, outputs ArchiveResult JSONL

Changes:
- Add Crawl.from_jsonl() method for creating Crawl from JSONL records
- Rewrite archivebox_crawl.py to create Crawl jobs without immediately starting them
- Update archivebox_snapshot.py to accept both Crawl JSONL and plain URLs
- Update jsonl.py docstring to document the pipeline
2025-12-30 19:42:41 +00:00
Nick Sweeting
f4e7820533 use full dotted paths for all archivebox imports, add migrations and more fixes 2025-12-29 00:47:08 -08:00
Nick Sweeting
f0aa19fa7d wip 2025-12-28 17:51:54 -08:00
Nick Sweeting
50e527ec65 way better plugin hooks system wip 2025-12-28 03:39:59 -08:00
Nick Sweeting
bb53228ebf remove Seed model in favor of Crawl as template 2025-12-25 01:52:41 -08:00
Nick Sweeting
1915333b81 wip major changes 2025-12-24 20:10:38 -08:00