ArchiveBox

mirror of https://github.com/ArchiveBox/ArchiveBox.git synced 2026-01-03 01:15:57 +10:00

Author	SHA1	Message	Date
Claude	f3e11b61fd	Implement JSONL CLI pipeline architecture (Phases 1-4, 6) Phase 1: Model Prerequisites - Add ArchiveResult.from_json() and from_jsonl() methods - Fix Snapshot.to_json() to use tags_str (consistent with Crawl) Phase 2: Shared Utilities - Create archivebox/cli/cli_utils.py with shared apply_filters() - Update 7 CLI files to import from cli_utils.py instead of duplicating Phase 3: Pass-Through Behavior - Add pass-through to crawl create (non-Crawl records pass unchanged) - Add pass-through to snapshot create (Crawl records + others pass through) - Add pass-through to archiveresult create (Snapshot records + others) - Add create-or-update behavior to run command: - Records WITHOUT id: Create via Model.from_json() - Records WITH id: Lookup existing, re-queue - Outputs JSONL of all processed records for chaining Phase 4: Test Infrastructure - Create archivebox/tests/conftest.py with pytest-django fixtures - Include CLI helpers, output assertions, database assertions Phase 6: Config Update - Update supervisord_util.py: orchestrator -> run command This enables Unix-style piping: archivebox crawl create URL \| archivebox run archivebox archiveresult list --status=failed \| archivebox run curl API \| jq transform \| archivebox crawl create \| archivebox run	2025-12-31 10:07:14 +00:00
Nick Sweeting	dd2302ad92	new jsonl cli interface	2025-12-30 16:12:53 -08:00
claude[bot]	762cddc8c5	fix: address PR review comments from cubic-dev-ai - Add JSONL_INDEX_FILENAME to ALLOWED_IN_DATA_DIR for consistency - Fix fallback logic in legacy.py to try JSON when JSONL parsing fails - Replace bare except clauses with specific exception types - Fix stdin double-consumption in archivebox_crawl.py - Merge CLI --tag option with crawl tags in archivebox_snapshot.py - Remove tautological mock tests (covered by integration tests) Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>	2025-12-30 20:09:51 +00:00
Claude	cf387ed59f	refactor: batch all URLs into single Crawl, update tests - archivebox crawl now creates one Crawl with all URLs as newline-separated string - Updated tests to reflect new pipeline: crawl -> snapshot -> extract - Added tests for Crawl JSONL parsing and output - Tests verify Crawl.from_jsonl() handles multiple URLs correctly	2025-12-30 20:06:56 +00:00
Claude	69965a2782	fix: correct CLI pipeline data flow for crawl -> snapshot -> extract - archivebox crawl: creates Crawl records, outputs Crawl JSONL - archivebox snapshot: accepts Crawl JSONL, creates Snapshots, outputs Snapshot JSONL - archivebox extract: accepts Snapshot JSONL, runs extractors, outputs ArchiveResult JSONL Changes: - Add Crawl.from_jsonl() method for creating Crawl from JSONL records - Rewrite archivebox_crawl.py to create Crawl jobs without immediately starting them - Update archivebox_snapshot.py to accept both Crawl JSONL and plain URLs - Update jsonl.py docstring to document the pipeline	2025-12-30 19:42:41 +00:00
Nick Sweeting	f4e7820533	use full dotted paths for all archivebox imports, add migrations and more fixes	2025-12-29 00:47:08 -08:00
Nick Sweeting	f0aa19fa7d	wip	2025-12-28 17:51:54 -08:00
Nick Sweeting	50e527ec65	way better plugin hooks system wip	2025-12-28 03:39:59 -08:00
Nick Sweeting	bb53228ebf	remove Seed model in favor of Crawl as template	2025-12-25 01:52:41 -08:00
Nick Sweeting	1915333b81	wip major changes	2025-12-24 20:10:38 -08:00

10 Commits