6 Commits

Author SHA1 Message Date
Nick Sweeting
65ee09ceab move tests into subfolder, add missing install hooks 2026-01-02 00:22:07 -08:00
Claude
bb52b5902a Add unit tests for JSONL CLI pipeline commands (Phase 5 & 6)
Add comprehensive unit tests for the CLI piping architecture:
- test_cli_crawl.py: crawl create/list/update/delete tests
- test_cli_snapshot.py: snapshot create/list/update/delete tests
- test_cli_archiveresult.py: archiveresult create/list/update/delete tests
- test_cli_run.py: run command create-or-update and pass-through tests

Extend tests_piping.py with:
- TestPassThroughBehavior: tests for pass-through behavior in all commands
- TestPipelineAccumulation: tests for accumulating records through pipeline

All tests use pytest fixtures from conftest.py with isolated DATA_DIR.
2025-12-31 10:21:05 +00:00
Claude
f3e11b61fd Implement JSONL CLI pipeline architecture (Phases 1-4, 6)
Phase 1: Model Prerequisites
- Add ArchiveResult.from_json() and from_jsonl() methods
- Fix Snapshot.to_json() to use tags_str (consistent with Crawl)

Phase 2: Shared Utilities
- Create archivebox/cli/cli_utils.py with shared apply_filters()
- Update 7 CLI files to import from cli_utils.py instead of duplicating

Phase 3: Pass-Through Behavior
- Add pass-through to crawl create (non-Crawl records pass unchanged)
- Add pass-through to snapshot create (Crawl records + others pass through)
- Add pass-through to archiveresult create (Snapshot records + others)
- Add create-or-update behavior to run command:
  - Records WITHOUT id: Create via Model.from_json()
  - Records WITH id: Lookup existing, re-queue
  - Outputs JSONL of all processed records for chaining

Phase 4: Test Infrastructure
- Create archivebox/tests/conftest.py with pytest-django fixtures
- Include CLI helpers, output assertions, database assertions

Phase 6: Config Update
- Update supervisord_util.py: orchestrator -> run command

This enables Unix-style piping:
  archivebox crawl create URL | archivebox run
  archivebox archiveresult list --status=failed | archivebox run
  curl API | jq transform | archivebox crawl create | archivebox run
2025-12-31 10:07:14 +00:00
Claude
1c85b4daa3 Refine use cases: 8 examples with efficient patterns
- Trimmed from 10 to 8 focused examples
- Emphasize CLI args for DB filtering (efficient), jq for transforms
- Added key examples showing `run` emits JSONL enabling chained processing:
  - #4: Retry failed with different binary/timeout via jq transform
  - #8: Recursive link following (run → jq filter → crawl → run)
- Removed redundant jq domain filtering (use --url__icontains instead)
- Updated summary table with "Retry w/ Changes" and "Chain Processing" patterns
2025-12-31 09:26:23 +00:00
Claude
0f46d8a22e Add real-world use cases to CLI pipeline plan
Added 10 practical examples demonstrating the JSONL piping architecture:
1. Basic archive with auto-cascade
2. Retry failed extractions (by status, plugin, domain)
3. Pinboard bookmark import with jq
4. GitHub repo filtering with jq regex
5. Selective extraction (screenshots only)
6. Bulk tag management
7. Deep documentation crawling
8. RSS feed monitoring
9. Archive audit with jq aggregation
10. Incremental backup with diff

Also added auto-cascade principle: `archivebox run` automatically
creates Snapshots from Crawls and ArchiveResults from Snapshots,
so intermediate commands are only needed for customization.
2025-12-31 09:20:25 +00:00
Claude
df2a0dcd44 Add revised CLI pipeline architecture plan
Comprehensive plan for implementing JSONL-based CLI piping:
- Phase 1: Model prerequisites (ArchiveResult.from_json, tags_str fix)
- Phase 2: Extract shared apply_filters() to cli_utils.py
- Phase 3: Implement pass-through behavior for all create commands
- Phase 4-6: Test infrastructure with pytest-django, unit/integration tests

Key changes from original plan:
- ArchiveResult.from_json() identified as missing prerequisite
- Pass-through documented as new feature to implement
- archivebox run updated to create-or-update pattern
- conftest.py redesigned to use pytest-django with isolated tmp_path
- Standardized on tags_str field name across all models
- Reordered phases: implement before test
2025-12-31 01:46:07 +00:00