254 Commits

Author SHA1 Message Date
Nick Sweeting
f3622d8cd3 update working changes 2026-03-25 05:36:07 -07:00
Nick Sweeting
50286d3c38 Reuse cached binaries in archivebox runtime 2026-03-24 11:03:43 -07:00
Nick Sweeting
39450111dd Update CI uv handling and runner changes 2026-03-23 13:27:23 -07:00
Nick Sweeting
25f935b9d1 split CrawlSetup into Install phase with new Binary + BinaryRequest events 2026-03-23 13:15:41 -07:00
Nick Sweeting
b749b26c5d wip 2026-03-23 03:58:32 -07:00
Nick Sweeting
f400a2cd67 WIP: checkpoint working tree before rebasing onto dev 2026-03-22 20:25:18 -07:00
Nick Sweeting
c87079aa0a Refactor ArchiveBox onto abx-dl bus runner 2026-03-21 11:47:57 -07:00
Nick Sweeting
57e11879ec cleanup archivebox tests 2026-03-15 22:09:56 -07:00
Nick Sweeting
9de084da65 bump package versions 2026-03-15 20:47:28 -07:00
Nick Sweeting
bc21d4bfdb type and test fixes 2026-03-15 20:12:27 -07:00
Nick Sweeting
4756697a17 Use ruff pyright and ty for linting 2026-03-15 19:43:59 -07:00
Nick Sweeting
49436af869 Tighten CLI and admin typing 2026-03-15 19:33:15 -07:00
Nick Sweeting
5381f7584c Tighten API typing and add return values 2026-03-15 19:24:54 -07:00
Nick Sweeting
f932054915 add stricter locking around stage machine models 2026-03-15 19:21:41 -07:00
Nick Sweeting
311e4340ec Fix add CLI input handling and lint regressions 2026-03-15 19:04:13 -07:00
Nick Sweeting
934e02695b fix lint 2026-03-15 18:45:29 -07:00
Nick Sweeting
70c9358cf9 Improve scheduling, runtime paths, and API behavior 2026-03-15 18:31:56 -07:00
Nick Sweeting
7d42c6c8b5 bump versions and fix docs 2026-03-15 17:43:07 -07:00
Nick Sweeting
1f792d7199 Restore CLI compat and plugin dependency handling 2026-03-15 06:06:18 -07:00
Nick Sweeting
6b482c62df Restore top-level list command compatibility 2026-03-15 05:04:31 -07:00
Nick Sweeting
c4d30a853f Restore index-only snapshot output links 2026-03-15 04:58:46 -07:00
Nick Sweeting
cc3e72b92f Preserve tags for index-only adds 2026-03-15 04:54:55 -07:00
Nick Sweeting
58f801c220 Fix update orphan import and host-aware tests 2026-03-15 04:51:06 -07:00
Nick Sweeting
ecb1764590 switch to external plugins 2026-03-15 03:46:23 -07:00
Nick Sweeting
ec4b27056e wip 2026-01-21 03:19:56 -08:00
Nick Sweeting
86e7973334 cleanup tui, startup, card templtes, and more 2026-01-19 14:33:20 -08:00
Nick Sweeting
c7b2217cd6 tons of fixes with codex 2026-01-19 01:00:53 -08:00
Nick Sweeting
b80e80439d more binary fixes 2026-01-05 02:18:38 -08:00
Nick Sweeting
7ceaeae2d9 rename archive_org to archivedotorg, add BinaryWorker, fix config pass-through 2026-01-04 22:38:15 -08:00
Nick Sweeting
839ae744cf simplify entrypoints for orchestrator and workers 2026-01-04 13:17:07 -08:00
Nick Sweeting
dd77511026 unified Process source of truth and better screenshot tests 2026-01-02 04:20:34 -08:00
Nick Sweeting
65ee09ceab move tests into subfolder, add missing install hooks 2026-01-02 00:22:07 -08:00
Nick Sweeting
c2afb40350 fix lib bin dir and archivebox add hanging 2026-01-01 16:58:47 -08:00
Nick Sweeting
a04e4a7345 cleanup migrations, json, jsonl 2025-12-31 15:36:43 -08:00
Nick Sweeting
edc83bfac6 Add persona CLI command with browser cookie import (#1747) 2025-12-31 10:56:40 -08:00
claude[bot]
3659adeb7e Fix path traversal vulnerabilities in persona management
Add input validation and path safety checks to prevent path traversal
attacks in persona name handling:

- Add validate_persona_name() to block dangerous characters (/, \, .., etc)
- Add ensure_path_within_personas_dir() to verify resolved paths stay within PERSONAS_DIR
- Apply validation at persona creation, renaming, and deletion operations

Fixes security issues identified by cubic-dev-ai in PR review.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>
2025-12-31 18:30:26 +00:00
Nick Sweeting
20690fabbf Fix CLI tests to use subprocess and remove mocks (#1746) 2025-12-31 10:20:50 -08:00
Claude
73425fa984 Add persona CLI command with browser cookie import
- Add `archivebox persona create/list/update/delete` commands
- Support `--import=chrome|firefox|brave` to copy browser profile
- Extract cookies via CDP to generate cookies.txt for non-browser tools
- Fix JSDoc comment parsing issue in chrome_utils.js
2025-12-31 12:13:07 +00:00
claude[bot]
5121b0e5f9 Merge branch 'dev' into claude/refactor-process-management-WcQyZ
Resolved conflicts by keeping Process model changes and accepting dev changes for unrelated files. Ensured pid_utils.py remains deleted as intended by this PR.

Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>
2025-12-31 11:28:47 +00:00
Claude
b87bbbbecb Fix CLI tests to use subprocess and remove mocks
- Fix conftest.py: use subprocess for init, remove unused cli_env fixture
- Update all test files to use data_dir parameter instead of env
- Remove mock-based TestJSONLOutput class from tests_piping.py
- Remove unused imports (MagicMock, patch)
- Fix file permissions for cli_utils.py

All tests now use real subprocess calls per CLAUDE.md guidelines:
- NO MOCKS - tests exercise real code paths
- NO SKIPS - every test runs
2025-12-31 10:53:45 +00:00
Nick Sweeting
7dd2d65770 Add pluginmap management command (#1742) 2025-12-31 02:29:24 -08:00
Claude
bb52b5902a Add unit tests for JSONL CLI pipeline commands (Phase 5 & 6)
Add comprehensive unit tests for the CLI piping architecture:
- test_cli_crawl.py: crawl create/list/update/delete tests
- test_cli_snapshot.py: snapshot create/list/update/delete tests
- test_cli_archiveresult.py: archiveresult create/list/update/delete tests
- test_cli_run.py: run command create-or-update and pass-through tests

Extend tests_piping.py with:
- TestPassThroughBehavior: tests for pass-through behavior in all commands
- TestPipelineAccumulation: tests for accumulating records through pipeline

All tests use pytest fixtures from conftest.py with isolated DATA_DIR.
2025-12-31 10:21:05 +00:00
Claude
672ccf918d Add pluginmap management command
Adds a new CLI command `archivebox pluginmap` that displays:
- ASCII art diagrams of all core state machines (Crawl, Snapshot,
  ArchiveResult, Binary)
- Lists all auto-detected on_Modelname_xyz hooks grouped by model/event
- Shows hook execution order (step 0-9), plugin name, and background status

Usage:
  archivebox pluginmap              # Show all diagrams and hooks
  archivebox pluginmap -m Snapshot  # Filter to specific model
  archivebox pluginmap -a           # Include disabled plugins
  archivebox pluginmap -q           # Output JSON only
2025-12-31 10:19:58 +00:00
Claude
f3e11b61fd Implement JSONL CLI pipeline architecture (Phases 1-4, 6)
Phase 1: Model Prerequisites
- Add ArchiveResult.from_json() and from_jsonl() methods
- Fix Snapshot.to_json() to use tags_str (consistent with Crawl)

Phase 2: Shared Utilities
- Create archivebox/cli/cli_utils.py with shared apply_filters()
- Update 7 CLI files to import from cli_utils.py instead of duplicating

Phase 3: Pass-Through Behavior
- Add pass-through to crawl create (non-Crawl records pass unchanged)
- Add pass-through to snapshot create (Crawl records + others pass through)
- Add pass-through to archiveresult create (Snapshot records + others)
- Add create-or-update behavior to run command:
  - Records WITHOUT id: Create via Model.from_json()
  - Records WITH id: Lookup existing, re-queue
  - Outputs JSONL of all processed records for chaining

Phase 4: Test Infrastructure
- Create archivebox/tests/conftest.py with pytest-django fixtures
- Include CLI helpers, output assertions, database assertions

Phase 6: Config Update
- Update supervisord_util.py: orchestrator -> run command

This enables Unix-style piping:
  archivebox crawl create URL | archivebox run
  archivebox archiveresult list --status=failed | archivebox run
  curl API | jq transform | archivebox crawl create | archivebox run
2025-12-31 10:07:14 +00:00
Nick Sweeting
dd2302ad92 new jsonl cli interface 2025-12-30 16:12:53 -08:00
Nick Sweeting
08366cfa46 document chrome configs 2025-12-30 12:42:50 -08:00
claude[bot]
251fe33e49 fix: rename --plugin to --plugins for consistency
Changed from singular --plugin to plural --plugins in both snapshot and extract
commands to match the pattern in archivebox add command. Updated to accept
comma-separated plugin names (e.g., --plugins=screenshot,singlefile,title).

- Updated CLI option from --plugin to --plugins
- Added parsing for comma-separated plugin names
- Updated function signatures and logic to handle multiple plugins
- Updated help text, docstrings, and examples

Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>
2025-12-30 20:20:29 +00:00
claude[bot]
64db6deab3 fix: revert incorrect --extract renaming, restore --plugin parameter
The --plugins parameter was incorrectly renamed to --extract (boolean).
This restores --plugin (singular, matching extract command) with correct
semantics: specify which plugin to run after creating snapshots.

- Changed --extract/--no-extract back to --plugin (string parameter)
- Updated function signature and logic to use plugin parameter
- Added ArchiveResult creation for specific plugin when --plugin is passed
- Updated docstring and examples

Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>
2025-12-30 20:15:48 +00:00
claude[bot]
762cddc8c5 fix: address PR review comments from cubic-dev-ai
- Add JSONL_INDEX_FILENAME to ALLOWED_IN_DATA_DIR for consistency
- Fix fallback logic in legacy.py to try JSON when JSONL parsing fails
- Replace bare except clauses with specific exception types
- Fix stdin double-consumption in archivebox_crawl.py
- Merge CLI --tag option with crawl tags in archivebox_snapshot.py
- Remove tautological mock tests (covered by integration tests)

Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>
2025-12-30 20:09:51 +00:00
Claude
cf387ed59f refactor: batch all URLs into single Crawl, update tests
- archivebox crawl now creates one Crawl with all URLs as newline-separated string
- Updated tests to reflect new pipeline: crawl -> snapshot -> extract
- Added tests for Crawl JSONL parsing and output
- Tests verify Crawl.from_jsonl() handles multiple URLs correctly
2025-12-30 20:06:56 +00:00