Commit Graph

4857 Commits

Author SHA1 Message Date
Claude
09a1ca3134 Fix hook priority conflicts and standardize on_Binary naming
on_Snapshot priority fixes:
- redirects.bg.js stays at 31, staticfile.bg.js → 32
- headers.js stays at 55, readability.py → 56
- mercury.py → 57, htmltotext.py → 58

on_Binary hooks now have numeric priorities:
- 10: npm_install.py
- 11: pip_install.py
- 12: brew_install.py
- 13: apt_install.py
- 14: custom_install.py
- 15: env_install.py
2026-01-01 01:31:52 +00:00
Claude
4d33084496 Remove redundant chrome_validate hook, rename wget_validate to wget_install
- Delete chrome/on_Crawl__10_chrome_validate.py (duplicates chrome_install)
- Rename wget/on_Crawl__11_wget_validate.py → on_Crawl__06_wget_install.py

All hooks now follow consistent naming: install, launch, or config
2025-12-31 23:41:40 +00:00
Nick Sweeting
8f518500a4 comments 2025-12-31 15:36:43 -08:00
Nick Sweeting
a04e4a7345 cleanup migrations, json, jsonl 2025-12-31 15:36:43 -08:00
Nick Sweeting
0930911a15 Clean up on_Crawl hooks and remove dead code (#1751)
Deleted dead/duplicate hooks:
- wget/on_Crawl__10_install_wget.py (duplicate of
__10_wget_validate_config.py)
- chrome/on_Crawl__00_chrome_install.py (simpler version, kept full one)
- chrome/on_Crawl__20_chrome_launch.bg.js (legacy, kept __30 version)
- singlefile/on_Crawl__20_install_singlefile_extension.js
(disabled/dead)
- istilldontcareaboutcookies/on_Crawl__20_install_*.js (legacy)
- ublock/on_Crawl__03_ublock.js (legacy, kept __20 version)
- Entire captcha2/ plugin (legacy version of twocaptcha/)

Renamed hooks to follow consistent pattern:
on_Crawl__XX_<plugin>_<action>.<ext>
Priority bands:
00-09: Binary/extension installation 10-19: Config validation 20-29:
Browser launch and post-launch config

Final hooks:
00 ripgrep_install.py, 01 chrome_install.py 02
istilldontcareaboutcookies_install.js 03 ublock_install.js, 04
singlefile_install.js 05 twocaptcha_install.js 10 chrome_validate.py, 11
wget_validate.py 20 chrome_launch.bg.js, 25 twocaptcha_config.js

<!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line
length changes. -->

# Summary

<!--e.g. This PR fixes ABC or adds the ability to do XYZ...-->

# Related issues

<!-- e.g. #123 or Roadmap goal #
https://github.com/pirate/ArchiveBox/wiki/Roadmap -->

# Changes these areas

- [ ] Bugfixes
- [ ] Feature behavior
- [ ] Command line interface
- [ ] Configuration options
- [ ] Internal architecture
- [ ] Snapshot data layout on disk


<!-- This is an auto-generated description by cubic. -->
---
## Summary by cubic
Cleaned up Crawl-level hooks by removing legacy/duplicate code and
standardizing hook names and priorities. Chrome launch is now a single,
updated hook with better extension detection and cleaner outputs.

- **Refactors**
- Removed dead hooks (legacy chrome install/launch, singlefile
extension, old ublock/cookies scripts, duplicate wget validate) and the
legacy captcha2 plugin in favor of twocaptcha.
- Renamed hooks to on_Crawl__XX_<plugin>_<action> with priority bands:
00-09 install, 10-19 validate, 20-29 launch/config.
- Consolidated Chrome launch into on_Crawl__20_chrome_launch.bg.js;
writes outputs to the current dir, resolves real extension IDs via
chrome://extensions, and records extensions.json after verification.

- **Migration**
- If you used captcha2, switch to the twocaptcha hooks
(on_Crawl__05_twocaptcha_install.js and
on_Crawl__25_twocaptcha_config.js).
  - Update any docs/scripts that reference old hook filenames.

<sup>Written for commit 4c77949197.
Summary will update on new commits.</sup>

<!-- End of auto-generated description by cubic. -->
2025-12-31 14:52:15 -08:00
Claude
4c77949197 Clean up on_Crawl hooks: remove duplicates and standardize naming
Deleted dead/duplicate hooks:
- wget/on_Crawl__10_install_wget.py (duplicate of __10_wget_validate_config.py)
- chrome/on_Crawl__00_chrome_install.py (simpler version, kept full one)
- chrome/on_Crawl__20_chrome_launch.bg.js (legacy, kept __30 version)
- singlefile/on_Crawl__20_install_singlefile_extension.js (disabled/dead)
- istilldontcareaboutcookies/on_Crawl__20_install_*.js (legacy)
- ublock/on_Crawl__03_ublock.js (legacy, kept __20 version)
- Entire captcha2/ plugin (legacy version of twocaptcha/)

Renamed hooks to follow consistent pattern: on_Crawl__XX_<plugin>_<action>.<ext>
Priority bands:
  00-09: Binary/extension installation
  10-19: Config validation
  20-29: Browser launch and post-launch config

Final hooks:
  00 ripgrep_install.py, 01 chrome_install.py
  02 istilldontcareaboutcookies_install.js
  03 ublock_install.js, 04 singlefile_install.js
  05 twocaptcha_install.js
  10 chrome_validate.py, 11 wget_validate.py
  20 chrome_launch.bg.js, 25 twocaptcha_config.js
2025-12-31 22:47:36 +00:00
Nick Sweeting
f12c3b4b55 less healthstats 2025-12-31 12:34:32 -08:00
Nick Sweeting
bd757188e4 keep stripping healthstats from iface and other things 2025-12-31 12:34:31 -08:00
Nick Sweeting
73fde81fce more migrations tweaks 2025-12-31 12:34:31 -08:00
Nick Sweeting
469932b469 more 2025-12-31 12:34:31 -08:00
Nick Sweeting
72f6a91b31 more progress bar and migrations fixes 2025-12-31 12:34:31 -08:00
Nick Sweeting
d5c0c64dcd fix progress bars 2025-12-31 12:34:29 -08:00
Nick Sweeting
cb97f6651b Add DNS traffic recorder plugin (#1748) 2025-12-31 11:02:43 -08:00
Nick Sweeting
60a4581ed8 Add tests for accessibility, parse_dom_outlinks, and consolelog plugins (#1749) 2025-12-31 11:01:56 -08:00
claude[bot]
1f84d1b467 Fix test assertions to fail when data is missing
- Add assertIsNotNone for accessibility_data to ensure test fails if no data generated
- Capture and report JSON decode errors in parse_dom_outlinks test
- Add assertIsNotNone for outlinks_data with error details
- Removes conditional checks that allowed tests to pass without verifying functionality

Addresses review comments from cubic-dev-ai

Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>
2025-12-31 19:00:30 +00:00
claude[bot]
483929391d Fix test assertions to fail properly and add NXDOMAIN deduplication
- test_seo.py: Add assertIsNotNone before conditional to catch SEO extraction failures
- test_ssl.py: Add assertIsNotNone to ensure SSL data is captured from HTTPS URLs
- test_pip_provider.py: Assert jsonl_found variable to verify binary discovery
- dns plugin: Deduplicate NXDOMAIN records using seenResolutions map

Tests now fail when functionality doesn't work (no cheating).

Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>
2025-12-31 19:00:28 +00:00
Nick Sweeting
edc83bfac6 Add persona CLI command with browser cookie import (#1747) 2025-12-31 10:56:40 -08:00
Claude
2a68248602 Update all Chrome plugins to use shared chrome_utils.js
Refactored 8 plugins to import shared utilities instead of
duplicating code locally:
- consolelog, redirects: Complete rewrite using shared utils
- modalcloser, staticfile: Use readCdpUrl, readTargetId, parseArgs
- dom, screenshot, pdf: Remove local parseArgs/getCdpUrl
- headers: Import getEnv, getEnvBool, getEnvInt, parseArgs

Removes ~380 lines of duplicated boilerplate code.
2025-12-31 18:35:25 +00:00
claude[bot]
3659adeb7e Fix path traversal vulnerabilities in persona management
Add input validation and path safety checks to prevent path traversal
attacks in persona name handling:

- Add validate_persona_name() to block dangerous characters (/, \, .., etc)
- Add ensure_path_within_personas_dir() to verify resolved paths stay within PERSONAS_DIR
- Apply validation at persona creation, renaming, and deletion operations

Fixes security issues identified by cubic-dev-ai in PR review.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>
2025-12-31 18:30:26 +00:00
Claude
263335dc6d Add tests for merkletree and custom binary provider plugins
- merkletree: Tests merkle tree generation with real files,
  empty directory handling, and disabled mode
- custom: Tests custom bash command execution and binary discovery
2025-12-31 18:30:04 +00:00
Claude
9703a8e88c Add tests for responses, staticfile, and env provider plugins
- responses: Tests network response capture during page load
- staticfile: Tests static file detection and download skip for HTML
- env: Tests PATH-based binary discovery (python3, bash)
2025-12-31 18:28:01 +00:00
Claude
cfa5edb160 Add tests for accessibility, parse_dom_outlinks, and consolelog plugins
Real integration tests using Chrome sessions with example.com:
- accessibility: Tests page outline and accessibility tree extraction
- parse_dom_outlinks: Tests link extraction and categorization
- consolelog: Tests console output capture
2025-12-31 18:25:48 +00:00
Claude
47d9874c1f Merge remote-tracking branch 'origin/dev' into claude/dns-traffic-recorder-plugin-dNbxC 2025-12-31 18:24:56 +00:00
Nick Sweeting
cd0394c858 Add comprehensive tests for machine/process models, orchestrator, and search backends (#1745) 2025-12-31 10:21:12 -08:00
Nick Sweeting
20690fabbf Fix CLI tests to use subprocess and remove mocks (#1746) 2025-12-31 10:20:50 -08:00
claude[bot]
08383c4d83 Fix tautological assertion in SEO test
The assertion was checking 'has_seo_data or seo_data' inside an 'if seo_data:' block,
making it always truthy. Changed to just check 'has_seo_data' to properly verify
that expected SEO keys were extracted.

Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>
2025-12-31 18:19:47 +00:00
Claude
5d8c93eaf4 Consolidate CDP connection logic into chrome_utils.js
Add shared snapshot hook utilities to chrome_utils.js:
- parseArgs(): CLI argument parsing
- waitForChromeSession(): Wait for CDP session files
- readCdpUrl(): Read CDP WebSocket URL
- readTargetId(): Read target page ID
- connectToPage(): High-level browser/page connection
- waitForPageLoaded(): Wait for navigation completion

Refactor ssl, responses, and dns plugins to use shared utilities,
eliminating ~100 lines of duplicated code across plugins.
2025-12-31 12:15:30 +00:00
Claude
73425fa984 Add persona CLI command with browser cookie import
- Add `archivebox persona create/list/update/delete` commands
- Support `--import=chrome|firefox|brave` to copy browser profile
- Extract cookies via CDP to generate cookies.txt for non-browser tools
- Fix JSDoc comment parsing issue in chrome_utils.js
2025-12-31 12:13:07 +00:00
Claude
f2c20f141c Refactor dns plugin to use chrome_utils.js
Import shared utilities (getEnv, getEnvBool, getEnvInt) from
chrome_utils.js instead of duplicating them locally.
Also use DNS_TIMEOUT config for dynamic timeout calculations.
2025-12-31 12:08:28 +00:00
Claude
13148fd6b5 Add DNS traffic recorder plugin
Records hostname → IP resolutions during page load using Chrome CDP.
Uses Network.responseReceived events to capture DNS resolution data
and writes one JSON line per record to dns.jsonl.

Features:
- Captures hostname to IP address mappings (A/AAAA records)
- Records failed DNS lookups (NXDOMAIN)
- Deduplicates resolution records per page load
- Integrates with existing Chrome plugin infrastructure
2025-12-31 12:05:02 +00:00
Claude
8a0acdebcd Add SSL, redirects, SEO plugin tests and fix fake test issues
- Add real integration tests for SSL, redirects, and SEO plugins
  using Chrome session helpers for live URL testing
- Remove fake "format" tests that just created dicts and asserted on them
  (apt, pip, npm provider output format tests)
- Remove npm integration test that created dirs then checked they existed
- Fix SQLite search test to use SQLITEFTS_DB constant instead of hardcoded value
2025-12-31 12:00:00 +00:00
Claude
9bf7a520a0 Update tests for new Process model-based architecture
- Remove pid_utils tests (module deleted in dev)
- Update orchestrator tests to use Process model for tracking
- Add tests for Process.current(), cleanup_stale_running(), terminate()
- Add tests for Process hierarchy (parent/child, root, depth)
- Add tests for Process.get_running(), get_running_count()
- Add tests for ProcessMachine state machine
- Update machine model tests to match current API (from_jsonl vs from_json)
2025-12-31 11:51:42 +00:00
Nick Sweeting
bbbfffd0fa Improve admin snapshot list/grid views with better UX (#1744) 2025-12-31 03:51:11 -08:00
Claude
a063d8cd43 Merge remote-tracking branch 'origin/dev' into claude/analyze-test-coverage-mWgwv 2025-12-31 11:45:22 +00:00
Nick Sweeting
bdb3d946b8 Delete pid_utils.py and migrate to Process model (#1741) 2025-12-31 03:43:18 -08:00
claude[bot]
b2132d1f14 Fix cubic review issues: process_type detection, cmd storage, PID cleanup, and migration
- Fix Process.current() to store psutil cmdline instead of sys.argv for accurate validation
- Fix worker process_type detection: explicitly set to WORKER after registration
- Fix ArchiveResultWorker.start() to use Process.TypeChoices.WORKER consistently
- Fix migration to be explicitly irreversible (SQLite doesn't support DROP COLUMN)
- Fix get_running_workers() to return process_id instead of incorrectly named worker_id
- Fix safe_kill_process() to wait for termination and escalate to SIGKILL if needed
- Fix migration to include all indexes in state_operations (parent_id, process_type)
- Fix documentation to use Machine.current() scoping and StatusChoices constants

Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>
2025-12-31 11:42:07 +00:00
Claude
0cb5f0712d Add comprehensive tests for machine/process models, orchestrator, and search backends
This adds new test coverage for previously untested areas:

Machine module (archivebox/machine/tests/):
- Machine, NetworkInterface, Binary, Process model tests
- BinaryMachine and ProcessMachine state machine tests
- JSONL serialization/deserialization tests
- Manager method tests

Workers module (archivebox/workers/tests/):
- PID file utility tests (write, read, cleanup)
- Orchestrator lifecycle and queue management tests
- Worker spawning logic tests
- Idle detection and exit condition tests

Search backends:
- SQLite FTS5 search tests with real indexed content
- Phrase search, stemming, and unicode support
- Ripgrep search tests with archive directory structure
- Environment variable configuration tests

Binary provider plugins:
- pip provider hook tests
- npm provider hook tests with PATH updates
- apt provider hook tests
2025-12-31 11:33:27 +00:00
claude[bot]
5121b0e5f9 Merge branch 'dev' into claude/refactor-process-management-WcQyZ
Resolved conflicts by keeping Process model changes and accepting dev changes for unrelated files. Ensured pid_utils.py remains deleted as intended by this PR.

Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>
2025-12-31 11:28:47 +00:00
Claude
2e6dcb2b87 Improve admin snapshot list/grid views with better UX
- Add prominent view mode switcher with List/Grid toggle buttons
- Improve filter sidebar CSS with modern styling, rounded corners
- Add live progress bar for in-progress snapshots showing hooks status
- Show plugin icons only when output directory has content
- Display archive result output_size sum from new field
- Show hooks succeeded/total count in size column
- Add get_progress_stats() method to Snapshot model
- Add CSS for progress spinner and status badges
- Update grid view template with progress indicator for archiving cards
- Add tests for admin views, search, and progress stats
2025-12-31 11:28:03 +00:00
claude[bot]
ee201a0f83 Fix code review issues in process management refactor
- Add pwd validation in Process.launch() to prevent crashes
- Fix psutil returncode handling (use wait() return value, not returncode attr)
- Add None check for proc.pid in cleanup_stale_running()
- Add stale process cleanup in Orchestrator.is_running()
- Ensure orchestrator process_type is correctly set to ORCHESTRATOR
- Fix KeyboardInterrupt handling (exit code 0 for graceful shutdown)
- Throttle cleanup_stale_running() to once per 30 seconds for performance
- Fix worker process_type to use TypeChoices.WORKER consistently
- Fix get_running_workers() API to return list of dicts (not Process objects)
- Only delete PID files after successful kill or confirmed stale
- Fix migration index names to match between SQL and Django state
- Remove db_index=True from process_type (index created manually)
- Update documentation to reflect actual implementation
- Add explanatory comments to empty except blocks
- Fix exit codes to use Unix convention (128 + signal number)

Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>
2025-12-31 11:14:47 +00:00
Claude
b87bbbbecb Fix CLI tests to use subprocess and remove mocks
- Fix conftest.py: use subprocess for init, remove unused cli_env fixture
- Update all test files to use data_dir parameter instead of env
- Remove mock-based TestJSONLOutput class from tests_piping.py
- Remove unused imports (MagicMock, patch)
- Fix file permissions for cli_utils.py

All tests now use real subprocess calls per CLAUDE.md guidelines:
- NO MOCKS - tests exercise real code paths
- NO SKIPS - every test runs
2025-12-31 10:53:45 +00:00
Nick Sweeting
7dd2d65770 Add pluginmap management command (#1742) 2025-12-31 02:29:24 -08:00
Nick Sweeting
575a595f26 Add unit tests for JSONL CLI pipeline commands (Phase 5 & 6) (#1743) 2025-12-31 02:27:17 -08:00
Claude
bb52b5902a Add unit tests for JSONL CLI pipeline commands (Phase 5 & 6)
Add comprehensive unit tests for the CLI piping architecture:
- test_cli_crawl.py: crawl create/list/update/delete tests
- test_cli_snapshot.py: snapshot create/list/update/delete tests
- test_cli_archiveresult.py: archiveresult create/list/update/delete tests
- test_cli_run.py: run command create-or-update and pass-through tests

Extend tests_piping.py with:
- TestPassThroughBehavior: tests for pass-through behavior in all commands
- TestPipelineAccumulation: tests for accumulating records through pipeline

All tests use pytest fixtures from conftest.py with isolated DATA_DIR.
2025-12-31 10:21:05 +00:00
Claude
672ccf918d Add pluginmap management command
Adds a new CLI command `archivebox pluginmap` that displays:
- ASCII art diagrams of all core state machines (Crawl, Snapshot,
  ArchiveResult, Binary)
- Lists all auto-detected on_Modelname_xyz hooks grouped by model/event
- Shows hook execution order (step 0-9), plugin name, and background status

Usage:
  archivebox pluginmap              # Show all diagrams and hooks
  archivebox pluginmap -m Snapshot  # Filter to specific model
  archivebox pluginmap -a           # Include disabled plugins
  archivebox pluginmap -q           # Output JSON only
2025-12-31 10:19:58 +00:00
Claude
b822352fc3 Delete pid_utils.py and migrate to Process model
DELETED:
- workers/pid_utils.py (-192 lines) - replaced by Process model methods

SIMPLIFIED:
- crawls/models.py Crawl.cleanup() (80 lines -> 10 lines)
- hooks.py: deleted process_is_alive() and kill_process() (-45 lines)

UPDATED to use Process model:
- core/models.py: Snapshot.cleanup() and has_running_background_hooks()
- machine/models.py: Binary.cleanup()
- workers/worker.py: Worker.on_startup/shutdown, get_running_workers, start
- workers/orchestrator.py: Orchestrator.on_startup/shutdown, is_running

All subprocess management now uses:
- Process.current() for registering current process
- Process.get_running() / get_running_count() for querying
- Process.cleanup_stale_running() for cleanup
- safe_kill_process() for validated PID killing

Total line reduction: ~250 lines
2025-12-31 10:15:22 +00:00
Claude
2d3a2fec57 Add terminate, kill_tree, and query methods to Process model
This consolidates scattered subprocess management logic into the Process model:

- terminate(): Graceful SIGTERM → wait → SIGKILL (replaces stop_worker, etc.)
- kill_tree(): Kill process and all OS children (replaces os.killpg logic)
- kill_children_db(): Kill DB-tracked child processes
- get_running(): Query running processes by type (replaces get_all_worker_pids)
- get_running_count(): Count running processes (replaces get_running_worker_count)
- stop_all(): Stop all processes of a type
- get_next_worker_id(): Get next worker ID for spawning

Added Phase 8 to TODO documenting ~390 lines that can be deleted after
consolidation, including workers/pid_utils.py which becomes obsolete.

Also includes migration 0002 for parent FK and process_type fields.
2025-12-31 10:08:45 +00:00
Claude
f3e11b61fd Implement JSONL CLI pipeline architecture (Phases 1-4, 6)
Phase 1: Model Prerequisites
- Add ArchiveResult.from_json() and from_jsonl() methods
- Fix Snapshot.to_json() to use tags_str (consistent with Crawl)

Phase 2: Shared Utilities
- Create archivebox/cli/cli_utils.py with shared apply_filters()
- Update 7 CLI files to import from cli_utils.py instead of duplicating

Phase 3: Pass-Through Behavior
- Add pass-through to crawl create (non-Crawl records pass unchanged)
- Add pass-through to snapshot create (Crawl records + others pass through)
- Add pass-through to archiveresult create (Snapshot records + others)
- Add create-or-update behavior to run command:
  - Records WITHOUT id: Create via Model.from_json()
  - Records WITH id: Lookup existing, re-queue
  - Outputs JSONL of all processed records for chaining

Phase 4: Test Infrastructure
- Create archivebox/tests/conftest.py with pytest-django fixtures
- Include CLI helpers, output assertions, database assertions

Phase 6: Config Update
- Update supervisord_util.py: orchestrator -> run command

This enables Unix-style piping:
  archivebox crawl create URL | archivebox run
  archivebox archiveresult list --status=failed | archivebox run
  curl API | jq transform | archivebox crawl create | archivebox run
2025-12-31 10:07:14 +00:00
Nick Sweeting
28a4f99f55 Add real-world use cases to CLI pipeline plan (#1740)
<!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line
length changes. -->

# Summary

<!--e.g. This PR fixes ABC or adds the ability to do XYZ...-->

# Related issues

<!-- e.g. #123 or Roadmap goal #
https://github.com/pirate/ArchiveBox/wiki/Roadmap -->

# Changes these areas

- [ ] Bugfixes
- [ ] Feature behavior
- [ ] Command line interface
- [ ] Configuration options
- [ ] Internal architecture
- [ ] Snapshot data layout on disk
2025-12-31 01:57:08 -08:00
Nick Sweeting
95d61b001e fix migrations 2025-12-31 01:40:59 -08:00