ArchiveBox

mirror of https://github.com/ArchiveBox/ArchiveBox.git synced 2026-01-03 09:25:42 +10:00

Author	SHA1	Message	Date
Nick Sweeting	469932b469	more	2025-12-31 12:34:31 -08:00
Nick Sweeting	72f6a91b31	more progress bar and migrations fixes	2025-12-31 12:34:31 -08:00
Nick Sweeting	d5c0c64dcd	fix progress bars	2025-12-31 12:34:29 -08:00
Nick Sweeting	cb97f6651b	Add DNS traffic recorder plugin (#1748 )	2025-12-31 11:02:43 -08:00
Nick Sweeting	60a4581ed8	Add tests for accessibility, parse_dom_outlinks, and consolelog plugins (#1749 )	2025-12-31 11:01:56 -08:00
claude[bot]	1f84d1b467	Fix test assertions to fail when data is missing - Add assertIsNotNone for accessibility_data to ensure test fails if no data generated - Capture and report JSON decode errors in parse_dom_outlinks test - Add assertIsNotNone for outlinks_data with error details - Removes conditional checks that allowed tests to pass without verifying functionality Addresses review comments from cubic-dev-ai Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>	2025-12-31 19:00:30 +00:00
claude[bot]	483929391d	Fix test assertions to fail properly and add NXDOMAIN deduplication - test_seo.py: Add assertIsNotNone before conditional to catch SEO extraction failures - test_ssl.py: Add assertIsNotNone to ensure SSL data is captured from HTTPS URLs - test_pip_provider.py: Assert jsonl_found variable to verify binary discovery - dns plugin: Deduplicate NXDOMAIN records using seenResolutions map Tests now fail when functionality doesn't work (no cheating). Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>	2025-12-31 19:00:28 +00:00
Nick Sweeting	edc83bfac6	Add persona CLI command with browser cookie import (#1747 )	2025-12-31 10:56:40 -08:00
Claude	2a68248602	Update all Chrome plugins to use shared chrome_utils.js Refactored 8 plugins to import shared utilities instead of duplicating code locally: - consolelog, redirects: Complete rewrite using shared utils - modalcloser, staticfile: Use readCdpUrl, readTargetId, parseArgs - dom, screenshot, pdf: Remove local parseArgs/getCdpUrl - headers: Import getEnv, getEnvBool, getEnvInt, parseArgs Removes ~380 lines of duplicated boilerplate code.	2025-12-31 18:35:25 +00:00
claude[bot]	3659adeb7e	Fix path traversal vulnerabilities in persona management Add input validation and path safety checks to prevent path traversal attacks in persona name handling: - Add validate_persona_name() to block dangerous characters (/, \, .., etc) - Add ensure_path_within_personas_dir() to verify resolved paths stay within PERSONAS_DIR - Apply validation at persona creation, renaming, and deletion operations Fixes security issues identified by cubic-dev-ai in PR review. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>	2025-12-31 18:30:26 +00:00
Claude	263335dc6d	Add tests for merkletree and custom binary provider plugins - merkletree: Tests merkle tree generation with real files, empty directory handling, and disabled mode - custom: Tests custom bash command execution and binary discovery	2025-12-31 18:30:04 +00:00
Claude	9703a8e88c	Add tests for responses, staticfile, and env provider plugins - responses: Tests network response capture during page load - staticfile: Tests static file detection and download skip for HTML - env: Tests PATH-based binary discovery (python3, bash)	2025-12-31 18:28:01 +00:00
Claude	cfa5edb160	Add tests for accessibility, parse_dom_outlinks, and consolelog plugins Real integration tests using Chrome sessions with example.com: - accessibility: Tests page outline and accessibility tree extraction - parse_dom_outlinks: Tests link extraction and categorization - consolelog: Tests console output capture	2025-12-31 18:25:48 +00:00
Claude	47d9874c1f	Merge remote-tracking branch 'origin/dev' into claude/dns-traffic-recorder-plugin-dNbxC	2025-12-31 18:24:56 +00:00
Nick Sweeting	cd0394c858	Add comprehensive tests for machine/process models, orchestrator, and search backends (#1745 )	2025-12-31 10:21:12 -08:00
Nick Sweeting	20690fabbf	Fix CLI tests to use subprocess and remove mocks (#1746 )	2025-12-31 10:20:50 -08:00
claude[bot]	08383c4d83	Fix tautological assertion in SEO test The assertion was checking 'has_seo_data or seo_data' inside an 'if seo_data:' block, making it always truthy. Changed to just check 'has_seo_data' to properly verify that expected SEO keys were extracted. Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>	2025-12-31 18:19:47 +00:00
Claude	5d8c93eaf4	Consolidate CDP connection logic into chrome_utils.js Add shared snapshot hook utilities to chrome_utils.js: - parseArgs(): CLI argument parsing - waitForChromeSession(): Wait for CDP session files - readCdpUrl(): Read CDP WebSocket URL - readTargetId(): Read target page ID - connectToPage(): High-level browser/page connection - waitForPageLoaded(): Wait for navigation completion Refactor ssl, responses, and dns plugins to use shared utilities, eliminating ~100 lines of duplicated code across plugins.	2025-12-31 12:15:30 +00:00
Claude	73425fa984	Add persona CLI command with browser cookie import - Add `archivebox persona create/list/update/delete` commands - Support `--import=chrome\|firefox\|brave` to copy browser profile - Extract cookies via CDP to generate cookies.txt for non-browser tools - Fix JSDoc comment parsing issue in chrome_utils.js	2025-12-31 12:13:07 +00:00
Claude	f2c20f141c	Refactor dns plugin to use chrome_utils.js Import shared utilities (getEnv, getEnvBool, getEnvInt) from chrome_utils.js instead of duplicating them locally. Also use DNS_TIMEOUT config for dynamic timeout calculations.	2025-12-31 12:08:28 +00:00
Claude	13148fd6b5	Add DNS traffic recorder plugin Records hostname → IP resolutions during page load using Chrome CDP. Uses Network.responseReceived events to capture DNS resolution data and writes one JSON line per record to dns.jsonl. Features: - Captures hostname to IP address mappings (A/AAAA records) - Records failed DNS lookups (NXDOMAIN) - Deduplicates resolution records per page load - Integrates with existing Chrome plugin infrastructure	2025-12-31 12:05:02 +00:00
Claude	8a0acdebcd	Add SSL, redirects, SEO plugin tests and fix fake test issues - Add real integration tests for SSL, redirects, and SEO plugins using Chrome session helpers for live URL testing - Remove fake "format" tests that just created dicts and asserted on them (apt, pip, npm provider output format tests) - Remove npm integration test that created dirs then checked they existed - Fix SQLite search test to use SQLITEFTS_DB constant instead of hardcoded value	2025-12-31 12:00:00 +00:00
Claude	9bf7a520a0	Update tests for new Process model-based architecture - Remove pid_utils tests (module deleted in dev) - Update orchestrator tests to use Process model for tracking - Add tests for Process.current(), cleanup_stale_running(), terminate() - Add tests for Process hierarchy (parent/child, root, depth) - Add tests for Process.get_running(), get_running_count() - Add tests for ProcessMachine state machine - Update machine model tests to match current API (from_jsonl vs from_json)	2025-12-31 11:51:42 +00:00
Nick Sweeting	bbbfffd0fa	Improve admin snapshot list/grid views with better UX (#1744 )	2025-12-31 03:51:11 -08:00
Claude	a063d8cd43	Merge remote-tracking branch 'origin/dev' into claude/analyze-test-coverage-mWgwv	2025-12-31 11:45:22 +00:00
Nick Sweeting	bdb3d946b8	Delete pid_utils.py and migrate to Process model (#1741 )	2025-12-31 03:43:18 -08:00
claude[bot]	b2132d1f14	Fix cubic review issues: process_type detection, cmd storage, PID cleanup, and migration - Fix Process.current() to store psutil cmdline instead of sys.argv for accurate validation - Fix worker process_type detection: explicitly set to WORKER after registration - Fix ArchiveResultWorker.start() to use Process.TypeChoices.WORKER consistently - Fix migration to be explicitly irreversible (SQLite doesn't support DROP COLUMN) - Fix get_running_workers() to return process_id instead of incorrectly named worker_id - Fix safe_kill_process() to wait for termination and escalate to SIGKILL if needed - Fix migration to include all indexes in state_operations (parent_id, process_type) - Fix documentation to use Machine.current() scoping and StatusChoices constants Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>	2025-12-31 11:42:07 +00:00
Claude	0cb5f0712d	Add comprehensive tests for machine/process models, orchestrator, and search backends This adds new test coverage for previously untested areas: Machine module (archivebox/machine/tests/): - Machine, NetworkInterface, Binary, Process model tests - BinaryMachine and ProcessMachine state machine tests - JSONL serialization/deserialization tests - Manager method tests Workers module (archivebox/workers/tests/): - PID file utility tests (write, read, cleanup) - Orchestrator lifecycle and queue management tests - Worker spawning logic tests - Idle detection and exit condition tests Search backends: - SQLite FTS5 search tests with real indexed content - Phrase search, stemming, and unicode support - Ripgrep search tests with archive directory structure - Environment variable configuration tests Binary provider plugins: - pip provider hook tests - npm provider hook tests with PATH updates - apt provider hook tests	2025-12-31 11:33:27 +00:00
claude[bot]	5121b0e5f9	Merge branch 'dev' into claude/refactor-process-management-WcQyZ Resolved conflicts by keeping Process model changes and accepting dev changes for unrelated files. Ensured pid_utils.py remains deleted as intended by this PR. Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>	2025-12-31 11:28:47 +00:00
Claude	2e6dcb2b87	Improve admin snapshot list/grid views with better UX - Add prominent view mode switcher with List/Grid toggle buttons - Improve filter sidebar CSS with modern styling, rounded corners - Add live progress bar for in-progress snapshots showing hooks status - Show plugin icons only when output directory has content - Display archive result output_size sum from new field - Show hooks succeeded/total count in size column - Add get_progress_stats() method to Snapshot model - Add CSS for progress spinner and status badges - Update grid view template with progress indicator for archiving cards - Add tests for admin views, search, and progress stats	2025-12-31 11:28:03 +00:00
claude[bot]	ee201a0f83	Fix code review issues in process management refactor - Add pwd validation in Process.launch() to prevent crashes - Fix psutil returncode handling (use wait() return value, not returncode attr) - Add None check for proc.pid in cleanup_stale_running() - Add stale process cleanup in Orchestrator.is_running() - Ensure orchestrator process_type is correctly set to ORCHESTRATOR - Fix KeyboardInterrupt handling (exit code 0 for graceful shutdown) - Throttle cleanup_stale_running() to once per 30 seconds for performance - Fix worker process_type to use TypeChoices.WORKER consistently - Fix get_running_workers() API to return list of dicts (not Process objects) - Only delete PID files after successful kill or confirmed stale - Fix migration index names to match between SQL and Django state - Remove db_index=True from process_type (index created manually) - Update documentation to reflect actual implementation - Add explanatory comments to empty except blocks - Fix exit codes to use Unix convention (128 + signal number) Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>	2025-12-31 11:14:47 +00:00
Claude	b87bbbbecb	Fix CLI tests to use subprocess and remove mocks - Fix conftest.py: use subprocess for init, remove unused cli_env fixture - Update all test files to use data_dir parameter instead of env - Remove mock-based TestJSONLOutput class from tests_piping.py - Remove unused imports (MagicMock, patch) - Fix file permissions for cli_utils.py All tests now use real subprocess calls per CLAUDE.md guidelines: - NO MOCKS - tests exercise real code paths - NO SKIPS - every test runs	2025-12-31 10:53:45 +00:00
Nick Sweeting	7dd2d65770	Add pluginmap management command (#1742 )	2025-12-31 02:29:24 -08:00
Nick Sweeting	575a595f26	Add unit tests for JSONL CLI pipeline commands (Phase 5 & 6) (#1743 )	2025-12-31 02:27:17 -08:00
Claude	bb52b5902a	Add unit tests for JSONL CLI pipeline commands (Phase 5 & 6) Add comprehensive unit tests for the CLI piping architecture: - test_cli_crawl.py: crawl create/list/update/delete tests - test_cli_snapshot.py: snapshot create/list/update/delete tests - test_cli_archiveresult.py: archiveresult create/list/update/delete tests - test_cli_run.py: run command create-or-update and pass-through tests Extend tests_piping.py with: - TestPassThroughBehavior: tests for pass-through behavior in all commands - TestPipelineAccumulation: tests for accumulating records through pipeline All tests use pytest fixtures from conftest.py with isolated DATA_DIR.	2025-12-31 10:21:05 +00:00
Claude	672ccf918d	Add pluginmap management command Adds a new CLI command `archivebox pluginmap` that displays: - ASCII art diagrams of all core state machines (Crawl, Snapshot, ArchiveResult, Binary) - Lists all auto-detected on_Modelname_xyz hooks grouped by model/event - Shows hook execution order (step 0-9), plugin name, and background status Usage: archivebox pluginmap # Show all diagrams and hooks archivebox pluginmap -m Snapshot # Filter to specific model archivebox pluginmap -a # Include disabled plugins archivebox pluginmap -q # Output JSON only	2025-12-31 10:19:58 +00:00
Claude	b822352fc3	Delete pid_utils.py and migrate to Process model DELETED: - workers/pid_utils.py (-192 lines) - replaced by Process model methods SIMPLIFIED: - crawls/models.py Crawl.cleanup() (80 lines -> 10 lines) - hooks.py: deleted process_is_alive() and kill_process() (-45 lines) UPDATED to use Process model: - core/models.py: Snapshot.cleanup() and has_running_background_hooks() - machine/models.py: Binary.cleanup() - workers/worker.py: Worker.on_startup/shutdown, get_running_workers, start - workers/orchestrator.py: Orchestrator.on_startup/shutdown, is_running All subprocess management now uses: - Process.current() for registering current process - Process.get_running() / get_running_count() for querying - Process.cleanup_stale_running() for cleanup - safe_kill_process() for validated PID killing Total line reduction: ~250 lines	2025-12-31 10:15:22 +00:00
Claude	2d3a2fec57	Add terminate, kill_tree, and query methods to Process model This consolidates scattered subprocess management logic into the Process model: - terminate(): Graceful SIGTERM → wait → SIGKILL (replaces stop_worker, etc.) - kill_tree(): Kill process and all OS children (replaces os.killpg logic) - kill_children_db(): Kill DB-tracked child processes - get_running(): Query running processes by type (replaces get_all_worker_pids) - get_running_count(): Count running processes (replaces get_running_worker_count) - stop_all(): Stop all processes of a type - get_next_worker_id(): Get next worker ID for spawning Added Phase 8 to TODO documenting ~390 lines that can be deleted after consolidation, including workers/pid_utils.py which becomes obsolete. Also includes migration 0002 for parent FK and process_type fields.	2025-12-31 10:08:45 +00:00
Claude	f3e11b61fd	Implement JSONL CLI pipeline architecture (Phases 1-4, 6) Phase 1: Model Prerequisites - Add ArchiveResult.from_json() and from_jsonl() methods - Fix Snapshot.to_json() to use tags_str (consistent with Crawl) Phase 2: Shared Utilities - Create archivebox/cli/cli_utils.py with shared apply_filters() - Update 7 CLI files to import from cli_utils.py instead of duplicating Phase 3: Pass-Through Behavior - Add pass-through to crawl create (non-Crawl records pass unchanged) - Add pass-through to snapshot create (Crawl records + others pass through) - Add pass-through to archiveresult create (Snapshot records + others) - Add create-or-update behavior to run command: - Records WITHOUT id: Create via Model.from_json() - Records WITH id: Lookup existing, re-queue - Outputs JSONL of all processed records for chaining Phase 4: Test Infrastructure - Create archivebox/tests/conftest.py with pytest-django fixtures - Include CLI helpers, output assertions, database assertions Phase 6: Config Update - Update supervisord_util.py: orchestrator -> run command This enables Unix-style piping: archivebox crawl create URL \| archivebox run archivebox archiveresult list --status=failed \| archivebox run curl API \| jq transform \| archivebox crawl create \| archivebox run	2025-12-31 10:07:14 +00:00
Nick Sweeting	28a4f99f55	Add real-world use cases to CLI pipeline plan (#1740 ) <!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line length changes. --> # Summary <!--e.g. This PR fixes ABC or adds the ability to do XYZ...--> # Related issues <!-- e.g. #123 or Roadmap goal # https://github.com/pirate/ArchiveBox/wiki/Roadmap --> # Changes these areas - [ ] Bugfixes - [ ] Feature behavior - [ ] Command line interface - [ ] Configuration options - [ ] Internal architecture - [ ] Snapshot data layout on disk	2025-12-31 01:57:08 -08:00
Nick Sweeting	95d61b001e	fix migrations	2025-12-31 01:40:59 -08:00
Nick Sweeting	1d15901304	fix process health stats	2025-12-31 01:40:59 -08:00
Nick Sweeting	3d8c62ffb1	fix extensions dir paths add personas migration	2025-12-31 01:40:59 -08:00
Nick Sweeting	1bbb9b45a7	Change hook timeout enforcement strategy (#1739 ) <!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line length changes. --> # Summary <!--e.g. This PR fixes ABC or adds the ability to do XYZ...--> # Related issues <!-- e.g. #123 or Roadmap goal # https://github.com/pirate/ArchiveBox/wiki/Roadmap --> # Changes these areas - [ ] Bugfixes - [ ] Feature behavior - [ ] Command line interface - [ ] Configuration options - [ ] Internal architecture - [ ] Snapshot data layout on disk <!-- This is an auto-generated description by cubic. --> --- ## Summary by cubic Switch background hook cleanup to a graceful termination flow using plugin-specific timeouts, only SIGKILLing if needed. This improves reliability and records accurate exit codes and stderr for better result reporting. - Refactors - Added graceful_terminate_background_hooks(): send SIGTERM to all hooks, wait per plugin timeout, SIGKILL remaining, reap with waitpid, write .returncode files. - Snapshot.cleanup() now uses merged config (get_config) to apply plugin-specific timeouts and terminate hooks gracefully. - update_from_output() reads .returncode and .stderr.log, infers status when no JSONL (handles signals like -9/-15), includes stderr on failures, and cleans up .returncode files. <sup>Written for commit `524e8e98c3`. Summary will update on new commits.</sup> <!-- End of auto-generated description by cubic. -->	2025-12-31 01:32:43 -08:00
Claude	1c85b4daa3	Refine use cases: 8 examples with efficient patterns - Trimmed from 10 to 8 focused examples - Emphasize CLI args for DB filtering (efficient), jq for transforms - Added key examples showing `run` emits JSONL enabling chained processing: - #4: Retry failed with different binary/timeout via jq transform - #8: Recursive link following (run → jq filter → crawl → run) - Removed redundant jq domain filtering (use --url__icontains instead) - Updated summary table with "Retry w/ Changes" and "Chain Processing" patterns	2025-12-31 09:26:23 +00:00
Nick Sweeting	8dab2966cc	Consolidate Chrome test helpers across all plugin tests (#1738 ) <!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line length changes. --> # Summary <!--e.g. This PR fixes ABC or adds the ability to do XYZ...--> # Related issues <!-- e.g. #123 or Roadmap goal # https://github.com/pirate/ArchiveBox/wiki/Roadmap --> # Changes these areas - [ ] Bugfixes - [ ] Feature behavior - [ ] Command line interface - [ ] Configuration options - [ ] Internal architecture - [ ] Snapshot data layout on disk	2025-12-31 01:25:39 -08:00
Claude	1cfb77a355	Rename Python helpers to match JS function names in snake_case - get_machine_type() matches JS getMachineType() - get_lib_dir() matches JS getLibDir() - get_node_modules_dir() matches JS getNodeModulesDir() - get_extensions_dir() matches JS getExtensionsDir() - find_chromium() matches JS findChromium() - kill_chrome() matches JS killChrome() - get_test_env() matches JS getTestEnv() All functions now try JS first (single source of truth) with Python fallback. Added backward compatibility aliases for old names.	2025-12-31 09:23:47 +00:00
Claude	524e8e98c3	Capture exit codes and stderr from background hooks Extended graceful_terminate_background_hooks() to: - Reap processes with os.waitpid() to get exit codes - Write returncode to .returncode file for update_from_output() - Return detailed result dict with status, returncode, and pid Updated update_from_output() to: - Read .returncode and .stderr.log files - Determine status from returncode if no ArchiveResult JSONL record - Include stderr in output_str for failed hooks - Handle signal termination (negative returncodes like -9 for SIGKILL) - Clean up .returncode files along with other hook output files	2025-12-31 09:23:41 +00:00
Claude	0f46d8a22e	Add real-world use cases to CLI pipeline plan Added 10 practical examples demonstrating the JSONL piping architecture: 1. Basic archive with auto-cascade 2. Retry failed extractions (by status, plugin, domain) 3. Pinboard bookmark import with jq 4. GitHub repo filtering with jq regex 5. Selective extraction (screenshots only) 6. Bulk tag management 7. Deep documentation crawling 8. RSS feed monitoring 9. Archive audit with jq aggregation 10. Incremental backup with diff Also added auto-cascade principle: `archivebox run` automatically creates Snapshots from Crawls and ArchiveResults from Snapshots, so intermediate commands are only needed for customization.	2025-12-31 09:20:25 +00:00
Claude	adeffb4bc5	Add JS-Python path delegation to reduce Chrome-related duplication - Add getMachineType, getLibDir, getNodeModulesDir, getTestEnv CLI commands to chrome_utils.js These are now the single source of truth for path calculations - Update chrome_test_helpers.py with call_chrome_utils() dispatcher - Add get_test_env_from_js(), get_machine_type_from_js(), kill_chrome_via_js() helpers - Update cleanup_chrome and kill_chromium_session to use JS killChrome - Remove unused Chrome binary search lists from singlefile hook (~25 lines) - Update readability, mercury, favicon, title tests to use shared helpers	2025-12-31 09:11:11 +00:00

1 2 3 4 5 ...

4848 Commits