ArchiveBox

mirror of https://github.com/ArchiveBox/ArchiveBox.git synced 2026-01-03 09:25:42 +10:00

Author	SHA1	Message	Date
Nick Sweeting	1d15901304	fix process health stats	2025-12-31 01:40:59 -08:00
Nick Sweeting	3d8c62ffb1	fix extensions dir paths add personas migration	2025-12-31 01:40:59 -08:00
Nick Sweeting	1bbb9b45a7	Change hook timeout enforcement strategy (#1739 ) <!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line length changes. --> # Summary <!--e.g. This PR fixes ABC or adds the ability to do XYZ...--> # Related issues <!-- e.g. #123 or Roadmap goal # https://github.com/pirate/ArchiveBox/wiki/Roadmap --> # Changes these areas - [ ] Bugfixes - [ ] Feature behavior - [ ] Command line interface - [ ] Configuration options - [ ] Internal architecture - [ ] Snapshot data layout on disk <!-- This is an auto-generated description by cubic. --> --- ## Summary by cubic Switch background hook cleanup to a graceful termination flow using plugin-specific timeouts, only SIGKILLing if needed. This improves reliability and records accurate exit codes and stderr for better result reporting. - Refactors - Added graceful_terminate_background_hooks(): send SIGTERM to all hooks, wait per plugin timeout, SIGKILL remaining, reap with waitpid, write .returncode files. - Snapshot.cleanup() now uses merged config (get_config) to apply plugin-specific timeouts and terminate hooks gracefully. - update_from_output() reads .returncode and .stderr.log, infers status when no JSONL (handles signals like -9/-15), includes stderr on failures, and cleans up .returncode files. <sup>Written for commit `524e8e98c3`. Summary will update on new commits.</sup> <!-- End of auto-generated description by cubic. -->	2025-12-31 01:32:43 -08:00
Nick Sweeting	8dab2966cc	Consolidate Chrome test helpers across all plugin tests (#1738 ) <!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line length changes. --> # Summary <!--e.g. This PR fixes ABC or adds the ability to do XYZ...--> # Related issues <!-- e.g. #123 or Roadmap goal # https://github.com/pirate/ArchiveBox/wiki/Roadmap --> # Changes these areas - [ ] Bugfixes - [ ] Feature behavior - [ ] Command line interface - [ ] Configuration options - [ ] Internal architecture - [ ] Snapshot data layout on disk	2025-12-31 01:25:39 -08:00
Claude	1cfb77a355	Rename Python helpers to match JS function names in snake_case - get_machine_type() matches JS getMachineType() - get_lib_dir() matches JS getLibDir() - get_node_modules_dir() matches JS getNodeModulesDir() - get_extensions_dir() matches JS getExtensionsDir() - find_chromium() matches JS findChromium() - kill_chrome() matches JS killChrome() - get_test_env() matches JS getTestEnv() All functions now try JS first (single source of truth) with Python fallback. Added backward compatibility aliases for old names.	2025-12-31 09:23:47 +00:00
Claude	524e8e98c3	Capture exit codes and stderr from background hooks Extended graceful_terminate_background_hooks() to: - Reap processes with os.waitpid() to get exit codes - Write returncode to .returncode file for update_from_output() - Return detailed result dict with status, returncode, and pid Updated update_from_output() to: - Read .returncode and .stderr.log files - Determine status from returncode if no ArchiveResult JSONL record - Include stderr in output_str for failed hooks - Handle signal termination (negative returncodes like -9 for SIGKILL) - Clean up .returncode files along with other hook output files	2025-12-31 09:23:41 +00:00
Claude	adeffb4bc5	Add JS-Python path delegation to reduce Chrome-related duplication - Add getMachineType, getLibDir, getNodeModulesDir, getTestEnv CLI commands to chrome_utils.js These are now the single source of truth for path calculations - Update chrome_test_helpers.py with call_chrome_utils() dispatcher - Add get_test_env_from_js(), get_machine_type_from_js(), kill_chrome_via_js() helpers - Update cleanup_chrome and kill_chromium_session to use JS killChrome - Remove unused Chrome binary search lists from singlefile hook (~25 lines) - Update readability, mercury, favicon, title tests to use shared helpers	2025-12-31 09:11:11 +00:00
Claude	b73199b33e	Refactor background hook cleanup to use graceful termination Changed Snapshot.cleanup() to gracefully terminate background hooks: 1. Send SIGTERM to all background hook processes first 2. Wait up to each hook's plugin-specific timeout 3. Send SIGKILL only to hooks still running after their timeout Added graceful_terminate_background_hooks() function in hooks.py that: - Collects all .pid files from output directory - Validates process identity using mtime - Sends SIGTERM to all valid processes in phase 1 - Polls each process for up to its plugin-specific timeout - Sends SIGKILL as last resort if timeout expires - Returns status for each hook (sigterm/sigkill/already_dead/invalid)	2025-12-31 09:03:27 +00:00
Claude	d72ab7c397	Add simpler Chrome test helpers and update test files New helpers in chrome_test_helpers.py: - get_plugin_dir(__file__) - get plugin dir from test file path - get_hook_script(dir, pattern) - find hook script by glob pattern - run_hook() - run hook script and return (returncode, stdout, stderr) - parse_jsonl_output() - parse JSONL from hook output - run_hook_and_parse() - convenience combo of above two - LIB_DIR, NODE_MODULES_DIR - lazy-loaded module constants - _LazyPath class for deferred path resolution Updated test files to use simpler patterns: - screenshot/tests/test_screenshot.py - dom/tests/test_dom.py - pdf/tests/test_pdf.py - singlefile/tests/test_singlefile.py Before: PLUGIN_DIR = Path(__file__).parent.parent After: PLUGIN_DIR = get_plugin_dir(__file__) Before: LIB_DIR = get_lib_dir(); NODE_MODULES_DIR = LIB_DIR / 'npm' / 'node_modules' After: from chrome_test_helpers import LIB_DIR, NODE_MODULES_DIR	2025-12-31 09:02:34 +00:00
Claude	7d74dd906c	Add Chrome CDP integration tests for singlefile - Import shared Chrome test helpers - Add test_singlefile_with_chrome_session() to verify CDP connection - Add test_singlefile_disabled_skips() for config testing - Update existing test to use get_test_env()	2025-12-31 08:57:13 +00:00
Claude	ef92a99c4a	Refactor test_chrome.py to use shared helpers - Add get_machine_type() to chrome_test_helpers.py - Update get_test_env() to include MACHINE_TYPE - Refactor test_chrome.py to import from shared helpers - Removes ~50 lines of duplicate code	2025-12-31 08:34:35 +00:00
Claude	65c839032a	Consolidate Chrome test helpers across all plugin tests - Add setup_test_env, launch_chromium_session, kill_chromium_session to chrome_test_helpers.py for extension tests - Add chromium_session context manager for cleaner test code - Refactor ublock, istilldontcareaboutcookies, twocaptcha tests to use shared helpers (~450 lines removed) - Refactor screenshot, dom, pdf tests to use shared get_test_env and get_lib_dir (~60 lines removed) - Net reduction: 228 lines of duplicate code	2025-12-31 08:30:14 +00:00
Nick Sweeting	29eb6280d3	tweak comment	2025-12-31 00:25:01 -08:00
Nick Sweeting	65b93d5a3b	tweak comment	2025-12-31 00:25:01 -08:00
Nick Sweeting	4394ce5f40	Reduce code duplication between Chrome utilities (#1737 ) This change consolidates duplicated logic between chrome_utils.js and extension installer hooks, as well as between Python plugin tests: JavaScript changes: - Add getExtensionsDir() to centralize extension directory path calculation - Add installExtensionWithCache() to handle extension install + cache workflow - Add CLI commands for new utilities - Refactor all 3 extension installers (ublock, istilldontcareaboutcookies, twocaptcha) to use shared utilities, reducing each from ~115 lines to ~60 - Update chrome_launch hook to use getExtensionsDir() Python test changes: - Add chrome_test_helpers.py with shared Chrome session management utilities - Refactor infiniscroll and modalcloser tests to use shared helpers - setup_chrome_session(), cleanup_chrome(), get_test_env() now centralized - Add chrome_session() context manager for automatic cleanup Net result: ~208 lines of code removed while maintaining same functionality. <!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line length changes. --> # Summary <!--e.g. This PR fixes ABC or adds the ability to do XYZ...--> # Related issues <!-- e.g. #123 or Roadmap goal # https://github.com/pirate/ArchiveBox/wiki/Roadmap --> # Changes these areas - [ ] Bugfixes - [ ] Feature behavior - [ ] Command line interface - [ ] Configuration options - [ ] Internal architecture - [ ] Snapshot data layout on disk	2025-12-31 00:19:44 -08:00
Nick Sweeting	987f4fbe0a	Review output file paths and data directory structure (#1736 ) - Update Crawl.output_dir_parent to use username instead of user_id for consistency with Snapshot paths - Add domain from first URL to Crawl path structure for easier debugging: users/{username}/crawls/YYYYMMDD/{domain}/{crawl_id}/ - Add CRAWL_OUTPUT_DIR to config passed to Snapshot hooks so chrome_tab can find the shared Chrome session from the Crawl - Update comment in chrome_tab hook to reflect new config source <!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line length changes. --> # Summary <!--e.g. This PR fixes ABC or adds the ability to do XYZ...--> # Related issues <!-- e.g. #123 or Roadmap goal # https://github.com/pirate/ArchiveBox/wiki/Roadmap --> # Changes these areas - [ ] Bugfixes - [ ] Feature behavior - [ ] Command line interface - [ ] Configuration options - [ ] Internal architecture - [ ] Snapshot data layout on disk	2025-12-31 00:19:03 -08:00
Claude	04c23badc2	Fix output path structure for 0.9.x data directory - Update Crawl.output_dir_parent to use username instead of user_id for consistency with Snapshot paths - Add domain from first URL to Crawl path structure for easier debugging: users/{username}/crawls/YYYYMMDD/{domain}/{crawl_id}/ - Add CRAWL_OUTPUT_DIR to config passed to Snapshot hooks so chrome_tab can find the shared Chrome session from the Crawl - Update comment in chrome_tab hook to reflect new config source	2025-12-31 08:18:24 +00:00
Claude	fd9ba86220	Reduce Chrome-related code duplication across JS and Python This change consolidates duplicated logic between chrome_utils.js and extension installer hooks, as well as between Python plugin tests: JavaScript changes: - Add getExtensionsDir() to centralize extension directory path calculation - Add installExtensionWithCache() to handle extension install + cache workflow - Add CLI commands for new utilities - Refactor all 3 extension installers (ublock, istilldontcareaboutcookies, twocaptcha) to use shared utilities, reducing each from ~115 lines to ~60 - Update chrome_launch hook to use getExtensionsDir() Python test changes: - Add chrome_test_helpers.py with shared Chrome session management utilities - Refactor infiniscroll and modalcloser tests to use shared helpers - setup_chrome_session(), cleanup_chrome(), get_test_env() now centralized - Add chrome_session() context manager for automatic cleanup Net result: ~208 lines of code removed while maintaining same functionality.	2025-12-31 08:13:00 +00:00
Nick Sweeting	cead22afc2	archivebox <modelname> create\|list\|update\|delete \| ... piping support (#1735 ) Comprehensive plan for implementing JSONL-based CLI piping: - Phase 1: Model prerequisites (ArchiveResult.from_json, tags_str fix) - Phase 2: Extract shared apply_filters() to cli_utils.py - Phase 3: Implement pass-through behavior for all create commands - Phase 4-6: Test infrastructure with pytest-django, unit/integration tests Key changes from original plan: - ArchiveResult.from_json() identified as missing prerequisite - Pass-through documented as new feature to implement - archivebox run updated to create-or-update pattern - conftest.py redesigned to use pytest-django with isolated tmp_path - Standardized on tags_str field name across all models - Reordered phases: implement before test <!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line length changes. --> # Summary <!--e.g. This PR fixes ABC or adds the ability to do XYZ...--> # Related issues <!-- e.g. #123 or Roadmap goal # https://github.com/pirate/ArchiveBox/wiki/Roadmap --> # Changes these areas - [ ] Bugfixes - [ ] Feature behavior - [ ] Command line interface - [ ] Configuration options - [ ] Internal architecture - [ ] Snapshot data layout on disk	2025-12-30 23:57:33 -08:00
Nick Sweeting	84a4fb0785	fix cubic comments	2025-12-30 23:53:47 -08:00
Nick Sweeting	ca3f8f0ff1	Process.launch()/kill()/.pidfile/.wait()/etc. centralize process handling logic on model methods (#1734 ) <!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line length changes. --> # Summary <!--e.g. This PR fixes ABC or adds the ability to do XYZ...--> # Related issues <!-- e.g. #123 or Roadmap goal # https://github.com/pirate/ArchiveBox/wiki/Roadmap --> # Changes these areas - [ ] Bugfixes - [ ] Feature behavior - [ ] Command line interface - [ ] Configuration options - [ ] Internal architecture - [ ] Snapshot data layout on disk <!-- This is an auto-generated description by cubic. --> --- ## Summary by cubic Added an implementation plan to centralize subprocess handling on the machine.Process model. It covers process hierarchy, Process.current(), safe lifecycle methods (launch/kill/wait), PID reuse protection, and phased changes across hooks, workers, CLI, migrations, and admin. <sup>Written for commit `3ae9410127`. Summary will update on new commits.</sup> <!-- End of auto-generated description by cubic. -->	2025-12-30 23:42:29 -08:00
Nick Sweeting	dfe68412af	Merge branch 'dev' into claude/refactor-process-management-WcQyZ	2025-12-30 23:42:23 -08:00
claude[bot]	4285a05d19	Fix getEnvArray to parse JSON when '[' present, CSV otherwise Simplifies the comma-separated parsing logic to: - If value contains '[', parse as JSON array - Otherwise, parse as comma-separated values This prevents incorrect splitting of arguments containing internal commas when there's only one argument. For arguments with commas, users should use JSON format: CHROME_ARGS='["--arg1,val", "--arg2"]' Also exports getEnvArray in module.exports for consistency. Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>	2025-12-31 07:39:49 +00:00
Nick Sweeting	3ae9410127	Update TODO_process_tracking.md	2025-12-31 02:39:36 -05:00
Nick Sweeting	e26a0f6fc0	Fix hook file overwrites in plugin directory (#1732 ) Multiple hooks in the same plugin directory were overwriting each other's stdout.log, stderr.log, hook.pid, and cmd.sh files. Now each hook uses filenames prefixed with its hook name: - on_Snapshot__20_chrome_tab.bg.stdout.log - on_Snapshot__20_chrome_tab.bg.stderr.log - on_Snapshot__20_chrome_tab.bg.pid - on_Snapshot__20_chrome_tab.bg.sh Updated: - hooks.py run_hook() to use hook-specific names - core/models.py cleanup and update_from_output methods - Plugin scripts to no longer write redundant hook.pid files <!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line length changes. --> # Summary <!--e.g. This PR fixes ABC or adds the ability to do XYZ...--> # Related issues <!-- e.g. #123 or Roadmap goal # https://github.com/pirate/ArchiveBox/wiki/Roadmap --> # Changes these areas - [ ] Bugfixes - [ ] Feature behavior - [ ] Command line interface - [ ] Configuration options - [ ] Internal architecture - [ ] Snapshot data layout on disk <!-- This is an auto-generated description by cubic. --> --- ## Summary by cubic Prevented hook file collisions by giving each hook its own stdout, stderr, pid, and cmd filenames. This fixes mixed logs and ensures correct cleanup and status checks when multiple hooks run in the same plugin directory. - Bug Fixes - hooks.py: write hook-specific stdout/stderr/pid/cmd files and exclude them from new_files; derive cmd.sh from pid for safe kill. - core/models.py: read hook-specific logs; exclude hook output files when computing outputs; cleanup and background detection use *.pid. - Plugins: stop writing redundant hook.pid files; minor chrome utils cleanup. <sup>Written for commit `754b096193`. Summary will update on new commits.</sup> <!-- End of auto-generated description by cubic. -->	2025-12-30 23:36:09 -08:00
Nick Sweeting	f7b186d7c8	Apply suggestion from @cubic-dev-ai[bot] Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>	2025-12-31 02:31:46 -05:00
Nick Sweeting	dac6c63bba	working extension tests	2025-12-30 18:30:16 -08:00
Nick Sweeting	42d3fb7025	extension test fixes	2025-12-30 18:28:14 -08:00
Claude	754b096193	use hook-specific filenames to avoid overwrites Multiple hooks in the same plugin directory were overwriting each other's stdout.log, stderr.log, hook.pid, and cmd.sh files. Now each hook uses filenames prefixed with its hook name: - on_Snapshot__20_chrome_tab.bg.stdout.log - on_Snapshot__20_chrome_tab.bg.stderr.log - on_Snapshot__20_chrome_tab.bg.pid - on_Snapshot__20_chrome_tab.bg.sh Updated: - hooks.py run_hook() to use hook-specific names - core/models.py cleanup and update_from_output methods - Plugin scripts to no longer write redundant hook.pid files	2025-12-31 02:00:15 +00:00
Claude	df2a0dcd44	Add revised CLI pipeline architecture plan Comprehensive plan for implementing JSONL-based CLI piping: - Phase 1: Model prerequisites (ArchiveResult.from_json, tags_str fix) - Phase 2: Extract shared apply_filters() to cli_utils.py - Phase 3: Implement pass-through behavior for all create commands - Phase 4-6: Test infrastructure with pytest-django, unit/integration tests Key changes from original plan: - ArchiveResult.from_json() identified as missing prerequisite - Pass-through documented as new feature to implement - archivebox run updated to create-or-update pattern - conftest.py redesigned to use pytest-django with isolated tmp_path - Standardized on tags_str field name across all models - Reordered phases: implement before test	2025-12-31 01:46:07 +00:00
Claude	b8a66c4a84	Convert Persona to Django ModelWithConfig, add to get_config() - Convert Persona from plain Python class to Django model with ModelWithConfig - Add config JSONField for persona-specific config overrides - Add get_derived_config() method that returns config with derived paths: - CHROME_USER_DATA_DIR, CHROME_EXTENSIONS_DIR, COOKIES_FILE, ACTIVE_PERSONA - Update get_config() to accept persona parameter in merge chain: get_config(persona=crawl.persona, crawl=crawl, snapshot=snapshot) - Remove _derive_persona_paths() - derivation now happens in Persona model - Merge order (highest to lowest priority): 1. snapshot.config 2. crawl.config 3. user.config 4. persona.get_derived_config() <- NEW 5. environment variables 6. ArchiveBox.conf file 7. plugin defaults 8. core defaults Usage: config = get_config(persona=crawl.persona, crawl=crawl) config['CHROME_USER_DATA_DIR'] # derived from persona	2025-12-31 01:07:29 +00:00
Claude	b1e31c3def	Simplify Persona class: remove convenience functions, fix get_active() - Remove standalone convenience functions (cleanup_chrome_for_persona, cleanup_chrome_all_personas) to reduce LOC - Change Persona.get_active(config) to accept config dict as argument instead of calling get_config() internally, since the caller needs to pass user/crawl/snapshot/archiveresult context for proper config	2025-12-31 01:00:52 +00:00
Claude	503a2f77cb	Add Persona class with cleanup_chrome() method - Create Persona class in personas/models.py for managing browser profiles/identities used for archiving sessions - Each Persona has: - chrome_user_data_dir: Chrome profile directory - chrome_extensions_dir: Installed extensions - cookies_file: Cookies for wget/curl - config_file: Persona-specific config overrides - Add Persona methods: - cleanup_chrome(): Remove stale SingletonLock/SingletonSocket files - get_config(): Load persona config from config.json - save_config(): Save persona config to config.json - ensure_dirs(): Create persona directory structure - all(): Iterator over all personas - get_active(): Get persona based on ACTIVE_PERSONA config - cleanup_chrome_all(): Clean up all personas - Update chrome_cleanup() in misc/util.py to use Persona.cleanup_chrome_all() instead of manual directory iteration - Add convenience functions: - cleanup_chrome_for_persona(name) - cleanup_chrome_all_personas()	2025-12-31 00:59:37 +00:00
Claude	1a86789523	Move Chrome default args to config.json CHROME_ARGS - Add comprehensive default CHROME_ARGS in config.json with 55+ flags for deterministic rendering, security, performance, and UI suppression - Update chrome_utils.js launchChromium() to read CHROME_ARGS and CHROME_ARGS_EXTRA from environment variables (set by get_config()) - Add getEnvArray() helper to parse JSON arrays or comma-separated strings from environment variables - Separate args into three categories: 1. baseArgs: Static flags from CHROME_ARGS config (configurable) 2. dynamicArgs: Runtime-computed flags (port, sandbox, headless, etc.) 3. extraArgs: User overrides from CHROME_ARGS_EXTRA - Add CHROME_SANDBOX config option to control --no-sandbox flag Args are now configurable via: - config.json defaults - ArchiveBox.conf file - Environment variables - Per-crawl/snapshot config overrides	2025-12-31 00:57:29 +00:00
Claude	caee376749	Add Process.proc property for validated psutil access New section 1.5 adds @property proc that returns psutil.Process ONLY if: - PID exists in OS - OS start time matches our started_at (within tolerance) - We're on the same machine Safety features: - Validates start time via psutil.Process.create_time() - Optional command validation (binary name matches) - Returns None instead of wrong process on PID reuse Also adds convenience methods: - is_running: Check via validated psutil - get_memory_info(): RSS/VMS if running - get_cpu_percent(): CPU usage if running - get_children_pids(): Child PIDs from OS Updated kill() to use self.proc for safe killing - never kills a recycled PID since we validate start time first.	2025-12-31 00:49:58 +00:00
Claude	f3c91b4c4e	Add detailed supervisord Process tracking to plan Phase 3.3 now includes: - Module-level _supervisord_db_process variable - start_new_supervisord_process(): Create Process record after Popen - stop_existing_supervisord_process(): Update Process status on shutdown - Process hierarchy diagram showing CLI → supervisord → workers chain Key insight: PPID-based linking works because workers call Process.current() in on_startup(), which finds supervisord's Process via PPID lookup.	2025-12-31 00:45:10 +00:00
Claude	e41ca37848	Add detailed hook/run() changes to Process tracking plan Phase 2 now includes line-by-line mapping of: - run_hook(): Create Process record, use Process.launch(), parse JSONL for child binary Process records - process_is_alive(): Accept Path or Process, use Process.is_alive() - kill_process(): Accept Path or Process, use Process.kill() - ArchiveResult.run(): Pass self.process as parent_process to run_hook() - ArchiveResult.update_from_output(): Read from Process.stdout/stderr - Snapshot.cleanup(): Kill via Process model, fallback to PID files - Snapshot.has_running_background_hooks(): Check via Process model Hook JSONL contract updated to support {"type": "Process"} records for tracking binary executions within hooks.	2025-12-31 00:44:10 +00:00
Claude	554d743719	Add robust PID reuse protection to Process.current() plan PIDs are recycled by OS, so all Process queries now: - Filter by machine=Machine.current() (PIDs unique per machine) - Filter by started_at within PID_REUSE_WINDOW (24h) - Validate start time matches OS via psutil.Process.create_time() Added: - ProcessManager.get_by_pid() for safe PID lookups - Process.cleanup_stale_running() to mark orphaned RUNNING as EXITED - START_TIME_TOLERANCE (5s) for start time comparison - Uses psutil.Process.create_time() for accurate started_at	2025-12-31 00:36:01 +00:00
Claude	4c4c065697	Add Process.current() to implementation plan Key addition: Process.current() class method (like Machine.current()) that auto-creates/retrieves the Process record for the current OS process. Benefits: - Uses PPID lookup to find parent Process automatically - Detects process_type from sys.argv - Cached with validation (like Machine.current()) - Eliminates need for thread-local context management Simplified Phase 3 (workers) and Phase 4 (CLI) to just call Process.current() instead of manual Process creation.	2025-12-31 00:32:05 +00:00
Claude	f21fb55a2c	Add comprehensive implementation plan for Process hierarchy tracking Documents 7-phase refactoring to use machine.Process as the core data model for all subprocess management: - Phase 1: Add parent FK and process_type to Process model - Phase 2: Add lifecycle methods (launch, kill, poll, wait) - Phase 3: Update hook system to create Process records - Phase 4-5: Track workers/orchestrator/supervisord as Process - Phase 6: Create root Process on CLI invocation - Phase 7: Admin UI with tree visualization Enables full process hierarchy tracking from CLI → binary execution.	2025-12-31 00:28:17 +00:00
Claude	877b5f91c2	Derive CHROME_USER_DATA_DIR from ACTIVE_PERSONA in config system - Add _derive_persona_paths() in configset.py to automatically derive CHROME_USER_DATA_DIR and CHROME_EXTENSIONS_DIR from ACTIVE_PERSONA when not explicitly set. This allows plugins to use these paths without knowing about the persona system. - Update chrome_utils.js launchChromium() to accept userDataDir option and pass --user-data-dir to Chrome. Also cleans up SingletonLock before launch. - Update killZombieChrome() to clean up SingletonLock files from all persona chrome_user_data directories after killing zombies. - Update chrome_cleanup() in misc/util.py to handle persona-based user data directories when cleaning up stale Chrome state. - Simplify on_Crawl__20_chrome_launch.bg.js to use CHROME_USER_DATA_DIR and CHROME_EXTENSIONS_DIR from env (derived by get_config()). Config priority flow: ACTIVE_PERSONA=WorkAccount (set on crawl/snapshot) -> get_config() derives: CHROME_USER_DATA_DIR = PERSONAS_DIR/WorkAccount/chrome_user_data CHROME_EXTENSIONS_DIR = PERSONAS_DIR/WorkAccount/chrome_extensions -> hooks receive these as env vars without needing persona logic	2025-12-31 00:21:07 +00:00
Nick Sweeting	dd2302ad92	new jsonl cli interface	2025-12-30 16:12:53 -08:00
Nick Sweeting	ba8c28a866	use process_set for related name not processes	2025-12-30 12:55:23 -08:00
Nick Sweeting	1b49ea9a0e	improve jsonl logic	2025-12-30 12:43:36 -08:00
Nick Sweeting	08366cfa46	document chrome configs	2025-12-30 12:42:50 -08:00
Nick Sweeting	93a78ce595	Convert snapshot index from JSON to JSONL (#1730 ) <!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line length changes. --> # Summary <!--e.g. This PR fixes ABC or adds the ability to do XYZ...--> # Related issues <!-- e.g. #123 or Roadmap goal # https://github.com/pirate/ArchiveBox/wiki/Roadmap --> # Changes these areas - [ ] Bugfixes - [ ] Feature behavior - [ ] Command line interface - [ ] Configuration options - [ ] Internal architecture - [x] Snapshot data layout on disk <!-- This is an auto-generated description by cubic. --> --- ## Summary by cubic Switch snapshot index storage from index.json to a flat index.jsonl format for easier parsing and extensibility. Includes automatic migration and backward-compatible reading, plus updated CLI pipeline to emit/consume JSONL records. - New Features - Write and read index.jsonl with per-line records (Snapshot, ArchiveResult, Binary, Process); reconcile prefers JSONL. - Auto-convert legacy index.json to JSONL during migration/update; load_from_directory/create_from_directory support both formats. - Serialization moved to model to_jsonl methods; added schema_version to all records, including Tag, Crawl, Binary, and Process. - CLI pipeline updated: crawl creates a single Crawl job from all input URLs and outputs Crawl JSONL (no immediate crawling); snapshot accepts Crawl JSONL/IDs and outputs Snapshot JSONL; extract outputs ArchiveResult JSONL via model methods. - Migration - Conversion runs during filesystem migration and reconcile; no manual steps. - Legacy index.json is deleted after conversion; external tools should switch to index.jsonl. <sup>Written for commit `251fe33e49`. Summary will update on new commits.</sup> <!-- End of auto-generated description by cubic. -->	2025-12-30 12:32:52 -08:00
claude[bot]	251fe33e49	fix: rename --plugin to --plugins for consistency Changed from singular --plugin to plural --plugins in both snapshot and extract commands to match the pattern in archivebox add command. Updated to accept comma-separated plugin names (e.g., --plugins=screenshot,singlefile,title). - Updated CLI option from --plugin to --plugins - Added parsing for comma-separated plugin names - Updated function signatures and logic to handle multiple plugins - Updated help text, docstrings, and examples Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>	2025-12-30 20:20:29 +00:00
claude[bot]	64db6deab3	fix: revert incorrect --extract renaming, restore --plugin parameter The --plugins parameter was incorrectly renamed to --extract (boolean). This restores --plugin (singular, matching extract command) with correct semantics: specify which plugin to run after creating snapshots. - Changed --extract/--no-extract back to --plugin (string parameter) - Updated function signature and logic to use plugin parameter - Added ArchiveResult creation for specific plugin when --plugin is passed - Updated docstring and examples Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>	2025-12-30 20:15:48 +00:00
claude[bot]	762cddc8c5	fix: address PR review comments from cubic-dev-ai - Add JSONL_INDEX_FILENAME to ALLOWED_IN_DATA_DIR for consistency - Fix fallback logic in legacy.py to try JSON when JSONL parsing fails - Replace bare except clauses with specific exception types - Fix stdin double-consumption in archivebox_crawl.py - Merge CLI --tag option with crawl tags in archivebox_snapshot.py - Remove tautological mock tests (covered by integration tests) Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>	2025-12-30 20:09:51 +00:00
Claude	cf387ed59f	refactor: batch all URLs into single Crawl, update tests - archivebox crawl now creates one Crawl with all URLs as newline-separated string - Updated tests to reflect new pipeline: crawl -> snapshot -> extract - Added tests for Crawl JSONL parsing and output - Tests verify Crawl.from_jsonl() handles multiple URLs correctly	2025-12-30 20:06:56 +00:00

1 2 3 4 5 ...

4805 Commits