ArchiveBox

mirror of https://github.com/ArchiveBox/ArchiveBox.git synced 2026-01-03 01:15:57 +10:00

Author	SHA1	Message	Date
Nick Sweeting	bdb3d946b8	Delete pid_utils.py and migrate to Process model (#1741 )	2025-12-31 03:43:18 -08:00
claude[bot]	b2132d1f14	Fix cubic review issues: process_type detection, cmd storage, PID cleanup, and migration - Fix Process.current() to store psutil cmdline instead of sys.argv for accurate validation - Fix worker process_type detection: explicitly set to WORKER after registration - Fix ArchiveResultWorker.start() to use Process.TypeChoices.WORKER consistently - Fix migration to be explicitly irreversible (SQLite doesn't support DROP COLUMN) - Fix get_running_workers() to return process_id instead of incorrectly named worker_id - Fix safe_kill_process() to wait for termination and escalate to SIGKILL if needed - Fix migration to include all indexes in state_operations (parent_id, process_type) - Fix documentation to use Machine.current() scoping and StatusChoices constants Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>	2025-12-31 11:42:07 +00:00
claude[bot]	5121b0e5f9	Merge branch 'dev' into claude/refactor-process-management-WcQyZ Resolved conflicts by keeping Process model changes and accepting dev changes for unrelated files. Ensured pid_utils.py remains deleted as intended by this PR. Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>	2025-12-31 11:28:47 +00:00
claude[bot]	ee201a0f83	Fix code review issues in process management refactor - Add pwd validation in Process.launch() to prevent crashes - Fix psutil returncode handling (use wait() return value, not returncode attr) - Add None check for proc.pid in cleanup_stale_running() - Add stale process cleanup in Orchestrator.is_running() - Ensure orchestrator process_type is correctly set to ORCHESTRATOR - Fix KeyboardInterrupt handling (exit code 0 for graceful shutdown) - Throttle cleanup_stale_running() to once per 30 seconds for performance - Fix worker process_type to use TypeChoices.WORKER consistently - Fix get_running_workers() API to return list of dicts (not Process objects) - Only delete PID files after successful kill or confirmed stale - Fix migration index names to match between SQL and Django state - Remove db_index=True from process_type (index created manually) - Update documentation to reflect actual implementation - Add explanatory comments to empty except blocks - Fix exit codes to use Unix convention (128 + signal number) Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>	2025-12-31 11:14:47 +00:00
Nick Sweeting	7dd2d65770	Add pluginmap management command (#1742 )	2025-12-31 02:29:24 -08:00
Nick Sweeting	575a595f26	Add unit tests for JSONL CLI pipeline commands (Phase 5 & 6) (#1743 )	2025-12-31 02:27:17 -08:00
Claude	bb52b5902a	Add unit tests for JSONL CLI pipeline commands (Phase 5 & 6) Add comprehensive unit tests for the CLI piping architecture: - test_cli_crawl.py: crawl create/list/update/delete tests - test_cli_snapshot.py: snapshot create/list/update/delete tests - test_cli_archiveresult.py: archiveresult create/list/update/delete tests - test_cli_run.py: run command create-or-update and pass-through tests Extend tests_piping.py with: - TestPassThroughBehavior: tests for pass-through behavior in all commands - TestPipelineAccumulation: tests for accumulating records through pipeline All tests use pytest fixtures from conftest.py with isolated DATA_DIR.	2025-12-31 10:21:05 +00:00
Claude	672ccf918d	Add pluginmap management command Adds a new CLI command `archivebox pluginmap` that displays: - ASCII art diagrams of all core state machines (Crawl, Snapshot, ArchiveResult, Binary) - Lists all auto-detected on_Modelname_xyz hooks grouped by model/event - Shows hook execution order (step 0-9), plugin name, and background status Usage: archivebox pluginmap # Show all diagrams and hooks archivebox pluginmap -m Snapshot # Filter to specific model archivebox pluginmap -a # Include disabled plugins archivebox pluginmap -q # Output JSON only	2025-12-31 10:19:58 +00:00
Claude	b822352fc3	Delete pid_utils.py and migrate to Process model DELETED: - workers/pid_utils.py (-192 lines) - replaced by Process model methods SIMPLIFIED: - crawls/models.py Crawl.cleanup() (80 lines -> 10 lines) - hooks.py: deleted process_is_alive() and kill_process() (-45 lines) UPDATED to use Process model: - core/models.py: Snapshot.cleanup() and has_running_background_hooks() - machine/models.py: Binary.cleanup() - workers/worker.py: Worker.on_startup/shutdown, get_running_workers, start - workers/orchestrator.py: Orchestrator.on_startup/shutdown, is_running All subprocess management now uses: - Process.current() for registering current process - Process.get_running() / get_running_count() for querying - Process.cleanup_stale_running() for cleanup - safe_kill_process() for validated PID killing Total line reduction: ~250 lines	2025-12-31 10:15:22 +00:00
Claude	2d3a2fec57	Add terminate, kill_tree, and query methods to Process model This consolidates scattered subprocess management logic into the Process model: - terminate(): Graceful SIGTERM → wait → SIGKILL (replaces stop_worker, etc.) - kill_tree(): Kill process and all OS children (replaces os.killpg logic) - kill_children_db(): Kill DB-tracked child processes - get_running(): Query running processes by type (replaces get_all_worker_pids) - get_running_count(): Count running processes (replaces get_running_worker_count) - stop_all(): Stop all processes of a type - get_next_worker_id(): Get next worker ID for spawning Added Phase 8 to TODO documenting ~390 lines that can be deleted after consolidation, including workers/pid_utils.py which becomes obsolete. Also includes migration 0002 for parent FK and process_type fields.	2025-12-31 10:08:45 +00:00
Claude	f3e11b61fd	Implement JSONL CLI pipeline architecture (Phases 1-4, 6) Phase 1: Model Prerequisites - Add ArchiveResult.from_json() and from_jsonl() methods - Fix Snapshot.to_json() to use tags_str (consistent with Crawl) Phase 2: Shared Utilities - Create archivebox/cli/cli_utils.py with shared apply_filters() - Update 7 CLI files to import from cli_utils.py instead of duplicating Phase 3: Pass-Through Behavior - Add pass-through to crawl create (non-Crawl records pass unchanged) - Add pass-through to snapshot create (Crawl records + others pass through) - Add pass-through to archiveresult create (Snapshot records + others) - Add create-or-update behavior to run command: - Records WITHOUT id: Create via Model.from_json() - Records WITH id: Lookup existing, re-queue - Outputs JSONL of all processed records for chaining Phase 4: Test Infrastructure - Create archivebox/tests/conftest.py with pytest-django fixtures - Include CLI helpers, output assertions, database assertions Phase 6: Config Update - Update supervisord_util.py: orchestrator -> run command This enables Unix-style piping: archivebox crawl create URL \| archivebox run archivebox archiveresult list --status=failed \| archivebox run curl API \| jq transform \| archivebox crawl create \| archivebox run	2025-12-31 10:07:14 +00:00
Nick Sweeting	28a4f99f55	Add real-world use cases to CLI pipeline plan (#1740 ) <!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line length changes. --> # Summary <!--e.g. This PR fixes ABC or adds the ability to do XYZ...--> # Related issues <!-- e.g. #123 or Roadmap goal # https://github.com/pirate/ArchiveBox/wiki/Roadmap --> # Changes these areas - [ ] Bugfixes - [ ] Feature behavior - [ ] Command line interface - [ ] Configuration options - [ ] Internal architecture - [ ] Snapshot data layout on disk	2025-12-31 01:57:08 -08:00
Nick Sweeting	95d61b001e	fix migrations	2025-12-31 01:40:59 -08:00
Nick Sweeting	1d15901304	fix process health stats	2025-12-31 01:40:59 -08:00
Nick Sweeting	3d8c62ffb1	fix extensions dir paths add personas migration	2025-12-31 01:40:59 -08:00
Nick Sweeting	1bbb9b45a7	Change hook timeout enforcement strategy (#1739 ) <!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line length changes. --> # Summary <!--e.g. This PR fixes ABC or adds the ability to do XYZ...--> # Related issues <!-- e.g. #123 or Roadmap goal # https://github.com/pirate/ArchiveBox/wiki/Roadmap --> # Changes these areas - [ ] Bugfixes - [ ] Feature behavior - [ ] Command line interface - [ ] Configuration options - [ ] Internal architecture - [ ] Snapshot data layout on disk <!-- This is an auto-generated description by cubic. --> --- ## Summary by cubic Switch background hook cleanup to a graceful termination flow using plugin-specific timeouts, only SIGKILLing if needed. This improves reliability and records accurate exit codes and stderr for better result reporting. - Refactors - Added graceful_terminate_background_hooks(): send SIGTERM to all hooks, wait per plugin timeout, SIGKILL remaining, reap with waitpid, write .returncode files. - Snapshot.cleanup() now uses merged config (get_config) to apply plugin-specific timeouts and terminate hooks gracefully. - update_from_output() reads .returncode and .stderr.log, infers status when no JSONL (handles signals like -9/-15), includes stderr on failures, and cleans up .returncode files. <sup>Written for commit `524e8e98c3`. Summary will update on new commits.</sup> <!-- End of auto-generated description by cubic. -->	2025-12-31 01:32:43 -08:00
Claude	1c85b4daa3	Refine use cases: 8 examples with efficient patterns - Trimmed from 10 to 8 focused examples - Emphasize CLI args for DB filtering (efficient), jq for transforms - Added key examples showing `run` emits JSONL enabling chained processing: - #4: Retry failed with different binary/timeout via jq transform - #8: Recursive link following (run → jq filter → crawl → run) - Removed redundant jq domain filtering (use --url__icontains instead) - Updated summary table with "Retry w/ Changes" and "Chain Processing" patterns	2025-12-31 09:26:23 +00:00
Nick Sweeting	8dab2966cc	Consolidate Chrome test helpers across all plugin tests (#1738 ) <!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line length changes. --> # Summary <!--e.g. This PR fixes ABC or adds the ability to do XYZ...--> # Related issues <!-- e.g. #123 or Roadmap goal # https://github.com/pirate/ArchiveBox/wiki/Roadmap --> # Changes these areas - [ ] Bugfixes - [ ] Feature behavior - [ ] Command line interface - [ ] Configuration options - [ ] Internal architecture - [ ] Snapshot data layout on disk	2025-12-31 01:25:39 -08:00
Claude	1cfb77a355	Rename Python helpers to match JS function names in snake_case - get_machine_type() matches JS getMachineType() - get_lib_dir() matches JS getLibDir() - get_node_modules_dir() matches JS getNodeModulesDir() - get_extensions_dir() matches JS getExtensionsDir() - find_chromium() matches JS findChromium() - kill_chrome() matches JS killChrome() - get_test_env() matches JS getTestEnv() All functions now try JS first (single source of truth) with Python fallback. Added backward compatibility aliases for old names.	2025-12-31 09:23:47 +00:00
Claude	524e8e98c3	Capture exit codes and stderr from background hooks Extended graceful_terminate_background_hooks() to: - Reap processes with os.waitpid() to get exit codes - Write returncode to .returncode file for update_from_output() - Return detailed result dict with status, returncode, and pid Updated update_from_output() to: - Read .returncode and .stderr.log files - Determine status from returncode if no ArchiveResult JSONL record - Include stderr in output_str for failed hooks - Handle signal termination (negative returncodes like -9 for SIGKILL) - Clean up .returncode files along with other hook output files	2025-12-31 09:23:41 +00:00
Claude	0f46d8a22e	Add real-world use cases to CLI pipeline plan Added 10 practical examples demonstrating the JSONL piping architecture: 1. Basic archive with auto-cascade 2. Retry failed extractions (by status, plugin, domain) 3. Pinboard bookmark import with jq 4. GitHub repo filtering with jq regex 5. Selective extraction (screenshots only) 6. Bulk tag management 7. Deep documentation crawling 8. RSS feed monitoring 9. Archive audit with jq aggregation 10. Incremental backup with diff Also added auto-cascade principle: `archivebox run` automatically creates Snapshots from Crawls and ArchiveResults from Snapshots, so intermediate commands are only needed for customization.	2025-12-31 09:20:25 +00:00
Claude	adeffb4bc5	Add JS-Python path delegation to reduce Chrome-related duplication - Add getMachineType, getLibDir, getNodeModulesDir, getTestEnv CLI commands to chrome_utils.js These are now the single source of truth for path calculations - Update chrome_test_helpers.py with call_chrome_utils() dispatcher - Add get_test_env_from_js(), get_machine_type_from_js(), kill_chrome_via_js() helpers - Update cleanup_chrome and kill_chromium_session to use JS killChrome - Remove unused Chrome binary search lists from singlefile hook (~25 lines) - Update readability, mercury, favicon, title tests to use shared helpers	2025-12-31 09:11:11 +00:00
Claude	b73199b33e	Refactor background hook cleanup to use graceful termination Changed Snapshot.cleanup() to gracefully terminate background hooks: 1. Send SIGTERM to all background hook processes first 2. Wait up to each hook's plugin-specific timeout 3. Send SIGKILL only to hooks still running after their timeout Added graceful_terminate_background_hooks() function in hooks.py that: - Collects all .pid files from output directory - Validates process identity using mtime - Sends SIGTERM to all valid processes in phase 1 - Polls each process for up to its plugin-specific timeout - Sends SIGKILL as last resort if timeout expires - Returns status for each hook (sigterm/sigkill/already_dead/invalid)	2025-12-31 09:03:27 +00:00
Claude	d72ab7c397	Add simpler Chrome test helpers and update test files New helpers in chrome_test_helpers.py: - get_plugin_dir(__file__) - get plugin dir from test file path - get_hook_script(dir, pattern) - find hook script by glob pattern - run_hook() - run hook script and return (returncode, stdout, stderr) - parse_jsonl_output() - parse JSONL from hook output - run_hook_and_parse() - convenience combo of above two - LIB_DIR, NODE_MODULES_DIR - lazy-loaded module constants - _LazyPath class for deferred path resolution Updated test files to use simpler patterns: - screenshot/tests/test_screenshot.py - dom/tests/test_dom.py - pdf/tests/test_pdf.py - singlefile/tests/test_singlefile.py Before: PLUGIN_DIR = Path(__file__).parent.parent After: PLUGIN_DIR = get_plugin_dir(__file__) Before: LIB_DIR = get_lib_dir(); NODE_MODULES_DIR = LIB_DIR / 'npm' / 'node_modules' After: from chrome_test_helpers import LIB_DIR, NODE_MODULES_DIR	2025-12-31 09:02:34 +00:00
Claude	7d74dd906c	Add Chrome CDP integration tests for singlefile - Import shared Chrome test helpers - Add test_singlefile_with_chrome_session() to verify CDP connection - Add test_singlefile_disabled_skips() for config testing - Update existing test to use get_test_env()	2025-12-31 08:57:13 +00:00
Claude	ef92a99c4a	Refactor test_chrome.py to use shared helpers - Add get_machine_type() to chrome_test_helpers.py - Update get_test_env() to include MACHINE_TYPE - Refactor test_chrome.py to import from shared helpers - Removes ~50 lines of duplicate code	2025-12-31 08:34:35 +00:00
Claude	65c839032a	Consolidate Chrome test helpers across all plugin tests - Add setup_test_env, launch_chromium_session, kill_chromium_session to chrome_test_helpers.py for extension tests - Add chromium_session context manager for cleaner test code - Refactor ublock, istilldontcareaboutcookies, twocaptcha tests to use shared helpers (~450 lines removed) - Refactor screenshot, dom, pdf tests to use shared get_test_env and get_lib_dir (~60 lines removed) - Net reduction: 228 lines of duplicate code	2025-12-31 08:30:14 +00:00
Nick Sweeting	29eb6280d3	tweak comment	2025-12-31 00:25:01 -08:00
Nick Sweeting	65b93d5a3b	tweak comment	2025-12-31 00:25:01 -08:00
Nick Sweeting	4394ce5f40	Reduce code duplication between Chrome utilities (#1737 ) This change consolidates duplicated logic between chrome_utils.js and extension installer hooks, as well as between Python plugin tests: JavaScript changes: - Add getExtensionsDir() to centralize extension directory path calculation - Add installExtensionWithCache() to handle extension install + cache workflow - Add CLI commands for new utilities - Refactor all 3 extension installers (ublock, istilldontcareaboutcookies, twocaptcha) to use shared utilities, reducing each from ~115 lines to ~60 - Update chrome_launch hook to use getExtensionsDir() Python test changes: - Add chrome_test_helpers.py with shared Chrome session management utilities - Refactor infiniscroll and modalcloser tests to use shared helpers - setup_chrome_session(), cleanup_chrome(), get_test_env() now centralized - Add chrome_session() context manager for automatic cleanup Net result: ~208 lines of code removed while maintaining same functionality. <!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line length changes. --> # Summary <!--e.g. This PR fixes ABC or adds the ability to do XYZ...--> # Related issues <!-- e.g. #123 or Roadmap goal # https://github.com/pirate/ArchiveBox/wiki/Roadmap --> # Changes these areas - [ ] Bugfixes - [ ] Feature behavior - [ ] Command line interface - [ ] Configuration options - [ ] Internal architecture - [ ] Snapshot data layout on disk	2025-12-31 00:19:44 -08:00
Nick Sweeting	987f4fbe0a	Review output file paths and data directory structure (#1736 ) - Update Crawl.output_dir_parent to use username instead of user_id for consistency with Snapshot paths - Add domain from first URL to Crawl path structure for easier debugging: users/{username}/crawls/YYYYMMDD/{domain}/{crawl_id}/ - Add CRAWL_OUTPUT_DIR to config passed to Snapshot hooks so chrome_tab can find the shared Chrome session from the Crawl - Update comment in chrome_tab hook to reflect new config source <!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line length changes. --> # Summary <!--e.g. This PR fixes ABC or adds the ability to do XYZ...--> # Related issues <!-- e.g. #123 or Roadmap goal # https://github.com/pirate/ArchiveBox/wiki/Roadmap --> # Changes these areas - [ ] Bugfixes - [ ] Feature behavior - [ ] Command line interface - [ ] Configuration options - [ ] Internal architecture - [ ] Snapshot data layout on disk	2025-12-31 00:19:03 -08:00
Claude	04c23badc2	Fix output path structure for 0.9.x data directory - Update Crawl.output_dir_parent to use username instead of user_id for consistency with Snapshot paths - Add domain from first URL to Crawl path structure for easier debugging: users/{username}/crawls/YYYYMMDD/{domain}/{crawl_id}/ - Add CRAWL_OUTPUT_DIR to config passed to Snapshot hooks so chrome_tab can find the shared Chrome session from the Crawl - Update comment in chrome_tab hook to reflect new config source	2025-12-31 08:18:24 +00:00
Claude	fd9ba86220	Reduce Chrome-related code duplication across JS and Python This change consolidates duplicated logic between chrome_utils.js and extension installer hooks, as well as between Python plugin tests: JavaScript changes: - Add getExtensionsDir() to centralize extension directory path calculation - Add installExtensionWithCache() to handle extension install + cache workflow - Add CLI commands for new utilities - Refactor all 3 extension installers (ublock, istilldontcareaboutcookies, twocaptcha) to use shared utilities, reducing each from ~115 lines to ~60 - Update chrome_launch hook to use getExtensionsDir() Python test changes: - Add chrome_test_helpers.py with shared Chrome session management utilities - Refactor infiniscroll and modalcloser tests to use shared helpers - setup_chrome_session(), cleanup_chrome(), get_test_env() now centralized - Add chrome_session() context manager for automatic cleanup Net result: ~208 lines of code removed while maintaining same functionality.	2025-12-31 08:13:00 +00:00
Nick Sweeting	cead22afc2	archivebox <modelname> create\|list\|update\|delete \| ... piping support (#1735 ) Comprehensive plan for implementing JSONL-based CLI piping: - Phase 1: Model prerequisites (ArchiveResult.from_json, tags_str fix) - Phase 2: Extract shared apply_filters() to cli_utils.py - Phase 3: Implement pass-through behavior for all create commands - Phase 4-6: Test infrastructure with pytest-django, unit/integration tests Key changes from original plan: - ArchiveResult.from_json() identified as missing prerequisite - Pass-through documented as new feature to implement - archivebox run updated to create-or-update pattern - conftest.py redesigned to use pytest-django with isolated tmp_path - Standardized on tags_str field name across all models - Reordered phases: implement before test <!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line length changes. --> # Summary <!--e.g. This PR fixes ABC or adds the ability to do XYZ...--> # Related issues <!-- e.g. #123 or Roadmap goal # https://github.com/pirate/ArchiveBox/wiki/Roadmap --> # Changes these areas - [ ] Bugfixes - [ ] Feature behavior - [ ] Command line interface - [ ] Configuration options - [ ] Internal architecture - [ ] Snapshot data layout on disk	2025-12-30 23:57:33 -08:00
Nick Sweeting	84a4fb0785	fix cubic comments	2025-12-30 23:53:47 -08:00
Nick Sweeting	ca3f8f0ff1	Process.launch()/kill()/.pidfile/.wait()/etc. centralize process handling logic on model methods (#1734 ) <!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line length changes. --> # Summary <!--e.g. This PR fixes ABC or adds the ability to do XYZ...--> # Related issues <!-- e.g. #123 or Roadmap goal # https://github.com/pirate/ArchiveBox/wiki/Roadmap --> # Changes these areas - [ ] Bugfixes - [ ] Feature behavior - [ ] Command line interface - [ ] Configuration options - [ ] Internal architecture - [ ] Snapshot data layout on disk <!-- This is an auto-generated description by cubic. --> --- ## Summary by cubic Added an implementation plan to centralize subprocess handling on the machine.Process model. It covers process hierarchy, Process.current(), safe lifecycle methods (launch/kill/wait), PID reuse protection, and phased changes across hooks, workers, CLI, migrations, and admin. <sup>Written for commit `3ae9410127`. Summary will update on new commits.</sup> <!-- End of auto-generated description by cubic. -->	2025-12-30 23:42:29 -08:00
Nick Sweeting	dfe68412af	Merge branch 'dev' into claude/refactor-process-management-WcQyZ	2025-12-30 23:42:23 -08:00
claude[bot]	4285a05d19	Fix getEnvArray to parse JSON when '[' present, CSV otherwise Simplifies the comma-separated parsing logic to: - If value contains '[', parse as JSON array - Otherwise, parse as comma-separated values This prevents incorrect splitting of arguments containing internal commas when there's only one argument. For arguments with commas, users should use JSON format: CHROME_ARGS='["--arg1,val", "--arg2"]' Also exports getEnvArray in module.exports for consistency. Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>	2025-12-31 07:39:49 +00:00
Nick Sweeting	3ae9410127	Update TODO_process_tracking.md	2025-12-31 02:39:36 -05:00
Nick Sweeting	e26a0f6fc0	Fix hook file overwrites in plugin directory (#1732 ) Multiple hooks in the same plugin directory were overwriting each other's stdout.log, stderr.log, hook.pid, and cmd.sh files. Now each hook uses filenames prefixed with its hook name: - on_Snapshot__20_chrome_tab.bg.stdout.log - on_Snapshot__20_chrome_tab.bg.stderr.log - on_Snapshot__20_chrome_tab.bg.pid - on_Snapshot__20_chrome_tab.bg.sh Updated: - hooks.py run_hook() to use hook-specific names - core/models.py cleanup and update_from_output methods - Plugin scripts to no longer write redundant hook.pid files <!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line length changes. --> # Summary <!--e.g. This PR fixes ABC or adds the ability to do XYZ...--> # Related issues <!-- e.g. #123 or Roadmap goal # https://github.com/pirate/ArchiveBox/wiki/Roadmap --> # Changes these areas - [ ] Bugfixes - [ ] Feature behavior - [ ] Command line interface - [ ] Configuration options - [ ] Internal architecture - [ ] Snapshot data layout on disk <!-- This is an auto-generated description by cubic. --> --- ## Summary by cubic Prevented hook file collisions by giving each hook its own stdout, stderr, pid, and cmd filenames. This fixes mixed logs and ensures correct cleanup and status checks when multiple hooks run in the same plugin directory. - Bug Fixes - hooks.py: write hook-specific stdout/stderr/pid/cmd files and exclude them from new_files; derive cmd.sh from pid for safe kill. - core/models.py: read hook-specific logs; exclude hook output files when computing outputs; cleanup and background detection use *.pid. - Plugins: stop writing redundant hook.pid files; minor chrome utils cleanup. <sup>Written for commit `754b096193`. Summary will update on new commits.</sup> <!-- End of auto-generated description by cubic. -->	2025-12-30 23:36:09 -08:00
Nick Sweeting	f7b186d7c8	Apply suggestion from @cubic-dev-ai[bot] Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>	2025-12-31 02:31:46 -05:00
Nick Sweeting	dac6c63bba	working extension tests	2025-12-30 18:30:16 -08:00
Nick Sweeting	42d3fb7025	extension test fixes	2025-12-30 18:28:14 -08:00
Claude	754b096193	use hook-specific filenames to avoid overwrites Multiple hooks in the same plugin directory were overwriting each other's stdout.log, stderr.log, hook.pid, and cmd.sh files. Now each hook uses filenames prefixed with its hook name: - on_Snapshot__20_chrome_tab.bg.stdout.log - on_Snapshot__20_chrome_tab.bg.stderr.log - on_Snapshot__20_chrome_tab.bg.pid - on_Snapshot__20_chrome_tab.bg.sh Updated: - hooks.py run_hook() to use hook-specific names - core/models.py cleanup and update_from_output methods - Plugin scripts to no longer write redundant hook.pid files	2025-12-31 02:00:15 +00:00
Claude	df2a0dcd44	Add revised CLI pipeline architecture plan Comprehensive plan for implementing JSONL-based CLI piping: - Phase 1: Model prerequisites (ArchiveResult.from_json, tags_str fix) - Phase 2: Extract shared apply_filters() to cli_utils.py - Phase 3: Implement pass-through behavior for all create commands - Phase 4-6: Test infrastructure with pytest-django, unit/integration tests Key changes from original plan: - ArchiveResult.from_json() identified as missing prerequisite - Pass-through documented as new feature to implement - archivebox run updated to create-or-update pattern - conftest.py redesigned to use pytest-django with isolated tmp_path - Standardized on tags_str field name across all models - Reordered phases: implement before test	2025-12-31 01:46:07 +00:00
Claude	b8a66c4a84	Convert Persona to Django ModelWithConfig, add to get_config() - Convert Persona from plain Python class to Django model with ModelWithConfig - Add config JSONField for persona-specific config overrides - Add get_derived_config() method that returns config with derived paths: - CHROME_USER_DATA_DIR, CHROME_EXTENSIONS_DIR, COOKIES_FILE, ACTIVE_PERSONA - Update get_config() to accept persona parameter in merge chain: get_config(persona=crawl.persona, crawl=crawl, snapshot=snapshot) - Remove _derive_persona_paths() - derivation now happens in Persona model - Merge order (highest to lowest priority): 1. snapshot.config 2. crawl.config 3. user.config 4. persona.get_derived_config() <- NEW 5. environment variables 6. ArchiveBox.conf file 7. plugin defaults 8. core defaults Usage: config = get_config(persona=crawl.persona, crawl=crawl) config['CHROME_USER_DATA_DIR'] # derived from persona	2025-12-31 01:07:29 +00:00
Claude	b1e31c3def	Simplify Persona class: remove convenience functions, fix get_active() - Remove standalone convenience functions (cleanup_chrome_for_persona, cleanup_chrome_all_personas) to reduce LOC - Change Persona.get_active(config) to accept config dict as argument instead of calling get_config() internally, since the caller needs to pass user/crawl/snapshot/archiveresult context for proper config	2025-12-31 01:00:52 +00:00
Claude	503a2f77cb	Add Persona class with cleanup_chrome() method - Create Persona class in personas/models.py for managing browser profiles/identities used for archiving sessions - Each Persona has: - chrome_user_data_dir: Chrome profile directory - chrome_extensions_dir: Installed extensions - cookies_file: Cookies for wget/curl - config_file: Persona-specific config overrides - Add Persona methods: - cleanup_chrome(): Remove stale SingletonLock/SingletonSocket files - get_config(): Load persona config from config.json - save_config(): Save persona config to config.json - ensure_dirs(): Create persona directory structure - all(): Iterator over all personas - get_active(): Get persona based on ACTIVE_PERSONA config - cleanup_chrome_all(): Clean up all personas - Update chrome_cleanup() in misc/util.py to use Persona.cleanup_chrome_all() instead of manual directory iteration - Add convenience functions: - cleanup_chrome_for_persona(name) - cleanup_chrome_all_personas()	2025-12-31 00:59:37 +00:00
Claude	1a86789523	Move Chrome default args to config.json CHROME_ARGS - Add comprehensive default CHROME_ARGS in config.json with 55+ flags for deterministic rendering, security, performance, and UI suppression - Update chrome_utils.js launchChromium() to read CHROME_ARGS and CHROME_ARGS_EXTRA from environment variables (set by get_config()) - Add getEnvArray() helper to parse JSON arrays or comma-separated strings from environment variables - Separate args into three categories: 1. baseArgs: Static flags from CHROME_ARGS config (configurable) 2. dynamicArgs: Runtime-computed flags (port, sandbox, headless, etc.) 3. extraArgs: User overrides from CHROME_ARGS_EXTRA - Add CHROME_SANDBOX config option to control --no-sandbox flag Args are now configurable via: - config.json defaults - ArchiveBox.conf file - Environment variables - Per-crawl/snapshot config overrides	2025-12-31 00:57:29 +00:00
Claude	caee376749	Add Process.proc property for validated psutil access New section 1.5 adds @property proc that returns psutil.Process ONLY if: - PID exists in OS - OS start time matches our started_at (within tolerance) - We're on the same machine Safety features: - Validates start time via psutil.Process.create_time() - Optional command validation (binary name matches) - Returns None instead of wrong process on PID reuse Also adds convenience methods: - is_running: Check via validated psutil - get_memory_info(): RSS/VMS if running - get_cpu_percent(): CPU usage if running - get_children_pids(): Child PIDs from OS Updated kill() to use self.proc for safe killing - never kills a recycled PID since we validate start time first.	2025-12-31 00:49:58 +00:00

1 2 3 4 5 ...

4820 Commits