Commit Graph

4806 Commits

Author SHA1 Message Date
Nick Sweeting
95d61b001e fix migrations 2025-12-31 01:40:59 -08:00
Nick Sweeting
1d15901304 fix process health stats 2025-12-31 01:40:59 -08:00
Nick Sweeting
3d8c62ffb1 fix extensions dir paths add personas migration 2025-12-31 01:40:59 -08:00
Nick Sweeting
1bbb9b45a7 Change hook timeout enforcement strategy (#1739)
<!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line
length changes. -->

# Summary

<!--e.g. This PR fixes ABC or adds the ability to do XYZ...-->

# Related issues

<!-- e.g. #123 or Roadmap goal #
https://github.com/pirate/ArchiveBox/wiki/Roadmap -->

# Changes these areas

- [ ] Bugfixes
- [ ] Feature behavior
- [ ] Command line interface
- [ ] Configuration options
- [ ] Internal architecture
- [ ] Snapshot data layout on disk


<!-- This is an auto-generated description by cubic. -->
---
## Summary by cubic
Switch background hook cleanup to a graceful termination flow using
plugin-specific timeouts, only SIGKILLing if needed. This improves
reliability and records accurate exit codes and stderr for better result
reporting.

- **Refactors**
- Added graceful_terminate_background_hooks(): send SIGTERM to all
hooks, wait per plugin timeout, SIGKILL remaining, reap with waitpid,
write .returncode files.
- Snapshot.cleanup() now uses merged config (get_config) to apply
plugin-specific timeouts and terminate hooks gracefully.
- update_from_output() reads .returncode and .stderr.log, infers status
when no JSONL (handles signals like -9/-15), includes stderr on
failures, and cleans up .returncode files.

<sup>Written for commit 524e8e98c3.
Summary will update on new commits.</sup>

<!-- End of auto-generated description by cubic. -->
2025-12-31 01:32:43 -08:00
Nick Sweeting
8dab2966cc Consolidate Chrome test helpers across all plugin tests (#1738)
<!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line
length changes. -->

# Summary

<!--e.g. This PR fixes ABC or adds the ability to do XYZ...-->

# Related issues

<!-- e.g. #123 or Roadmap goal #
https://github.com/pirate/ArchiveBox/wiki/Roadmap -->

# Changes these areas

- [ ] Bugfixes
- [ ] Feature behavior
- [ ] Command line interface
- [ ] Configuration options
- [ ] Internal architecture
- [ ] Snapshot data layout on disk
2025-12-31 01:25:39 -08:00
Claude
1cfb77a355 Rename Python helpers to match JS function names in snake_case
- get_machine_type() matches JS getMachineType()
- get_lib_dir() matches JS getLibDir()
- get_node_modules_dir() matches JS getNodeModulesDir()
- get_extensions_dir() matches JS getExtensionsDir()
- find_chromium() matches JS findChromium()
- kill_chrome() matches JS killChrome()
- get_test_env() matches JS getTestEnv()

All functions now try JS first (single source of truth) with Python fallback.
Added backward compatibility aliases for old names.
2025-12-31 09:23:47 +00:00
Claude
524e8e98c3 Capture exit codes and stderr from background hooks
Extended graceful_terminate_background_hooks() to:
- Reap processes with os.waitpid() to get exit codes
- Write returncode to .returncode file for update_from_output()
- Return detailed result dict with status, returncode, and pid

Updated update_from_output() to:
- Read .returncode and .stderr.log files
- Determine status from returncode if no ArchiveResult JSONL record
- Include stderr in output_str for failed hooks
- Handle signal termination (negative returncodes like -9 for SIGKILL)
- Clean up .returncode files along with other hook output files
2025-12-31 09:23:41 +00:00
Claude
adeffb4bc5 Add JS-Python path delegation to reduce Chrome-related duplication
- Add getMachineType, getLibDir, getNodeModulesDir, getTestEnv CLI commands to chrome_utils.js
  These are now the single source of truth for path calculations
- Update chrome_test_helpers.py with call_chrome_utils() dispatcher
- Add get_test_env_from_js(), get_machine_type_from_js(), kill_chrome_via_js() helpers
- Update cleanup_chrome and kill_chromium_session to use JS killChrome
- Remove unused Chrome binary search lists from singlefile hook (~25 lines)
- Update readability, mercury, favicon, title tests to use shared helpers
2025-12-31 09:11:11 +00:00
Claude
b73199b33e Refactor background hook cleanup to use graceful termination
Changed Snapshot.cleanup() to gracefully terminate background hooks:
1. Send SIGTERM to all background hook processes first
2. Wait up to each hook's plugin-specific timeout
3. Send SIGKILL only to hooks still running after their timeout

Added graceful_terminate_background_hooks() function in hooks.py that:
- Collects all .pid files from output directory
- Validates process identity using mtime
- Sends SIGTERM to all valid processes in phase 1
- Polls each process for up to its plugin-specific timeout
- Sends SIGKILL as last resort if timeout expires
- Returns status for each hook (sigterm/sigkill/already_dead/invalid)
2025-12-31 09:03:27 +00:00
Claude
d72ab7c397 Add simpler Chrome test helpers and update test files
New helpers in chrome_test_helpers.py:
- get_plugin_dir(__file__) - get plugin dir from test file path
- get_hook_script(dir, pattern) - find hook script by glob pattern
- run_hook() - run hook script and return (returncode, stdout, stderr)
- parse_jsonl_output() - parse JSONL from hook output
- run_hook_and_parse() - convenience combo of above two
- LIB_DIR, NODE_MODULES_DIR - lazy-loaded module constants
- _LazyPath class for deferred path resolution

Updated test files to use simpler patterns:
- screenshot/tests/test_screenshot.py
- dom/tests/test_dom.py
- pdf/tests/test_pdf.py
- singlefile/tests/test_singlefile.py

Before: PLUGIN_DIR = Path(__file__).parent.parent
After:  PLUGIN_DIR = get_plugin_dir(__file__)

Before: LIB_DIR = get_lib_dir(); NODE_MODULES_DIR = LIB_DIR / 'npm' / 'node_modules'
After:  from chrome_test_helpers import LIB_DIR, NODE_MODULES_DIR
2025-12-31 09:02:34 +00:00
Claude
7d74dd906c Add Chrome CDP integration tests for singlefile
- Import shared Chrome test helpers
- Add test_singlefile_with_chrome_session() to verify CDP connection
- Add test_singlefile_disabled_skips() for config testing
- Update existing test to use get_test_env()
2025-12-31 08:57:13 +00:00
Claude
ef92a99c4a Refactor test_chrome.py to use shared helpers
- Add get_machine_type() to chrome_test_helpers.py
- Update get_test_env() to include MACHINE_TYPE
- Refactor test_chrome.py to import from shared helpers
- Removes ~50 lines of duplicate code
2025-12-31 08:34:35 +00:00
Claude
65c839032a Consolidate Chrome test helpers across all plugin tests
- Add setup_test_env, launch_chromium_session, kill_chromium_session
  to chrome_test_helpers.py for extension tests
- Add chromium_session context manager for cleaner test code
- Refactor ublock, istilldontcareaboutcookies, twocaptcha tests to use
  shared helpers (~450 lines removed)
- Refactor screenshot, dom, pdf tests to use shared get_test_env
  and get_lib_dir (~60 lines removed)
- Net reduction: 228 lines of duplicate code
2025-12-31 08:30:14 +00:00
Nick Sweeting
29eb6280d3 tweak comment 2025-12-31 00:25:01 -08:00
Nick Sweeting
65b93d5a3b tweak comment 2025-12-31 00:25:01 -08:00
Nick Sweeting
4394ce5f40 Reduce code duplication between Chrome utilities (#1737)
This change consolidates duplicated logic between chrome_utils.js and
extension installer hooks, as well as between Python plugin tests:

JavaScript changes:
- Add getExtensionsDir() to centralize extension directory path
calculation
- Add installExtensionWithCache() to handle extension install + cache
workflow
- Add CLI commands for new utilities
- Refactor all 3 extension installers (ublock,
istilldontcareaboutcookies, twocaptcha) to use shared utilities,
reducing each from ~115 lines to ~60
- Update chrome_launch hook to use getExtensionsDir()

Python test changes:
- Add chrome_test_helpers.py with shared Chrome session management
utilities
- Refactor infiniscroll and modalcloser tests to use shared helpers
- setup_chrome_session(), cleanup_chrome(), get_test_env() now
centralized
- Add chrome_session() context manager for automatic cleanup

Net result: ~208 lines of code removed while maintaining same
functionality.

<!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line
length changes. -->

# Summary

<!--e.g. This PR fixes ABC or adds the ability to do XYZ...-->

# Related issues

<!-- e.g. #123 or Roadmap goal #
https://github.com/pirate/ArchiveBox/wiki/Roadmap -->

# Changes these areas

- [ ] Bugfixes
- [ ] Feature behavior
- [ ] Command line interface
- [ ] Configuration options
- [ ] Internal architecture
- [ ] Snapshot data layout on disk
2025-12-31 00:19:44 -08:00
Nick Sweeting
987f4fbe0a Review output file paths and data directory structure (#1736)
- Update Crawl.output_dir_parent to use username instead of user_id for
consistency with Snapshot paths
- Add domain from first URL to Crawl path structure for easier
debugging: users/{username}/crawls/YYYYMMDD/{domain}/{crawl_id}/
- Add CRAWL_OUTPUT_DIR to config passed to Snapshot hooks so chrome_tab
can find the shared Chrome session from the Crawl
- Update comment in chrome_tab hook to reflect new config source

<!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line
length changes. -->

# Summary

<!--e.g. This PR fixes ABC or adds the ability to do XYZ...-->

# Related issues

<!-- e.g. #123 or Roadmap goal #
https://github.com/pirate/ArchiveBox/wiki/Roadmap -->

# Changes these areas

- [ ] Bugfixes
- [ ] Feature behavior
- [ ] Command line interface
- [ ] Configuration options
- [ ] Internal architecture
- [ ] Snapshot data layout on disk
2025-12-31 00:19:03 -08:00
Claude
04c23badc2 Fix output path structure for 0.9.x data directory
- Update Crawl.output_dir_parent to use username instead of user_id
  for consistency with Snapshot paths
- Add domain from first URL to Crawl path structure for easier debugging:
  users/{username}/crawls/YYYYMMDD/{domain}/{crawl_id}/
- Add CRAWL_OUTPUT_DIR to config passed to Snapshot hooks so chrome_tab
  can find the shared Chrome session from the Crawl
- Update comment in chrome_tab hook to reflect new config source
2025-12-31 08:18:24 +00:00
Claude
fd9ba86220 Reduce Chrome-related code duplication across JS and Python
This change consolidates duplicated logic between chrome_utils.js and
extension installer hooks, as well as between Python plugin tests:

JavaScript changes:
- Add getExtensionsDir() to centralize extension directory path calculation
- Add installExtensionWithCache() to handle extension install + cache workflow
- Add CLI commands for new utilities
- Refactor all 3 extension installers (ublock, istilldontcareaboutcookies,
  twocaptcha) to use shared utilities, reducing each from ~115 lines to ~60
- Update chrome_launch hook to use getExtensionsDir()

Python test changes:
- Add chrome_test_helpers.py with shared Chrome session management utilities
- Refactor infiniscroll and modalcloser tests to use shared helpers
- setup_chrome_session(), cleanup_chrome(), get_test_env() now centralized
- Add chrome_session() context manager for automatic cleanup

Net result: ~208 lines of code removed while maintaining same functionality.
2025-12-31 08:13:00 +00:00
Nick Sweeting
cead22afc2 archivebox <modelname> create|list|update|delete | ... piping support (#1735)
Comprehensive plan for implementing JSONL-based CLI piping:
- Phase 1: Model prerequisites (ArchiveResult.from_json, tags_str fix)
- Phase 2: Extract shared apply_filters() to cli_utils.py
- Phase 3: Implement pass-through behavior for all create commands
- Phase 4-6: Test infrastructure with pytest-django, unit/integration
tests

Key changes from original plan:
- ArchiveResult.from_json() identified as missing prerequisite
- Pass-through documented as new feature to implement
- archivebox run updated to create-or-update pattern
- conftest.py redesigned to use pytest-django with isolated tmp_path
- Standardized on tags_str field name across all models
- Reordered phases: implement before test

<!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line
length changes. -->

# Summary

<!--e.g. This PR fixes ABC or adds the ability to do XYZ...-->

# Related issues

<!-- e.g. #123 or Roadmap goal #
https://github.com/pirate/ArchiveBox/wiki/Roadmap -->

# Changes these areas

- [ ] Bugfixes
- [ ] Feature behavior
- [ ] Command line interface
- [ ] Configuration options
- [ ] Internal architecture
- [ ] Snapshot data layout on disk
2025-12-30 23:57:33 -08:00
Nick Sweeting
84a4fb0785 fix cubic comments 2025-12-30 23:53:47 -08:00
Nick Sweeting
ca3f8f0ff1 Process.launch()/kill()/.pidfile/.wait()/etc. centralize process handling logic on model methods (#1734)
<!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line
length changes. -->

# Summary

<!--e.g. This PR fixes ABC or adds the ability to do XYZ...-->

# Related issues

<!-- e.g. #123 or Roadmap goal #
https://github.com/pirate/ArchiveBox/wiki/Roadmap -->

# Changes these areas

- [ ] Bugfixes
- [ ] Feature behavior
- [ ] Command line interface
- [ ] Configuration options
- [ ] Internal architecture
- [ ] Snapshot data layout on disk

<!-- This is an auto-generated description by cubic. -->
---
## Summary by cubic
Added an implementation plan to centralize subprocess handling on the
machine.Process model. It covers process hierarchy, Process.current(),
safe lifecycle methods (launch/kill/wait), PID reuse protection, and
phased changes across hooks, workers, CLI, migrations, and admin.

<sup>Written for commit 3ae9410127.
Summary will update on new commits.</sup>

<!-- End of auto-generated description by cubic. -->
2025-12-30 23:42:29 -08:00
Nick Sweeting
dfe68412af Merge branch 'dev' into claude/refactor-process-management-WcQyZ 2025-12-30 23:42:23 -08:00
claude[bot]
4285a05d19 Fix getEnvArray to parse JSON when '[' present, CSV otherwise
Simplifies the comma-separated parsing logic to:
- If value contains '[', parse as JSON array
- Otherwise, parse as comma-separated values

This prevents incorrect splitting of arguments containing internal commas
when there's only one argument. For arguments with commas, users should
use JSON format: CHROME_ARGS='["--arg1,val", "--arg2"]'

Also exports getEnvArray in module.exports for consistency.

Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>
2025-12-31 07:39:49 +00:00
Nick Sweeting
3ae9410127 Update TODO_process_tracking.md 2025-12-31 02:39:36 -05:00
Nick Sweeting
e26a0f6fc0 Fix hook file overwrites in plugin directory (#1732)
Multiple hooks in the same plugin directory were overwriting each
other's stdout.log, stderr.log, hook.pid, and cmd.sh files. Now each
hook uses filenames prefixed with its hook name:
- on_Snapshot__20_chrome_tab.bg.stdout.log
- on_Snapshot__20_chrome_tab.bg.stderr.log
- on_Snapshot__20_chrome_tab.bg.pid
- on_Snapshot__20_chrome_tab.bg.sh

Updated:
- hooks.py run_hook() to use hook-specific names
- core/models.py cleanup and update_from_output methods
- Plugin scripts to no longer write redundant hook.pid files

<!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line
length changes. -->

# Summary

<!--e.g. This PR fixes ABC or adds the ability to do XYZ...-->

# Related issues

<!-- e.g. #123 or Roadmap goal #
https://github.com/pirate/ArchiveBox/wiki/Roadmap -->

# Changes these areas

- [ ] Bugfixes
- [ ] Feature behavior
- [ ] Command line interface
- [ ] Configuration options
- [ ] Internal architecture
- [ ] Snapshot data layout on disk


<!-- This is an auto-generated description by cubic. -->
---
## Summary by cubic
Prevented hook file collisions by giving each hook its own stdout,
stderr, pid, and cmd filenames. This fixes mixed logs and ensures
correct cleanup and status checks when multiple hooks run in the same
plugin directory.

- **Bug Fixes**
- hooks.py: write hook-specific stdout/stderr/pid/cmd files and exclude
them from new_files; derive cmd.sh from pid for safe kill.
- core/models.py: read hook-specific logs; exclude hook output files
when computing outputs; cleanup and background detection use *.pid.
- Plugins: stop writing redundant hook.pid files; minor chrome utils
cleanup.

<sup>Written for commit 754b096193.
Summary will update on new commits.</sup>

<!-- End of auto-generated description by cubic. -->
2025-12-30 23:36:09 -08:00
Nick Sweeting
f7b186d7c8 Apply suggestion from @cubic-dev-ai[bot]
Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>
2025-12-31 02:31:46 -05:00
Nick Sweeting
dac6c63bba working extension tests 2025-12-30 18:30:16 -08:00
Nick Sweeting
42d3fb7025 extension test fixes 2025-12-30 18:28:14 -08:00
Claude
754b096193 use hook-specific filenames to avoid overwrites
Multiple hooks in the same plugin directory were overwriting each
other's stdout.log, stderr.log, hook.pid, and cmd.sh files. Now
each hook uses filenames prefixed with its hook name:
- on_Snapshot__20_chrome_tab.bg.stdout.log
- on_Snapshot__20_chrome_tab.bg.stderr.log
- on_Snapshot__20_chrome_tab.bg.pid
- on_Snapshot__20_chrome_tab.bg.sh

Updated:
- hooks.py run_hook() to use hook-specific names
- core/models.py cleanup and update_from_output methods
- Plugin scripts to no longer write redundant hook.pid files
2025-12-31 02:00:15 +00:00
Claude
df2a0dcd44 Add revised CLI pipeline architecture plan
Comprehensive plan for implementing JSONL-based CLI piping:
- Phase 1: Model prerequisites (ArchiveResult.from_json, tags_str fix)
- Phase 2: Extract shared apply_filters() to cli_utils.py
- Phase 3: Implement pass-through behavior for all create commands
- Phase 4-6: Test infrastructure with pytest-django, unit/integration tests

Key changes from original plan:
- ArchiveResult.from_json() identified as missing prerequisite
- Pass-through documented as new feature to implement
- archivebox run updated to create-or-update pattern
- conftest.py redesigned to use pytest-django with isolated tmp_path
- Standardized on tags_str field name across all models
- Reordered phases: implement before test
2025-12-31 01:46:07 +00:00
Claude
b8a66c4a84 Convert Persona to Django ModelWithConfig, add to get_config()
- Convert Persona from plain Python class to Django model with ModelWithConfig
- Add config JSONField for persona-specific config overrides
- Add get_derived_config() method that returns config with derived paths:
  - CHROME_USER_DATA_DIR, CHROME_EXTENSIONS_DIR, COOKIES_FILE, ACTIVE_PERSONA

- Update get_config() to accept persona parameter in merge chain:
  get_config(persona=crawl.persona, crawl=crawl, snapshot=snapshot)

- Remove _derive_persona_paths() - derivation now happens in Persona model

- Merge order (highest to lowest priority):
  1. snapshot.config
  2. crawl.config
  3. user.config
  4. persona.get_derived_config()  <- NEW
  5. environment variables
  6. ArchiveBox.conf file
  7. plugin defaults
  8. core defaults

Usage:
  config = get_config(persona=crawl.persona, crawl=crawl)
  config['CHROME_USER_DATA_DIR']  # derived from persona
2025-12-31 01:07:29 +00:00
Claude
b1e31c3def Simplify Persona class: remove convenience functions, fix get_active()
- Remove standalone convenience functions (cleanup_chrome_for_persona,
  cleanup_chrome_all_personas) to reduce LOC
- Change Persona.get_active(config) to accept config dict as argument
  instead of calling get_config() internally, since the caller needs
  to pass user/crawl/snapshot/archiveresult context for proper config
2025-12-31 01:00:52 +00:00
Claude
503a2f77cb Add Persona class with cleanup_chrome() method
- Create Persona class in personas/models.py for managing browser
  profiles/identities used for archiving sessions

- Each Persona has:
  - chrome_user_data_dir: Chrome profile directory
  - chrome_extensions_dir: Installed extensions
  - cookies_file: Cookies for wget/curl
  - config_file: Persona-specific config overrides

- Add Persona methods:
  - cleanup_chrome(): Remove stale SingletonLock/SingletonSocket files
  - get_config(): Load persona config from config.json
  - save_config(): Save persona config to config.json
  - ensure_dirs(): Create persona directory structure
  - all(): Iterator over all personas
  - get_active(): Get persona based on ACTIVE_PERSONA config
  - cleanup_chrome_all(): Clean up all personas

- Update chrome_cleanup() in misc/util.py to use Persona.cleanup_chrome_all()
  instead of manual directory iteration

- Add convenience functions:
  - cleanup_chrome_for_persona(name)
  - cleanup_chrome_all_personas()
2025-12-31 00:59:37 +00:00
Claude
1a86789523 Move Chrome default args to config.json CHROME_ARGS
- Add comprehensive default CHROME_ARGS in config.json with 55+ flags
  for deterministic rendering, security, performance, and UI suppression

- Update chrome_utils.js launchChromium() to read CHROME_ARGS and
  CHROME_ARGS_EXTRA from environment variables (set by get_config())

- Add getEnvArray() helper to parse JSON arrays or comma-separated
  strings from environment variables

- Separate args into three categories:
  1. baseArgs: Static flags from CHROME_ARGS config (configurable)
  2. dynamicArgs: Runtime-computed flags (port, sandbox, headless, etc.)
  3. extraArgs: User overrides from CHROME_ARGS_EXTRA

- Add CHROME_SANDBOX config option to control --no-sandbox flag

Args are now configurable via:
  - config.json defaults
  - ArchiveBox.conf file
  - Environment variables
  - Per-crawl/snapshot config overrides
2025-12-31 00:57:29 +00:00
Claude
caee376749 Add Process.proc property for validated psutil access
New section 1.5 adds @property proc that returns psutil.Process ONLY if:
- PID exists in OS
- OS start time matches our started_at (within tolerance)
- We're on the same machine

Safety features:
- Validates start time via psutil.Process.create_time()
- Optional command validation (binary name matches)
- Returns None instead of wrong process on PID reuse

Also adds convenience methods:
- is_running: Check via validated psutil
- get_memory_info(): RSS/VMS if running
- get_cpu_percent(): CPU usage if running
- get_children_pids(): Child PIDs from OS

Updated kill() to use self.proc for safe killing - never kills
a recycled PID since we validate start time first.
2025-12-31 00:49:58 +00:00
Claude
f3c91b4c4e Add detailed supervisord Process tracking to plan
Phase 3.3 now includes:
- Module-level _supervisord_db_process variable
- start_new_supervisord_process(): Create Process record after Popen
- stop_existing_supervisord_process(): Update Process status on shutdown
- Process hierarchy diagram showing CLI → supervisord → workers chain

Key insight: PPID-based linking works because workers call Process.current()
in on_startup(), which finds supervisord's Process via PPID lookup.
2025-12-31 00:45:10 +00:00
Claude
e41ca37848 Add detailed hook/run() changes to Process tracking plan
Phase 2 now includes line-by-line mapping of:

- run_hook(): Create Process record, use Process.launch(), parse
  JSONL for child binary Process records
- process_is_alive(): Accept Path or Process, use Process.is_alive()
- kill_process(): Accept Path or Process, use Process.kill()
- ArchiveResult.run(): Pass self.process as parent_process to run_hook()
- ArchiveResult.update_from_output(): Read from Process.stdout/stderr
- Snapshot.cleanup(): Kill via Process model, fallback to PID files
- Snapshot.has_running_background_hooks(): Check via Process model

Hook JSONL contract updated to support {"type": "Process"} records
for tracking binary executions within hooks.
2025-12-31 00:44:10 +00:00
Claude
554d743719 Add robust PID reuse protection to Process.current() plan
PIDs are recycled by OS, so all Process queries now:
- Filter by machine=Machine.current() (PIDs unique per machine)
- Filter by started_at within PID_REUSE_WINDOW (24h)
- Validate start time matches OS via psutil.Process.create_time()

Added:
- ProcessManager.get_by_pid() for safe PID lookups
- Process.cleanup_stale_running() to mark orphaned RUNNING as EXITED
- START_TIME_TOLERANCE (5s) for start time comparison
- Uses psutil.Process.create_time() for accurate started_at
2025-12-31 00:36:01 +00:00
Claude
4c4c065697 Add Process.current() to implementation plan
Key addition: Process.current() class method (like Machine.current())
that auto-creates/retrieves the Process record for the current OS process.

Benefits:
- Uses PPID lookup to find parent Process automatically
- Detects process_type from sys.argv
- Cached with validation (like Machine.current())
- Eliminates need for thread-local context management

Simplified Phase 3 (workers) and Phase 4 (CLI) to just call
Process.current() instead of manual Process creation.
2025-12-31 00:32:05 +00:00
Claude
f21fb55a2c Add comprehensive implementation plan for Process hierarchy tracking
Documents 7-phase refactoring to use machine.Process as the core data
model for all subprocess management:

- Phase 1: Add parent FK and process_type to Process model
- Phase 2: Add lifecycle methods (launch, kill, poll, wait)
- Phase 3: Update hook system to create Process records
- Phase 4-5: Track workers/orchestrator/supervisord as Process
- Phase 6: Create root Process on CLI invocation
- Phase 7: Admin UI with tree visualization

Enables full process hierarchy tracking from CLI → binary execution.
2025-12-31 00:28:17 +00:00
Claude
877b5f91c2 Derive CHROME_USER_DATA_DIR from ACTIVE_PERSONA in config system
- Add _derive_persona_paths() in configset.py to automatically derive
  CHROME_USER_DATA_DIR and CHROME_EXTENSIONS_DIR from ACTIVE_PERSONA
  when not explicitly set. This allows plugins to use these paths
  without knowing about the persona system.

- Update chrome_utils.js launchChromium() to accept userDataDir option
  and pass --user-data-dir to Chrome. Also cleans up SingletonLock
  before launch.

- Update killZombieChrome() to clean up SingletonLock files from all
  persona chrome_user_data directories after killing zombies.

- Update chrome_cleanup() in misc/util.py to handle persona-based
  user data directories when cleaning up stale Chrome state.

- Simplify on_Crawl__20_chrome_launch.bg.js to use CHROME_USER_DATA_DIR
  and CHROME_EXTENSIONS_DIR from env (derived by get_config()).

Config priority flow:
  ACTIVE_PERSONA=WorkAccount (set on crawl/snapshot)
  -> get_config() derives:
     CHROME_USER_DATA_DIR = PERSONAS_DIR/WorkAccount/chrome_user_data
     CHROME_EXTENSIONS_DIR = PERSONAS_DIR/WorkAccount/chrome_extensions
  -> hooks receive these as env vars without needing persona logic
2025-12-31 00:21:07 +00:00
Nick Sweeting
dd2302ad92 new jsonl cli interface 2025-12-30 16:12:53 -08:00
Nick Sweeting
ba8c28a866 use process_set for related name not processes 2025-12-30 12:55:23 -08:00
Nick Sweeting
1b49ea9a0e improve jsonl logic 2025-12-30 12:43:36 -08:00
Nick Sweeting
08366cfa46 document chrome configs 2025-12-30 12:42:50 -08:00
Nick Sweeting
93a78ce595 Convert snapshot index from JSON to JSONL (#1730)
<!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line
length changes. -->

# Summary

<!--e.g. This PR fixes ABC or adds the ability to do XYZ...-->

# Related issues

<!-- e.g. #123 or Roadmap goal #
https://github.com/pirate/ArchiveBox/wiki/Roadmap -->

# Changes these areas

- [ ] Bugfixes
- [ ] Feature behavior
- [ ] Command line interface
- [ ] Configuration options
- [ ] Internal architecture
- [x] Snapshot data layout on disk

<!-- This is an auto-generated description by cubic. -->
---
## Summary by cubic
Switch snapshot index storage from index.json to a flat index.jsonl
format for easier parsing and extensibility. Includes automatic
migration and backward-compatible reading, plus updated CLI pipeline to
emit/consume JSONL records.

- **New Features**
- Write and read index.jsonl with per-line records (Snapshot,
ArchiveResult, Binary, Process); reconcile prefers JSONL.
- Auto-convert legacy index.json to JSONL during migration/update;
load_from_directory/create_from_directory support both formats.
- Serialization moved to model to_jsonl methods; added schema_version to
all records, including Tag, Crawl, Binary, and Process.
- CLI pipeline updated: crawl creates a single Crawl job from all input
URLs and outputs Crawl JSONL (no immediate crawling); snapshot accepts
Crawl JSONL/IDs and outputs Snapshot JSONL; extract outputs
ArchiveResult JSONL via model methods.

- **Migration**
- Conversion runs during filesystem migration and reconcile; no manual
steps.
- Legacy index.json is deleted after conversion; external tools should
switch to index.jsonl.

<sup>Written for commit 251fe33e49.
Summary will update on new commits.</sup>

<!-- End of auto-generated description by cubic. -->
2025-12-30 12:32:52 -08:00
claude[bot]
251fe33e49 fix: rename --plugin to --plugins for consistency
Changed from singular --plugin to plural --plugins in both snapshot and extract
commands to match the pattern in archivebox add command. Updated to accept
comma-separated plugin names (e.g., --plugins=screenshot,singlefile,title).

- Updated CLI option from --plugin to --plugins
- Added parsing for comma-separated plugin names
- Updated function signatures and logic to handle multiple plugins
- Updated help text, docstrings, and examples

Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>
2025-12-30 20:20:29 +00:00
claude[bot]
64db6deab3 fix: revert incorrect --extract renaming, restore --plugin parameter
The --plugins parameter was incorrectly renamed to --extract (boolean).
This restores --plugin (singular, matching extract command) with correct
semantics: specify which plugin to run after creating snapshots.

- Changed --extract/--no-extract back to --plugin (string parameter)
- Updated function signature and logic to use plugin parameter
- Added ArchiveResult creation for specific plugin when --plugin is passed
- Updated docstring and examples

Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>
2025-12-30 20:15:48 +00:00
claude[bot]
762cddc8c5 fix: address PR review comments from cubic-dev-ai
- Add JSONL_INDEX_FILENAME to ALLOWED_IN_DATA_DIR for consistency
- Fix fallback logic in legacy.py to try JSON when JSONL parsing fails
- Replace bare except clauses with specific exception types
- Fix stdin double-consumption in archivebox_crawl.py
- Merge CLI --tag option with crawl tags in archivebox_snapshot.py
- Remove tautological mock tests (covered by integration tests)

Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>
2025-12-30 20:09:51 +00:00