44 Commits

Author SHA1 Message Date
Nick Sweeting
dd77511026 unified Process source of truth and better screenshot tests 2026-01-02 04:20:34 -08:00
Nick Sweeting
3672174dad fix transition mid transition 2026-01-02 00:24:44 -08:00
Nick Sweeting
65ee09ceab move tests into subfolder, add missing install hooks 2026-01-02 00:22:07 -08:00
Nick Sweeting
9008cefca2 codecov, migrations, orchestrator fixes 2026-01-01 16:57:04 -08:00
Nick Sweeting
876feac522 actually working migration path from 0.7.2 and 0.8.6 + renames and test coverage 2026-01-01 15:50:00 -08:00
Claude
4d33084496 Remove redundant chrome_validate hook, rename wget_validate to wget_install
- Delete chrome/on_Crawl__10_chrome_validate.py (duplicates chrome_install)
- Rename wget/on_Crawl__11_wget_validate.py → on_Crawl__06_wget_install.py

All hooks now follow consistent naming: install, launch, or config
2025-12-31 23:41:40 +00:00
Claude
4c77949197 Clean up on_Crawl hooks: remove duplicates and standardize naming
Deleted dead/duplicate hooks:
- wget/on_Crawl__10_install_wget.py (duplicate of __10_wget_validate_config.py)
- chrome/on_Crawl__00_chrome_install.py (simpler version, kept full one)
- chrome/on_Crawl__20_chrome_launch.bg.js (legacy, kept __30 version)
- singlefile/on_Crawl__20_install_singlefile_extension.js (disabled/dead)
- istilldontcareaboutcookies/on_Crawl__20_install_*.js (legacy)
- ublock/on_Crawl__03_ublock.js (legacy, kept __20 version)
- Entire captcha2/ plugin (legacy version of twocaptcha/)

Renamed hooks to follow consistent pattern: on_Crawl__XX_<plugin>_<action>.<ext>
Priority bands:
  00-09: Binary/extension installation
  10-19: Config validation
  20-29: Browser launch and post-launch config

Final hooks:
  00 ripgrep_install.py, 01 chrome_install.py
  02 istilldontcareaboutcookies_install.js
  03 ublock_install.js, 04 singlefile_install.js
  05 twocaptcha_install.js
  10 chrome_validate.py, 11 wget_validate.py
  20 chrome_launch.bg.js, 25 twocaptcha_config.js
2025-12-31 22:47:36 +00:00
Nick Sweeting
d5c0c64dcd fix progress bars 2025-12-31 12:34:29 -08:00
Nick Sweeting
cb97f6651b Add DNS traffic recorder plugin (#1748) 2025-12-31 11:02:43 -08:00
Claude
5d8c93eaf4 Consolidate CDP connection logic into chrome_utils.js
Add shared snapshot hook utilities to chrome_utils.js:
- parseArgs(): CLI argument parsing
- waitForChromeSession(): Wait for CDP session files
- readCdpUrl(): Read CDP WebSocket URL
- readTargetId(): Read target page ID
- connectToPage(): High-level browser/page connection
- waitForPageLoaded(): Wait for navigation completion

Refactor ssl, responses, and dns plugins to use shared utilities,
eliminating ~100 lines of duplicated code across plugins.
2025-12-31 12:15:30 +00:00
Claude
73425fa984 Add persona CLI command with browser cookie import
- Add `archivebox persona create/list/update/delete` commands
- Support `--import=chrome|firefox|brave` to copy browser profile
- Extract cookies via CDP to generate cookies.txt for non-browser tools
- Fix JSDoc comment parsing issue in chrome_utils.js
2025-12-31 12:13:07 +00:00
claude[bot]
5121b0e5f9 Merge branch 'dev' into claude/refactor-process-management-WcQyZ
Resolved conflicts by keeping Process model changes and accepting dev changes for unrelated files. Ensured pid_utils.py remains deleted as intended by this PR.

Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>
2025-12-31 11:28:47 +00:00
Nick Sweeting
1d15901304 fix process health stats 2025-12-31 01:40:59 -08:00
Nick Sweeting
3d8c62ffb1 fix extensions dir paths add personas migration 2025-12-31 01:40:59 -08:00
Nick Sweeting
8dab2966cc Consolidate Chrome test helpers across all plugin tests (#1738)
<!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line
length changes. -->

# Summary

<!--e.g. This PR fixes ABC or adds the ability to do XYZ...-->

# Related issues

<!-- e.g. #123 or Roadmap goal #
https://github.com/pirate/ArchiveBox/wiki/Roadmap -->

# Changes these areas

- [ ] Bugfixes
- [ ] Feature behavior
- [ ] Command line interface
- [ ] Configuration options
- [ ] Internal architecture
- [ ] Snapshot data layout on disk
2025-12-31 01:25:39 -08:00
Claude
1cfb77a355 Rename Python helpers to match JS function names in snake_case
- get_machine_type() matches JS getMachineType()
- get_lib_dir() matches JS getLibDir()
- get_node_modules_dir() matches JS getNodeModulesDir()
- get_extensions_dir() matches JS getExtensionsDir()
- find_chromium() matches JS findChromium()
- kill_chrome() matches JS killChrome()
- get_test_env() matches JS getTestEnv()

All functions now try JS first (single source of truth) with Python fallback.
Added backward compatibility aliases for old names.
2025-12-31 09:23:47 +00:00
Claude
adeffb4bc5 Add JS-Python path delegation to reduce Chrome-related duplication
- Add getMachineType, getLibDir, getNodeModulesDir, getTestEnv CLI commands to chrome_utils.js
  These are now the single source of truth for path calculations
- Update chrome_test_helpers.py with call_chrome_utils() dispatcher
- Add get_test_env_from_js(), get_machine_type_from_js(), kill_chrome_via_js() helpers
- Update cleanup_chrome and kill_chromium_session to use JS killChrome
- Remove unused Chrome binary search lists from singlefile hook (~25 lines)
- Update readability, mercury, favicon, title tests to use shared helpers
2025-12-31 09:11:11 +00:00
Claude
d72ab7c397 Add simpler Chrome test helpers and update test files
New helpers in chrome_test_helpers.py:
- get_plugin_dir(__file__) - get plugin dir from test file path
- get_hook_script(dir, pattern) - find hook script by glob pattern
- run_hook() - run hook script and return (returncode, stdout, stderr)
- parse_jsonl_output() - parse JSONL from hook output
- run_hook_and_parse() - convenience combo of above two
- LIB_DIR, NODE_MODULES_DIR - lazy-loaded module constants
- _LazyPath class for deferred path resolution

Updated test files to use simpler patterns:
- screenshot/tests/test_screenshot.py
- dom/tests/test_dom.py
- pdf/tests/test_pdf.py
- singlefile/tests/test_singlefile.py

Before: PLUGIN_DIR = Path(__file__).parent.parent
After:  PLUGIN_DIR = get_plugin_dir(__file__)

Before: LIB_DIR = get_lib_dir(); NODE_MODULES_DIR = LIB_DIR / 'npm' / 'node_modules'
After:  from chrome_test_helpers import LIB_DIR, NODE_MODULES_DIR
2025-12-31 09:02:34 +00:00
Claude
ef92a99c4a Refactor test_chrome.py to use shared helpers
- Add get_machine_type() to chrome_test_helpers.py
- Update get_test_env() to include MACHINE_TYPE
- Refactor test_chrome.py to import from shared helpers
- Removes ~50 lines of duplicate code
2025-12-31 08:34:35 +00:00
Claude
65c839032a Consolidate Chrome test helpers across all plugin tests
- Add setup_test_env, launch_chromium_session, kill_chromium_session
  to chrome_test_helpers.py for extension tests
- Add chromium_session context manager for cleaner test code
- Refactor ublock, istilldontcareaboutcookies, twocaptcha tests to use
  shared helpers (~450 lines removed)
- Refactor screenshot, dom, pdf tests to use shared get_test_env
  and get_lib_dir (~60 lines removed)
- Net reduction: 228 lines of duplicate code
2025-12-31 08:30:14 +00:00
Nick Sweeting
4394ce5f40 Reduce code duplication between Chrome utilities (#1737)
This change consolidates duplicated logic between chrome_utils.js and
extension installer hooks, as well as between Python plugin tests:

JavaScript changes:
- Add getExtensionsDir() to centralize extension directory path
calculation
- Add installExtensionWithCache() to handle extension install + cache
workflow
- Add CLI commands for new utilities
- Refactor all 3 extension installers (ublock,
istilldontcareaboutcookies, twocaptcha) to use shared utilities,
reducing each from ~115 lines to ~60
- Update chrome_launch hook to use getExtensionsDir()

Python test changes:
- Add chrome_test_helpers.py with shared Chrome session management
utilities
- Refactor infiniscroll and modalcloser tests to use shared helpers
- setup_chrome_session(), cleanup_chrome(), get_test_env() now
centralized
- Add chrome_session() context manager for automatic cleanup

Net result: ~208 lines of code removed while maintaining same
functionality.

<!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line
length changes. -->

# Summary

<!--e.g. This PR fixes ABC or adds the ability to do XYZ...-->

# Related issues

<!-- e.g. #123 or Roadmap goal #
https://github.com/pirate/ArchiveBox/wiki/Roadmap -->

# Changes these areas

- [ ] Bugfixes
- [ ] Feature behavior
- [ ] Command line interface
- [ ] Configuration options
- [ ] Internal architecture
- [ ] Snapshot data layout on disk
2025-12-31 00:19:44 -08:00
Claude
04c23badc2 Fix output path structure for 0.9.x data directory
- Update Crawl.output_dir_parent to use username instead of user_id
  for consistency with Snapshot paths
- Add domain from first URL to Crawl path structure for easier debugging:
  users/{username}/crawls/YYYYMMDD/{domain}/{crawl_id}/
- Add CRAWL_OUTPUT_DIR to config passed to Snapshot hooks so chrome_tab
  can find the shared Chrome session from the Crawl
- Update comment in chrome_tab hook to reflect new config source
2025-12-31 08:18:24 +00:00
Claude
fd9ba86220 Reduce Chrome-related code duplication across JS and Python
This change consolidates duplicated logic between chrome_utils.js and
extension installer hooks, as well as between Python plugin tests:

JavaScript changes:
- Add getExtensionsDir() to centralize extension directory path calculation
- Add installExtensionWithCache() to handle extension install + cache workflow
- Add CLI commands for new utilities
- Refactor all 3 extension installers (ublock, istilldontcareaboutcookies,
  twocaptcha) to use shared utilities, reducing each from ~115 lines to ~60
- Update chrome_launch hook to use getExtensionsDir()

Python test changes:
- Add chrome_test_helpers.py with shared Chrome session management utilities
- Refactor infiniscroll and modalcloser tests to use shared helpers
- setup_chrome_session(), cleanup_chrome(), get_test_env() now centralized
- Add chrome_session() context manager for automatic cleanup

Net result: ~208 lines of code removed while maintaining same functionality.
2025-12-31 08:13:00 +00:00
Nick Sweeting
84a4fb0785 fix cubic comments 2025-12-30 23:53:47 -08:00
claude[bot]
4285a05d19 Fix getEnvArray to parse JSON when '[' present, CSV otherwise
Simplifies the comma-separated parsing logic to:
- If value contains '[', parse as JSON array
- Otherwise, parse as comma-separated values

This prevents incorrect splitting of arguments containing internal commas
when there's only one argument. For arguments with commas, users should
use JSON format: CHROME_ARGS='["--arg1,val", "--arg2"]'

Also exports getEnvArray in module.exports for consistency.

Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>
2025-12-31 07:39:49 +00:00
Nick Sweeting
e26a0f6fc0 Fix hook file overwrites in plugin directory (#1732)
Multiple hooks in the same plugin directory were overwriting each
other's stdout.log, stderr.log, hook.pid, and cmd.sh files. Now each
hook uses filenames prefixed with its hook name:
- on_Snapshot__20_chrome_tab.bg.stdout.log
- on_Snapshot__20_chrome_tab.bg.stderr.log
- on_Snapshot__20_chrome_tab.bg.pid
- on_Snapshot__20_chrome_tab.bg.sh

Updated:
- hooks.py run_hook() to use hook-specific names
- core/models.py cleanup and update_from_output methods
- Plugin scripts to no longer write redundant hook.pid files

<!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line
length changes. -->

# Summary

<!--e.g. This PR fixes ABC or adds the ability to do XYZ...-->

# Related issues

<!-- e.g. #123 or Roadmap goal #
https://github.com/pirate/ArchiveBox/wiki/Roadmap -->

# Changes these areas

- [ ] Bugfixes
- [ ] Feature behavior
- [ ] Command line interface
- [ ] Configuration options
- [ ] Internal architecture
- [ ] Snapshot data layout on disk


<!-- This is an auto-generated description by cubic. -->
---
## Summary by cubic
Prevented hook file collisions by giving each hook its own stdout,
stderr, pid, and cmd filenames. This fixes mixed logs and ensures
correct cleanup and status checks when multiple hooks run in the same
plugin directory.

- **Bug Fixes**
- hooks.py: write hook-specific stdout/stderr/pid/cmd files and exclude
them from new_files; derive cmd.sh from pid for safe kill.
- core/models.py: read hook-specific logs; exclude hook output files
when computing outputs; cleanup and background detection use *.pid.
- Plugins: stop writing redundant hook.pid files; minor chrome utils
cleanup.

<sup>Written for commit 754b096193.
Summary will update on new commits.</sup>

<!-- End of auto-generated description by cubic. -->
2025-12-30 23:36:09 -08:00
Nick Sweeting
dac6c63bba working extension tests 2025-12-30 18:30:16 -08:00
Nick Sweeting
42d3fb7025 extension test fixes 2025-12-30 18:28:14 -08:00
Claude
754b096193 use hook-specific filenames to avoid overwrites
Multiple hooks in the same plugin directory were overwriting each
other's stdout.log, stderr.log, hook.pid, and cmd.sh files. Now
each hook uses filenames prefixed with its hook name:
- on_Snapshot__20_chrome_tab.bg.stdout.log
- on_Snapshot__20_chrome_tab.bg.stderr.log
- on_Snapshot__20_chrome_tab.bg.pid
- on_Snapshot__20_chrome_tab.bg.sh

Updated:
- hooks.py run_hook() to use hook-specific names
- core/models.py cleanup and update_from_output methods
- Plugin scripts to no longer write redundant hook.pid files
2025-12-31 02:00:15 +00:00
Claude
1a86789523 Move Chrome default args to config.json CHROME_ARGS
- Add comprehensive default CHROME_ARGS in config.json with 55+ flags
  for deterministic rendering, security, performance, and UI suppression

- Update chrome_utils.js launchChromium() to read CHROME_ARGS and
  CHROME_ARGS_EXTRA from environment variables (set by get_config())

- Add getEnvArray() helper to parse JSON arrays or comma-separated
  strings from environment variables

- Separate args into three categories:
  1. baseArgs: Static flags from CHROME_ARGS config (configurable)
  2. dynamicArgs: Runtime-computed flags (port, sandbox, headless, etc.)
  3. extraArgs: User overrides from CHROME_ARGS_EXTRA

- Add CHROME_SANDBOX config option to control --no-sandbox flag

Args are now configurable via:
  - config.json defaults
  - ArchiveBox.conf file
  - Environment variables
  - Per-crawl/snapshot config overrides
2025-12-31 00:57:29 +00:00
Claude
877b5f91c2 Derive CHROME_USER_DATA_DIR from ACTIVE_PERSONA in config system
- Add _derive_persona_paths() in configset.py to automatically derive
  CHROME_USER_DATA_DIR and CHROME_EXTENSIONS_DIR from ACTIVE_PERSONA
  when not explicitly set. This allows plugins to use these paths
  without knowing about the persona system.

- Update chrome_utils.js launchChromium() to accept userDataDir option
  and pass --user-data-dir to Chrome. Also cleans up SingletonLock
  before launch.

- Update killZombieChrome() to clean up SingletonLock files from all
  persona chrome_user_data directories after killing zombies.

- Update chrome_cleanup() in misc/util.py to handle persona-based
  user data directories when cleaning up stale Chrome state.

- Simplify on_Crawl__20_chrome_launch.bg.js to use CHROME_USER_DATA_DIR
  and CHROME_EXTENSIONS_DIR from env (derived by get_config()).

Config priority flow:
  ACTIVE_PERSONA=WorkAccount (set on crawl/snapshot)
  -> get_config() derives:
     CHROME_USER_DATA_DIR = PERSONAS_DIR/WorkAccount/chrome_user_data
     CHROME_EXTENSIONS_DIR = PERSONAS_DIR/WorkAccount/chrome_extensions
  -> hooks receive these as env vars without needing persona logic
2025-12-31 00:21:07 +00:00
Nick Sweeting
dd2302ad92 new jsonl cli interface 2025-12-30 16:12:53 -08:00
Nick Sweeting
08366cfa46 document chrome configs 2025-12-30 12:42:50 -08:00
Nick Sweeting
80f75126c6 more fixes 2025-12-29 21:03:05 -08:00
Nick Sweeting
7e6e3be9e7 messing with chrome install process to reuse cached chromium with pinned version 2025-12-29 18:49:36 -08:00
Nick Sweeting
b670612685 centralize chrome pid and zombie logic in chrome_utils 2025-12-29 17:57:23 -08:00
Nick Sweeting
4ba3e8d120 fix extension loading and consolidate chromium logic 2025-12-29 17:47:37 -08:00
Nick Sweeting
967c5d53e0 make plugin config more consistent 2025-12-29 13:21:46 -08:00
Nick Sweeting
30c60eef76 much better tests and add page ui 2025-12-29 04:02:11 -08:00
Nick Sweeting
1e4d3ffd11 improve plugin tests and config 2025-12-29 00:45:23 -08:00
Nick Sweeting
f0aa19fa7d wip 2025-12-28 17:51:54 -08:00
Nick Sweeting
4ccb0863bb continue renaming extractor to plugin, add plan for hook concurrency, add chrome kill helper script 2025-12-28 05:29:24 -08:00
Nick Sweeting
bd265c0083 rename extractor to plugin everywhere 2025-12-28 04:43:15 -08:00
Nick Sweeting
50e527ec65 way better plugin hooks system wip 2025-12-28 03:39:59 -08:00