ArchiveBox

mirror of https://github.com/ArchiveBox/ArchiveBox.git synced 2026-01-04 09:55:33 +10:00

Author	SHA1	Message	Date
Claude	adeffb4bc5	Add JS-Python path delegation to reduce Chrome-related duplication - Add getMachineType, getLibDir, getNodeModulesDir, getTestEnv CLI commands to chrome_utils.js These are now the single source of truth for path calculations - Update chrome_test_helpers.py with call_chrome_utils() dispatcher - Add get_test_env_from_js(), get_machine_type_from_js(), kill_chrome_via_js() helpers - Update cleanup_chrome and kill_chromium_session to use JS killChrome - Remove unused Chrome binary search lists from singlefile hook (~25 lines) - Update readability, mercury, favicon, title tests to use shared helpers	2025-12-31 09:11:11 +00:00
Claude	d72ab7c397	Add simpler Chrome test helpers and update test files New helpers in chrome_test_helpers.py: - get_plugin_dir(__file__) - get plugin dir from test file path - get_hook_script(dir, pattern) - find hook script by glob pattern - run_hook() - run hook script and return (returncode, stdout, stderr) - parse_jsonl_output() - parse JSONL from hook output - run_hook_and_parse() - convenience combo of above two - LIB_DIR, NODE_MODULES_DIR - lazy-loaded module constants - _LazyPath class for deferred path resolution Updated test files to use simpler patterns: - screenshot/tests/test_screenshot.py - dom/tests/test_dom.py - pdf/tests/test_pdf.py - singlefile/tests/test_singlefile.py Before: PLUGIN_DIR = Path(__file__).parent.parent After: PLUGIN_DIR = get_plugin_dir(__file__) Before: LIB_DIR = get_lib_dir(); NODE_MODULES_DIR = LIB_DIR / 'npm' / 'node_modules' After: from chrome_test_helpers import LIB_DIR, NODE_MODULES_DIR	2025-12-31 09:02:34 +00:00
Claude	7d74dd906c	Add Chrome CDP integration tests for singlefile - Import shared Chrome test helpers - Add test_singlefile_with_chrome_session() to verify CDP connection - Add test_singlefile_disabled_skips() for config testing - Update existing test to use get_test_env()	2025-12-31 08:57:13 +00:00
Claude	ef92a99c4a	Refactor test_chrome.py to use shared helpers - Add get_machine_type() to chrome_test_helpers.py - Update get_test_env() to include MACHINE_TYPE - Refactor test_chrome.py to import from shared helpers - Removes ~50 lines of duplicate code	2025-12-31 08:34:35 +00:00
Claude	65c839032a	Consolidate Chrome test helpers across all plugin tests - Add setup_test_env, launch_chromium_session, kill_chromium_session to chrome_test_helpers.py for extension tests - Add chromium_session context manager for cleaner test code - Refactor ublock, istilldontcareaboutcookies, twocaptcha tests to use shared helpers (~450 lines removed) - Refactor screenshot, dom, pdf tests to use shared get_test_env and get_lib_dir (~60 lines removed) - Net reduction: 228 lines of duplicate code	2025-12-31 08:30:14 +00:00
Claude	fd9ba86220	Reduce Chrome-related code duplication across JS and Python This change consolidates duplicated logic between chrome_utils.js and extension installer hooks, as well as between Python plugin tests: JavaScript changes: - Add getExtensionsDir() to centralize extension directory path calculation - Add installExtensionWithCache() to handle extension install + cache workflow - Add CLI commands for new utilities - Refactor all 3 extension installers (ublock, istilldontcareaboutcookies, twocaptcha) to use shared utilities, reducing each from ~115 lines to ~60 - Update chrome_launch hook to use getExtensionsDir() Python test changes: - Add chrome_test_helpers.py with shared Chrome session management utilities - Refactor infiniscroll and modalcloser tests to use shared helpers - setup_chrome_session(), cleanup_chrome(), get_test_env() now centralized - Add chrome_session() context manager for automatic cleanup Net result: ~208 lines of code removed while maintaining same functionality.	2025-12-31 08:13:00 +00:00
Nick Sweeting	84a4fb0785	fix cubic comments	2025-12-30 23:53:47 -08:00
claude[bot]	4285a05d19	Fix getEnvArray to parse JSON when '[' present, CSV otherwise Simplifies the comma-separated parsing logic to: - If value contains '[', parse as JSON array - Otherwise, parse as comma-separated values This prevents incorrect splitting of arguments containing internal commas when there's only one argument. For arguments with commas, users should use JSON format: CHROME_ARGS='["--arg1,val", "--arg2"]' Also exports getEnvArray in module.exports for consistency. Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>	2025-12-31 07:39:49 +00:00
Nick Sweeting	e26a0f6fc0	Fix hook file overwrites in plugin directory (#1732 ) Multiple hooks in the same plugin directory were overwriting each other's stdout.log, stderr.log, hook.pid, and cmd.sh files. Now each hook uses filenames prefixed with its hook name: - on_Snapshot__20_chrome_tab.bg.stdout.log - on_Snapshot__20_chrome_tab.bg.stderr.log - on_Snapshot__20_chrome_tab.bg.pid - on_Snapshot__20_chrome_tab.bg.sh Updated: - hooks.py run_hook() to use hook-specific names - core/models.py cleanup and update_from_output methods - Plugin scripts to no longer write redundant hook.pid files <!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line length changes. --> # Summary <!--e.g. This PR fixes ABC or adds the ability to do XYZ...--> # Related issues <!-- e.g. #123 or Roadmap goal # https://github.com/pirate/ArchiveBox/wiki/Roadmap --> # Changes these areas - [ ] Bugfixes - [ ] Feature behavior - [ ] Command line interface - [ ] Configuration options - [ ] Internal architecture - [ ] Snapshot data layout on disk <!-- This is an auto-generated description by cubic. --> --- ## Summary by cubic Prevented hook file collisions by giving each hook its own stdout, stderr, pid, and cmd filenames. This fixes mixed logs and ensures correct cleanup and status checks when multiple hooks run in the same plugin directory. - Bug Fixes - hooks.py: write hook-specific stdout/stderr/pid/cmd files and exclude them from new_files; derive cmd.sh from pid for safe kill. - core/models.py: read hook-specific logs; exclude hook output files when computing outputs; cleanup and background detection use *.pid. - Plugins: stop writing redundant hook.pid files; minor chrome utils cleanup. <sup>Written for commit `754b096193`. Summary will update on new commits.</sup> <!-- End of auto-generated description by cubic. -->	2025-12-30 23:36:09 -08:00
Nick Sweeting	f7b186d7c8	Apply suggestion from @cubic-dev-ai[bot] Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>	2025-12-31 02:31:46 -05:00
Nick Sweeting	dac6c63bba	working extension tests	2025-12-30 18:30:16 -08:00
Nick Sweeting	42d3fb7025	extension test fixes	2025-12-30 18:28:14 -08:00
Claude	754b096193	use hook-specific filenames to avoid overwrites Multiple hooks in the same plugin directory were overwriting each other's stdout.log, stderr.log, hook.pid, and cmd.sh files. Now each hook uses filenames prefixed with its hook name: - on_Snapshot__20_chrome_tab.bg.stdout.log - on_Snapshot__20_chrome_tab.bg.stderr.log - on_Snapshot__20_chrome_tab.bg.pid - on_Snapshot__20_chrome_tab.bg.sh Updated: - hooks.py run_hook() to use hook-specific names - core/models.py cleanup and update_from_output methods - Plugin scripts to no longer write redundant hook.pid files	2025-12-31 02:00:15 +00:00
Claude	b8a66c4a84	Convert Persona to Django ModelWithConfig, add to get_config() - Convert Persona from plain Python class to Django model with ModelWithConfig - Add config JSONField for persona-specific config overrides - Add get_derived_config() method that returns config with derived paths: - CHROME_USER_DATA_DIR, CHROME_EXTENSIONS_DIR, COOKIES_FILE, ACTIVE_PERSONA - Update get_config() to accept persona parameter in merge chain: get_config(persona=crawl.persona, crawl=crawl, snapshot=snapshot) - Remove _derive_persona_paths() - derivation now happens in Persona model - Merge order (highest to lowest priority): 1. snapshot.config 2. crawl.config 3. user.config 4. persona.get_derived_config() <- NEW 5. environment variables 6. ArchiveBox.conf file 7. plugin defaults 8. core defaults Usage: config = get_config(persona=crawl.persona, crawl=crawl) config['CHROME_USER_DATA_DIR'] # derived from persona	2025-12-31 01:07:29 +00:00
Claude	b1e31c3def	Simplify Persona class: remove convenience functions, fix get_active() - Remove standalone convenience functions (cleanup_chrome_for_persona, cleanup_chrome_all_personas) to reduce LOC - Change Persona.get_active(config) to accept config dict as argument instead of calling get_config() internally, since the caller needs to pass user/crawl/snapshot/archiveresult context for proper config	2025-12-31 01:00:52 +00:00
Claude	503a2f77cb	Add Persona class with cleanup_chrome() method - Create Persona class in personas/models.py for managing browser profiles/identities used for archiving sessions - Each Persona has: - chrome_user_data_dir: Chrome profile directory - chrome_extensions_dir: Installed extensions - cookies_file: Cookies for wget/curl - config_file: Persona-specific config overrides - Add Persona methods: - cleanup_chrome(): Remove stale SingletonLock/SingletonSocket files - get_config(): Load persona config from config.json - save_config(): Save persona config to config.json - ensure_dirs(): Create persona directory structure - all(): Iterator over all personas - get_active(): Get persona based on ACTIVE_PERSONA config - cleanup_chrome_all(): Clean up all personas - Update chrome_cleanup() in misc/util.py to use Persona.cleanup_chrome_all() instead of manual directory iteration - Add convenience functions: - cleanup_chrome_for_persona(name) - cleanup_chrome_all_personas()	2025-12-31 00:59:37 +00:00
Claude	1a86789523	Move Chrome default args to config.json CHROME_ARGS - Add comprehensive default CHROME_ARGS in config.json with 55+ flags for deterministic rendering, security, performance, and UI suppression - Update chrome_utils.js launchChromium() to read CHROME_ARGS and CHROME_ARGS_EXTRA from environment variables (set by get_config()) - Add getEnvArray() helper to parse JSON arrays or comma-separated strings from environment variables - Separate args into three categories: 1. baseArgs: Static flags from CHROME_ARGS config (configurable) 2. dynamicArgs: Runtime-computed flags (port, sandbox, headless, etc.) 3. extraArgs: User overrides from CHROME_ARGS_EXTRA - Add CHROME_SANDBOX config option to control --no-sandbox flag Args are now configurable via: - config.json defaults - ArchiveBox.conf file - Environment variables - Per-crawl/snapshot config overrides	2025-12-31 00:57:29 +00:00
Claude	877b5f91c2	Derive CHROME_USER_DATA_DIR from ACTIVE_PERSONA in config system - Add _derive_persona_paths() in configset.py to automatically derive CHROME_USER_DATA_DIR and CHROME_EXTENSIONS_DIR from ACTIVE_PERSONA when not explicitly set. This allows plugins to use these paths without knowing about the persona system. - Update chrome_utils.js launchChromium() to accept userDataDir option and pass --user-data-dir to Chrome. Also cleans up SingletonLock before launch. - Update killZombieChrome() to clean up SingletonLock files from all persona chrome_user_data directories after killing zombies. - Update chrome_cleanup() in misc/util.py to handle persona-based user data directories when cleaning up stale Chrome state. - Simplify on_Crawl__20_chrome_launch.bg.js to use CHROME_USER_DATA_DIR and CHROME_EXTENSIONS_DIR from env (derived by get_config()). Config priority flow: ACTIVE_PERSONA=WorkAccount (set on crawl/snapshot) -> get_config() derives: CHROME_USER_DATA_DIR = PERSONAS_DIR/WorkAccount/chrome_user_data CHROME_EXTENSIONS_DIR = PERSONAS_DIR/WorkAccount/chrome_extensions -> hooks receive these as env vars without needing persona logic	2025-12-31 00:21:07 +00:00
Nick Sweeting	dd2302ad92	new jsonl cli interface	2025-12-30 16:12:53 -08:00
Nick Sweeting	ba8c28a866	use process_set for related name not processes	2025-12-30 12:55:23 -08:00
Nick Sweeting	1b49ea9a0e	improve jsonl logic	2025-12-30 12:43:36 -08:00
Nick Sweeting	08366cfa46	document chrome configs	2025-12-30 12:42:50 -08:00
claude[bot]	251fe33e49	fix: rename --plugin to --plugins for consistency Changed from singular --plugin to plural --plugins in both snapshot and extract commands to match the pattern in archivebox add command. Updated to accept comma-separated plugin names (e.g., --plugins=screenshot,singlefile,title). - Updated CLI option from --plugin to --plugins - Added parsing for comma-separated plugin names - Updated function signatures and logic to handle multiple plugins - Updated help text, docstrings, and examples Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>	2025-12-30 20:20:29 +00:00
claude[bot]	64db6deab3	fix: revert incorrect --extract renaming, restore --plugin parameter The --plugins parameter was incorrectly renamed to --extract (boolean). This restores --plugin (singular, matching extract command) with correct semantics: specify which plugin to run after creating snapshots. - Changed --extract/--no-extract back to --plugin (string parameter) - Updated function signature and logic to use plugin parameter - Added ArchiveResult creation for specific plugin when --plugin is passed - Updated docstring and examples Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>	2025-12-30 20:15:48 +00:00
claude[bot]	762cddc8c5	fix: address PR review comments from cubic-dev-ai - Add JSONL_INDEX_FILENAME to ALLOWED_IN_DATA_DIR for consistency - Fix fallback logic in legacy.py to try JSON when JSONL parsing fails - Replace bare except clauses with specific exception types - Fix stdin double-consumption in archivebox_crawl.py - Merge CLI --tag option with crawl tags in archivebox_snapshot.py - Remove tautological mock tests (covered by integration tests) Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>	2025-12-30 20:09:51 +00:00
Claude	cf387ed59f	refactor: batch all URLs into single Crawl, update tests - archivebox crawl now creates one Crawl with all URLs as newline-separated string - Updated tests to reflect new pipeline: crawl -> snapshot -> extract - Added tests for Crawl JSONL parsing and output - Tests verify Crawl.from_jsonl() handles multiple URLs correctly	2025-12-30 20:06:56 +00:00
Nick Sweeting	bb59287411	Merge branch 'dev' into claude/snapshot-index-jsonl-UxEXK	2025-12-30 12:05:05 -08:00
Nick Sweeting	099d955ef5	Implement tags editor widget for Django admin (#1729 ) Implement a sleek inline tag editor with autocomplete and AJAX support: - Create TagEditorWidget and InlineTagEditorWidget in core/widgets.py - Pills display with X remove button, sorted alphabetically - Text input with HTML5 datalist autocomplete - Enter/Space/Comma to add tags, auto-creates if doesn't exist - Backspace removes last tag when input is empty - Add API endpoints in api/v1_core.py - GET /tags/autocomplete/ - search tags by name - POST /tags/create/ - get_or_create tag - POST /tags/add-to-snapshot/ - add tag to snapshot via AJAX - POST /tags/remove-from-snapshot/ - remove tag from snapshot - Update admin_snapshots.py - Replace FilteredSelectMultiple with TagEditorWidget in bulk actions - Create SnapshotAdminForm with tags_editor field - Update title_str() to render inline tag editor in list view - Remove TagInline, use widget instead - Add CSS styles in templates/admin/base.html - Blue gradient pill styling matching admin theme - Focus ring and hover states - Compact inline variant for list view <!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line length changes. --> # Summary <!--e.g. This PR fixes ABC or adds the ability to do XYZ...--> # Related issues <!-- e.g. #123 or Roadmap goal # https://github.com/pirate/ArchiveBox/wiki/Roadmap --> # Changes these areas - [ ] Bugfixes - [ ] Feature behavior - [ ] Command line interface - [ ] Configuration options - [ ] Internal architecture - [ ] Snapshot data layout on disk <!-- This is an auto-generated description by cubic. --> --- ## Summary by cubic Implemented a new interactive tags editor for Django admin with autocomplete and AJAX add/remove, replacing the old multi-select and inline. This makes tagging snapshots faster and safer in detail, list, and bulk actions. - New Features - TagEditorWidget and InlineTagEditorWidget with pill UI and remove buttons, XSS-safe rendering, and delegated events. - Keyboard support: Enter/Space/Comma to add, Backspace to remove last when input is empty. - Datalist autocomplete and debounced search via GET /tags/autocomplete/. - AJAX endpoints: POST /tags/create/, /tags/add-to-snapshot/, /tags/remove-from-snapshot/. - Refactors - Replaced FilteredSelectMultiple with TagEditorWidget in bulk actions; parse comma-separated tags and use bulk_create/delete for efficient add/remove. - Added SnapshotAdminForm with tags_editor field; saves tags case-insensitively and fixes remove_tags matching. - Rendered inline tag editor in list view via title_str; removed TagInline. - Added CSS in admin/base.html for pill styling, focus ring, and compact inline variant. <sup>Written for commit `0dee662f41`. Summary will update on new commits.</sup> <!-- End of auto-generated description by cubic. -->	2025-12-30 11:59:39 -08:00
Claude	69965a2782	fix: correct CLI pipeline data flow for crawl -> snapshot -> extract - archivebox crawl: creates Crawl records, outputs Crawl JSONL - archivebox snapshot: accepts Crawl JSONL, creates Snapshots, outputs Snapshot JSONL - archivebox extract: accepts Snapshot JSONL, runs extractors, outputs ArchiveResult JSONL Changes: - Add Crawl.from_jsonl() method for creating Crawl from JSONL records - Rewrite archivebox_crawl.py to create Crawl jobs without immediately starting them - Update archivebox_snapshot.py to accept both Crawl JSONL and plain URLs - Update jsonl.py docstring to document the pipeline	2025-12-30 19:42:41 +00:00
Claude	ae648c9bc1	refactor: move remaining JSONL methods to models, clean up jsonl.py - Add Tag.to_jsonl() method with schema_version - Add Crawl.to_jsonl() method with schema_version - Fix Tag.from_jsonl() to not depend on jsonl.py helper - Update tests to use Snapshot.from_jsonl() instead of non-existent get_or_create_snapshot Remove model-specific functions from misc/jsonl.py: - tag_to_jsonl() - use Tag.to_jsonl() instead - crawl_to_jsonl() - use Crawl.to_jsonl() instead - get_or_create_tag() - use Tag.from_jsonl() instead - process_jsonl_records() - use model from_jsonl() methods directly jsonl.py now only contains generic I/O utilities: - Type constants (TYPE_SNAPSHOT, etc.) - parse_line(), read_stdin(), read_file(), read_args_or_stdin() - write_record(), write_records() - filter_by_type(), process_records()	2025-12-30 19:30:18 +00:00
Claude	0dee662f41	Use bulk operations for add/remove tags actions - add_tags: Uses SnapshotTag.objects.bulk_create() with ignore_conflicts Instead of N calls to obj.tags.add(), now makes 1 query per tag - remove_tags: Uses single SnapshotTag.objects.filter().delete() Instead of N calls to obj.tags.remove(), now makes 1 query total Works correctly with "select all across pages" via queryset.values_list()	2025-12-30 19:29:34 +00:00
Claude	bc273c5a7f	feat: add schema_version to JSONL outputs and remove dead code - Add schema_version (archivebox.VERSION) to all to_jsonl() outputs: - Snapshot.to_jsonl() - ArchiveResult.to_jsonl() - Binary.to_jsonl() - Process.to_jsonl() - Update CLI commands to use model methods directly: - archivebox_snapshot.py: snapshot.to_jsonl() - archivebox_extract.py: result.to_jsonl() - Remove dead wrapper functions from misc/jsonl.py: - snapshot_to_jsonl() - archiveresult_to_jsonl() - binary_to_jsonl() - process_to_jsonl() - machine_to_jsonl() - Update tests to use model methods directly	2025-12-30 19:24:53 +00:00
claude[bot]	03b96ef4ce	Fix security issues in tag editor widgets - Fix case-sensitivity mismatch in remove_tags (use name__iexact) - Fix XSS vulnerability by removing onclick attributes - Use data attributes and event delegation instead - Apply DOM APIs to prevent injection attacks Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>	2025-12-30 19:18:41 +00:00
Claude	a5206e7648	refactor: move to_jsonl() methods to models Move JSONL serialization from standalone functions to model methods to mirror the from_jsonl() pattern: - Add Binary.to_jsonl() method - Add Process.to_jsonl() method - Add ArchiveResult.to_jsonl() method - Add Snapshot.to_jsonl() method - Update write_index_jsonl() to use model methods - Update jsonl.py functions to be thin wrappers	2025-12-30 18:35:22 +00:00
Nick Sweeting	91375d35a3	more migrations	2025-12-30 10:30:52 -08:00
Claude	d36079829b	feat: replace index.json with index.jsonl flat JSONL format Switch from hierarchical index.json to flat index.jsonl format for snapshot metadata storage. Each line is a self-contained JSON record with a 'type' field (Snapshot, ArchiveResult, Binary, Process). Changes: - Add JSONL_INDEX_FILENAME constant to constants.py - Add TYPE_PROCESS and TYPE_MACHINE to jsonl.py type constants - Add binary_to_jsonl(), process_to_jsonl(), machine_to_jsonl() converters - Add Snapshot.write_index_jsonl() to write new format - Add Snapshot.read_index_jsonl() to read new format - Add Snapshot.convert_index_json_to_jsonl() for migration - Update Snapshot.reconcile_with_index() to handle both formats - Update fs_migrate to convert during filesystem migration - Update load_from_directory/create_from_directory for both formats - Update legacy.py parse_json_links_details for JSONL support The new format is easier to parse, extend, and mix record types.	2025-12-30 18:21:06 +00:00
Nick Sweeting	96ee1bf686	more migration fixes	2025-12-30 09:57:33 -08:00
Nick Sweeting	4cd2fceb8a	even more migration fixes	2025-12-29 22:30:37 -08:00
Nick Sweeting	95beddc5fc	more migration fixes	2025-12-29 22:12:57 -08:00
Nick Sweeting	2e350d317d	fix initial migrtaions	2025-12-29 21:27:31 -08:00
Nick Sweeting	3dd329600e	comment updates	2025-12-29 21:05:34 -08:00
Nick Sweeting	80f75126c6	more fixes	2025-12-29 21:03:05 -08:00
Nick Sweeting	a648c17ec7	Merge branch 'dev' into claude/tags-editor-widget-0Dq7f	2025-12-29 19:25:36 -08:00
Nick Sweeting	147d567d3f	fix migrations	2025-12-29 19:25:26 -08:00
Nick Sweeting	dcf1136afa	Merge branch 'dev' into claude/tags-editor-widget-0Dq7f	2025-12-29 19:24:51 -08:00
Nick Sweeting	64dccb7a19	passing	2025-12-29 18:55:57 -08:00
Nick Sweeting	5549a79869	more speed fixes	2025-12-29 18:55:37 -08:00
Nick Sweeting	abf5f44134	more debug logging	2025-12-29 18:53:52 -08:00
Nick Sweeting	bcf0513d05	more debug logging	2025-12-29 18:50:04 -08:00
Nick Sweeting	7e6e3be9e7	messing with chrome install process to reuse cached chromium with pinned version	2025-12-29 18:49:36 -08:00

1 2 3 4 5 ...

2150 Commits