ArchiveBox

mirror of https://github.com/ArchiveBox/ArchiveBox.git synced 2026-04-06 07:47:53 +10:00

Author	SHA1	Message	Date
Claude	df2a0dcd44	Add revised CLI pipeline architecture plan Comprehensive plan for implementing JSONL-based CLI piping: - Phase 1: Model prerequisites (ArchiveResult.from_json, tags_str fix) - Phase 2: Extract shared apply_filters() to cli_utils.py - Phase 3: Implement pass-through behavior for all create commands - Phase 4-6: Test infrastructure with pytest-django, unit/integration tests Key changes from original plan: - ArchiveResult.from_json() identified as missing prerequisite - Pass-through documented as new feature to implement - archivebox run updated to create-or-update pattern - conftest.py redesigned to use pytest-django with isolated tmp_path - Standardized on tags_str field name across all models - Reordered phases: implement before test	2025-12-31 01:46:07 +00:00
Claude	b8a66c4a84	Convert Persona to Django ModelWithConfig, add to get_config() - Convert Persona from plain Python class to Django model with ModelWithConfig - Add config JSONField for persona-specific config overrides - Add get_derived_config() method that returns config with derived paths: - CHROME_USER_DATA_DIR, CHROME_EXTENSIONS_DIR, COOKIES_FILE, ACTIVE_PERSONA - Update get_config() to accept persona parameter in merge chain: get_config(persona=crawl.persona, crawl=crawl, snapshot=snapshot) - Remove _derive_persona_paths() - derivation now happens in Persona model - Merge order (highest to lowest priority): 1. snapshot.config 2. crawl.config 3. user.config 4. persona.get_derived_config() <- NEW 5. environment variables 6. ArchiveBox.conf file 7. plugin defaults 8. core defaults Usage: config = get_config(persona=crawl.persona, crawl=crawl) config['CHROME_USER_DATA_DIR'] # derived from persona	2025-12-31 01:07:29 +00:00
Claude	b1e31c3def	Simplify Persona class: remove convenience functions, fix get_active() - Remove standalone convenience functions (cleanup_chrome_for_persona, cleanup_chrome_all_personas) to reduce LOC - Change Persona.get_active(config) to accept config dict as argument instead of calling get_config() internally, since the caller needs to pass user/crawl/snapshot/archiveresult context for proper config	2025-12-31 01:00:52 +00:00
Claude	503a2f77cb	Add Persona class with cleanup_chrome() method - Create Persona class in personas/models.py for managing browser profiles/identities used for archiving sessions - Each Persona has: - chrome_user_data_dir: Chrome profile directory - chrome_extensions_dir: Installed extensions - cookies_file: Cookies for wget/curl - config_file: Persona-specific config overrides - Add Persona methods: - cleanup_chrome(): Remove stale SingletonLock/SingletonSocket files - get_config(): Load persona config from config.json - save_config(): Save persona config to config.json - ensure_dirs(): Create persona directory structure - all(): Iterator over all personas - get_active(): Get persona based on ACTIVE_PERSONA config - cleanup_chrome_all(): Clean up all personas - Update chrome_cleanup() in misc/util.py to use Persona.cleanup_chrome_all() instead of manual directory iteration - Add convenience functions: - cleanup_chrome_for_persona(name) - cleanup_chrome_all_personas()	2025-12-31 00:59:37 +00:00
Claude	1a86789523	Move Chrome default args to config.json CHROME_ARGS - Add comprehensive default CHROME_ARGS in config.json with 55+ flags for deterministic rendering, security, performance, and UI suppression - Update chrome_utils.js launchChromium() to read CHROME_ARGS and CHROME_ARGS_EXTRA from environment variables (set by get_config()) - Add getEnvArray() helper to parse JSON arrays or comma-separated strings from environment variables - Separate args into three categories: 1. baseArgs: Static flags from CHROME_ARGS config (configurable) 2. dynamicArgs: Runtime-computed flags (port, sandbox, headless, etc.) 3. extraArgs: User overrides from CHROME_ARGS_EXTRA - Add CHROME_SANDBOX config option to control --no-sandbox flag Args are now configurable via: - config.json defaults - ArchiveBox.conf file - Environment variables - Per-crawl/snapshot config overrides	2025-12-31 00:57:29 +00:00
Claude	caee376749	Add Process.proc property for validated psutil access New section 1.5 adds @property proc that returns psutil.Process ONLY if: - PID exists in OS - OS start time matches our started_at (within tolerance) - We're on the same machine Safety features: - Validates start time via psutil.Process.create_time() - Optional command validation (binary name matches) - Returns None instead of wrong process on PID reuse Also adds convenience methods: - is_running: Check via validated psutil - get_memory_info(): RSS/VMS if running - get_cpu_percent(): CPU usage if running - get_children_pids(): Child PIDs from OS Updated kill() to use self.proc for safe killing - never kills a recycled PID since we validate start time first.	2025-12-31 00:49:58 +00:00
Claude	f3c91b4c4e	Add detailed supervisord Process tracking to plan Phase 3.3 now includes: - Module-level _supervisord_db_process variable - start_new_supervisord_process(): Create Process record after Popen - stop_existing_supervisord_process(): Update Process status on shutdown - Process hierarchy diagram showing CLI → supervisord → workers chain Key insight: PPID-based linking works because workers call Process.current() in on_startup(), which finds supervisord's Process via PPID lookup.	2025-12-31 00:45:10 +00:00
Claude	e41ca37848	Add detailed hook/run() changes to Process tracking plan Phase 2 now includes line-by-line mapping of: - run_hook(): Create Process record, use Process.launch(), parse JSONL for child binary Process records - process_is_alive(): Accept Path or Process, use Process.is_alive() - kill_process(): Accept Path or Process, use Process.kill() - ArchiveResult.run(): Pass self.process as parent_process to run_hook() - ArchiveResult.update_from_output(): Read from Process.stdout/stderr - Snapshot.cleanup(): Kill via Process model, fallback to PID files - Snapshot.has_running_background_hooks(): Check via Process model Hook JSONL contract updated to support {"type": "Process"} records for tracking binary executions within hooks.	2025-12-31 00:44:10 +00:00
Claude	554d743719	Add robust PID reuse protection to Process.current() plan PIDs are recycled by OS, so all Process queries now: - Filter by machine=Machine.current() (PIDs unique per machine) - Filter by started_at within PID_REUSE_WINDOW (24h) - Validate start time matches OS via psutil.Process.create_time() Added: - ProcessManager.get_by_pid() for safe PID lookups - Process.cleanup_stale_running() to mark orphaned RUNNING as EXITED - START_TIME_TOLERANCE (5s) for start time comparison - Uses psutil.Process.create_time() for accurate started_at	2025-12-31 00:36:01 +00:00
Claude	4c4c065697	Add Process.current() to implementation plan Key addition: Process.current() class method (like Machine.current()) that auto-creates/retrieves the Process record for the current OS process. Benefits: - Uses PPID lookup to find parent Process automatically - Detects process_type from sys.argv - Cached with validation (like Machine.current()) - Eliminates need for thread-local context management Simplified Phase 3 (workers) and Phase 4 (CLI) to just call Process.current() instead of manual Process creation.	2025-12-31 00:32:05 +00:00
Claude	f21fb55a2c	Add comprehensive implementation plan for Process hierarchy tracking Documents 7-phase refactoring to use machine.Process as the core data model for all subprocess management: - Phase 1: Add parent FK and process_type to Process model - Phase 2: Add lifecycle methods (launch, kill, poll, wait) - Phase 3: Update hook system to create Process records - Phase 4-5: Track workers/orchestrator/supervisord as Process - Phase 6: Create root Process on CLI invocation - Phase 7: Admin UI with tree visualization Enables full process hierarchy tracking from CLI → binary execution.	2025-12-31 00:28:17 +00:00
Claude	877b5f91c2	Derive CHROME_USER_DATA_DIR from ACTIVE_PERSONA in config system - Add _derive_persona_paths() in configset.py to automatically derive CHROME_USER_DATA_DIR and CHROME_EXTENSIONS_DIR from ACTIVE_PERSONA when not explicitly set. This allows plugins to use these paths without knowing about the persona system. - Update chrome_utils.js launchChromium() to accept userDataDir option and pass --user-data-dir to Chrome. Also cleans up SingletonLock before launch. - Update killZombieChrome() to clean up SingletonLock files from all persona chrome_user_data directories after killing zombies. - Update chrome_cleanup() in misc/util.py to handle persona-based user data directories when cleaning up stale Chrome state. - Simplify on_Crawl__20_chrome_launch.bg.js to use CHROME_USER_DATA_DIR and CHROME_EXTENSIONS_DIR from env (derived by get_config()). Config priority flow: ACTIVE_PERSONA=WorkAccount (set on crawl/snapshot) -> get_config() derives: CHROME_USER_DATA_DIR = PERSONAS_DIR/WorkAccount/chrome_user_data CHROME_EXTENSIONS_DIR = PERSONAS_DIR/WorkAccount/chrome_extensions -> hooks receive these as env vars without needing persona logic	2025-12-31 00:21:07 +00:00
Nick Sweeting	dd2302ad92	new jsonl cli interface	2025-12-30 16:12:53 -08:00
Nick Sweeting	ba8c28a866	use process_set for related name not processes	2025-12-30 12:55:23 -08:00
Nick Sweeting	1b49ea9a0e	improve jsonl logic	2025-12-30 12:43:36 -08:00
Nick Sweeting	08366cfa46	document chrome configs	2025-12-30 12:42:50 -08:00
Nick Sweeting	93a78ce595	Convert snapshot index from JSON to JSONL (#1730 ) <!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line length changes. --> # Summary <!--e.g. This PR fixes ABC or adds the ability to do XYZ...--> # Related issues <!-- e.g. #123 or Roadmap goal # https://github.com/pirate/ArchiveBox/wiki/Roadmap --> # Changes these areas - [ ] Bugfixes - [ ] Feature behavior - [ ] Command line interface - [ ] Configuration options - [ ] Internal architecture - [x] Snapshot data layout on disk <!-- This is an auto-generated description by cubic. --> --- ## Summary by cubic Switch snapshot index storage from index.json to a flat index.jsonl format for easier parsing and extensibility. Includes automatic migration and backward-compatible reading, plus updated CLI pipeline to emit/consume JSONL records. - New Features - Write and read index.jsonl with per-line records (Snapshot, ArchiveResult, Binary, Process); reconcile prefers JSONL. - Auto-convert legacy index.json to JSONL during migration/update; load_from_directory/create_from_directory support both formats. - Serialization moved to model to_jsonl methods; added schema_version to all records, including Tag, Crawl, Binary, and Process. - CLI pipeline updated: crawl creates a single Crawl job from all input URLs and outputs Crawl JSONL (no immediate crawling); snapshot accepts Crawl JSONL/IDs and outputs Snapshot JSONL; extract outputs ArchiveResult JSONL via model methods. - Migration - Conversion runs during filesystem migration and reconcile; no manual steps. - Legacy index.json is deleted after conversion; external tools should switch to index.jsonl. <sup>Written for commit `251fe33e49`. Summary will update on new commits.</sup> <!-- End of auto-generated description by cubic. -->	2025-12-30 12:32:52 -08:00
claude[bot]	251fe33e49	fix: rename --plugin to --plugins for consistency Changed from singular --plugin to plural --plugins in both snapshot and extract commands to match the pattern in archivebox add command. Updated to accept comma-separated plugin names (e.g., --plugins=screenshot,singlefile,title). - Updated CLI option from --plugin to --plugins - Added parsing for comma-separated plugin names - Updated function signatures and logic to handle multiple plugins - Updated help text, docstrings, and examples Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>	2025-12-30 20:20:29 +00:00
claude[bot]	64db6deab3	fix: revert incorrect --extract renaming, restore --plugin parameter The --plugins parameter was incorrectly renamed to --extract (boolean). This restores --plugin (singular, matching extract command) with correct semantics: specify which plugin to run after creating snapshots. - Changed --extract/--no-extract back to --plugin (string parameter) - Updated function signature and logic to use plugin parameter - Added ArchiveResult creation for specific plugin when --plugin is passed - Updated docstring and examples Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>	2025-12-30 20:15:48 +00:00
claude[bot]	762cddc8c5	fix: address PR review comments from cubic-dev-ai - Add JSONL_INDEX_FILENAME to ALLOWED_IN_DATA_DIR for consistency - Fix fallback logic in legacy.py to try JSON when JSONL parsing fails - Replace bare except clauses with specific exception types - Fix stdin double-consumption in archivebox_crawl.py - Merge CLI --tag option with crawl tags in archivebox_snapshot.py - Remove tautological mock tests (covered by integration tests) Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>	2025-12-30 20:09:51 +00:00
Claude	cf387ed59f	refactor: batch all URLs into single Crawl, update tests - archivebox crawl now creates one Crawl with all URLs as newline-separated string - Updated tests to reflect new pipeline: crawl -> snapshot -> extract - Added tests for Crawl JSONL parsing and output - Tests verify Crawl.from_jsonl() handles multiple URLs correctly	2025-12-30 20:06:56 +00:00
Nick Sweeting	bb59287411	Merge branch 'dev' into claude/snapshot-index-jsonl-UxEXK	2025-12-30 12:05:05 -08:00
Nick Sweeting	099d955ef5	Implement tags editor widget for Django admin (#1729 ) Implement a sleek inline tag editor with autocomplete and AJAX support: - Create TagEditorWidget and InlineTagEditorWidget in core/widgets.py - Pills display with X remove button, sorted alphabetically - Text input with HTML5 datalist autocomplete - Enter/Space/Comma to add tags, auto-creates if doesn't exist - Backspace removes last tag when input is empty - Add API endpoints in api/v1_core.py - GET /tags/autocomplete/ - search tags by name - POST /tags/create/ - get_or_create tag - POST /tags/add-to-snapshot/ - add tag to snapshot via AJAX - POST /tags/remove-from-snapshot/ - remove tag from snapshot - Update admin_snapshots.py - Replace FilteredSelectMultiple with TagEditorWidget in bulk actions - Create SnapshotAdminForm with tags_editor field - Update title_str() to render inline tag editor in list view - Remove TagInline, use widget instead - Add CSS styles in templates/admin/base.html - Blue gradient pill styling matching admin theme - Focus ring and hover states - Compact inline variant for list view <!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line length changes. --> # Summary <!--e.g. This PR fixes ABC or adds the ability to do XYZ...--> # Related issues <!-- e.g. #123 or Roadmap goal # https://github.com/pirate/ArchiveBox/wiki/Roadmap --> # Changes these areas - [ ] Bugfixes - [ ] Feature behavior - [ ] Command line interface - [ ] Configuration options - [ ] Internal architecture - [ ] Snapshot data layout on disk <!-- This is an auto-generated description by cubic. --> --- ## Summary by cubic Implemented a new interactive tags editor for Django admin with autocomplete and AJAX add/remove, replacing the old multi-select and inline. This makes tagging snapshots faster and safer in detail, list, and bulk actions. - New Features - TagEditorWidget and InlineTagEditorWidget with pill UI and remove buttons, XSS-safe rendering, and delegated events. - Keyboard support: Enter/Space/Comma to add, Backspace to remove last when input is empty. - Datalist autocomplete and debounced search via GET /tags/autocomplete/. - AJAX endpoints: POST /tags/create/, /tags/add-to-snapshot/, /tags/remove-from-snapshot/. - Refactors - Replaced FilteredSelectMultiple with TagEditorWidget in bulk actions; parse comma-separated tags and use bulk_create/delete for efficient add/remove. - Added SnapshotAdminForm with tags_editor field; saves tags case-insensitively and fixes remove_tags matching. - Rendered inline tag editor in list view via title_str; removed TagInline. - Added CSS in admin/base.html for pill styling, focus ring, and compact inline variant. <sup>Written for commit `0dee662f41`. Summary will update on new commits.</sup> <!-- End of auto-generated description by cubic. -->	2025-12-30 11:59:39 -08:00
Claude	69965a2782	fix: correct CLI pipeline data flow for crawl -> snapshot -> extract - archivebox crawl: creates Crawl records, outputs Crawl JSONL - archivebox snapshot: accepts Crawl JSONL, creates Snapshots, outputs Snapshot JSONL - archivebox extract: accepts Snapshot JSONL, runs extractors, outputs ArchiveResult JSONL Changes: - Add Crawl.from_jsonl() method for creating Crawl from JSONL records - Rewrite archivebox_crawl.py to create Crawl jobs without immediately starting them - Update archivebox_snapshot.py to accept both Crawl JSONL and plain URLs - Update jsonl.py docstring to document the pipeline	2025-12-30 19:42:41 +00:00
Claude	ae648c9bc1	refactor: move remaining JSONL methods to models, clean up jsonl.py - Add Tag.to_jsonl() method with schema_version - Add Crawl.to_jsonl() method with schema_version - Fix Tag.from_jsonl() to not depend on jsonl.py helper - Update tests to use Snapshot.from_jsonl() instead of non-existent get_or_create_snapshot Remove model-specific functions from misc/jsonl.py: - tag_to_jsonl() - use Tag.to_jsonl() instead - crawl_to_jsonl() - use Crawl.to_jsonl() instead - get_or_create_tag() - use Tag.from_jsonl() instead - process_jsonl_records() - use model from_jsonl() methods directly jsonl.py now only contains generic I/O utilities: - Type constants (TYPE_SNAPSHOT, etc.) - parse_line(), read_stdin(), read_file(), read_args_or_stdin() - write_record(), write_records() - filter_by_type(), process_records()	2025-12-30 19:30:18 +00:00
Claude	0dee662f41	Use bulk operations for add/remove tags actions - add_tags: Uses SnapshotTag.objects.bulk_create() with ignore_conflicts Instead of N calls to obj.tags.add(), now makes 1 query per tag - remove_tags: Uses single SnapshotTag.objects.filter().delete() Instead of N calls to obj.tags.remove(), now makes 1 query total Works correctly with "select all across pages" via queryset.values_list()	2025-12-30 19:29:34 +00:00
Claude	bc273c5a7f	feat: add schema_version to JSONL outputs and remove dead code - Add schema_version (archivebox.VERSION) to all to_jsonl() outputs: - Snapshot.to_jsonl() - ArchiveResult.to_jsonl() - Binary.to_jsonl() - Process.to_jsonl() - Update CLI commands to use model methods directly: - archivebox_snapshot.py: snapshot.to_jsonl() - archivebox_extract.py: result.to_jsonl() - Remove dead wrapper functions from misc/jsonl.py: - snapshot_to_jsonl() - archiveresult_to_jsonl() - binary_to_jsonl() - process_to_jsonl() - machine_to_jsonl() - Update tests to use model methods directly	2025-12-30 19:24:53 +00:00
claude[bot]	03b96ef4ce	Fix security issues in tag editor widgets - Fix case-sensitivity mismatch in remove_tags (use name__iexact) - Fix XSS vulnerability by removing onclick attributes - Use data attributes and event delegation instead - Apply DOM APIs to prevent injection attacks Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>	2025-12-30 19:18:41 +00:00
Claude	a5206e7648	refactor: move to_jsonl() methods to models Move JSONL serialization from standalone functions to model methods to mirror the from_jsonl() pattern: - Add Binary.to_jsonl() method - Add Process.to_jsonl() method - Add ArchiveResult.to_jsonl() method - Add Snapshot.to_jsonl() method - Update write_index_jsonl() to use model methods - Update jsonl.py functions to be thin wrappers	2025-12-30 18:35:22 +00:00
Nick Sweeting	91375d35a3	more migrations	2025-12-30 10:30:52 -08:00
Claude	d36079829b	feat: replace index.json with index.jsonl flat JSONL format Switch from hierarchical index.json to flat index.jsonl format for snapshot metadata storage. Each line is a self-contained JSON record with a 'type' field (Snapshot, ArchiveResult, Binary, Process). Changes: - Add JSONL_INDEX_FILENAME constant to constants.py - Add TYPE_PROCESS and TYPE_MACHINE to jsonl.py type constants - Add binary_to_jsonl(), process_to_jsonl(), machine_to_jsonl() converters - Add Snapshot.write_index_jsonl() to write new format - Add Snapshot.read_index_jsonl() to read new format - Add Snapshot.convert_index_json_to_jsonl() for migration - Update Snapshot.reconcile_with_index() to handle both formats - Update fs_migrate to convert during filesystem migration - Update load_from_directory/create_from_directory for both formats - Update legacy.py parse_json_links_details for JSONL support The new format is easier to parse, extend, and mix record types.	2025-12-30 18:21:06 +00:00
Nick Sweeting	96ee1bf686	more migration fixes	2025-12-30 09:57:33 -08:00
Nick Sweeting	4cd2fceb8a	even more migration fixes	2025-12-29 22:30:37 -08:00
Nick Sweeting	95beddc5fc	more migration fixes	2025-12-29 22:12:57 -08:00
Nick Sweeting	2e350d317d	fix initial migrtaions	2025-12-29 21:27:31 -08:00
Nick Sweeting	3dd329600e	comment updates	2025-12-29 21:05:34 -08:00
Nick Sweeting	80f75126c6	more fixes	2025-12-29 21:03:05 -08:00
Nick Sweeting	a648c17ec7	Merge branch 'dev' into claude/tags-editor-widget-0Dq7f	2025-12-29 19:25:36 -08:00
Nick Sweeting	147d567d3f	fix migrations	2025-12-29 19:25:26 -08:00
Nick Sweeting	dcf1136afa	Merge branch 'dev' into claude/tags-editor-widget-0Dq7f	2025-12-29 19:24:51 -08:00
Nick Sweeting	64dccb7a19	passing	2025-12-29 18:55:57 -08:00
Nick Sweeting	5549a79869	more speed fixes	2025-12-29 18:55:37 -08:00
Nick Sweeting	abf5f44134	more debug logging	2025-12-29 18:53:52 -08:00
Nick Sweeting	bcf0513d05	more debug logging	2025-12-29 18:50:04 -08:00
Nick Sweeting	7e6e3be9e7	messing with chrome install process to reuse cached chromium with pinned version	2025-12-29 18:49:36 -08:00
Claude	202e5b2e59	Add interactive tags editor widget for Django admin Implement a sleek inline tag editor with autocomplete and AJAX support: - Create TagEditorWidget and InlineTagEditorWidget in core/widgets.py - Pills display with X remove button, sorted alphabetically - Text input with HTML5 datalist autocomplete - Enter/Space/Comma to add tags, auto-creates if doesn't exist - Backspace removes last tag when input is empty - Add API endpoints in api/v1_core.py - GET /tags/autocomplete/ - search tags by name - POST /tags/create/ - get_or_create tag - POST /tags/add-to-snapshot/ - add tag to snapshot via AJAX - POST /tags/remove-from-snapshot/ - remove tag from snapshot - Update admin_snapshots.py - Replace FilteredSelectMultiple with TagEditorWidget in bulk actions - Create SnapshotAdminForm with tags_editor field - Update title_str() to render inline tag editor in list view - Remove TagInline, use widget instead - Add CSS styles in templates/admin/base.html - Blue gradient pill styling matching admin theme - Focus ring and hover states - Compact inline variant for list view	2025-12-30 02:18:08 +00:00
Nick Sweeting	b670612685	centralize chrome pid and zombie logic in chrome_utils	2025-12-29 17:57:23 -08:00
Nick Sweeting	4ba3e8d120	fix extension loading and consolidate chromium logic	2025-12-29 17:47:37 -08:00
claude[bot]	a101449cba	Fix: Make SingleFile use SINGLEFILE_CHROME_ARGS with fallback to CHROME_ARGS This fix resolves issue #1445 where SingleFile was not respecting Chrome user data directory and other Chrome launch options that work for other Chrome-based extractors (PDF, Screenshot, etc.). Changes: - Added SINGLEFILE_CHROME_ARGS config option in config.json with x-fallback to CHROME_ARGS - Updated SingleFile extractor to read and pass Chrome arguments via --browser-args parameter - Updated docstring to document the new environment variable This ensures SingleFile respects the same Chrome configuration (user data directory, cookies, etc.) as other Chrome-based extractors. Fixes #1445 Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>	2025-12-29 22:39:48 +00:00
Nick Sweeting	638b3ba774	add modalcloser plugin	2025-12-29 14:36:15 -08:00

... 2 3 4 5 6 ...

4927 Commits