ArchiveBox

mirror of https://github.com/ArchiveBox/ArchiveBox.git synced 2026-01-03 01:15:57 +10:00

Author	SHA1	Message	Date
Nick Sweeting	3da523fc74	more consistent crawl, snapshot, hook cleanup and Process tracking	2026-01-02 04:27:38 -08:00
Nick Sweeting	dd77511026	unified Process source of truth and better screenshot tests	2026-01-02 04:20:34 -08:00
claude[bot]	b2132d1f14	Fix cubic review issues: process_type detection, cmd storage, PID cleanup, and migration - Fix Process.current() to store psutil cmdline instead of sys.argv for accurate validation - Fix worker process_type detection: explicitly set to WORKER after registration - Fix ArchiveResultWorker.start() to use Process.TypeChoices.WORKER consistently - Fix migration to be explicitly irreversible (SQLite doesn't support DROP COLUMN) - Fix get_running_workers() to return process_id instead of incorrectly named worker_id - Fix safe_kill_process() to wait for termination and escalate to SIGKILL if needed - Fix migration to include all indexes in state_operations (parent_id, process_type) - Fix documentation to use Machine.current() scoping and StatusChoices constants Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>	2025-12-31 11:42:07 +00:00
Nick Sweeting	84a4fb0785	fix cubic comments	2025-12-30 23:53:47 -08:00
Nick Sweeting	f7b186d7c8	Apply suggestion from @cubic-dev-ai[bot] Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>	2025-12-31 02:31:46 -05:00
Claude	503a2f77cb	Add Persona class with cleanup_chrome() method - Create Persona class in personas/models.py for managing browser profiles/identities used for archiving sessions - Each Persona has: - chrome_user_data_dir: Chrome profile directory - chrome_extensions_dir: Installed extensions - cookies_file: Cookies for wget/curl - config_file: Persona-specific config overrides - Add Persona methods: - cleanup_chrome(): Remove stale SingletonLock/SingletonSocket files - get_config(): Load persona config from config.json - save_config(): Save persona config to config.json - ensure_dirs(): Create persona directory structure - all(): Iterator over all personas - get_active(): Get persona based on ACTIVE_PERSONA config - cleanup_chrome_all(): Clean up all personas - Update chrome_cleanup() in misc/util.py to use Persona.cleanup_chrome_all() instead of manual directory iteration - Add convenience functions: - cleanup_chrome_for_persona(name) - cleanup_chrome_all_personas()	2025-12-31 00:59:37 +00:00
Claude	877b5f91c2	Derive CHROME_USER_DATA_DIR from ACTIVE_PERSONA in config system - Add _derive_persona_paths() in configset.py to automatically derive CHROME_USER_DATA_DIR and CHROME_EXTENSIONS_DIR from ACTIVE_PERSONA when not explicitly set. This allows plugins to use these paths without knowing about the persona system. - Update chrome_utils.js launchChromium() to accept userDataDir option and pass --user-data-dir to Chrome. Also cleans up SingletonLock before launch. - Update killZombieChrome() to clean up SingletonLock files from all persona chrome_user_data directories after killing zombies. - Update chrome_cleanup() in misc/util.py to handle persona-based user data directories when cleaning up stale Chrome state. - Simplify on_Crawl__20_chrome_launch.bg.js to use CHROME_USER_DATA_DIR and CHROME_EXTENSIONS_DIR from env (derived by get_config()). Config priority flow: ACTIVE_PERSONA=WorkAccount (set on crawl/snapshot) -> get_config() derives: CHROME_USER_DATA_DIR = PERSONAS_DIR/WorkAccount/chrome_user_data CHROME_EXTENSIONS_DIR = PERSONAS_DIR/WorkAccount/chrome_extensions -> hooks receive these as env vars without needing persona logic	2025-12-31 00:21:07 +00:00
Nick Sweeting	dd2302ad92	new jsonl cli interface	2025-12-30 16:12:53 -08:00
claude[bot]	762cddc8c5	fix: address PR review comments from cubic-dev-ai - Add JSONL_INDEX_FILENAME to ALLOWED_IN_DATA_DIR for consistency - Fix fallback logic in legacy.py to try JSON when JSONL parsing fails - Replace bare except clauses with specific exception types - Fix stdin double-consumption in archivebox_crawl.py - Merge CLI --tag option with crawl tags in archivebox_snapshot.py - Remove tautological mock tests (covered by integration tests) Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>	2025-12-30 20:09:51 +00:00
Claude	69965a2782	fix: correct CLI pipeline data flow for crawl -> snapshot -> extract - archivebox crawl: creates Crawl records, outputs Crawl JSONL - archivebox snapshot: accepts Crawl JSONL, creates Snapshots, outputs Snapshot JSONL - archivebox extract: accepts Snapshot JSONL, runs extractors, outputs ArchiveResult JSONL Changes: - Add Crawl.from_jsonl() method for creating Crawl from JSONL records - Rewrite archivebox_crawl.py to create Crawl jobs without immediately starting them - Update archivebox_snapshot.py to accept both Crawl JSONL and plain URLs - Update jsonl.py docstring to document the pipeline	2025-12-30 19:42:41 +00:00
Claude	ae648c9bc1	refactor: move remaining JSONL methods to models, clean up jsonl.py - Add Tag.to_jsonl() method with schema_version - Add Crawl.to_jsonl() method with schema_version - Fix Tag.from_jsonl() to not depend on jsonl.py helper - Update tests to use Snapshot.from_jsonl() instead of non-existent get_or_create_snapshot Remove model-specific functions from misc/jsonl.py: - tag_to_jsonl() - use Tag.to_jsonl() instead - crawl_to_jsonl() - use Crawl.to_jsonl() instead - get_or_create_tag() - use Tag.from_jsonl() instead - process_jsonl_records() - use model from_jsonl() methods directly jsonl.py now only contains generic I/O utilities: - Type constants (TYPE_SNAPSHOT, etc.) - parse_line(), read_stdin(), read_file(), read_args_or_stdin() - write_record(), write_records() - filter_by_type(), process_records()	2025-12-30 19:30:18 +00:00
Claude	bc273c5a7f	feat: add schema_version to JSONL outputs and remove dead code - Add schema_version (archivebox.VERSION) to all to_jsonl() outputs: - Snapshot.to_jsonl() - ArchiveResult.to_jsonl() - Binary.to_jsonl() - Process.to_jsonl() - Update CLI commands to use model methods directly: - archivebox_snapshot.py: snapshot.to_jsonl() - archivebox_extract.py: result.to_jsonl() - Remove dead wrapper functions from misc/jsonl.py: - snapshot_to_jsonl() - archiveresult_to_jsonl() - binary_to_jsonl() - process_to_jsonl() - machine_to_jsonl() - Update tests to use model methods directly	2025-12-30 19:24:53 +00:00
Claude	a5206e7648	refactor: move to_jsonl() methods to models Move JSONL serialization from standalone functions to model methods to mirror the from_jsonl() pattern: - Add Binary.to_jsonl() method - Add Process.to_jsonl() method - Add ArchiveResult.to_jsonl() method - Add Snapshot.to_jsonl() method - Update write_index_jsonl() to use model methods - Update jsonl.py functions to be thin wrappers	2025-12-30 18:35:22 +00:00
Claude	d36079829b	feat: replace index.json with index.jsonl flat JSONL format Switch from hierarchical index.json to flat index.jsonl format for snapshot metadata storage. Each line is a self-contained JSON record with a 'type' field (Snapshot, ArchiveResult, Binary, Process). Changes: - Add JSONL_INDEX_FILENAME constant to constants.py - Add TYPE_PROCESS and TYPE_MACHINE to jsonl.py type constants - Add binary_to_jsonl(), process_to_jsonl(), machine_to_jsonl() converters - Add Snapshot.write_index_jsonl() to write new format - Add Snapshot.read_index_jsonl() to read new format - Add Snapshot.convert_index_json_to_jsonl() for migration - Update Snapshot.reconcile_with_index() to handle both formats - Update fs_migrate to convert during filesystem migration - Update load_from_directory/create_from_directory for both formats - Update legacy.py parse_json_links_details for JSONL support The new format is easier to parse, extend, and mix record types.	2025-12-30 18:21:06 +00:00
Nick Sweeting	f0aa19fa7d	wip	2025-12-28 17:51:54 -08:00
Nick Sweeting	4ccb0863bb	continue renaming extractor to plugin, add plan for hook concurrency, add chrome kill helper script	2025-12-28 05:29:24 -08:00
Nick Sweeting	bd265c0083	rename extractor to plugin everywhere	2025-12-28 04:43:15 -08:00
Nick Sweeting	50e527ec65	way better plugin hooks system wip	2025-12-28 03:39:59 -08:00
Claude	b632894bc9	Update views, API, and exports for new ArchiveResult output fields Replace old `output` field with new fields across the codebase: - output_str: Human-readable output summary - output_json: Structured metadata (optional) - output_files: Dict of output files with metadata - output_size: Total size in bytes - output_mimetypes: CSV of file mimetypes Files updated: - api/v1_core.py: Update MinimalArchiveResultSchema to expose new fields - api/v1_core.py: Update ArchiveResultFilterSchema to search output_str - cli/archivebox_extract.py: Use output_str in CLI output - core/admin_archiveresults.py: Update admin fields, search, and fieldsets - core/admin_archiveresults.py: Fix output_html variable name bug in output_summary - misc/jsonl.py: Update archiveresult_to_jsonl() to include new fields - plugins/extractor_utils.py: Update ExtractorResult helper class The embed_path() method already uses output_files and output_str, so snapshot detail page and template tags work correctly.	2025-12-27 20:28:22 +00:00
Claude	c3acadd528	Remove extractor field from Crawl model and fix tests - Remove extractor field from Crawl model (moved to config dict) - Update migration 0002_drop_seed_model to not add extractor - Update archivebox_add.py to use config['PARSER'] instead - Update admin.py recrawl to not pass extractor - Update jsonl.py serialization to not include extractor - Update test schema SCHEMA_0_8 to not include extractor - Set default timeout to 60s for test commands	2025-12-27 01:49:09 +00:00
Nick Sweeting	4fd7fcdbcf	new gallerydl plugin and more	2025-12-26 11:55:03 -08:00
Nick Sweeting	9838d7ba02	tons of ui fixes and plugin fixes	2025-12-25 03:59:51 -08:00
Nick Sweeting	bb53228ebf	remove Seed model in favor of Crawl as template	2025-12-25 01:52:41 -08:00
Nick Sweeting	866f993f26	logging and admin ui improvements	2025-12-25 01:10:41 -08:00
Nick Sweeting	d95f0dc186	remove huey	2025-12-24 23:40:18 -08:00
Nick Sweeting	6c769d831c	wip 2	2025-12-24 21:46:14 -08:00
Nick Sweeting	1915333b81	wip major changes	2025-12-24 20:10:38 -08:00
Ben Muthalaly	71c02ca4eb	Update archivebox/misc/logging_util.py Co-authored-by: Nick Sweeting <git@sweeting.me>	2025-02-05 17:55:45 -06:00
Ben Muthalaly	9f4cf0a8e1	Kill the timer process if it doesn't properly terminate.	2025-02-03 02:47:33 -06:00
Nick Sweeting	c5fc4068f4	fix unneeded import	2024-12-18 18:09:21 -08:00
Nick Sweeting	7975b47c85	remove dependencies on unneeded libraries	2024-12-18 18:07:35 -08:00
Nick Sweeting	d192eb5c48	add filestore content addressible store draft	2024-12-04 02:15:04 -08:00
Nick Sweeting	a3fe78afaa	add basename to hashing get_dir_info	2024-12-04 02:15:04 -08:00
Nick Sweeting	eae7ed8447	add hashing misc library for merkle tree generation	2024-12-03 02:12:20 -08:00
Nick Sweeting	2595139180	improve statemachine logging and archivebox update CLI cmd	2024-11-19 03:31:05 -08:00
Nick Sweeting	c9a05c9d94	working archivebox update CLI cmd	2024-11-19 02:32:05 -08:00
Nick Sweeting	328eb98a38	move main funcs into cli files and switch to using click for CLI	2024-11-19 00:18:51 -08:00
Nick Sweeting	4c25e90378	move monkey_patches.py into archivebox.misc subfolder	2024-11-18 19:10:42 -08:00
Nick Sweeting	4a5d607296	move logging_util into archivebox.misc subfolder	2024-11-18 19:08:49 -08:00
Nick Sweeting	b3c1cb716e	move abx plugins inside vendor dir	2024-10-28 04:07:35 -07:00
Nick Sweeting	4b6f08b0fe	swap more direct settings.CONFIG access to abx getters	2024-10-24 15:42:19 -07:00
Nick Sweeting	60f0458c77	rename configfile to collection	2024-10-24 15:40:24 -07:00
Nick Sweeting	657eec479b	fix CONSTANTS.LIB_DIR old style access	2024-10-21 03:20:20 -07:00
Nick Sweeting	b3107ab830	move final legacy config to plugins and fix archivebox config cmd and add search opt	2024-10-21 02:56:00 -07:00
Nick Sweeting	a211461ffc	fix LIB_DIR and TMP_DIR loading when primary option isnt available	2024-10-21 00:35:56 -07:00
Nick Sweeting	bb9c3fda14	fix makemigrations being blocked by check_migrations func	2024-10-14 17:40:06 -07:00
Nick Sweeting	9a04ed7c76	move serve_static and shell_welcome_message into misc	2024-10-14 17:35:28 -07:00
Nick Sweeting	f75ae805f8	comment out Crawl api methods temporarily	2024-10-14 15:41:58 -07:00
Nick Sweeting	2f68a1d476	fix ldap lib loading after apt install	2024-10-09 04:03:02 -07:00
Nick Sweeting	afc24e802a	tweak version log output	2024-10-09 03:18:22 -07:00

1 2

75 Commits