ArchiveBox

mirror of https://github.com/ArchiveBox/ArchiveBox.git synced 2026-01-04 01:46:54 +10:00

Author	SHA1	Message	Date
Claude	04c23badc2	Fix output path structure for 0.9.x data directory - Update Crawl.output_dir_parent to use username instead of user_id for consistency with Snapshot paths - Add domain from first URL to Crawl path structure for easier debugging: users/{username}/crawls/YYYYMMDD/{domain}/{crawl_id}/ - Add CRAWL_OUTPUT_DIR to config passed to Snapshot hooks so chrome_tab can find the shared Chrome session from the Crawl - Update comment in chrome_tab hook to reflect new config source	2025-12-31 08:18:24 +00:00
Claude	b8a66c4a84	Convert Persona to Django ModelWithConfig, add to get_config() - Convert Persona from plain Python class to Django model with ModelWithConfig - Add config JSONField for persona-specific config overrides - Add get_derived_config() method that returns config with derived paths: - CHROME_USER_DATA_DIR, CHROME_EXTENSIONS_DIR, COOKIES_FILE, ACTIVE_PERSONA - Update get_config() to accept persona parameter in merge chain: get_config(persona=crawl.persona, crawl=crawl, snapshot=snapshot) - Remove _derive_persona_paths() - derivation now happens in Persona model - Merge order (highest to lowest priority): 1. snapshot.config 2. crawl.config 3. user.config 4. persona.get_derived_config() <- NEW 5. environment variables 6. ArchiveBox.conf file 7. plugin defaults 8. core defaults Usage: config = get_config(persona=crawl.persona, crawl=crawl) config['CHROME_USER_DATA_DIR'] # derived from persona	2025-12-31 01:07:29 +00:00
Claude	877b5f91c2	Derive CHROME_USER_DATA_DIR from ACTIVE_PERSONA in config system - Add _derive_persona_paths() in configset.py to automatically derive CHROME_USER_DATA_DIR and CHROME_EXTENSIONS_DIR from ACTIVE_PERSONA when not explicitly set. This allows plugins to use these paths without knowing about the persona system. - Update chrome_utils.js launchChromium() to accept userDataDir option and pass --user-data-dir to Chrome. Also cleans up SingletonLock before launch. - Update killZombieChrome() to clean up SingletonLock files from all persona chrome_user_data directories after killing zombies. - Update chrome_cleanup() in misc/util.py to handle persona-based user data directories when cleaning up stale Chrome state. - Simplify on_Crawl__20_chrome_launch.bg.js to use CHROME_USER_DATA_DIR and CHROME_EXTENSIONS_DIR from env (derived by get_config()). Config priority flow: ACTIVE_PERSONA=WorkAccount (set on crawl/snapshot) -> get_config() derives: CHROME_USER_DATA_DIR = PERSONAS_DIR/WorkAccount/chrome_user_data CHROME_EXTENSIONS_DIR = PERSONAS_DIR/WorkAccount/chrome_extensions -> hooks receive these as env vars without needing persona logic	2025-12-31 00:21:07 +00:00
claude[bot]	762cddc8c5	fix: address PR review comments from cubic-dev-ai - Add JSONL_INDEX_FILENAME to ALLOWED_IN_DATA_DIR for consistency - Fix fallback logic in legacy.py to try JSON when JSONL parsing fails - Replace bare except clauses with specific exception types - Fix stdin double-consumption in archivebox_crawl.py - Merge CLI --tag option with crawl tags in archivebox_snapshot.py - Remove tautological mock tests (covered by integration tests) Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>	2025-12-30 20:09:51 +00:00
Claude	d36079829b	feat: replace index.json with index.jsonl flat JSONL format Switch from hierarchical index.json to flat index.jsonl format for snapshot metadata storage. Each line is a self-contained JSON record with a 'type' field (Snapshot, ArchiveResult, Binary, Process). Changes: - Add JSONL_INDEX_FILENAME constant to constants.py - Add TYPE_PROCESS and TYPE_MACHINE to jsonl.py type constants - Add binary_to_jsonl(), process_to_jsonl(), machine_to_jsonl() converters - Add Snapshot.write_index_jsonl() to write new format - Add Snapshot.read_index_jsonl() to read new format - Add Snapshot.convert_index_json_to_jsonl() for migration - Update Snapshot.reconcile_with_index() to handle both formats - Update fs_migrate to convert during filesystem migration - Update load_from_directory/create_from_directory for both formats - Update legacy.py parse_json_links_details for JSONL support The new format is easier to parse, extend, and mix record types.	2025-12-30 18:21:06 +00:00
claude[bot]	329d185d95	Fix: Make CUSTOM_TEMPLATES_DIR configurable again Resolves issue #1484 where CUSTOM_TEMPLATES_DIR configuration was being ignored. The setting was previously removed from ServerConfig and hardcoded as a constant, preventing users from customizing the templates directory location. Changes: - Added CUSTOM_TEMPLATES_DIR field to StorageConfig in common.py - Updated settings.py to use STORAGE_CONFIG.CUSTOM_TEMPLATES_DIR - Updated paths.py to use configurable value in version output Users can now configure the custom templates directory via: - ArchiveBox.conf: CUSTOM_TEMPLATES_DIR = ./custom_templates - Environment variable: export CUSTOM_TEMPLATES_DIR=/path/to/templates - Defaults to DATA_DIR/user_templates if not configured 🤖 Generated with [Claude Code](https://claude.ai/code) Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>	2025-12-29 21:50:21 +00:00
Claude	88d7906033	Add MAX_URL_ATTEMPTS config option to stop retries after too many failures Adds a new MAX_URL_ATTEMPTS configuration option (default: 50) that stops retrying ArchiveResult hooks for a snapshot once that many failures have been recorded. This prevents infinite retry loops for problematic URLs. When the limit is reached, any pending ArchiveResults for that snapshot are marked as SKIPPED with an explanatory message.	2025-12-29 20:20:50 +00:00
Nick Sweeting	30c60eef76	much better tests and add page ui	2025-12-29 04:02:11 -08:00
Nick Sweeting	f4e7820533	use full dotted paths for all archivebox imports, add migrations and more fixes	2025-12-29 00:47:08 -08:00
Nick Sweeting	f0aa19fa7d	wip	2025-12-28 17:51:54 -08:00
Nick Sweeting	50e527ec65	way better plugin hooks system wip	2025-12-28 03:39:59 -08:00
Nick Sweeting	9838d7ba02	tons of ui fixes and plugin fixes	2025-12-25 03:59:51 -08:00
Nick Sweeting	bb53228ebf	remove Seed model in favor of Crawl as template	2025-12-25 01:52:41 -08:00
Nick Sweeting	866f993f26	logging and admin ui improvements	2025-12-25 01:10:41 -08:00
Nick Sweeting	d95f0dc186	remove huey	2025-12-24 23:40:18 -08:00
Nick Sweeting	6c769d831c	wip 2	2025-12-24 21:46:14 -08:00
Nick Sweeting	1915333b81	wip major changes	2025-12-24 20:10:38 -08:00
Nick Sweeting	ac53fdf677	make chrome binary and configs directly runnable and make extractor use external bin	2024-12-06 02:06:39 -08:00
Nick Sweeting	c9a05c9d94	working archivebox update CLI cmd	2024-11-19 02:32:05 -08:00
Nick Sweeting	328eb98a38	move main funcs into cli files and switch to using click for CLI	2024-11-19 00:18:51 -08:00
Nick Sweeting	4a5d607296	move logging_util into archivebox.misc subfolder	2024-11-18 19:08:49 -08:00
Nick Sweeting	e469c5a344	merge queues and actors apps into new workers app	2024-11-18 18:52:48 -08:00
Nick Sweeting	67c22b2df0	fix config set not working with constants	2024-11-18 04:27:37 -08:00
Nick Sweeting	c8e186f21b	fix plugin loading order, admin, abx-pkg	2024-11-16 06:44:12 -08:00
Nick Sweeting	684a394cba	add HOSTNAME to config.permissions	2024-11-16 02:45:58 -08:00
Nick Sweeting	9b24fe7390	merge dev	2024-11-02 17:34:33 -07:00
Nick Sweeting	721427a484	hide progress bar on startup	2024-10-31 07:11:15 -07:00
Nick Sweeting	d93aa46949	fix django.forms.JSONField does not exist 500 error	2024-10-28 18:47:45 -07:00
Nick Sweeting	b3c1cb716e	move abx plugins inside vendor dir	2024-10-28 04:07:35 -07:00
Nick Sweeting	60f0458c77	rename configfile to collection	2024-10-24 15:40:24 -07:00
Nick Sweeting	9e40dd69a4	more config improvements, move away from settings GLOBALS to getters	2024-10-24 14:50:07 -07:00
Nick Sweeting	312e40b95b	finally get rid of config/legacy in favor of configfile.py and django.py	2024-10-21 03:06:19 -07:00
Nick Sweeting	b3107ab830	move final legacy config to plugins and fix archivebox config cmd and add search opt	2024-10-21 02:56:00 -07:00
Nick Sweeting	7a6f1f36d2	trigger abx.pm.hook.ready from core.AppConfig.ready	2024-10-21 01:31:02 -07:00
Nick Sweeting	a211461ffc	fix LIB_DIR and TMP_DIR loading when primary option isnt available	2024-10-21 00:35:56 -07:00
Nick Sweeting	80d8a6b667	split archivebox.use into archivebox.reads and archivebox.writes	2024-10-15 01:03:01 -07:00
Nick Sweeting	df79b8e038	rename config sections to match old sections	2024-10-15 01:01:34 -07:00
Nick Sweeting	01ba6d49d3	new vastly simplified plugin spec without pydantic	2024-10-14 21:50:47 -07:00
Nick Sweeting	86380a1ef2	fix .archivebox_id being created outside collection dir	2024-10-14 17:35:43 -07:00
Nick Sweeting	6e7071bd19	add new binproviders and binaries args to install and version, bump pydantic-pkgr version	2024-10-11 00:45:59 -07:00
Nick Sweeting	0c29e08f73	avoid creating collection id file on every startup since its not needed	2024-10-09 19:12:08 -07:00
Nick Sweeting	de7ab65f11	ignore errors when chowning at initial startup	2024-10-09 04:48:09 -07:00
Nick Sweeting	ad675a8e7c	properly handle chowning DATA_DIR on init when using sudo	2024-10-09 04:39:09 -07:00
Nick Sweeting	1b7aca130b	properly detect sudo UID	2024-10-09 04:02:46 -07:00
Nick Sweeting	db65af898b	correctly update environment HOME and USER vars when dropping permissions	2024-10-09 03:18:04 -07:00
Nick Sweeting	613caec8eb	improve install flow with sudo, check package managers, and fix docker build	2024-10-09 00:41:16 -07:00
Nick Sweeting	7c34f2bc90	hide errors if user is just getting help or version info	2024-10-08 19:20:03 -07:00
Nick Sweeting	9f274cf9f4	remove platformdirs dependency	2024-10-08 19:17:18 -07:00
Nick Sweeting	4b34b729ab	fuck it go back to nested lib and tmp dirs with supervisord sock workaround	2024-10-08 17:48:59 -07:00
Nick Sweeting	1888691ee8	try creating shared libs as 777 when running as root	2024-10-08 17:10:56 -07:00

1 2 3 4

170 Commits