Commit Graph

170 Commits

Author SHA1 Message Date
Claude
04c23badc2 Fix output path structure for 0.9.x data directory
- Update Crawl.output_dir_parent to use username instead of user_id
  for consistency with Snapshot paths
- Add domain from first URL to Crawl path structure for easier debugging:
  users/{username}/crawls/YYYYMMDD/{domain}/{crawl_id}/
- Add CRAWL_OUTPUT_DIR to config passed to Snapshot hooks so chrome_tab
  can find the shared Chrome session from the Crawl
- Update comment in chrome_tab hook to reflect new config source
2025-12-31 08:18:24 +00:00
Claude
b8a66c4a84 Convert Persona to Django ModelWithConfig, add to get_config()
- Convert Persona from plain Python class to Django model with ModelWithConfig
- Add config JSONField for persona-specific config overrides
- Add get_derived_config() method that returns config with derived paths:
  - CHROME_USER_DATA_DIR, CHROME_EXTENSIONS_DIR, COOKIES_FILE, ACTIVE_PERSONA

- Update get_config() to accept persona parameter in merge chain:
  get_config(persona=crawl.persona, crawl=crawl, snapshot=snapshot)

- Remove _derive_persona_paths() - derivation now happens in Persona model

- Merge order (highest to lowest priority):
  1. snapshot.config
  2. crawl.config
  3. user.config
  4. persona.get_derived_config()  <- NEW
  5. environment variables
  6. ArchiveBox.conf file
  7. plugin defaults
  8. core defaults

Usage:
  config = get_config(persona=crawl.persona, crawl=crawl)
  config['CHROME_USER_DATA_DIR']  # derived from persona
2025-12-31 01:07:29 +00:00
Claude
877b5f91c2 Derive CHROME_USER_DATA_DIR from ACTIVE_PERSONA in config system
- Add _derive_persona_paths() in configset.py to automatically derive
  CHROME_USER_DATA_DIR and CHROME_EXTENSIONS_DIR from ACTIVE_PERSONA
  when not explicitly set. This allows plugins to use these paths
  without knowing about the persona system.

- Update chrome_utils.js launchChromium() to accept userDataDir option
  and pass --user-data-dir to Chrome. Also cleans up SingletonLock
  before launch.

- Update killZombieChrome() to clean up SingletonLock files from all
  persona chrome_user_data directories after killing zombies.

- Update chrome_cleanup() in misc/util.py to handle persona-based
  user data directories when cleaning up stale Chrome state.

- Simplify on_Crawl__20_chrome_launch.bg.js to use CHROME_USER_DATA_DIR
  and CHROME_EXTENSIONS_DIR from env (derived by get_config()).

Config priority flow:
  ACTIVE_PERSONA=WorkAccount (set on crawl/snapshot)
  -> get_config() derives:
     CHROME_USER_DATA_DIR = PERSONAS_DIR/WorkAccount/chrome_user_data
     CHROME_EXTENSIONS_DIR = PERSONAS_DIR/WorkAccount/chrome_extensions
  -> hooks receive these as env vars without needing persona logic
2025-12-31 00:21:07 +00:00
claude[bot]
762cddc8c5 fix: address PR review comments from cubic-dev-ai
- Add JSONL_INDEX_FILENAME to ALLOWED_IN_DATA_DIR for consistency
- Fix fallback logic in legacy.py to try JSON when JSONL parsing fails
- Replace bare except clauses with specific exception types
- Fix stdin double-consumption in archivebox_crawl.py
- Merge CLI --tag option with crawl tags in archivebox_snapshot.py
- Remove tautological mock tests (covered by integration tests)

Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>
2025-12-30 20:09:51 +00:00
Claude
d36079829b feat: replace index.json with index.jsonl flat JSONL format
Switch from hierarchical index.json to flat index.jsonl format for
snapshot metadata storage. Each line is a self-contained JSON record
with a 'type' field (Snapshot, ArchiveResult, Binary, Process).

Changes:
- Add JSONL_INDEX_FILENAME constant to constants.py
- Add TYPE_PROCESS and TYPE_MACHINE to jsonl.py type constants
- Add binary_to_jsonl(), process_to_jsonl(), machine_to_jsonl() converters
- Add Snapshot.write_index_jsonl() to write new format
- Add Snapshot.read_index_jsonl() to read new format
- Add Snapshot.convert_index_json_to_jsonl() for migration
- Update Snapshot.reconcile_with_index() to handle both formats
- Update fs_migrate to convert during filesystem migration
- Update load_from_directory/create_from_directory for both formats
- Update legacy.py parse_json_links_details for JSONL support

The new format is easier to parse, extend, and mix record types.
2025-12-30 18:21:06 +00:00
claude[bot]
329d185d95 Fix: Make CUSTOM_TEMPLATES_DIR configurable again
Resolves issue #1484 where CUSTOM_TEMPLATES_DIR configuration was
being ignored. The setting was previously removed from ServerConfig
and hardcoded as a constant, preventing users from customizing the
templates directory location.

Changes:
- Added CUSTOM_TEMPLATES_DIR field to StorageConfig in common.py
- Updated settings.py to use STORAGE_CONFIG.CUSTOM_TEMPLATES_DIR
- Updated paths.py to use configurable value in version output

Users can now configure the custom templates directory via:
- ArchiveBox.conf: CUSTOM_TEMPLATES_DIR = ./custom_templates
- Environment variable: export CUSTOM_TEMPLATES_DIR=/path/to/templates
- Defaults to DATA_DIR/user_templates if not configured

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>
2025-12-29 21:50:21 +00:00
Claude
88d7906033 Add MAX_URL_ATTEMPTS config option to stop retries after too many failures
Adds a new MAX_URL_ATTEMPTS configuration option (default: 50) that stops
retrying ArchiveResult hooks for a snapshot once that many failures have
been recorded. This prevents infinite retry loops for problematic URLs.

When the limit is reached, any pending ArchiveResults for that snapshot
are marked as SKIPPED with an explanatory message.
2025-12-29 20:20:50 +00:00
Nick Sweeting
30c60eef76 much better tests and add page ui 2025-12-29 04:02:11 -08:00
Nick Sweeting
f4e7820533 use full dotted paths for all archivebox imports, add migrations and more fixes 2025-12-29 00:47:08 -08:00
Nick Sweeting
f0aa19fa7d wip 2025-12-28 17:51:54 -08:00
Nick Sweeting
50e527ec65 way better plugin hooks system wip 2025-12-28 03:39:59 -08:00
Nick Sweeting
9838d7ba02 tons of ui fixes and plugin fixes 2025-12-25 03:59:51 -08:00
Nick Sweeting
bb53228ebf remove Seed model in favor of Crawl as template 2025-12-25 01:52:41 -08:00
Nick Sweeting
866f993f26 logging and admin ui improvements 2025-12-25 01:10:41 -08:00
Nick Sweeting
d95f0dc186 remove huey 2025-12-24 23:40:18 -08:00
Nick Sweeting
6c769d831c wip 2 2025-12-24 21:46:14 -08:00
Nick Sweeting
1915333b81 wip major changes 2025-12-24 20:10:38 -08:00
Nick Sweeting
ac53fdf677 make chrome binary and configs directly runnable and make extractor use external bin 2024-12-06 02:06:39 -08:00
Nick Sweeting
c9a05c9d94 working archivebox update CLI cmd 2024-11-19 02:32:05 -08:00
Nick Sweeting
328eb98a38 move main funcs into cli files and switch to using click for CLI 2024-11-19 00:18:51 -08:00
Nick Sweeting
4a5d607296 move logging_util into archivebox.misc subfolder 2024-11-18 19:08:49 -08:00
Nick Sweeting
e469c5a344 merge queues and actors apps into new workers app 2024-11-18 18:52:48 -08:00
Nick Sweeting
67c22b2df0 fix config set not working with constants 2024-11-18 04:27:37 -08:00
Nick Sweeting
c8e186f21b fix plugin loading order, admin, abx-pkg 2024-11-16 06:44:12 -08:00
Nick Sweeting
684a394cba add HOSTNAME to config.permissions 2024-11-16 02:45:58 -08:00
Nick Sweeting
9b24fe7390 merge dev 2024-11-02 17:34:33 -07:00
Nick Sweeting
721427a484 hide progress bar on startup 2024-10-31 07:11:15 -07:00
Nick Sweeting
d93aa46949 fix django.forms.JSONField does not exist 500 error 2024-10-28 18:47:45 -07:00
Nick Sweeting
b3c1cb716e move abx plugins inside vendor dir 2024-10-28 04:07:35 -07:00
Nick Sweeting
60f0458c77 rename configfile to collection 2024-10-24 15:40:24 -07:00
Nick Sweeting
9e40dd69a4 more config improvements, move away from settings GLOBALS to getters 2024-10-24 14:50:07 -07:00
Nick Sweeting
312e40b95b finally get rid of config/legacy in favor of configfile.py and django.py 2024-10-21 03:06:19 -07:00
Nick Sweeting
b3107ab830 move final legacy config to plugins and fix archivebox config cmd and add search opt 2024-10-21 02:56:00 -07:00
Nick Sweeting
7a6f1f36d2 trigger abx.pm.hook.ready from core.AppConfig.ready 2024-10-21 01:31:02 -07:00
Nick Sweeting
a211461ffc fix LIB_DIR and TMP_DIR loading when primary option isnt available 2024-10-21 00:35:56 -07:00
Nick Sweeting
80d8a6b667 split archivebox.use into archivebox.reads and archivebox.writes 2024-10-15 01:03:01 -07:00
Nick Sweeting
df79b8e038 rename config sections to match old sections 2024-10-15 01:01:34 -07:00
Nick Sweeting
01ba6d49d3 new vastly simplified plugin spec without pydantic 2024-10-14 21:50:47 -07:00
Nick Sweeting
86380a1ef2 fix .archivebox_id being created outside collection dir 2024-10-14 17:35:43 -07:00
Nick Sweeting
6e7071bd19 add new binproviders and binaries args to install and version, bump pydantic-pkgr version 2024-10-11 00:45:59 -07:00
Nick Sweeting
0c29e08f73 avoid creating collection id file on every startup since its not needed 2024-10-09 19:12:08 -07:00
Nick Sweeting
de7ab65f11 ignore errors when chowning at initial startup 2024-10-09 04:48:09 -07:00
Nick Sweeting
ad675a8e7c properly handle chowning DATA_DIR on init when using sudo 2024-10-09 04:39:09 -07:00
Nick Sweeting
1b7aca130b properly detect sudo UID 2024-10-09 04:02:46 -07:00
Nick Sweeting
db65af898b correctly update environment HOME and USER vars when dropping permissions 2024-10-09 03:18:04 -07:00
Nick Sweeting
613caec8eb improve install flow with sudo, check package managers, and fix docker build 2024-10-09 00:41:16 -07:00
Nick Sweeting
7c34f2bc90 hide errors if user is just getting help or version info 2024-10-08 19:20:03 -07:00
Nick Sweeting
9f274cf9f4 remove platformdirs dependency 2024-10-08 19:17:18 -07:00
Nick Sweeting
4b34b729ab fuck it go back to nested lib and tmp dirs with supervisord sock workaround 2024-10-08 17:48:59 -07:00
Nick Sweeting
1888691ee8 try creating shared libs as 777 when running as root 2024-10-08 17:10:56 -07:00