75 Commits

Author SHA1 Message Date
Nick Sweeting
3da523fc74 more consistent crawl, snapshot, hook cleanup and Process tracking 2026-01-02 04:27:38 -08:00
Nick Sweeting
dd77511026 unified Process source of truth and better screenshot tests 2026-01-02 04:20:34 -08:00
claude[bot]
b2132d1f14 Fix cubic review issues: process_type detection, cmd storage, PID cleanup, and migration
- Fix Process.current() to store psutil cmdline instead of sys.argv for accurate validation
- Fix worker process_type detection: explicitly set to WORKER after registration
- Fix ArchiveResultWorker.start() to use Process.TypeChoices.WORKER consistently
- Fix migration to be explicitly irreversible (SQLite doesn't support DROP COLUMN)
- Fix get_running_workers() to return process_id instead of incorrectly named worker_id
- Fix safe_kill_process() to wait for termination and escalate to SIGKILL if needed
- Fix migration to include all indexes in state_operations (parent_id, process_type)
- Fix documentation to use Machine.current() scoping and StatusChoices constants

Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>
2025-12-31 11:42:07 +00:00
Nick Sweeting
84a4fb0785 fix cubic comments 2025-12-30 23:53:47 -08:00
Nick Sweeting
f7b186d7c8 Apply suggestion from @cubic-dev-ai[bot]
Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>
2025-12-31 02:31:46 -05:00
Claude
503a2f77cb Add Persona class with cleanup_chrome() method
- Create Persona class in personas/models.py for managing browser
  profiles/identities used for archiving sessions

- Each Persona has:
  - chrome_user_data_dir: Chrome profile directory
  - chrome_extensions_dir: Installed extensions
  - cookies_file: Cookies for wget/curl
  - config_file: Persona-specific config overrides

- Add Persona methods:
  - cleanup_chrome(): Remove stale SingletonLock/SingletonSocket files
  - get_config(): Load persona config from config.json
  - save_config(): Save persona config to config.json
  - ensure_dirs(): Create persona directory structure
  - all(): Iterator over all personas
  - get_active(): Get persona based on ACTIVE_PERSONA config
  - cleanup_chrome_all(): Clean up all personas

- Update chrome_cleanup() in misc/util.py to use Persona.cleanup_chrome_all()
  instead of manual directory iteration

- Add convenience functions:
  - cleanup_chrome_for_persona(name)
  - cleanup_chrome_all_personas()
2025-12-31 00:59:37 +00:00
Claude
877b5f91c2 Derive CHROME_USER_DATA_DIR from ACTIVE_PERSONA in config system
- Add _derive_persona_paths() in configset.py to automatically derive
  CHROME_USER_DATA_DIR and CHROME_EXTENSIONS_DIR from ACTIVE_PERSONA
  when not explicitly set. This allows plugins to use these paths
  without knowing about the persona system.

- Update chrome_utils.js launchChromium() to accept userDataDir option
  and pass --user-data-dir to Chrome. Also cleans up SingletonLock
  before launch.

- Update killZombieChrome() to clean up SingletonLock files from all
  persona chrome_user_data directories after killing zombies.

- Update chrome_cleanup() in misc/util.py to handle persona-based
  user data directories when cleaning up stale Chrome state.

- Simplify on_Crawl__20_chrome_launch.bg.js to use CHROME_USER_DATA_DIR
  and CHROME_EXTENSIONS_DIR from env (derived by get_config()).

Config priority flow:
  ACTIVE_PERSONA=WorkAccount (set on crawl/snapshot)
  -> get_config() derives:
     CHROME_USER_DATA_DIR = PERSONAS_DIR/WorkAccount/chrome_user_data
     CHROME_EXTENSIONS_DIR = PERSONAS_DIR/WorkAccount/chrome_extensions
  -> hooks receive these as env vars without needing persona logic
2025-12-31 00:21:07 +00:00
Nick Sweeting
dd2302ad92 new jsonl cli interface 2025-12-30 16:12:53 -08:00
claude[bot]
762cddc8c5 fix: address PR review comments from cubic-dev-ai
- Add JSONL_INDEX_FILENAME to ALLOWED_IN_DATA_DIR for consistency
- Fix fallback logic in legacy.py to try JSON when JSONL parsing fails
- Replace bare except clauses with specific exception types
- Fix stdin double-consumption in archivebox_crawl.py
- Merge CLI --tag option with crawl tags in archivebox_snapshot.py
- Remove tautological mock tests (covered by integration tests)

Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>
2025-12-30 20:09:51 +00:00
Claude
69965a2782 fix: correct CLI pipeline data flow for crawl -> snapshot -> extract
- archivebox crawl: creates Crawl records, outputs Crawl JSONL
- archivebox snapshot: accepts Crawl JSONL, creates Snapshots, outputs Snapshot JSONL
- archivebox extract: accepts Snapshot JSONL, runs extractors, outputs ArchiveResult JSONL

Changes:
- Add Crawl.from_jsonl() method for creating Crawl from JSONL records
- Rewrite archivebox_crawl.py to create Crawl jobs without immediately starting them
- Update archivebox_snapshot.py to accept both Crawl JSONL and plain URLs
- Update jsonl.py docstring to document the pipeline
2025-12-30 19:42:41 +00:00
Claude
ae648c9bc1 refactor: move remaining JSONL methods to models, clean up jsonl.py
- Add Tag.to_jsonl() method with schema_version
- Add Crawl.to_jsonl() method with schema_version
- Fix Tag.from_jsonl() to not depend on jsonl.py helper
- Update tests to use Snapshot.from_jsonl() instead of non-existent get_or_create_snapshot

Remove model-specific functions from misc/jsonl.py:
- tag_to_jsonl() - use Tag.to_jsonl() instead
- crawl_to_jsonl() - use Crawl.to_jsonl() instead
- get_or_create_tag() - use Tag.from_jsonl() instead
- process_jsonl_records() - use model from_jsonl() methods directly

jsonl.py now only contains generic I/O utilities:
- Type constants (TYPE_SNAPSHOT, etc.)
- parse_line(), read_stdin(), read_file(), read_args_or_stdin()
- write_record(), write_records()
- filter_by_type(), process_records()
2025-12-30 19:30:18 +00:00
Claude
bc273c5a7f feat: add schema_version to JSONL outputs and remove dead code
- Add schema_version (archivebox.VERSION) to all to_jsonl() outputs:
  - Snapshot.to_jsonl()
  - ArchiveResult.to_jsonl()
  - Binary.to_jsonl()
  - Process.to_jsonl()

- Update CLI commands to use model methods directly:
  - archivebox_snapshot.py: snapshot.to_jsonl()
  - archivebox_extract.py: result.to_jsonl()

- Remove dead wrapper functions from misc/jsonl.py:
  - snapshot_to_jsonl()
  - archiveresult_to_jsonl()
  - binary_to_jsonl()
  - process_to_jsonl()
  - machine_to_jsonl()

- Update tests to use model methods directly
2025-12-30 19:24:53 +00:00
Claude
a5206e7648 refactor: move to_jsonl() methods to models
Move JSONL serialization from standalone functions to model methods
to mirror the from_jsonl() pattern:

- Add Binary.to_jsonl() method
- Add Process.to_jsonl() method
- Add ArchiveResult.to_jsonl() method
- Add Snapshot.to_jsonl() method
- Update write_index_jsonl() to use model methods
- Update jsonl.py functions to be thin wrappers
2025-12-30 18:35:22 +00:00
Claude
d36079829b feat: replace index.json with index.jsonl flat JSONL format
Switch from hierarchical index.json to flat index.jsonl format for
snapshot metadata storage. Each line is a self-contained JSON record
with a 'type' field (Snapshot, ArchiveResult, Binary, Process).

Changes:
- Add JSONL_INDEX_FILENAME constant to constants.py
- Add TYPE_PROCESS and TYPE_MACHINE to jsonl.py type constants
- Add binary_to_jsonl(), process_to_jsonl(), machine_to_jsonl() converters
- Add Snapshot.write_index_jsonl() to write new format
- Add Snapshot.read_index_jsonl() to read new format
- Add Snapshot.convert_index_json_to_jsonl() for migration
- Update Snapshot.reconcile_with_index() to handle both formats
- Update fs_migrate to convert during filesystem migration
- Update load_from_directory/create_from_directory for both formats
- Update legacy.py parse_json_links_details for JSONL support

The new format is easier to parse, extend, and mix record types.
2025-12-30 18:21:06 +00:00
Nick Sweeting
f0aa19fa7d wip 2025-12-28 17:51:54 -08:00
Nick Sweeting
4ccb0863bb continue renaming extractor to plugin, add plan for hook concurrency, add chrome kill helper script 2025-12-28 05:29:24 -08:00
Nick Sweeting
bd265c0083 rename extractor to plugin everywhere 2025-12-28 04:43:15 -08:00
Nick Sweeting
50e527ec65 way better plugin hooks system wip 2025-12-28 03:39:59 -08:00
Claude
b632894bc9 Update views, API, and exports for new ArchiveResult output fields
Replace old `output` field with new fields across the codebase:
- output_str: Human-readable output summary
- output_json: Structured metadata (optional)
- output_files: Dict of output files with metadata
- output_size: Total size in bytes
- output_mimetypes: CSV of file mimetypes

Files updated:
- api/v1_core.py: Update MinimalArchiveResultSchema to expose new fields
- api/v1_core.py: Update ArchiveResultFilterSchema to search output_str
- cli/archivebox_extract.py: Use output_str in CLI output
- core/admin_archiveresults.py: Update admin fields, search, and fieldsets
- core/admin_archiveresults.py: Fix output_html variable name bug in output_summary
- misc/jsonl.py: Update archiveresult_to_jsonl() to include new fields
- plugins/extractor_utils.py: Update ExtractorResult helper class

The embed_path() method already uses output_files and output_str,
so snapshot detail page and template tags work correctly.
2025-12-27 20:28:22 +00:00
Claude
c3acadd528 Remove extractor field from Crawl model and fix tests
- Remove extractor field from Crawl model (moved to config dict)
- Update migration 0002_drop_seed_model to not add extractor
- Update archivebox_add.py to use config['PARSER'] instead
- Update admin.py recrawl to not pass extractor
- Update jsonl.py serialization to not include extractor
- Update test schema SCHEMA_0_8 to not include extractor
- Set default timeout to 60s for test commands
2025-12-27 01:49:09 +00:00
Nick Sweeting
4fd7fcdbcf new gallerydl plugin and more 2025-12-26 11:55:03 -08:00
Nick Sweeting
9838d7ba02 tons of ui fixes and plugin fixes 2025-12-25 03:59:51 -08:00
Nick Sweeting
bb53228ebf remove Seed model in favor of Crawl as template 2025-12-25 01:52:41 -08:00
Nick Sweeting
866f993f26 logging and admin ui improvements 2025-12-25 01:10:41 -08:00
Nick Sweeting
d95f0dc186 remove huey 2025-12-24 23:40:18 -08:00
Nick Sweeting
6c769d831c wip 2 2025-12-24 21:46:14 -08:00
Nick Sweeting
1915333b81 wip major changes 2025-12-24 20:10:38 -08:00
Ben Muthalaly
71c02ca4eb Update archivebox/misc/logging_util.py
Co-authored-by: Nick Sweeting <git@sweeting.me>
2025-02-05 17:55:45 -06:00
Ben Muthalaly
9f4cf0a8e1 Kill the timer process if it doesn't properly terminate. 2025-02-03 02:47:33 -06:00
Nick Sweeting
c5fc4068f4 fix unneeded import 2024-12-18 18:09:21 -08:00
Nick Sweeting
7975b47c85 remove dependencies on unneeded libraries 2024-12-18 18:07:35 -08:00
Nick Sweeting
d192eb5c48 add filestore content addressible store draft 2024-12-04 02:15:04 -08:00
Nick Sweeting
a3fe78afaa add basename to hashing get_dir_info 2024-12-04 02:15:04 -08:00
Nick Sweeting
eae7ed8447 add hashing misc library for merkle tree generation 2024-12-03 02:12:20 -08:00
Nick Sweeting
2595139180 improve statemachine logging and archivebox update CLI cmd 2024-11-19 03:31:05 -08:00
Nick Sweeting
c9a05c9d94 working archivebox update CLI cmd 2024-11-19 02:32:05 -08:00
Nick Sweeting
328eb98a38 move main funcs into cli files and switch to using click for CLI 2024-11-19 00:18:51 -08:00
Nick Sweeting
4c25e90378 move monkey_patches.py into archivebox.misc subfolder 2024-11-18 19:10:42 -08:00
Nick Sweeting
4a5d607296 move logging_util into archivebox.misc subfolder 2024-11-18 19:08:49 -08:00
Nick Sweeting
b3c1cb716e move abx plugins inside vendor dir 2024-10-28 04:07:35 -07:00
Nick Sweeting
4b6f08b0fe swap more direct settings.CONFIG access to abx getters 2024-10-24 15:42:19 -07:00
Nick Sweeting
60f0458c77 rename configfile to collection 2024-10-24 15:40:24 -07:00
Nick Sweeting
657eec479b fix CONSTANTS.LIB_DIR old style access 2024-10-21 03:20:20 -07:00
Nick Sweeting
b3107ab830 move final legacy config to plugins and fix archivebox config cmd and add search opt 2024-10-21 02:56:00 -07:00
Nick Sweeting
a211461ffc fix LIB_DIR and TMP_DIR loading when primary option isnt available 2024-10-21 00:35:56 -07:00
Nick Sweeting
bb9c3fda14 fix makemigrations being blocked by check_migrations func 2024-10-14 17:40:06 -07:00
Nick Sweeting
9a04ed7c76 move serve_static and shell_welcome_message into misc 2024-10-14 17:35:28 -07:00
Nick Sweeting
f75ae805f8 comment out Crawl api methods temporarily 2024-10-14 15:41:58 -07:00
Nick Sweeting
2f68a1d476 fix ldap lib loading after apt install 2024-10-09 04:03:02 -07:00
Nick Sweeting
afc24e802a tweak version log output 2024-10-09 03:18:22 -07:00