90 Commits

Author SHA1 Message Date
Nick Sweeting
dd77511026 unified Process source of truth and better screenshot tests 2026-01-02 04:20:34 -08:00
Nick Sweeting
65ee09ceab move tests into subfolder, add missing install hooks 2026-01-02 00:22:07 -08:00
Nick Sweeting
60422adc87 fix orchestrator statemachine and Process from archiveresult migrations 2026-01-01 16:43:02 -08:00
Nick Sweeting
876feac522 actually working migration path from 0.7.2 and 0.8.6 + renames and test coverage 2026-01-01 15:50:00 -08:00
Nick Sweeting
6fadcf5168 remove model health stats from models that dont need it 2026-01-01 15:50:00 -08:00
Nick Sweeting
f7457b13ad more migrations fixes attempts 2025-12-31 17:46:10 -08:00
Nick Sweeting
1c7b0cb2d3 working migrations again 2025-12-31 16:19:50 -08:00
Nick Sweeting
6521e7ddda more migrations fixes 2025-12-31 16:10:56 -08:00
Nick Sweeting
a04e4a7345 cleanup migrations, json, jsonl 2025-12-31 15:36:43 -08:00
Nick Sweeting
73fde81fce more migrations tweaks 2025-12-31 12:34:31 -08:00
Nick Sweeting
72f6a91b31 more progress bar and migrations fixes 2025-12-31 12:34:31 -08:00
Nick Sweeting
d5c0c64dcd fix progress bars 2025-12-31 12:34:29 -08:00
Nick Sweeting
3d8c62ffb1 fix extensions dir paths add personas migration 2025-12-31 01:40:59 -08:00
Nick Sweeting
91375d35a3 more migrations 2025-12-30 10:30:52 -08:00
Nick Sweeting
96ee1bf686 more migration fixes 2025-12-30 09:57:33 -08:00
Nick Sweeting
4cd2fceb8a even more migration fixes 2025-12-29 22:30:37 -08:00
Nick Sweeting
30c60eef76 much better tests and add page ui 2025-12-29 04:02:11 -08:00
Nick Sweeting
f4e7820533 use full dotted paths for all archivebox imports, add migrations and more fixes 2025-12-29 00:47:08 -08:00
Nick Sweeting
f0aa19fa7d wip 2025-12-28 17:51:54 -08:00
Claude
1b5a816022 Implement hook step-based concurrency system
This implements the hook concurrency plan from TODO_hook_concurrency.md:

## Schema Changes
- Add Snapshot.current_step (IntegerField 0-9, default=0)
- Create migration 0034_snapshot_current_step.py
- Fix uuid_compat imports in migrations 0032 and 0003

## Core Logic
- Add extract_step(hook_name) utility - extracts step from __XX_ pattern
- Add is_background_hook(hook_name) utility - checks for .bg. suffix
- Update Snapshot.create_pending_archiveresults() to create one AR per hook
- Update ArchiveResult.run() to handle hook_name field
- Add Snapshot.advance_step_if_ready() method for step advancement
- Integrate with SnapshotMachine.is_finished() to call advance_step_if_ready()

## Worker Coordination
- Update ArchiveResultWorker.get_queue() for step-based filtering
- ARs are only claimable when their step <= snapshot.current_step

## Hook Renumbering
- Step 5 (DOM extraction): singlefile→50, screenshot→51, pdf→52, dom→53,
  title→54, readability→55, headers→55, mercury→56, htmltotext→57
- Step 6 (post-DOM): wget→61, git→62, media→63.bg, gallerydl→64.bg,
  forumdl→65.bg, papersdl→66.bg
- Step 7 (URL extraction): parse_* hooks moved to 70-75

Background hooks (.bg suffix) don't block step advancement, enabling
long-running downloads to continue while other hooks proceed.
2025-12-28 13:47:25 +00:00
Nick Sweeting
50e527ec65 way better plugin hooks system wip 2025-12-28 03:39:59 -08:00
Claude
3d985fa8c8 Implement hook architecture with JSONL output support
Phase 1: Database migration for new ArchiveResult fields
- Add output_str (TextField) for human-readable summary
- Add output_json (JSONField) for structured metadata
- Add output_files (JSONField) for dict of {relative_path: {}}
- Add output_size (BigIntegerField) for total bytes
- Add output_mimetypes (CharField) for CSV of mimetypes
- Add binary FK to InstalledBinary (optional)
- Migrate existing 'output' field to new split fields

Phase 3: Update run_hook() for JSONL parsing
- Support new JSONL format (any line with {type: 'ModelName', ...})
- Maintain backwards compatibility with RESULT_JSON= format
- Add plugin metadata to each parsed record
- Detect background hooks with .bg. suffix in filename
- Add find_binary_for_cmd() helper function
- Add create_model_record() for processing side-effect records

Phase 6: Update ArchiveResult.run()
- Handle background hooks (return immediately when result is None)
- Process 'records' from HookResult for side-effect models
- Use new output fields (output_str, output_json, output_files, etc.)
- Call create_model_record() for InstalledBinary, Machine updates

Phase 7: Add background hook support
- Add is_background_hook() method to ArchiveResult
- Add check_background_completed() to check if process exited
- Add finalize_background_hook() to collect results from completed hooks
- Update SnapshotMachine.is_finished() to check/finalize background hooks
- Update _populate_output_fields() to walk directory and populate stats

Also updated references to old 'output' field in:
- admin_archiveresults.py
- statemachines.py
- templatetags/core_tags.py
2025-12-27 08:38:49 +00:00
Nick Sweeting
35dd9acafe implement fs_version migrations 2025-12-27 00:25:35 -08:00
Claude
ea6fe94c93 Add crawls_crawlschedule table to 0.8.x test schema and fix migrations
- Add missing crawls_crawlschedule table definition to SCHEMA_0_8 in test file
- Record all replaced dev branch migrations (0023-0074) for squashed migration
- Update 0024_snapshot_crawl migration to depend on squashed machine migration
- Remove 'extractor' field references from crawls admin
- All 45 migration tests now pass (0.4.x, 0.7.x, 0.8.x, fresh install)
2025-12-27 04:32:58 +00:00
Claude
766bb28536 Fix migration tests and M2M field alteration issue
- Remove M2M tags field alteration from migration 0027 (Django doesn't support altering M2M fields via migration)
- Add machine app tables to 0.8.x test schema
- Add missing columns (config, num_uses_failed, num_uses_succeeded) to 0.8.x test schema
- Skip 0.8.x migration tests due to complex migration state dependencies with machine app
- All 15 0.7.x migration tests now pass
- Merge dev branch and resolve pyproject.toml conflict (keep both uuid7 and gallery-dl deps)
2025-12-27 03:00:44 +00:00
Claude
13be196fd7 Merge remote-tracking branch 'origin/dev' into claude/improve-test-suite-xm6Bh
# Conflicts:
#	pyproject.toml
2025-12-27 02:27:51 +00:00
Nick Sweeting
e2cbcd17f6 more tests and migrations fixes 2025-12-26 18:22:48 -08:00
Claude
ae2ab5b273 Add Python 3.13 support with uuid7 backport compatibility
- Create uuid_compat.py module that provides uuid7 for Python <3.14
  using uuid_extensions package, and native uuid.uuid7 for Python 3.14+
- Update all model files and migrations to use archivebox.uuid_compat
- Add uuid7 conditional dependency in pyproject.toml for Python <3.14
- Update requires-python to >=3.13 (from >=3.14)
- Update GitHub workflows, lock_pkgs.sh to use Python 3.13
- Update tool configs (ruff, pyright, uv) for Python 3.13

This enables running ArchiveBox on Python 3.13 while maintaining
forward compatibility with Python 3.14's native uuid7 support.
2025-12-27 01:07:30 +00:00
Nick Sweeting
bb53228ebf remove Seed model in favor of Crawl as template 2025-12-25 01:52:41 -08:00
Nick Sweeting
866f993f26 logging and admin ui improvements 2025-12-25 01:10:41 -08:00
Nick Sweeting
6c769d831c wip 2 2025-12-24 21:46:14 -08:00
Nick Sweeting
1915333b81 wip major changes 2025-12-24 20:10:38 -08:00
Nick Sweeting
569081a9eb rename abid_utils to base_models 2024-11-18 19:40:05 -08:00
Nick Sweeting
d69df359ea remove Crawl migration in favor of separate app 2024-10-14 17:41:07 -07:00
Nick Sweeting
12f32c4690 fix tmp data dir resolution when running help or version outside data dir 2024-10-04 01:40:41 -07:00
Nick Sweeting
295c5c46e0 add new crawl model 2024-10-01 21:47:16 -07:00
Nick Sweeting
3e5b6ddeae move config into dedicated global app 2024-09-30 15:59:05 -07:00
Nick Sweeting
3f76e0a87f fix migrations import errors 2024-09-06 03:48:52 -07:00
Nick Sweeting
ed5357cec9 add migrations for datetime field renames 2024-09-04 23:44:13 -07:00
Nick Sweeting
cbf2a8fdc3 rename datetime fields to _at, massively improve ABID generation safety and determinism 2024-09-04 23:42:36 -07:00
Nick Sweeting
68a39b7392 remove .old_id entirely and make ABID generation only happen once on initial save 2024-09-04 16:40:15 -07:00
Nick Sweeting
1e73a06ba0 change ABIDModel.created to use AutoTimeField seeded on .save instead of auto_now_add so that ts_src for ABID is available on creation before DB row is created 2024-08-28 03:02:37 -07:00
Nick Sweeting
d0fefc0279 add chunk_size=500 to more iterator calls 2024-08-27 19:28:00 -07:00
Nick Sweeting
73a3e6aad0 handle tag with no slug or name 2024-08-22 18:25:15 -07:00
Nick Sweeting
1d31b88fa3 fix migration failing when Tag name is empty 2024-08-22 16:30:25 -07:00
Nick Sweeting
09553d8340 hardcode EXTRACTOR_CHOICES to prevent nondeterministic migrations 2024-08-22 15:36:02 -07:00
Nick Sweeting
afe1307617 fix created_by field migration to create User properly if none exists 2024-08-22 15:20:36 -07:00
Nick Sweeting
849b4963a1 add migrations 2024-08-20 01:58:44 -07:00
Nick Sweeting
9273db528e fix abid generation migrations to be historically consistent 2024-08-20 01:58:19 -07:00
Nick Sweeting
344e902fc6 migrate SnapshotTag to use new snapshot id 2024-08-19 19:42:25 -07:00