Commit Graph

72 Commits

Author SHA1 Message Date
Nick Sweeting
f0aa19fa7d wip 2025-12-28 17:51:54 -08:00
Claude
1b5a816022 Implement hook step-based concurrency system
This implements the hook concurrency plan from TODO_hook_concurrency.md:

## Schema Changes
- Add Snapshot.current_step (IntegerField 0-9, default=0)
- Create migration 0034_snapshot_current_step.py
- Fix uuid_compat imports in migrations 0032 and 0003

## Core Logic
- Add extract_step(hook_name) utility - extracts step from __XX_ pattern
- Add is_background_hook(hook_name) utility - checks for .bg. suffix
- Update Snapshot.create_pending_archiveresults() to create one AR per hook
- Update ArchiveResult.run() to handle hook_name field
- Add Snapshot.advance_step_if_ready() method for step advancement
- Integrate with SnapshotMachine.is_finished() to call advance_step_if_ready()

## Worker Coordination
- Update ArchiveResultWorker.get_queue() for step-based filtering
- ARs are only claimable when their step <= snapshot.current_step

## Hook Renumbering
- Step 5 (DOM extraction): singlefile→50, screenshot→51, pdf→52, dom→53,
  title→54, readability→55, headers→55, mercury→56, htmltotext→57
- Step 6 (post-DOM): wget→61, git→62, media→63.bg, gallerydl→64.bg,
  forumdl→65.bg, papersdl→66.bg
- Step 7 (URL extraction): parse_* hooks moved to 70-75

Background hooks (.bg suffix) don't block step advancement, enabling
long-running downloads to continue while other hooks proceed.
2025-12-28 13:47:25 +00:00
Nick Sweeting
50e527ec65 way better plugin hooks system wip 2025-12-28 03:39:59 -08:00
Claude
3d985fa8c8 Implement hook architecture with JSONL output support
Phase 1: Database migration for new ArchiveResult fields
- Add output_str (TextField) for human-readable summary
- Add output_json (JSONField) for structured metadata
- Add output_files (JSONField) for dict of {relative_path: {}}
- Add output_size (BigIntegerField) for total bytes
- Add output_mimetypes (CharField) for CSV of mimetypes
- Add binary FK to InstalledBinary (optional)
- Migrate existing 'output' field to new split fields

Phase 3: Update run_hook() for JSONL parsing
- Support new JSONL format (any line with {type: 'ModelName', ...})
- Maintain backwards compatibility with RESULT_JSON= format
- Add plugin metadata to each parsed record
- Detect background hooks with .bg. suffix in filename
- Add find_binary_for_cmd() helper function
- Add create_model_record() for processing side-effect records

Phase 6: Update ArchiveResult.run()
- Handle background hooks (return immediately when result is None)
- Process 'records' from HookResult for side-effect models
- Use new output fields (output_str, output_json, output_files, etc.)
- Call create_model_record() for InstalledBinary, Machine updates

Phase 7: Add background hook support
- Add is_background_hook() method to ArchiveResult
- Add check_background_completed() to check if process exited
- Add finalize_background_hook() to collect results from completed hooks
- Update SnapshotMachine.is_finished() to check/finalize background hooks
- Update _populate_output_fields() to walk directory and populate stats

Also updated references to old 'output' field in:
- admin_archiveresults.py
- statemachines.py
- templatetags/core_tags.py
2025-12-27 08:38:49 +00:00
Nick Sweeting
35dd9acafe implement fs_version migrations 2025-12-27 00:25:35 -08:00
Claude
ea6fe94c93 Add crawls_crawlschedule table to 0.8.x test schema and fix migrations
- Add missing crawls_crawlschedule table definition to SCHEMA_0_8 in test file
- Record all replaced dev branch migrations (0023-0074) for squashed migration
- Update 0024_snapshot_crawl migration to depend on squashed machine migration
- Remove 'extractor' field references from crawls admin
- All 45 migration tests now pass (0.4.x, 0.7.x, 0.8.x, fresh install)
2025-12-27 04:32:58 +00:00
Claude
766bb28536 Fix migration tests and M2M field alteration issue
- Remove M2M tags field alteration from migration 0027 (Django doesn't support altering M2M fields via migration)
- Add machine app tables to 0.8.x test schema
- Add missing columns (config, num_uses_failed, num_uses_succeeded) to 0.8.x test schema
- Skip 0.8.x migration tests due to complex migration state dependencies with machine app
- All 15 0.7.x migration tests now pass
- Merge dev branch and resolve pyproject.toml conflict (keep both uuid7 and gallery-dl deps)
2025-12-27 03:00:44 +00:00
Claude
13be196fd7 Merge remote-tracking branch 'origin/dev' into claude/improve-test-suite-xm6Bh
# Conflicts:
#	pyproject.toml
2025-12-27 02:27:51 +00:00
Nick Sweeting
e2cbcd17f6 more tests and migrations fixes 2025-12-26 18:22:48 -08:00
Claude
ae2ab5b273 Add Python 3.13 support with uuid7 backport compatibility
- Create uuid_compat.py module that provides uuid7 for Python <3.14
  using uuid_extensions package, and native uuid.uuid7 for Python 3.14+
- Update all model files and migrations to use archivebox.uuid_compat
- Add uuid7 conditional dependency in pyproject.toml for Python <3.14
- Update requires-python to >=3.13 (from >=3.14)
- Update GitHub workflows, lock_pkgs.sh to use Python 3.13
- Update tool configs (ruff, pyright, uv) for Python 3.13

This enables running ArchiveBox on Python 3.13 while maintaining
forward compatibility with Python 3.14's native uuid7 support.
2025-12-27 01:07:30 +00:00
Nick Sweeting
bb53228ebf remove Seed model in favor of Crawl as template 2025-12-25 01:52:41 -08:00
Nick Sweeting
866f993f26 logging and admin ui improvements 2025-12-25 01:10:41 -08:00
Nick Sweeting
6c769d831c wip 2 2025-12-24 21:46:14 -08:00
Nick Sweeting
1915333b81 wip major changes 2025-12-24 20:10:38 -08:00
Nick Sweeting
569081a9eb rename abid_utils to base_models 2024-11-18 19:40:05 -08:00
Nick Sweeting
d69df359ea remove Crawl migration in favor of separate app 2024-10-14 17:41:07 -07:00
Nick Sweeting
12f32c4690 fix tmp data dir resolution when running help or version outside data dir 2024-10-04 01:40:41 -07:00
Nick Sweeting
295c5c46e0 add new crawl model 2024-10-01 21:47:16 -07:00
Nick Sweeting
3e5b6ddeae move config into dedicated global app 2024-09-30 15:59:05 -07:00
Nick Sweeting
3f76e0a87f fix migrations import errors 2024-09-06 03:48:52 -07:00
Nick Sweeting
ed5357cec9 add migrations for datetime field renames 2024-09-04 23:44:13 -07:00
Nick Sweeting
cbf2a8fdc3 rename datetime fields to _at, massively improve ABID generation safety and determinism 2024-09-04 23:42:36 -07:00
Nick Sweeting
68a39b7392 remove .old_id entirely and make ABID generation only happen once on initial save 2024-09-04 16:40:15 -07:00
Nick Sweeting
1e73a06ba0 change ABIDModel.created to use AutoTimeField seeded on .save instead of auto_now_add so that ts_src for ABID is available on creation before DB row is created 2024-08-28 03:02:37 -07:00
Nick Sweeting
d0fefc0279 add chunk_size=500 to more iterator calls 2024-08-27 19:28:00 -07:00
Nick Sweeting
73a3e6aad0 handle tag with no slug or name 2024-08-22 18:25:15 -07:00
Nick Sweeting
1d31b88fa3 fix migration failing when Tag name is empty 2024-08-22 16:30:25 -07:00
Nick Sweeting
09553d8340 hardcode EXTRACTOR_CHOICES to prevent nondeterministic migrations 2024-08-22 15:36:02 -07:00
Nick Sweeting
afe1307617 fix created_by field migration to create User properly if none exists 2024-08-22 15:20:36 -07:00
Nick Sweeting
849b4963a1 add migrations 2024-08-20 01:58:44 -07:00
Nick Sweeting
9273db528e fix abid generation migrations to be historically consistent 2024-08-20 01:58:19 -07:00
Nick Sweeting
344e902fc6 migrate SnapshotTag to use new snapshot id 2024-08-19 19:42:25 -07:00
Nick Sweeting
cf2faecf61 add migrations for SnapshotTag through model 2024-08-19 18:36:20 -07:00
Nick Sweeting
033ec08d0c save snapshot ids during migration 2024-08-17 21:56:45 -07:00
Nick Sweeting
de489d3c60 minor snapshot details ui fixes and migrations log msg improvements 2024-06-04 04:17:32 -07:00
Nick Sweeting
a4cc10d7f8 add migrations for third round of field changes 2024-05-13 07:50:22 -07:00
Nick Sweeting
206e7e74b3 add migrations to create and populate ABIDField and UUIDField values 2024-05-13 05:13:42 -07:00
Nick Sweeting
457c42bf84 load EXTRACTORS dynamically using importlib.import_module 2024-05-11 22:28:59 -07:00
Ross Williams
310b4d1242 Add htmltotext extractor
Saves HTML text nodes and selected element attributes in
`htmltotext.txt` for each Snapshot. Primarily intended to be used
for search indexing.
2023-10-23 21:42:32 -04:00
Joseph Turian
22d8e57637 Add missing migration 0021 2022-09-14 09:36:17 +00:00
Nick Sweeting
6949803395 enforce new models to use uuid keys 2021-04-10 06:32:45 -04:00
Nick Sweeting
d73f7d7d96 add db_index on url field 2021-04-01 14:00:07 -04:00
Nick Sweeting
a58ad5b272 allow larger tags 2021-03-27 05:52:42 -04:00
Nick Sweeting
0036e9cce2 add migration 2021-02-28 22:55:12 -05:00
Nick Sweeting
acbce25201 missing migrations 2021-02-18 08:05:05 -05:00
Nick Sweeting
74a9dd8880 add missing migrations 2021-02-18 02:36:21 -05:00
Nick Sweeting
71cf8d5224 add migrations 2021-02-16 15:57:13 -05:00
Nick Sweeting
171bbeb69b catch exception on import of old index.json into ArchiveResult 2021-02-01 16:31:29 -05:00
Nick Sweeting
0aea5ed3e8 fix handling of skipped ArchiveResult entries with null output 2021-02-01 14:37:34 -05:00
Nick Sweeting
aa84a7ff2b fix migration creating conflicting tags based on slug 2021-02-01 05:13:23 -05:00