Commit Graph

138 Commits

Author SHA1 Message Date
Nick Sweeting
80f75126c6 more fixes 2025-12-29 21:03:05 -08:00
Claude
f88182df7a Merge remote-tracking branch 'origin/dev' into claude/add-max-url-attempts-oBHCD 2025-12-29 21:29:01 +00:00
Nick Sweeting
92c26124a3 remove more hardcoded plugin names from codebase 2025-12-29 13:21:47 -08:00
Claude
88d7906033 Add MAX_URL_ATTEMPTS config option to stop retries after too many failures
Adds a new MAX_URL_ATTEMPTS configuration option (default: 50) that stops
retrying ArchiveResult hooks for a snapshot once that many failures have
been recorded. This prevents infinite retry loops for problematic URLs.

When the limit is reached, any pending ArchiveResults for that snapshot
are marked as SKIPPED with an explanatory message.
2025-12-29 20:20:50 +00:00
Nick Sweeting
30c60eef76 much better tests and add page ui 2025-12-29 04:02:11 -08:00
Nick Sweeting
f4e7820533 use full dotted paths for all archivebox imports, add migrations and more fixes 2025-12-29 00:47:08 -08:00
Nick Sweeting
f0aa19fa7d wip 2025-12-28 17:51:54 -08:00
Claude
1b5a816022 Implement hook step-based concurrency system
This implements the hook concurrency plan from TODO_hook_concurrency.md:

## Schema Changes
- Add Snapshot.current_step (IntegerField 0-9, default=0)
- Create migration 0034_snapshot_current_step.py
- Fix uuid_compat imports in migrations 0032 and 0003

## Core Logic
- Add extract_step(hook_name) utility - extracts step from __XX_ pattern
- Add is_background_hook(hook_name) utility - checks for .bg. suffix
- Update Snapshot.create_pending_archiveresults() to create one AR per hook
- Update ArchiveResult.run() to handle hook_name field
- Add Snapshot.advance_step_if_ready() method for step advancement
- Integrate with SnapshotMachine.is_finished() to call advance_step_if_ready()

## Worker Coordination
- Update ArchiveResultWorker.get_queue() for step-based filtering
- ARs are only claimable when their step <= snapshot.current_step

## Hook Renumbering
- Step 5 (DOM extraction): singlefile→50, screenshot→51, pdf→52, dom→53,
  title→54, readability→55, headers→55, mercury→56, htmltotext→57
- Step 6 (post-DOM): wget→61, git→62, media→63.bg, gallerydl→64.bg,
  forumdl→65.bg, papersdl→66.bg
- Step 7 (URL extraction): parse_* hooks moved to 70-75

Background hooks (.bg suffix) don't block step advancement, enabling
long-running downloads to continue while other hooks proceed.
2025-12-28 13:47:25 +00:00
Nick Sweeting
4ccb0863bb continue renaming extractor to plugin, add plan for hook concurrency, add chrome kill helper script 2025-12-28 05:29:24 -08:00
Nick Sweeting
bd265c0083 rename extractor to plugin everywhere 2025-12-28 04:43:15 -08:00
Nick Sweeting
50e527ec65 way better plugin hooks system wip 2025-12-28 03:39:59 -08:00
Claude
3d985fa8c8 Implement hook architecture with JSONL output support
Phase 1: Database migration for new ArchiveResult fields
- Add output_str (TextField) for human-readable summary
- Add output_json (JSONField) for structured metadata
- Add output_files (JSONField) for dict of {relative_path: {}}
- Add output_size (BigIntegerField) for total bytes
- Add output_mimetypes (CharField) for CSV of mimetypes
- Add binary FK to InstalledBinary (optional)
- Migrate existing 'output' field to new split fields

Phase 3: Update run_hook() for JSONL parsing
- Support new JSONL format (any line with {type: 'ModelName', ...})
- Maintain backwards compatibility with RESULT_JSON= format
- Add plugin metadata to each parsed record
- Detect background hooks with .bg. suffix in filename
- Add find_binary_for_cmd() helper function
- Add create_model_record() for processing side-effect records

Phase 6: Update ArchiveResult.run()
- Handle background hooks (return immediately when result is None)
- Process 'records' from HookResult for side-effect models
- Use new output fields (output_str, output_json, output_files, etc.)
- Call create_model_record() for InstalledBinary, Machine updates

Phase 7: Add background hook support
- Add is_background_hook() method to ArchiveResult
- Add check_background_completed() to check if process exited
- Add finalize_background_hook() to collect results from completed hooks
- Update SnapshotMachine.is_finished() to check/finalize background hooks
- Update _populate_output_fields() to walk directory and populate stats

Also updated references to old 'output' field in:
- admin_archiveresults.py
- statemachines.py
- templatetags/core_tags.py
2025-12-27 08:38:49 +00:00
Nick Sweeting
35dd9acafe implement fs_version migrations 2025-12-27 00:25:35 -08:00
Claude
766bb28536 Fix migration tests and M2M field alteration issue
- Remove M2M tags field alteration from migration 0027 (Django doesn't support altering M2M fields via migration)
- Add machine app tables to 0.8.x test schema
- Add missing columns (config, num_uses_failed, num_uses_succeeded) to 0.8.x test schema
- Skip 0.8.x migration tests due to complex migration state dependencies with machine app
- All 15 0.7.x migration tests now pass
- Merge dev branch and resolve pyproject.toml conflict (keep both uuid7 and gallery-dl deps)
2025-12-27 03:00:44 +00:00
Claude
ae2ab5b273 Add Python 3.13 support with uuid7 backport compatibility
- Create uuid_compat.py module that provides uuid7 for Python <3.14
  using uuid_extensions package, and native uuid.uuid7 for Python 3.14+
- Update all model files and migrations to use archivebox.uuid_compat
- Add uuid7 conditional dependency in pyproject.toml for Python <3.14
- Update requires-python to >=3.13 (from >=3.14)
- Update GitHub workflows, lock_pkgs.sh to use Python 3.13
- Update tool configs (ruff, pyright, uv) for Python 3.13

This enables running ArchiveBox on Python 3.13 while maintaining
forward compatibility with Python 3.14's native uuid7 support.
2025-12-27 01:07:30 +00:00
Nick Sweeting
4fd7fcdbcf new gallerydl plugin and more 2025-12-26 11:55:03 -08:00
Nick Sweeting
9838d7ba02 tons of ui fixes and plugin fixes 2025-12-25 03:59:51 -08:00
Nick Sweeting
866f993f26 logging and admin ui improvements 2025-12-25 01:10:41 -08:00
Nick Sweeting
d95f0dc186 remove huey 2025-12-24 23:40:18 -08:00
Nick Sweeting
6c769d831c wip 2 2025-12-24 21:46:14 -08:00
Nick Sweeting
1915333b81 wip major changes 2025-12-24 20:10:38 -08:00
Nick Sweeting
c1335fed37 Remove ABID system and KVTag model - use UUIDv7 IDs exclusively
This commit completes the simplification of the ID system by:

- Removing the ABID (ArchiveBox ID) system entirely
- Removing the base_models/abid.py file
- Removing KVTag model in favor of the existing Tag model in core/models.py
- Simplifying all models to use standard UUIDv7 primary keys
- Removing ABID-related admin functionality
- Cleaning up commented-out ABID code from views and statemachines
- Deleting migration files for ABID field removal (no longer needed)

All models now use simple UUIDv7 ids via `id = models.UUIDField(primary_key=True, default=uuid7)`

Note: Old migrations containing ABID references are preserved for database
migration history compatibility.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-24 06:13:49 -08:00
dish
9ca66c6a2b fix syntax error in archivebox/core/models.py 2024-12-18 18:17:17 -05:00
Nick Sweeting
2a1afcf6c2 move crawl models back into dedicated app 2024-12-12 21:45:55 -08:00
Nick Sweeting
bd5dd2f949 clearer core models separation of concerns using new basemodels 2024-12-12 21:45:53 -08:00
Nick Sweeting
ac53fdf677 make chrome binary and configs directly runnable and make extractor use external bin 2024-12-06 02:06:39 -08:00
Nick Sweeting
1ceaa1ac7a add ABID model check and fix model inheritance 2024-12-03 02:14:21 -08:00
Nick Sweeting
b948e49013 add urls log to Crawl model 2024-11-19 06:32:33 -08:00
Nick Sweeting
c9a05c9d94 working archivebox update CLI cmd 2024-11-19 02:32:05 -08:00
Nick Sweeting
569081a9eb rename abid_utils to base_models 2024-11-18 19:40:05 -08:00
Nick Sweeting
e469c5a344 merge queues and actors apps into new workers app 2024-11-18 18:52:48 -08:00
Nick Sweeting
0acd388c02 fix imports and deps 2024-11-18 18:07:34 -08:00
Nick Sweeting
385ccaa14d extend core models with ModelWithOutputDir 2024-11-18 04:27:38 -08:00
Nick Sweeting
1ec2753664 fix statemachine create_root_snapshot and retry timing 2024-11-18 04:27:37 -08:00
Nick Sweeting
c8e186f21b fix plugin loading order, admin, abx-pkg 2024-11-16 06:44:12 -08:00
Nick Sweeting
ba26d75079 add notes and label fields, fix model getters 2024-11-16 02:47:35 -08:00
Nick Sweeting
a9a3b153b1 more StateMachine, Actor, and Orchestrator improvements 2024-11-04 07:08:39 -08:00
Nick Sweeting
48f8416762 add new core and crawsl statemachine manager 2024-11-03 00:41:11 -07:00
Nick Sweeting
02a1fc3049 rename sessions app in INSTALLED_APPS to personas 2024-10-21 00:37:57 -07:00
Nick Sweeting
59b669691f fix Admin data view for Config to render both sections and individual values 2024-10-14 17:39:14 -07:00
Nick Sweeting
1d7f0ab20d fix Tag creation via admin erroring because slug field is not filled 2024-10-14 17:38:53 -07:00
Nick Sweeting
c0b7887fd7 fix admin registration using abx hooks 2024-10-14 17:38:38 -07:00
Nick Sweeting
a0bef4e27b move crawl model out of core 2024-10-14 15:42:36 -07:00
Nick Sweeting
de2ab43f7f switch .is_dir and .exists for os.access to avoid PermissionError on startup 2024-10-08 03:02:34 -07:00
Nick Sweeting
e315905721 add new InstalledBinary model to cache binaries on host machine 2024-10-03 03:10:22 -07:00
Nick Sweeting
295c5c46e0 add new crawl model 2024-10-01 21:47:16 -07:00
Nick Sweeting
363a499289 move util.py into misc folder 2024-09-30 17:25:15 -07:00
Nick Sweeting
dfca4b13b2 move system.py into misc folder 2024-09-30 17:13:55 -07:00
Nick Sweeting
3e5b6ddeae move config into dedicated global app 2024-09-30 15:59:05 -07:00
Nick Sweeting
d8a9dca0f6 use constants in more places 2024-09-26 02:38:45 -07:00