573 Commits

Author SHA1 Message Date
Claude
f88182df7a Merge remote-tracking branch 'origin/dev' into claude/add-max-url-attempts-oBHCD 2025-12-29 21:29:01 +00:00
Nick Sweeting
92c26124a3 remove more hardcoded plugin names from codebase 2025-12-29 13:21:47 -08:00
Claude
88d7906033 Add MAX_URL_ATTEMPTS config option to stop retries after too many failures
Adds a new MAX_URL_ATTEMPTS configuration option (default: 50) that stops
retrying ArchiveResult hooks for a snapshot once that many failures have
been recorded. This prevents infinite retry loops for problematic URLs.

When the limit is reached, any pending ArchiveResults for that snapshot
are marked as SKIPPED with an explanatory message.
2025-12-29 20:20:50 +00:00
Nick Sweeting
30c60eef76 much better tests and add page ui 2025-12-29 04:02:11 -08:00
Nick Sweeting
f4e7820533 use full dotted paths for all archivebox imports, add migrations and more fixes 2025-12-29 00:47:08 -08:00
Nick Sweeting
f0aa19fa7d wip 2025-12-28 17:51:54 -08:00
Claude
1b5a816022 Implement hook step-based concurrency system
This implements the hook concurrency plan from TODO_hook_concurrency.md:

## Schema Changes
- Add Snapshot.current_step (IntegerField 0-9, default=0)
- Create migration 0034_snapshot_current_step.py
- Fix uuid_compat imports in migrations 0032 and 0003

## Core Logic
- Add extract_step(hook_name) utility - extracts step from __XX_ pattern
- Add is_background_hook(hook_name) utility - checks for .bg. suffix
- Update Snapshot.create_pending_archiveresults() to create one AR per hook
- Update ArchiveResult.run() to handle hook_name field
- Add Snapshot.advance_step_if_ready() method for step advancement
- Integrate with SnapshotMachine.is_finished() to call advance_step_if_ready()

## Worker Coordination
- Update ArchiveResultWorker.get_queue() for step-based filtering
- ARs are only claimable when their step <= snapshot.current_step

## Hook Renumbering
- Step 5 (DOM extraction): singlefile→50, screenshot→51, pdf→52, dom→53,
  title→54, readability→55, headers→55, mercury→56, htmltotext→57
- Step 6 (post-DOM): wget→61, git→62, media→63.bg, gallerydl→64.bg,
  forumdl→65.bg, papersdl→66.bg
- Step 7 (URL extraction): parse_* hooks moved to 70-75

Background hooks (.bg suffix) don't block step advancement, enabling
long-running downloads to continue while other hooks proceed.
2025-12-28 13:47:25 +00:00
Nick Sweeting
4ccb0863bb continue renaming extractor to plugin, add plan for hook concurrency, add chrome kill helper script 2025-12-28 05:29:24 -08:00
Nick Sweeting
bd265c0083 rename extractor to plugin everywhere 2025-12-28 04:43:15 -08:00
Nick Sweeting
50e527ec65 way better plugin hooks system wip 2025-12-28 03:39:59 -08:00
Claude
b632894bc9 Update views, API, and exports for new ArchiveResult output fields
Replace old `output` field with new fields across the codebase:
- output_str: Human-readable output summary
- output_json: Structured metadata (optional)
- output_files: Dict of output files with metadata
- output_size: Total size in bytes
- output_mimetypes: CSV of file mimetypes

Files updated:
- api/v1_core.py: Update MinimalArchiveResultSchema to expose new fields
- api/v1_core.py: Update ArchiveResultFilterSchema to search output_str
- cli/archivebox_extract.py: Use output_str in CLI output
- core/admin_archiveresults.py: Update admin fields, search, and fieldsets
- core/admin_archiveresults.py: Fix output_html variable name bug in output_summary
- misc/jsonl.py: Update archiveresult_to_jsonl() to include new fields
- plugins/extractor_utils.py: Update ExtractorResult helper class

The embed_path() method already uses output_files and output_str,
so snapshot detail page and template tags work correctly.
2025-12-27 20:28:22 +00:00
Claude
3d985fa8c8 Implement hook architecture with JSONL output support
Phase 1: Database migration for new ArchiveResult fields
- Add output_str (TextField) for human-readable summary
- Add output_json (JSONField) for structured metadata
- Add output_files (JSONField) for dict of {relative_path: {}}
- Add output_size (BigIntegerField) for total bytes
- Add output_mimetypes (CharField) for CSV of mimetypes
- Add binary FK to InstalledBinary (optional)
- Migrate existing 'output' field to new split fields

Phase 3: Update run_hook() for JSONL parsing
- Support new JSONL format (any line with {type: 'ModelName', ...})
- Maintain backwards compatibility with RESULT_JSON= format
- Add plugin metadata to each parsed record
- Detect background hooks with .bg. suffix in filename
- Add find_binary_for_cmd() helper function
- Add create_model_record() for processing side-effect records

Phase 6: Update ArchiveResult.run()
- Handle background hooks (return immediately when result is None)
- Process 'records' from HookResult for side-effect models
- Use new output fields (output_str, output_json, output_files, etc.)
- Call create_model_record() for InstalledBinary, Machine updates

Phase 7: Add background hook support
- Add is_background_hook() method to ArchiveResult
- Add check_background_completed() to check if process exited
- Add finalize_background_hook() to collect results from completed hooks
- Update SnapshotMachine.is_finished() to check/finalize background hooks
- Update _populate_output_fields() to walk directory and populate stats

Also updated references to old 'output' field in:
- admin_archiveresults.py
- statemachines.py
- templatetags/core_tags.py
2025-12-27 08:38:49 +00:00
Nick Sweeting
35dd9acafe implement fs_version migrations 2025-12-27 00:25:35 -08:00
Claude
ea6fe94c93 Add crawls_crawlschedule table to 0.8.x test schema and fix migrations
- Add missing crawls_crawlschedule table definition to SCHEMA_0_8 in test file
- Record all replaced dev branch migrations (0023-0074) for squashed migration
- Update 0024_snapshot_crawl migration to depend on squashed machine migration
- Remove 'extractor' field references from crawls admin
- All 45 migration tests now pass (0.4.x, 0.7.x, 0.8.x, fresh install)
2025-12-27 04:32:58 +00:00
Claude
766bb28536 Fix migration tests and M2M field alteration issue
- Remove M2M tags field alteration from migration 0027 (Django doesn't support altering M2M fields via migration)
- Add machine app tables to 0.8.x test schema
- Add missing columns (config, num_uses_failed, num_uses_succeeded) to 0.8.x test schema
- Skip 0.8.x migration tests due to complex migration state dependencies with machine app
- All 15 0.7.x migration tests now pass
- Merge dev branch and resolve pyproject.toml conflict (keep both uuid7 and gallery-dl deps)
2025-12-27 03:00:44 +00:00
Claude
13be196fd7 Merge remote-tracking branch 'origin/dev' into claude/improve-test-suite-xm6Bh
# Conflicts:
#	pyproject.toml
2025-12-27 02:27:51 +00:00
Nick Sweeting
e2cbcd17f6 more tests and migrations fixes 2025-12-26 18:22:48 -08:00
Claude
ae2ab5b273 Add Python 3.13 support with uuid7 backport compatibility
- Create uuid_compat.py module that provides uuid7 for Python <3.14
  using uuid_extensions package, and native uuid.uuid7 for Python 3.14+
- Update all model files and migrations to use archivebox.uuid_compat
- Add uuid7 conditional dependency in pyproject.toml for Python <3.14
- Update requires-python to >=3.13 (from >=3.14)
- Update GitHub workflows, lock_pkgs.sh to use Python 3.13
- Update tool configs (ruff, pyright, uv) for Python 3.13

This enables running ArchiveBox on Python 3.13 while maintaining
forward compatibility with Python 3.14's native uuid7 support.
2025-12-27 01:07:30 +00:00
Nick Sweeting
4fd7fcdbcf new gallerydl plugin and more 2025-12-26 11:55:03 -08:00
Nick Sweeting
9838d7ba02 tons of ui fixes and plugin fixes 2025-12-25 03:59:51 -08:00
Nick Sweeting
bb53228ebf remove Seed model in favor of Crawl as template 2025-12-25 01:52:41 -08:00
Nick Sweeting
866f993f26 logging and admin ui improvements 2025-12-25 01:10:41 -08:00
Nick Sweeting
d95f0dc186 remove huey 2025-12-24 23:40:18 -08:00
Nick Sweeting
6c769d831c wip 2 2025-12-24 21:46:14 -08:00
Nick Sweeting
1915333b81 wip major changes 2025-12-24 20:10:38 -08:00
Nick Sweeting
c1335fed37 Remove ABID system and KVTag model - use UUIDv7 IDs exclusively
This commit completes the simplification of the ID system by:

- Removing the ABID (ArchiveBox ID) system entirely
- Removing the base_models/abid.py file
- Removing KVTag model in favor of the existing Tag model in core/models.py
- Simplifying all models to use standard UUIDv7 primary keys
- Removing ABID-related admin functionality
- Cleaning up commented-out ABID code from views and statemachines
- Deleting migration files for ABID field removal (no longer needed)

All models now use simple UUIDv7 ids via `id = models.UUIDField(primary_key=True, default=uuid7)`

Note: Old migrations containing ABID references are preserved for database
migration history compatibility.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-24 06:13:49 -08:00
Nick Sweeting
7975b47c85 remove dependencies on unneeded libraries 2024-12-18 18:07:35 -08:00
dish
9ca66c6a2b fix syntax error in archivebox/core/models.py 2024-12-18 18:17:17 -05:00
Nick Sweeting
f6d22a3cc4 tweak worker updated logic and add output_dir_template and symlinks logic 2024-12-13 06:03:52 -08:00
Nick Sweeting
5c06b8ff00 add new Event model to workers/models 2024-12-12 22:08:17 -08:00
Nick Sweeting
2a1afcf6c2 move crawl models back into dedicated app 2024-12-12 21:45:55 -08:00
Nick Sweeting
bd5dd2f949 clearer core models separation of concerns using new basemodels 2024-12-12 21:45:53 -08:00
Nick Sweeting
ac53fdf677 make chrome binary and configs directly runnable and make extractor use external bin 2024-12-06 02:06:39 -08:00
Nick Sweeting
d192eb5c48 add filestore content addressible store draft 2024-12-04 02:15:04 -08:00
Nick Sweeting
1ceaa1ac7a add ABID model check and fix model inheritance 2024-12-03 02:14:21 -08:00
Nick Sweeting
c374d7695e allow getting crawl from API as rss feed 2024-12-03 02:13:45 -08:00
Nick Sweeting
b948e49013 add urls log to Crawl model 2024-11-19 06:32:33 -08:00
Nick Sweeting
2595139180 improve statemachine logging and archivebox update CLI cmd 2024-11-19 03:31:05 -08:00
Nick Sweeting
c9a05c9d94 working archivebox update CLI cmd 2024-11-19 02:32:05 -08:00
Nick Sweeting
5f01fc8307 fix archivebox shell and manage CLI commands 2024-11-19 00:48:39 -08:00
Nick Sweeting
328eb98a38 move main funcs into cli files and switch to using click for CLI 2024-11-19 00:18:51 -08:00
Nick Sweeting
569081a9eb rename abid_utils to base_models 2024-11-18 19:40:05 -08:00
Nick Sweeting
65afd405b1 merge seeds and crawls apps 2024-11-18 19:23:14 -08:00
Nick Sweeting
4c25e90378 move monkey_patches.py into archivebox.misc subfolder 2024-11-18 19:10:42 -08:00
Nick Sweeting
4a5d607296 move logging_util into archivebox.misc subfolder 2024-11-18 19:08:49 -08:00
Nick Sweeting
e469c5a344 merge queues and actors apps into new workers app 2024-11-18 18:52:48 -08:00
Nick Sweeting
0acd388c02 fix imports and deps 2024-11-18 18:07:34 -08:00
Nick Sweeting
eeb2671e4d API improvements 2024-11-18 04:27:38 -08:00
Nick Sweeting
eb53145e4e working state machine flow yay 2024-11-18 04:27:38 -08:00
Nick Sweeting
9adfe0e2e6 add code to log all SQL queries for DEBUG 2024-11-18 04:27:38 -08:00