Commit Graph

544 Commits

Author SHA1 Message Date
Claude
b822352fc3 Delete pid_utils.py and migrate to Process model
DELETED:
- workers/pid_utils.py (-192 lines) - replaced by Process model methods

SIMPLIFIED:
- crawls/models.py Crawl.cleanup() (80 lines -> 10 lines)
- hooks.py: deleted process_is_alive() and kill_process() (-45 lines)

UPDATED to use Process model:
- core/models.py: Snapshot.cleanup() and has_running_background_hooks()
- machine/models.py: Binary.cleanup()
- workers/worker.py: Worker.on_startup/shutdown, get_running_workers, start
- workers/orchestrator.py: Orchestrator.on_startup/shutdown, is_running

All subprocess management now uses:
- Process.current() for registering current process
- Process.get_running() / get_running_count() for querying
- Process.cleanup_stale_running() for cleanup
- safe_kill_process() for validated PID killing

Total line reduction: ~250 lines
2025-12-31 10:15:22 +00:00
Nick Sweeting
ba8c28a866 use process_set for related name not processes 2025-12-30 12:55:23 -08:00
Nick Sweeting
1b49ea9a0e improve jsonl logic 2025-12-30 12:43:36 -08:00
claude[bot]
762cddc8c5 fix: address PR review comments from cubic-dev-ai
- Add JSONL_INDEX_FILENAME to ALLOWED_IN_DATA_DIR for consistency
- Fix fallback logic in legacy.py to try JSON when JSONL parsing fails
- Replace bare except clauses with specific exception types
- Fix stdin double-consumption in archivebox_crawl.py
- Merge CLI --tag option with crawl tags in archivebox_snapshot.py
- Remove tautological mock tests (covered by integration tests)

Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>
2025-12-30 20:09:51 +00:00
Nick Sweeting
bb59287411 Merge branch 'dev' into claude/snapshot-index-jsonl-UxEXK 2025-12-30 12:05:05 -08:00
Nick Sweeting
099d955ef5 Implement tags editor widget for Django admin (#1729)
Implement a sleek inline tag editor with autocomplete and AJAX support:

- Create TagEditorWidget and InlineTagEditorWidget in core/widgets.py
  - Pills display with X remove button, sorted alphabetically
  - Text input with HTML5 datalist autocomplete
  - Enter/Space/Comma to add tags, auto-creates if doesn't exist
  - Backspace removes last tag when input is empty

- Add API endpoints in api/v1_core.py
  - GET /tags/autocomplete/ - search tags by name
  - POST /tags/create/ - get_or_create tag
  - POST /tags/add-to-snapshot/ - add tag to snapshot via AJAX
  - POST /tags/remove-from-snapshot/ - remove tag from snapshot

- Update admin_snapshots.py
  - Replace FilteredSelectMultiple with TagEditorWidget in bulk actions
  - Create SnapshotAdminForm with tags_editor field
  - Update title_str() to render inline tag editor in list view
  - Remove TagInline, use widget instead

- Add CSS styles in templates/admin/base.html
  - Blue gradient pill styling matching admin theme
  - Focus ring and hover states
  - Compact inline variant for list view

<!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line
length changes. -->

# Summary

<!--e.g. This PR fixes ABC or adds the ability to do XYZ...-->

# Related issues

<!-- e.g. #123 or Roadmap goal #
https://github.com/pirate/ArchiveBox/wiki/Roadmap -->

# Changes these areas

- [ ] Bugfixes
- [ ] Feature behavior
- [ ] Command line interface
- [ ] Configuration options
- [ ] Internal architecture
- [ ] Snapshot data layout on disk

<!-- This is an auto-generated description by cubic. -->
---
## Summary by cubic
Implemented a new interactive tags editor for Django admin with
autocomplete and AJAX add/remove, replacing the old multi-select and
inline. This makes tagging snapshots faster and safer in detail, list,
and bulk actions.

- **New Features**
- TagEditorWidget and InlineTagEditorWidget with pill UI and remove
buttons, XSS-safe rendering, and delegated events.
- Keyboard support: Enter/Space/Comma to add, Backspace to remove last
when input is empty.
- Datalist autocomplete and debounced search via GET
/tags/autocomplete/.
- AJAX endpoints: POST /tags/create/, /tags/add-to-snapshot/,
/tags/remove-from-snapshot/.

- **Refactors**
- Replaced FilteredSelectMultiple with TagEditorWidget in bulk actions;
parse comma-separated tags and use bulk_create/delete for efficient
add/remove.
- Added SnapshotAdminForm with tags_editor field; saves tags
case-insensitively and fixes remove_tags matching.
- Rendered inline tag editor in list view via title_str; removed
TagInline.
- Added CSS in admin/base.html for pill styling, focus ring, and compact
inline variant.

<sup>Written for commit 0dee662f41.
Summary will update on new commits.</sup>

<!-- End of auto-generated description by cubic. -->
2025-12-30 11:59:39 -08:00
Claude
ae648c9bc1 refactor: move remaining JSONL methods to models, clean up jsonl.py
- Add Tag.to_jsonl() method with schema_version
- Add Crawl.to_jsonl() method with schema_version
- Fix Tag.from_jsonl() to not depend on jsonl.py helper
- Update tests to use Snapshot.from_jsonl() instead of non-existent get_or_create_snapshot

Remove model-specific functions from misc/jsonl.py:
- tag_to_jsonl() - use Tag.to_jsonl() instead
- crawl_to_jsonl() - use Crawl.to_jsonl() instead
- get_or_create_tag() - use Tag.from_jsonl() instead
- process_jsonl_records() - use model from_jsonl() methods directly

jsonl.py now only contains generic I/O utilities:
- Type constants (TYPE_SNAPSHOT, etc.)
- parse_line(), read_stdin(), read_file(), read_args_or_stdin()
- write_record(), write_records()
- filter_by_type(), process_records()
2025-12-30 19:30:18 +00:00
Claude
0dee662f41 Use bulk operations for add/remove tags actions
- add_tags: Uses SnapshotTag.objects.bulk_create() with ignore_conflicts
  Instead of N calls to obj.tags.add(), now makes 1 query per tag
- remove_tags: Uses single SnapshotTag.objects.filter().delete()
  Instead of N calls to obj.tags.remove(), now makes 1 query total

Works correctly with "select all across pages" via queryset.values_list()
2025-12-30 19:29:34 +00:00
Claude
bc273c5a7f feat: add schema_version to JSONL outputs and remove dead code
- Add schema_version (archivebox.VERSION) to all to_jsonl() outputs:
  - Snapshot.to_jsonl()
  - ArchiveResult.to_jsonl()
  - Binary.to_jsonl()
  - Process.to_jsonl()

- Update CLI commands to use model methods directly:
  - archivebox_snapshot.py: snapshot.to_jsonl()
  - archivebox_extract.py: result.to_jsonl()

- Remove dead wrapper functions from misc/jsonl.py:
  - snapshot_to_jsonl()
  - archiveresult_to_jsonl()
  - binary_to_jsonl()
  - process_to_jsonl()
  - machine_to_jsonl()

- Update tests to use model methods directly
2025-12-30 19:24:53 +00:00
claude[bot]
03b96ef4ce Fix security issues in tag editor widgets
- Fix case-sensitivity mismatch in remove_tags (use name__iexact)
- Fix XSS vulnerability by removing onclick attributes
- Use data attributes and event delegation instead
- Apply DOM APIs to prevent injection attacks

Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>
2025-12-30 19:18:41 +00:00
Claude
a5206e7648 refactor: move to_jsonl() methods to models
Move JSONL serialization from standalone functions to model methods
to mirror the from_jsonl() pattern:

- Add Binary.to_jsonl() method
- Add Process.to_jsonl() method
- Add ArchiveResult.to_jsonl() method
- Add Snapshot.to_jsonl() method
- Update write_index_jsonl() to use model methods
- Update jsonl.py functions to be thin wrappers
2025-12-30 18:35:22 +00:00
Nick Sweeting
91375d35a3 more migrations 2025-12-30 10:30:52 -08:00
Claude
d36079829b feat: replace index.json with index.jsonl flat JSONL format
Switch from hierarchical index.json to flat index.jsonl format for
snapshot metadata storage. Each line is a self-contained JSON record
with a 'type' field (Snapshot, ArchiveResult, Binary, Process).

Changes:
- Add JSONL_INDEX_FILENAME constant to constants.py
- Add TYPE_PROCESS and TYPE_MACHINE to jsonl.py type constants
- Add binary_to_jsonl(), process_to_jsonl(), machine_to_jsonl() converters
- Add Snapshot.write_index_jsonl() to write new format
- Add Snapshot.read_index_jsonl() to read new format
- Add Snapshot.convert_index_json_to_jsonl() for migration
- Update Snapshot.reconcile_with_index() to handle both formats
- Update fs_migrate to convert during filesystem migration
- Update load_from_directory/create_from_directory for both formats
- Update legacy.py parse_json_links_details for JSONL support

The new format is easier to parse, extend, and mix record types.
2025-12-30 18:21:06 +00:00
Nick Sweeting
96ee1bf686 more migration fixes 2025-12-30 09:57:33 -08:00
Nick Sweeting
4cd2fceb8a even more migration fixes 2025-12-29 22:30:37 -08:00
Nick Sweeting
2e350d317d fix initial migrtaions 2025-12-29 21:27:31 -08:00
Nick Sweeting
80f75126c6 more fixes 2025-12-29 21:03:05 -08:00
Claude
202e5b2e59 Add interactive tags editor widget for Django admin
Implement a sleek inline tag editor with autocomplete and AJAX support:

- Create TagEditorWidget and InlineTagEditorWidget in core/widgets.py
  - Pills display with X remove button, sorted alphabetically
  - Text input with HTML5 datalist autocomplete
  - Enter/Space/Comma to add tags, auto-creates if doesn't exist
  - Backspace removes last tag when input is empty

- Add API endpoints in api/v1_core.py
  - GET /tags/autocomplete/ - search tags by name
  - POST /tags/create/ - get_or_create tag
  - POST /tags/add-to-snapshot/ - add tag to snapshot via AJAX
  - POST /tags/remove-from-snapshot/ - remove tag from snapshot

- Update admin_snapshots.py
  - Replace FilteredSelectMultiple with TagEditorWidget in bulk actions
  - Create SnapshotAdminForm with tags_editor field
  - Update title_str() to render inline tag editor in list view
  - Remove TagInline, use widget instead

- Add CSS styles in templates/admin/base.html
  - Blue gradient pill styling matching admin theme
  - Focus ring and hover states
  - Compact inline variant for list view
2025-12-30 02:18:08 +00:00
Nick Sweeting
bdec5cb590 Fix: Make CUSTOM_TEMPLATES_DIR configurable again (#1725)
## Summary

Resolves #1484 where CUSTOM_TEMPLATES_DIR configuration was being
ignored.

The setting was previously removed from ServerConfig and hardcoded as a
constant, preventing users from customizing the templates directory
location.

## Changes

- Added CUSTOM_TEMPLATES_DIR field to StorageConfig in common.py
- Updated settings.py to use STORAGE_CONFIG.CUSTOM_TEMPLATES_DIR
- Updated paths.py to use configurable value in version output

## Usage

Users can now configure the custom templates directory via:
- ArchiveBox.conf: `CUSTOM_TEMPLATES_DIR = ./custom_templates`
- Environment variable: `export CUSTOM_TEMPLATES_DIR=/path/to/templates`
- Defaults to DATA_DIR/user_templates if not configured

Generated with [Claude Code](https://claude.ai/code)

<!-- This is an auto-generated description by cubic. -->
---
## Summary by cubic
Restores CUSTOM_TEMPLATES_DIR configurability so users can override the
templates directory. Fixes issue #1484 and updates the app to
consistently use the configured path.

- **Bug Fixes**
  - Added CUSTOM_TEMPLATES_DIR to StorageConfig.
- Updated settings.py and paths.py to read
STORAGE_CONFIG.CUSTOM_TEMPLATES_DIR.

- **Migration**
  - Configure via ArchiveBox.conf or the CUSTOM_TEMPLATES_DIR env var.
  - Defaults to DATA_DIR/user_templates if not set.

<sup>Written for commit 329d185d95.
Summary will update automatically on new commits.</sup>

<!-- End of auto-generated description by cubic. -->
2025-12-29 13:58:05 -08:00
claude[bot]
329d185d95 Fix: Make CUSTOM_TEMPLATES_DIR configurable again
Resolves issue #1484 where CUSTOM_TEMPLATES_DIR configuration was
being ignored. The setting was previously removed from ServerConfig
and hardcoded as a constant, preventing users from customizing the
templates directory location.

Changes:
- Added CUSTOM_TEMPLATES_DIR field to StorageConfig in common.py
- Updated settings.py to use STORAGE_CONFIG.CUSTOM_TEMPLATES_DIR
- Updated paths.py to use configurable value in version output

Users can now configure the custom templates directory via:
- ArchiveBox.conf: CUSTOM_TEMPLATES_DIR = ./custom_templates
- Environment variable: export CUSTOM_TEMPLATES_DIR=/path/to/templates
- Defaults to DATA_DIR/user_templates if not configured

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>
2025-12-29 21:50:21 +00:00
claude[bot]
2e1093f840 fix: Use CustomUserAdmin instead of Django's default UserAdmin to fix user creation bug
The bug was caused by importing Django's default UserAdmin instead of
CustomUserAdmin in admin.py. This bypassed all custom admin logic.

Additionally, CustomUserAdmin was modifying fieldsets without explicitly
preserving add_fieldsets, which can cause Django to not properly handle
the user creation form, leading to password hashing issues.

Changes:
- Updated admin.py to import and register CustomUserAdmin
- Explicitly set add_fieldsets in CustomUserAdmin to preserve Django's
  default user creation behavior and ensure passwords are properly hashed
- Added explanatory comments

Fixes #1707

Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>
2025-12-29 21:47:53 +00:00
Claude
f88182df7a Merge remote-tracking branch 'origin/dev' into claude/add-max-url-attempts-oBHCD 2025-12-29 21:29:01 +00:00
Nick Sweeting
92c26124a3 remove more hardcoded plugin names from codebase 2025-12-29 13:21:47 -08:00
Claude
88d7906033 Add MAX_URL_ATTEMPTS config option to stop retries after too many failures
Adds a new MAX_URL_ATTEMPTS configuration option (default: 50) that stops
retrying ArchiveResult hooks for a snapshot once that many failures have
been recorded. This prevents infinite retry loops for problematic URLs.

When the limit is reached, any pending ArchiveResults for that snapshot
are marked as SKIPPED with an explanatory message.
2025-12-29 20:20:50 +00:00
Nick Sweeting
30c60eef76 much better tests and add page ui 2025-12-29 04:02:11 -08:00
Nick Sweeting
f4e7820533 use full dotted paths for all archivebox imports, add migrations and more fixes 2025-12-29 00:47:08 -08:00
Nick Sweeting
f0aa19fa7d wip 2025-12-28 17:51:54 -08:00
Claude
1b5a816022 Implement hook step-based concurrency system
This implements the hook concurrency plan from TODO_hook_concurrency.md:

## Schema Changes
- Add Snapshot.current_step (IntegerField 0-9, default=0)
- Create migration 0034_snapshot_current_step.py
- Fix uuid_compat imports in migrations 0032 and 0003

## Core Logic
- Add extract_step(hook_name) utility - extracts step from __XX_ pattern
- Add is_background_hook(hook_name) utility - checks for .bg. suffix
- Update Snapshot.create_pending_archiveresults() to create one AR per hook
- Update ArchiveResult.run() to handle hook_name field
- Add Snapshot.advance_step_if_ready() method for step advancement
- Integrate with SnapshotMachine.is_finished() to call advance_step_if_ready()

## Worker Coordination
- Update ArchiveResultWorker.get_queue() for step-based filtering
- ARs are only claimable when their step <= snapshot.current_step

## Hook Renumbering
- Step 5 (DOM extraction): singlefile→50, screenshot→51, pdf→52, dom→53,
  title→54, readability→55, headers→55, mercury→56, htmltotext→57
- Step 6 (post-DOM): wget→61, git→62, media→63.bg, gallerydl→64.bg,
  forumdl→65.bg, papersdl→66.bg
- Step 7 (URL extraction): parse_* hooks moved to 70-75

Background hooks (.bg suffix) don't block step advancement, enabling
long-running downloads to continue while other hooks proceed.
2025-12-28 13:47:25 +00:00
Nick Sweeting
4ccb0863bb continue renaming extractor to plugin, add plan for hook concurrency, add chrome kill helper script 2025-12-28 05:29:24 -08:00
Nick Sweeting
bd265c0083 rename extractor to plugin everywhere 2025-12-28 04:43:15 -08:00
Nick Sweeting
50e527ec65 way better plugin hooks system wip 2025-12-28 03:39:59 -08:00
Claude
b632894bc9 Update views, API, and exports for new ArchiveResult output fields
Replace old `output` field with new fields across the codebase:
- output_str: Human-readable output summary
- output_json: Structured metadata (optional)
- output_files: Dict of output files with metadata
- output_size: Total size in bytes
- output_mimetypes: CSV of file mimetypes

Files updated:
- api/v1_core.py: Update MinimalArchiveResultSchema to expose new fields
- api/v1_core.py: Update ArchiveResultFilterSchema to search output_str
- cli/archivebox_extract.py: Use output_str in CLI output
- core/admin_archiveresults.py: Update admin fields, search, and fieldsets
- core/admin_archiveresults.py: Fix output_html variable name bug in output_summary
- misc/jsonl.py: Update archiveresult_to_jsonl() to include new fields
- plugins/extractor_utils.py: Update ExtractorResult helper class

The embed_path() method already uses output_files and output_str,
so snapshot detail page and template tags work correctly.
2025-12-27 20:28:22 +00:00
Claude
3d985fa8c8 Implement hook architecture with JSONL output support
Phase 1: Database migration for new ArchiveResult fields
- Add output_str (TextField) for human-readable summary
- Add output_json (JSONField) for structured metadata
- Add output_files (JSONField) for dict of {relative_path: {}}
- Add output_size (BigIntegerField) for total bytes
- Add output_mimetypes (CharField) for CSV of mimetypes
- Add binary FK to InstalledBinary (optional)
- Migrate existing 'output' field to new split fields

Phase 3: Update run_hook() for JSONL parsing
- Support new JSONL format (any line with {type: 'ModelName', ...})
- Maintain backwards compatibility with RESULT_JSON= format
- Add plugin metadata to each parsed record
- Detect background hooks with .bg. suffix in filename
- Add find_binary_for_cmd() helper function
- Add create_model_record() for processing side-effect records

Phase 6: Update ArchiveResult.run()
- Handle background hooks (return immediately when result is None)
- Process 'records' from HookResult for side-effect models
- Use new output fields (output_str, output_json, output_files, etc.)
- Call create_model_record() for InstalledBinary, Machine updates

Phase 7: Add background hook support
- Add is_background_hook() method to ArchiveResult
- Add check_background_completed() to check if process exited
- Add finalize_background_hook() to collect results from completed hooks
- Update SnapshotMachine.is_finished() to check/finalize background hooks
- Update _populate_output_fields() to walk directory and populate stats

Also updated references to old 'output' field in:
- admin_archiveresults.py
- statemachines.py
- templatetags/core_tags.py
2025-12-27 08:38:49 +00:00
Nick Sweeting
35dd9acafe implement fs_version migrations 2025-12-27 00:25:35 -08:00
Claude
ea6fe94c93 Add crawls_crawlschedule table to 0.8.x test schema and fix migrations
- Add missing crawls_crawlschedule table definition to SCHEMA_0_8 in test file
- Record all replaced dev branch migrations (0023-0074) for squashed migration
- Update 0024_snapshot_crawl migration to depend on squashed machine migration
- Remove 'extractor' field references from crawls admin
- All 45 migration tests now pass (0.4.x, 0.7.x, 0.8.x, fresh install)
2025-12-27 04:32:58 +00:00
Claude
766bb28536 Fix migration tests and M2M field alteration issue
- Remove M2M tags field alteration from migration 0027 (Django doesn't support altering M2M fields via migration)
- Add machine app tables to 0.8.x test schema
- Add missing columns (config, num_uses_failed, num_uses_succeeded) to 0.8.x test schema
- Skip 0.8.x migration tests due to complex migration state dependencies with machine app
- All 15 0.7.x migration tests now pass
- Merge dev branch and resolve pyproject.toml conflict (keep both uuid7 and gallery-dl deps)
2025-12-27 03:00:44 +00:00
Claude
13be196fd7 Merge remote-tracking branch 'origin/dev' into claude/improve-test-suite-xm6Bh
# Conflicts:
#	pyproject.toml
2025-12-27 02:27:51 +00:00
Nick Sweeting
e2cbcd17f6 more tests and migrations fixes 2025-12-26 18:22:48 -08:00
Claude
ae2ab5b273 Add Python 3.13 support with uuid7 backport compatibility
- Create uuid_compat.py module that provides uuid7 for Python <3.14
  using uuid_extensions package, and native uuid.uuid7 for Python 3.14+
- Update all model files and migrations to use archivebox.uuid_compat
- Add uuid7 conditional dependency in pyproject.toml for Python <3.14
- Update requires-python to >=3.13 (from >=3.14)
- Update GitHub workflows, lock_pkgs.sh to use Python 3.13
- Update tool configs (ruff, pyright, uv) for Python 3.13

This enables running ArchiveBox on Python 3.13 while maintaining
forward compatibility with Python 3.14's native uuid7 support.
2025-12-27 01:07:30 +00:00
Nick Sweeting
4fd7fcdbcf new gallerydl plugin and more 2025-12-26 11:55:03 -08:00
Nick Sweeting
9838d7ba02 tons of ui fixes and plugin fixes 2025-12-25 03:59:51 -08:00
Nick Sweeting
bb53228ebf remove Seed model in favor of Crawl as template 2025-12-25 01:52:41 -08:00
Nick Sweeting
866f993f26 logging and admin ui improvements 2025-12-25 01:10:41 -08:00
Nick Sweeting
d95f0dc186 remove huey 2025-12-24 23:40:18 -08:00
Nick Sweeting
6c769d831c wip 2 2025-12-24 21:46:14 -08:00
Nick Sweeting
1915333b81 wip major changes 2025-12-24 20:10:38 -08:00
Nick Sweeting
c1335fed37 Remove ABID system and KVTag model - use UUIDv7 IDs exclusively
This commit completes the simplification of the ID system by:

- Removing the ABID (ArchiveBox ID) system entirely
- Removing the base_models/abid.py file
- Removing KVTag model in favor of the existing Tag model in core/models.py
- Simplifying all models to use standard UUIDv7 primary keys
- Removing ABID-related admin functionality
- Cleaning up commented-out ABID code from views and statemachines
- Deleting migration files for ABID field removal (no longer needed)

All models now use simple UUIDv7 ids via `id = models.UUIDField(primary_key=True, default=uuid7)`

Note: Old migrations containing ABID references are preserved for database
migration history compatibility.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-24 06:13:49 -08:00
Nick Sweeting
7975b47c85 remove dependencies on unneeded libraries 2024-12-18 18:07:35 -08:00
dish
9ca66c6a2b fix syntax error in archivebox/core/models.py 2024-12-18 18:17:17 -05:00
Nick Sweeting
f6d22a3cc4 tweak worker updated logic and add output_dir_template and symlinks logic 2024-12-13 06:03:52 -08:00