Commit Graph

2135 Commits

Author SHA1 Message Date
Claude
b87bbbbecb Fix CLI tests to use subprocess and remove mocks
- Fix conftest.py: use subprocess for init, remove unused cli_env fixture
- Update all test files to use data_dir parameter instead of env
- Remove mock-based TestJSONLOutput class from tests_piping.py
- Remove unused imports (MagicMock, patch)
- Fix file permissions for cli_utils.py

All tests now use real subprocess calls per CLAUDE.md guidelines:
- NO MOCKS - tests exercise real code paths
- NO SKIPS - every test runs
2025-12-31 10:53:45 +00:00
Claude
bb52b5902a Add unit tests for JSONL CLI pipeline commands (Phase 5 & 6)
Add comprehensive unit tests for the CLI piping architecture:
- test_cli_crawl.py: crawl create/list/update/delete tests
- test_cli_snapshot.py: snapshot create/list/update/delete tests
- test_cli_archiveresult.py: archiveresult create/list/update/delete tests
- test_cli_run.py: run command create-or-update and pass-through tests

Extend tests_piping.py with:
- TestPassThroughBehavior: tests for pass-through behavior in all commands
- TestPipelineAccumulation: tests for accumulating records through pipeline

All tests use pytest fixtures from conftest.py with isolated DATA_DIR.
2025-12-31 10:21:05 +00:00
Claude
f3e11b61fd Implement JSONL CLI pipeline architecture (Phases 1-4, 6)
Phase 1: Model Prerequisites
- Add ArchiveResult.from_json() and from_jsonl() methods
- Fix Snapshot.to_json() to use tags_str (consistent with Crawl)

Phase 2: Shared Utilities
- Create archivebox/cli/cli_utils.py with shared apply_filters()
- Update 7 CLI files to import from cli_utils.py instead of duplicating

Phase 3: Pass-Through Behavior
- Add pass-through to crawl create (non-Crawl records pass unchanged)
- Add pass-through to snapshot create (Crawl records + others pass through)
- Add pass-through to archiveresult create (Snapshot records + others)
- Add create-or-update behavior to run command:
  - Records WITHOUT id: Create via Model.from_json()
  - Records WITH id: Lookup existing, re-queue
  - Outputs JSONL of all processed records for chaining

Phase 4: Test Infrastructure
- Create archivebox/tests/conftest.py with pytest-django fixtures
- Include CLI helpers, output assertions, database assertions

Phase 6: Config Update
- Update supervisord_util.py: orchestrator -> run command

This enables Unix-style piping:
  archivebox crawl create URL | archivebox run
  archivebox archiveresult list --status=failed | archivebox run
  curl API | jq transform | archivebox crawl create | archivebox run
2025-12-31 10:07:14 +00:00
Nick Sweeting
dd2302ad92 new jsonl cli interface 2025-12-30 16:12:53 -08:00
Nick Sweeting
ba8c28a866 use process_set for related name not processes 2025-12-30 12:55:23 -08:00
Nick Sweeting
1b49ea9a0e improve jsonl logic 2025-12-30 12:43:36 -08:00
Nick Sweeting
08366cfa46 document chrome configs 2025-12-30 12:42:50 -08:00
claude[bot]
251fe33e49 fix: rename --plugin to --plugins for consistency
Changed from singular --plugin to plural --plugins in both snapshot and extract
commands to match the pattern in archivebox add command. Updated to accept
comma-separated plugin names (e.g., --plugins=screenshot,singlefile,title).

- Updated CLI option from --plugin to --plugins
- Added parsing for comma-separated plugin names
- Updated function signatures and logic to handle multiple plugins
- Updated help text, docstrings, and examples

Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>
2025-12-30 20:20:29 +00:00
claude[bot]
64db6deab3 fix: revert incorrect --extract renaming, restore --plugin parameter
The --plugins parameter was incorrectly renamed to --extract (boolean).
This restores --plugin (singular, matching extract command) with correct
semantics: specify which plugin to run after creating snapshots.

- Changed --extract/--no-extract back to --plugin (string parameter)
- Updated function signature and logic to use plugin parameter
- Added ArchiveResult creation for specific plugin when --plugin is passed
- Updated docstring and examples

Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>
2025-12-30 20:15:48 +00:00
claude[bot]
762cddc8c5 fix: address PR review comments from cubic-dev-ai
- Add JSONL_INDEX_FILENAME to ALLOWED_IN_DATA_DIR for consistency
- Fix fallback logic in legacy.py to try JSON when JSONL parsing fails
- Replace bare except clauses with specific exception types
- Fix stdin double-consumption in archivebox_crawl.py
- Merge CLI --tag option with crawl tags in archivebox_snapshot.py
- Remove tautological mock tests (covered by integration tests)

Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>
2025-12-30 20:09:51 +00:00
Claude
cf387ed59f refactor: batch all URLs into single Crawl, update tests
- archivebox crawl now creates one Crawl with all URLs as newline-separated string
- Updated tests to reflect new pipeline: crawl -> snapshot -> extract
- Added tests for Crawl JSONL parsing and output
- Tests verify Crawl.from_jsonl() handles multiple URLs correctly
2025-12-30 20:06:56 +00:00
Nick Sweeting
bb59287411 Merge branch 'dev' into claude/snapshot-index-jsonl-UxEXK 2025-12-30 12:05:05 -08:00
Nick Sweeting
099d955ef5 Implement tags editor widget for Django admin (#1729)
Implement a sleek inline tag editor with autocomplete and AJAX support:

- Create TagEditorWidget and InlineTagEditorWidget in core/widgets.py
  - Pills display with X remove button, sorted alphabetically
  - Text input with HTML5 datalist autocomplete
  - Enter/Space/Comma to add tags, auto-creates if doesn't exist
  - Backspace removes last tag when input is empty

- Add API endpoints in api/v1_core.py
  - GET /tags/autocomplete/ - search tags by name
  - POST /tags/create/ - get_or_create tag
  - POST /tags/add-to-snapshot/ - add tag to snapshot via AJAX
  - POST /tags/remove-from-snapshot/ - remove tag from snapshot

- Update admin_snapshots.py
  - Replace FilteredSelectMultiple with TagEditorWidget in bulk actions
  - Create SnapshotAdminForm with tags_editor field
  - Update title_str() to render inline tag editor in list view
  - Remove TagInline, use widget instead

- Add CSS styles in templates/admin/base.html
  - Blue gradient pill styling matching admin theme
  - Focus ring and hover states
  - Compact inline variant for list view

<!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line
length changes. -->

# Summary

<!--e.g. This PR fixes ABC or adds the ability to do XYZ...-->

# Related issues

<!-- e.g. #123 or Roadmap goal #
https://github.com/pirate/ArchiveBox/wiki/Roadmap -->

# Changes these areas

- [ ] Bugfixes
- [ ] Feature behavior
- [ ] Command line interface
- [ ] Configuration options
- [ ] Internal architecture
- [ ] Snapshot data layout on disk

<!-- This is an auto-generated description by cubic. -->
---
## Summary by cubic
Implemented a new interactive tags editor for Django admin with
autocomplete and AJAX add/remove, replacing the old multi-select and
inline. This makes tagging snapshots faster and safer in detail, list,
and bulk actions.

- **New Features**
- TagEditorWidget and InlineTagEditorWidget with pill UI and remove
buttons, XSS-safe rendering, and delegated events.
- Keyboard support: Enter/Space/Comma to add, Backspace to remove last
when input is empty.
- Datalist autocomplete and debounced search via GET
/tags/autocomplete/.
- AJAX endpoints: POST /tags/create/, /tags/add-to-snapshot/,
/tags/remove-from-snapshot/.

- **Refactors**
- Replaced FilteredSelectMultiple with TagEditorWidget in bulk actions;
parse comma-separated tags and use bulk_create/delete for efficient
add/remove.
- Added SnapshotAdminForm with tags_editor field; saves tags
case-insensitively and fixes remove_tags matching.
- Rendered inline tag editor in list view via title_str; removed
TagInline.
- Added CSS in admin/base.html for pill styling, focus ring, and compact
inline variant.

<sup>Written for commit 0dee662f41.
Summary will update on new commits.</sup>

<!-- End of auto-generated description by cubic. -->
2025-12-30 11:59:39 -08:00
Claude
69965a2782 fix: correct CLI pipeline data flow for crawl -> snapshot -> extract
- archivebox crawl: creates Crawl records, outputs Crawl JSONL
- archivebox snapshot: accepts Crawl JSONL, creates Snapshots, outputs Snapshot JSONL
- archivebox extract: accepts Snapshot JSONL, runs extractors, outputs ArchiveResult JSONL

Changes:
- Add Crawl.from_jsonl() method for creating Crawl from JSONL records
- Rewrite archivebox_crawl.py to create Crawl jobs without immediately starting them
- Update archivebox_snapshot.py to accept both Crawl JSONL and plain URLs
- Update jsonl.py docstring to document the pipeline
2025-12-30 19:42:41 +00:00
Claude
ae648c9bc1 refactor: move remaining JSONL methods to models, clean up jsonl.py
- Add Tag.to_jsonl() method with schema_version
- Add Crawl.to_jsonl() method with schema_version
- Fix Tag.from_jsonl() to not depend on jsonl.py helper
- Update tests to use Snapshot.from_jsonl() instead of non-existent get_or_create_snapshot

Remove model-specific functions from misc/jsonl.py:
- tag_to_jsonl() - use Tag.to_jsonl() instead
- crawl_to_jsonl() - use Crawl.to_jsonl() instead
- get_or_create_tag() - use Tag.from_jsonl() instead
- process_jsonl_records() - use model from_jsonl() methods directly

jsonl.py now only contains generic I/O utilities:
- Type constants (TYPE_SNAPSHOT, etc.)
- parse_line(), read_stdin(), read_file(), read_args_or_stdin()
- write_record(), write_records()
- filter_by_type(), process_records()
2025-12-30 19:30:18 +00:00
Claude
0dee662f41 Use bulk operations for add/remove tags actions
- add_tags: Uses SnapshotTag.objects.bulk_create() with ignore_conflicts
  Instead of N calls to obj.tags.add(), now makes 1 query per tag
- remove_tags: Uses single SnapshotTag.objects.filter().delete()
  Instead of N calls to obj.tags.remove(), now makes 1 query total

Works correctly with "select all across pages" via queryset.values_list()
2025-12-30 19:29:34 +00:00
Claude
bc273c5a7f feat: add schema_version to JSONL outputs and remove dead code
- Add schema_version (archivebox.VERSION) to all to_jsonl() outputs:
  - Snapshot.to_jsonl()
  - ArchiveResult.to_jsonl()
  - Binary.to_jsonl()
  - Process.to_jsonl()

- Update CLI commands to use model methods directly:
  - archivebox_snapshot.py: snapshot.to_jsonl()
  - archivebox_extract.py: result.to_jsonl()

- Remove dead wrapper functions from misc/jsonl.py:
  - snapshot_to_jsonl()
  - archiveresult_to_jsonl()
  - binary_to_jsonl()
  - process_to_jsonl()
  - machine_to_jsonl()

- Update tests to use model methods directly
2025-12-30 19:24:53 +00:00
claude[bot]
03b96ef4ce Fix security issues in tag editor widgets
- Fix case-sensitivity mismatch in remove_tags (use name__iexact)
- Fix XSS vulnerability by removing onclick attributes
- Use data attributes and event delegation instead
- Apply DOM APIs to prevent injection attacks

Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>
2025-12-30 19:18:41 +00:00
Claude
a5206e7648 refactor: move to_jsonl() methods to models
Move JSONL serialization from standalone functions to model methods
to mirror the from_jsonl() pattern:

- Add Binary.to_jsonl() method
- Add Process.to_jsonl() method
- Add ArchiveResult.to_jsonl() method
- Add Snapshot.to_jsonl() method
- Update write_index_jsonl() to use model methods
- Update jsonl.py functions to be thin wrappers
2025-12-30 18:35:22 +00:00
Nick Sweeting
91375d35a3 more migrations 2025-12-30 10:30:52 -08:00
Claude
d36079829b feat: replace index.json with index.jsonl flat JSONL format
Switch from hierarchical index.json to flat index.jsonl format for
snapshot metadata storage. Each line is a self-contained JSON record
with a 'type' field (Snapshot, ArchiveResult, Binary, Process).

Changes:
- Add JSONL_INDEX_FILENAME constant to constants.py
- Add TYPE_PROCESS and TYPE_MACHINE to jsonl.py type constants
- Add binary_to_jsonl(), process_to_jsonl(), machine_to_jsonl() converters
- Add Snapshot.write_index_jsonl() to write new format
- Add Snapshot.read_index_jsonl() to read new format
- Add Snapshot.convert_index_json_to_jsonl() for migration
- Update Snapshot.reconcile_with_index() to handle both formats
- Update fs_migrate to convert during filesystem migration
- Update load_from_directory/create_from_directory for both formats
- Update legacy.py parse_json_links_details for JSONL support

The new format is easier to parse, extend, and mix record types.
2025-12-30 18:21:06 +00:00
Nick Sweeting
96ee1bf686 more migration fixes 2025-12-30 09:57:33 -08:00
Nick Sweeting
4cd2fceb8a even more migration fixes 2025-12-29 22:30:37 -08:00
Nick Sweeting
95beddc5fc more migration fixes 2025-12-29 22:12:57 -08:00
Nick Sweeting
2e350d317d fix initial migrtaions 2025-12-29 21:27:31 -08:00
Nick Sweeting
3dd329600e comment updates 2025-12-29 21:05:34 -08:00
Nick Sweeting
80f75126c6 more fixes 2025-12-29 21:03:05 -08:00
Nick Sweeting
a648c17ec7 Merge branch 'dev' into claude/tags-editor-widget-0Dq7f 2025-12-29 19:25:36 -08:00
Nick Sweeting
147d567d3f fix migrations 2025-12-29 19:25:26 -08:00
Nick Sweeting
dcf1136afa Merge branch 'dev' into claude/tags-editor-widget-0Dq7f 2025-12-29 19:24:51 -08:00
Nick Sweeting
64dccb7a19 passing 2025-12-29 18:55:57 -08:00
Nick Sweeting
5549a79869 more speed fixes 2025-12-29 18:55:37 -08:00
Nick Sweeting
abf5f44134 more debug logging 2025-12-29 18:53:52 -08:00
Nick Sweeting
bcf0513d05 more debug logging 2025-12-29 18:50:04 -08:00
Nick Sweeting
7e6e3be9e7 messing with chrome install process to reuse cached chromium with pinned version 2025-12-29 18:49:36 -08:00
Claude
202e5b2e59 Add interactive tags editor widget for Django admin
Implement a sleek inline tag editor with autocomplete and AJAX support:

- Create TagEditorWidget and InlineTagEditorWidget in core/widgets.py
  - Pills display with X remove button, sorted alphabetically
  - Text input with HTML5 datalist autocomplete
  - Enter/Space/Comma to add tags, auto-creates if doesn't exist
  - Backspace removes last tag when input is empty

- Add API endpoints in api/v1_core.py
  - GET /tags/autocomplete/ - search tags by name
  - POST /tags/create/ - get_or_create tag
  - POST /tags/add-to-snapshot/ - add tag to snapshot via AJAX
  - POST /tags/remove-from-snapshot/ - remove tag from snapshot

- Update admin_snapshots.py
  - Replace FilteredSelectMultiple with TagEditorWidget in bulk actions
  - Create SnapshotAdminForm with tags_editor field
  - Update title_str() to render inline tag editor in list view
  - Remove TagInline, use widget instead

- Add CSS styles in templates/admin/base.html
  - Blue gradient pill styling matching admin theme
  - Focus ring and hover states
  - Compact inline variant for list view
2025-12-30 02:18:08 +00:00
Nick Sweeting
b670612685 centralize chrome pid and zombie logic in chrome_utils 2025-12-29 17:57:23 -08:00
Nick Sweeting
4ba3e8d120 fix extension loading and consolidate chromium logic 2025-12-29 17:47:37 -08:00
Nick Sweeting
638b3ba774 add modalcloser plugin 2025-12-29 14:36:15 -08:00
Nick Sweeting
bdec5cb590 Fix: Make CUSTOM_TEMPLATES_DIR configurable again (#1725)
## Summary

Resolves #1484 where CUSTOM_TEMPLATES_DIR configuration was being
ignored.

The setting was previously removed from ServerConfig and hardcoded as a
constant, preventing users from customizing the templates directory
location.

## Changes

- Added CUSTOM_TEMPLATES_DIR field to StorageConfig in common.py
- Updated settings.py to use STORAGE_CONFIG.CUSTOM_TEMPLATES_DIR
- Updated paths.py to use configurable value in version output

## Usage

Users can now configure the custom templates directory via:
- ArchiveBox.conf: `CUSTOM_TEMPLATES_DIR = ./custom_templates`
- Environment variable: `export CUSTOM_TEMPLATES_DIR=/path/to/templates`
- Defaults to DATA_DIR/user_templates if not configured

Generated with [Claude Code](https://claude.ai/code)

<!-- This is an auto-generated description by cubic. -->
---
## Summary by cubic
Restores CUSTOM_TEMPLATES_DIR configurability so users can override the
templates directory. Fixes issue #1484 and updates the app to
consistently use the configured path.

- **Bug Fixes**
  - Added CUSTOM_TEMPLATES_DIR to StorageConfig.
- Updated settings.py and paths.py to read
STORAGE_CONFIG.CUSTOM_TEMPLATES_DIR.

- **Migration**
  - Configure via ArchiveBox.conf or the CUSTOM_TEMPLATES_DIR env var.
  - Defaults to DATA_DIR/user_templates if not set.

<sup>Written for commit 329d185d95.
Summary will update automatically on new commits.</sup>

<!-- End of auto-generated description by cubic. -->
2025-12-29 13:58:05 -08:00
Nick Sweeting
2be21ac592 fix: Use CustomUserAdmin to fix user creation bug (#1726)
### Summary

Fixed the bug where users created via the web GUI cannot login.

### Root Cause

The issue was in `archivebox/core/admin.py` which imported and
registered Django's default `UserAdmin` instead of the custom
`CustomUserAdmin` class. This bypassed all custom admin logic.
Additionally, `CustomUserAdmin` was modifying `fieldsets` without
explicitly preserving `add_fieldsets`, which could cause Django to not
properly handle the user creation form.

### Changes

- Updated `admin.py` to import and register `CustomUserAdmin`
- Explicitly set `add_fieldsets` in `CustomUserAdmin` to preserve
Django's default user creation behavior
- Added explanatory comments

### Testing

To verify the fix:
1. Start ArchiveBox web server
2. Navigate to the admin user creation page (`/admin/auth/user/add/`)
3. Create a new user with staff and superuser permissions
4. Log out and attempt to log in with the new user's credentials
5. Login should now succeed

Fixes #1707

Generated with [Claude Code](https://claude.ai/code)
2025-12-29 13:57:31 -08:00
Nick Sweeting
8c69124935 make infiniscroll plugin also expand details and comments sections 2025-12-29 13:55:27 -08:00
Nick Sweeting
b649db5294 fix infiniscroll plugin 2025-12-29 13:55:26 -08:00
claude[bot]
329d185d95 Fix: Make CUSTOM_TEMPLATES_DIR configurable again
Resolves issue #1484 where CUSTOM_TEMPLATES_DIR configuration was
being ignored. The setting was previously removed from ServerConfig
and hardcoded as a constant, preventing users from customizing the
templates directory location.

Changes:
- Added CUSTOM_TEMPLATES_DIR field to StorageConfig in common.py
- Updated settings.py to use STORAGE_CONFIG.CUSTOM_TEMPLATES_DIR
- Updated paths.py to use configurable value in version output

Users can now configure the custom templates directory via:
- ArchiveBox.conf: CUSTOM_TEMPLATES_DIR = ./custom_templates
- Environment variable: export CUSTOM_TEMPLATES_DIR=/path/to/templates
- Defaults to DATA_DIR/user_templates if not configured

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>
2025-12-29 21:50:21 +00:00
claude[bot]
2e1093f840 fix: Use CustomUserAdmin instead of Django's default UserAdmin to fix user creation bug
The bug was caused by importing Django's default UserAdmin instead of
CustomUserAdmin in admin.py. This bypassed all custom admin logic.

Additionally, CustomUserAdmin was modifying fieldsets without explicitly
preserving add_fieldsets, which can cause Django to not properly handle
the user creation form, leading to password hashing issues.

Changes:
- Updated admin.py to import and register CustomUserAdmin
- Explicitly set add_fieldsets in CustomUserAdmin to preserve Django's
  default user creation behavior and ensure passwords are properly hashed
- Added explanatory comments

Fixes #1707

Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>
2025-12-29 21:47:53 +00:00
Nick Sweeting
34d03be891 Add MAX_URL_ATTEMPTS option to ArchiveBox (#1723)
…lures

Adds a new MAX_URL_ATTEMPTS configuration option (default: 50) that
stops retrying ArchiveResult hooks for a snapshot once that many
failures have been recorded. This prevents infinite retry loops for
problematic URLs.

When the limit is reached, any pending ArchiveResults for that snapshot
are marked as SKIPPED with an explanatory message.

<!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line
length changes. -->

# Summary

<!--e.g. This PR fixes ABC or adds the ability to do XYZ...-->

# Related issues

<!-- e.g. #123 or Roadmap goal #
https://github.com/pirate/ArchiveBox/wiki/Roadmap -->

# Changes these areas

- [ ] Bugfixes
- [ ] Feature behavior
- [ ] Command line interface
- [ ] Configuration options
- [ ] Internal architecture
- [ ] Snapshot data layout on disk
2025-12-29 13:32:11 -08:00
Nick Sweeting
690f0669cd remove uneeded test 2025-12-29 13:30:25 -08:00
Claude
f88182df7a Merge remote-tracking branch 'origin/dev' into claude/add-max-url-attempts-oBHCD 2025-12-29 21:29:01 +00:00
Nick Sweeting
73e977ea97 ytdlp fixes 2025-12-29 13:26:50 -08:00
Nick Sweeting
92c26124a3 remove more hardcoded plugin names from codebase 2025-12-29 13:21:47 -08:00