ArchiveBox

mirror of https://github.com/ArchiveBox/ArchiveBox.git synced 2026-01-03 01:15:57 +10:00

Author	SHA1	Message	Date
Nick Sweeting	2be21ac592	fix: Use CustomUserAdmin to fix user creation bug (#1726 ) ### Summary Fixed the bug where users created via the web GUI cannot login. ### Root Cause The issue was in `archivebox/core/admin.py` which imported and registered Django's default `UserAdmin` instead of the custom `CustomUserAdmin` class. This bypassed all custom admin logic. Additionally, `CustomUserAdmin` was modifying `fieldsets` without explicitly preserving `add_fieldsets`, which could cause Django to not properly handle the user creation form. ### Changes - Updated `admin.py` to import and register `CustomUserAdmin` - Explicitly set `add_fieldsets` in `CustomUserAdmin` to preserve Django's default user creation behavior - Added explanatory comments ### Testing To verify the fix: 1. Start ArchiveBox web server 2. Navigate to the admin user creation page (`/admin/auth/user/add/`) 3. Create a new user with staff and superuser permissions 4. Log out and attempt to log in with the new user's credentials 5. Login should now succeed Fixes #1707 Generated with [Claude Code](https://claude.ai/code)	2025-12-29 13:57:31 -08:00
Nick Sweeting	8c69124935	make infiniscroll plugin also expand details and comments sections	2025-12-29 13:55:27 -08:00
Nick Sweeting	621359c37c	add duplicate issue detection bot with opencode	2025-12-29 13:55:26 -08:00
Nick Sweeting	b649db5294	fix infiniscroll plugin	2025-12-29 13:55:26 -08:00
claude[bot]	2e1093f840	fix: Use CustomUserAdmin instead of Django's default UserAdmin to fix user creation bug The bug was caused by importing Django's default UserAdmin instead of CustomUserAdmin in admin.py. This bypassed all custom admin logic. Additionally, CustomUserAdmin was modifying fieldsets without explicitly preserving add_fieldsets, which can cause Django to not properly handle the user creation form, leading to password hashing issues. Changes: - Updated admin.py to import and register CustomUserAdmin - Explicitly set add_fieldsets in CustomUserAdmin to preserve Django's default user creation behavior and ensure passwords are properly hashed - Added explanatory comments Fixes #1707 Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>	2025-12-29 21:47:53 +00:00
Nick Sweeting	9f015df0d8	Add Claude Code GitHub Workflow (#1724 ) ## 🤖 Installing Claude Code GitHub App This PR adds a GitHub Actions workflow that enables Claude Code integration in our repository. ### What is Claude Code? [Claude Code](https://claude.com/claude-code) is an AI coding agent that can help with: - Bug fixes and improvements - Documentation updates - Implementing new features - Code reviews and suggestions - Writing tests - And more! ### How it works Once this PR is merged, we'll be able to interact with Claude by mentioning @claude in a pull request or issue comment. Once the workflow is triggered, Claude will analyze the comment and surrounding context, and execute on the request in a GitHub action. ### Important Notes - This workflow won't take effect until this PR is merged - @claude mentions won't work until after the merge is complete - The workflow runs automatically whenever Claude is mentioned in PR or issue comments - Claude gets access to the entire PR or issue context including files, diffs, and previous comments ### Security - Our Anthropic API key is securely stored as a GitHub Actions secret - Only users with write access to the repository can trigger the workflow - All Claude runs are stored in the GitHub Actions run history - Claude's default tools are limited to reading/writing files and interacting with our repo by creating comments, branches, and commits. - We can add more allowed tools by adding them to the workflow file like: ``` allowed_tools: Bash(npm install),Bash(npm run build),Bash(npm run lint),Bash(npm run test) ``` There's more information in the [Claude Code action repo](https://github.com/anthropics/claude-code-action). After merging this PR, let's try mentioning @claude in a comment on any PR to get started!	2025-12-29 13:43:10 -08:00
Nick Sweeting	8c280100c7	Change permissions for pull-requests and issues	2025-12-29 13:42:59 -08:00
Nick Sweeting	d8b10d0827	Delete .github/workflows/claude-code-review.yml	2025-12-29 13:40:55 -08:00
Nick Sweeting	58b7f9c334	"Claude Code Review workflow"	2025-12-29 13:40:20 -08:00
Nick Sweeting	0162ee2434	"Claude PR Assistant workflow"	2025-12-29 13:40:18 -08:00
Nick Sweeting	34d03be891	Add MAX_URL_ATTEMPTS option to ArchiveBox (#1723 ) …lures Adds a new MAX_URL_ATTEMPTS configuration option (default: 50) that stops retrying ArchiveResult hooks for a snapshot once that many failures have been recorded. This prevents infinite retry loops for problematic URLs. When the limit is reached, any pending ArchiveResults for that snapshot are marked as SKIPPED with an explanatory message. <!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line length changes. --> # Summary <!--e.g. This PR fixes ABC or adds the ability to do XYZ...--> # Related issues <!-- e.g. #123 or Roadmap goal # https://github.com/pirate/ArchiveBox/wiki/Roadmap --> # Changes these areas - [ ] Bugfixes - [ ] Feature behavior - [ ] Command line interface - [ ] Configuration options - [ ] Internal architecture - [ ] Snapshot data layout on disk	2025-12-29 13:32:11 -08:00
Nick Sweeting	690f0669cd	remove uneeded test	2025-12-29 13:30:25 -08:00
Claude	f88182df7a	Merge remote-tracking branch 'origin/dev' into claude/add-max-url-attempts-oBHCD	2025-12-29 21:29:01 +00:00
Nick Sweeting	73e977ea97	ytdlp fixes	2025-12-29 13:26:50 -08:00
Nick Sweeting	92c26124a3	remove more hardcoded plugin names from codebase	2025-12-29 13:21:47 -08:00
Nick Sweeting	967c5d53e0	make plugin config more consistent	2025-12-29 13:21:46 -08:00
Nick Sweeting	8d76b2b0c6	add infiniscroll plugin	2025-12-29 13:14:40 -08:00
Nick Sweeting	e20fdae2a5	fix gh ci cd	2025-12-29 13:14:40 -08:00
Claude	88d7906033	Add MAX_URL_ATTEMPTS config option to stop retries after too many failures Adds a new MAX_URL_ATTEMPTS configuration option (default: 50) that stops retrying ArchiveResult hooks for a snapshot once that many failures have been recorded. This prevents infinite retry loops for problematic URLs. When the limit is reached, any pending ArchiveResults for that snapshot are marked as SKIPPED with an explanatory message.	2025-12-29 20:20:50 +00:00
Nick Sweeting	e38ddf3a25	Rename media plugin to ytdlp (#1722 ) - Rename archivebox/plugins/media/ → archivebox/plugins/ytdlp/ - Rename hook script on_Snapshot__63_media.bg.py → on_Snapshot__63_ytdlp.bg.py - Update config.json: YTDLP_* as primary keys, MEDIA_* as x-aliases - Update templates CSS classes: media-* → ytdlp-* - Fix gallerydl bug: remove incorrect dependency on media plugin output - Update all codebase references to use YTDLP_* and SAVE_YTDLP - Add backwards compatibility test for MEDIA_ENABLED alias <!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line length changes. --> # Summary <!--e.g. This PR fixes ABC or adds the ability to do XYZ...--> # Related issues <!-- e.g. #123 or Roadmap goal # https://github.com/pirate/ArchiveBox/wiki/Roadmap --> # Changes these areas - [ ] Bugfixes - [ ] Feature behavior - [ ] Command line interface - [ ] Configuration options - [ ] Internal architecture - [ ] Snapshot data layout on disk	2025-12-29 11:47:05 -08:00
Claude	ac64c77341	move default yt-dlp args to config.json YTDLP_ARGS for user override - Move hardcoded default args from Python to config.json YTDLP_ARGS - Add get_ytdlp_args() function to read from YTDLP_ARGS env var - Keep format arg with max_size in code (depends on YTDLP_MAX_SIZE) - YTDLP_ARGS can be overridden as JSON array in environment	2025-12-29 19:38:37 +00:00
Claude	a5654e877f	rename media plugin to ytdlp with backwards-compatible aliases - Rename archivebox/plugins/media/ → archivebox/plugins/ytdlp/ - Rename hook script on_Snapshot__63_media.bg.py → on_Snapshot__63_ytdlp.bg.py - Update config.json: YTDLP_* as primary keys, MEDIA_* as x-aliases - Update templates CSS classes: media-* → ytdlp-* - Fix gallerydl bug: remove incorrect dependency on media plugin output - Update all codebase references to use YTDLP_* and SAVE_YTDLP - Add backwards compatibility test for MEDIA_ENABLED alias	2025-12-29 19:09:05 +00:00
Nick Sweeting	30c60eef76	much better tests and add page ui	2025-12-29 04:02:11 -08:00
Nick Sweeting	9487f8a0de	add ci for parallel tests	2025-12-29 02:39:24 -08:00
Nick Sweeting	f4e7820533	use full dotted paths for all archivebox imports, add migrations and more fixes	2025-12-29 00:47:08 -08:00
Nick Sweeting	1e4d3ffd11	improve plugin tests and config	2025-12-29 00:45:23 -08:00
Nick Sweeting	f0aa19fa7d	wip	2025-12-28 17:51:54 -08:00
Nick Sweeting	54f91c1339	Improve concurrency control between plugin hooks (#1721 ) <!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line length changes. --> # Summary <!--e.g. This PR fixes ABC or adds the ability to do XYZ...--> # Related issues <!-- e.g. #123 or Roadmap goal # https://github.com/pirate/ArchiveBox/wiki/Roadmap --> # Changes these areas - [ ] Bugfixes - [ ] Feature behavior - [ ] Command line interface - [ ] Configuration options - [ ] Internal architecture - [ ] Snapshot data layout on disk	2025-12-28 12:48:53 -08:00
Nick Sweeting	6d991a08ea	fix final_status uneeded	2025-12-28 12:47:36 -08:00
Claude	057b49ad85	Update status command to use DB as source of truth Remove imports of deleted folder utility functions and rewrite status command to query Snapshot model directly. This aligns with the fs_version refactor where the DB is the single source of truth. - Use Snapshot.objects queries for indexed/archived/unarchived counts - Scan filesystem directly for present/orphaned directory counts - Simplify output to focus on essential status information	2025-12-28 19:19:03 +00:00
Claude	767458e4e0	Revert "Restore missing folder utility functions" This reverts commit `32bcf0896d`.	2025-12-28 19:16:52 +00:00
Claude	32bcf0896d	Restore missing folder utility functions Restored 10 folder status functions that were accidentally removed: - get_indexed_folders, get_archived_folders, get_unarchived_folders - get_present_folders, get_valid_folders, get_invalid_folders - get_duplicate_folders, get_orphaned_folders - get_corrupted_folders, get_unrecognized_folders These are required by archivebox_status.py for the status command. Added safety checks for non-existent archive directories.	2025-12-28 14:00:48 +00:00
Claude	6b3c87276f	Mark hook renumbering testing as complete in TODO All hook utility tests pass (extract_step, is_background_hook, discover_hooks). Model fields and methods verified (current_step, hook_name, advance_step_if_ready).	2025-12-28 13:48:11 +00:00
Claude	1b5a816022	Implement hook step-based concurrency system This implements the hook concurrency plan from TODO_hook_concurrency.md: ## Schema Changes - Add Snapshot.current_step (IntegerField 0-9, default=0) - Create migration 0034_snapshot_current_step.py - Fix uuid_compat imports in migrations 0032 and 0003 ## Core Logic - Add extract_step(hook_name) utility - extracts step from __XX_ pattern - Add is_background_hook(hook_name) utility - checks for .bg. suffix - Update Snapshot.create_pending_archiveresults() to create one AR per hook - Update ArchiveResult.run() to handle hook_name field - Add Snapshot.advance_step_if_ready() method for step advancement - Integrate with SnapshotMachine.is_finished() to call advance_step_if_ready() ## Worker Coordination - Update ArchiveResultWorker.get_queue() for step-based filtering - ARs are only claimable when their step <= snapshot.current_step ## Hook Renumbering - Step 5 (DOM extraction): singlefile→50, screenshot→51, pdf→52, dom→53, title→54, readability→55, headers→55, mercury→56, htmltotext→57 - Step 6 (post-DOM): wget→61, git→62, media→63.bg, gallerydl→64.bg, forumdl→65.bg, papersdl→66.bg - Step 7 (URL extraction): parse_* hooks moved to 70-75 Background hooks (.bg suffix) don't block step advancement, enabling long-running downloads to continue while other hooks proceed.	2025-12-28 13:47:25 +00:00
Nick Sweeting	b1e354619f	minor bugfixes	2025-12-28 05:33:09 -08:00
Nick Sweeting	4ccb0863bb	continue renaming extractor to plugin, add plan for hook concurrency, add chrome kill helper script	2025-12-28 05:29:24 -08:00
Nick Sweeting	d2e65cfd38	move todos	2025-12-28 04:44:38 -08:00
Nick Sweeting	bd265c0083	rename extractor to plugin everywhere	2025-12-28 04:43:15 -08:00
Nick Sweeting	50e527ec65	way better plugin hooks system wip	2025-12-28 03:39:59 -08:00
Nick Sweeting	a38624a4dd	Improve filesystem based hook architecture (#1720 ) <!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line length changes. --> # Summary <!--e.g. This PR fixes ABC or adds the ability to do XYZ...--> # Related issues <!-- e.g. #123 or Roadmap goal # https://github.com/pirate/ArchiveBox/wiki/Roadmap --> # Changes these areas - [ ] Bugfixes - [ ] Feature behavior - [ ] Command line interface - [ ] Configuration options - [ ] Internal architecture - [ ] Snapshot data layout on disk	2025-12-27 13:03:21 -08:00
Claude	b632894bc9	Update views, API, and exports for new ArchiveResult output fields Replace old `output` field with new fields across the codebase: - output_str: Human-readable output summary - output_json: Structured metadata (optional) - output_files: Dict of output files with metadata - output_size: Total size in bytes - output_mimetypes: CSV of file mimetypes Files updated: - api/v1_core.py: Update MinimalArchiveResultSchema to expose new fields - api/v1_core.py: Update ArchiveResultFilterSchema to search output_str - cli/archivebox_extract.py: Use output_str in CLI output - core/admin_archiveresults.py: Update admin fields, search, and fieldsets - core/admin_archiveresults.py: Fix output_html variable name bug in output_summary - misc/jsonl.py: Update archiveresult_to_jsonl() to include new fields - plugins/extractor_utils.py: Update ExtractorResult helper class The embed_path() method already uses output_files and output_str, so snapshot detail page and template tags work correctly.	2025-12-27 20:28:22 +00:00
Nick Sweeting	9b533ad3c8	tweak concurrency for more speed	2025-12-27 12:08:53 -08:00
Claude	d65eb587d9	Add hook architecture unit tests + mark remaining work complete - Add test_hooks.py with 31 unit tests covering: - Background hook detection (.bg. suffix) - JSONL parsing (clean format and legacy RESULT_JSON= format) - Install hook XYZ_BINARY env var handling - Hook discovery and sorting - get_extractor_name() function - Hook execution with real subprocesses - Install hook output format compliance - Snapshot hook output format compliance - Plugin metadata addition - Update TODO_hook_architecture.md to mark all tasks complete: - Tests: 31 tests in archivebox/tests/test_hooks.py - Migrations: 0029 and 0030 applied successfully All phases of the hook architecture implementation are now complete.	2025-12-27 20:05:09 +00:00
Claude	4e50c4f182	Mark snapshot hook checklist items as complete All snapshot hooks now: - Read XYZ_BINARY env vars and use in cmd - Output exactly one clean JSONL line (no RESULT_JSON= prefix) - No extra output lines (VERSION=, START_TS=, etc.) - Only provide allowed fields - Don't include computed fields - Python hooks include cmd array with binary path	2025-12-27 10:14:14 +00:00
Claude	e3ba599812	Update install hooks to respect XYZ_BINARY env vars - All install hooks now respect their respective XYZ_BINARY env vars (e.g., WGET_BINARY, CHROME_BINARY, YTDLP_BINARY, etc.) - Support both absolute paths (/usr/bin/wget2) and binary names (wget2) - Dynamic bin_name used in Dependency JSONL output - Updated 11 install hooks to follow the new pattern - Mark checklist items as complete in TODO_hook_architecture.md	2025-12-27 10:12:45 +00:00
Claude	8c846b7d1c	Rename validate hooks to install hooks - Rename 13 on_Crawl__00_validate_* hooks to on_Crawl__00_install_* - This better reflects what these hooks actually do (check/install binaries) - Update TODO_hook_architecture.md to reflect renamed hooks	2025-12-27 10:06:34 +00:00
Claude	2623c6cc11	Complete JS hooks to clean JSONL format + rename background hooks - Update 12 remaining JS snapshot hooks to output clean JSONL - Remove RESULT_JSON= prefix, START_TS=, END_TS=, STATUS= output - Rename 3 background hooks with .bg. suffix: - consolelog -> on_Snapshot__21_consolelog.bg.js - ssl -> on_Snapshot__23_ssl.bg.js - responses -> on_Snapshot__24_responses.bg.js - Update TODO_hook_architecture.md with completion status	2025-12-27 09:46:59 +00:00
Claude	c52eef1459	Update Python/JS hooks to clean JSONL format + add audit report Phase 4 Plugin Audit Progress: - Audited all 6 Dependency hooks (all already compliant) - Audited all 11 Crawl Validate hooks (all already compliant) - Updated 8 Python Snapshot hooks to clean JSONL format - Updated 1 JS Snapshot hook (title.js) to clean JSONL format Snapshot hooks updated to remove: - RESULT_JSON= prefix - Extra output lines (START_TS=, END_TS=, DURATION=, VERSION=, OUTPUT=, STATUS=) Now output clean JSONL: {"type": "ArchiveResult", "status": "...", "output_str": "..."} Added implementation report to TODO_hook_architecture.md documenting: - All completed phases (1, 3, 6, 7) - Plugin audit results with status tables - Remaining 13 JS hooks that need updating - Files modified list	2025-12-27 09:31:03 +00:00
Claude	3d985fa8c8	Implement hook architecture with JSONL output support Phase 1: Database migration for new ArchiveResult fields - Add output_str (TextField) for human-readable summary - Add output_json (JSONField) for structured metadata - Add output_files (JSONField) for dict of {relative_path: {}} - Add output_size (BigIntegerField) for total bytes - Add output_mimetypes (CharField) for CSV of mimetypes - Add binary FK to InstalledBinary (optional) - Migrate existing 'output' field to new split fields Phase 3: Update run_hook() for JSONL parsing - Support new JSONL format (any line with {type: 'ModelName', ...}) - Maintain backwards compatibility with RESULT_JSON= format - Add plugin metadata to each parsed record - Detect background hooks with .bg. suffix in filename - Add find_binary_for_cmd() helper function - Add create_model_record() for processing side-effect records Phase 6: Update ArchiveResult.run() - Handle background hooks (return immediately when result is None) - Process 'records' from HookResult for side-effect models - Use new output fields (output_str, output_json, output_files, etc.) - Call create_model_record() for InstalledBinary, Machine updates Phase 7: Add background hook support - Add is_background_hook() method to ArchiveResult - Add check_background_completed() to check if process exited - Add finalize_background_hook() to collect results from completed hooks - Update SnapshotMachine.is_finished() to check/finalize background hooks - Update _populate_output_fields() to walk directory and populate stats Also updated references to old 'output' field in: - admin_archiveresults.py - statemachines.py - templatetags/core_tags.py	2025-12-27 08:38:49 +00:00
Nick Sweeting	cffbef84ed	make Claude.md stricter and improve migration tests	2025-12-27 00:33:51 -08:00

1 2 3 4 5 ...

4725 Commits