ArchiveBox

mirror of https://github.com/ArchiveBox/ArchiveBox.git synced 2026-01-03 09:25:42 +10:00

Author	SHA1	Message	Date
Nick Sweeting	a04e4a7345	cleanup migrations, json, jsonl	2025-12-31 15:36:43 -08:00
Nick Sweeting	edc83bfac6	Add persona CLI command with browser cookie import (#1747 )	2025-12-31 10:56:40 -08:00
claude[bot]	3659adeb7e	Fix path traversal vulnerabilities in persona management Add input validation and path safety checks to prevent path traversal attacks in persona name handling: - Add validate_persona_name() to block dangerous characters (/, \, .., etc) - Add ensure_path_within_personas_dir() to verify resolved paths stay within PERSONAS_DIR - Apply validation at persona creation, renaming, and deletion operations Fixes security issues identified by cubic-dev-ai in PR review. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>	2025-12-31 18:30:26 +00:00
Nick Sweeting	20690fabbf	Fix CLI tests to use subprocess and remove mocks (#1746 )	2025-12-31 10:20:50 -08:00
Claude	73425fa984	Add persona CLI command with browser cookie import - Add `archivebox persona create/list/update/delete` commands - Support `--import=chrome\|firefox\|brave` to copy browser profile - Extract cookies via CDP to generate cookies.txt for non-browser tools - Fix JSDoc comment parsing issue in chrome_utils.js	2025-12-31 12:13:07 +00:00
claude[bot]	5121b0e5f9	Merge branch 'dev' into claude/refactor-process-management-WcQyZ Resolved conflicts by keeping Process model changes and accepting dev changes for unrelated files. Ensured pid_utils.py remains deleted as intended by this PR. Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>	2025-12-31 11:28:47 +00:00
Claude	b87bbbbecb	Fix CLI tests to use subprocess and remove mocks - Fix conftest.py: use subprocess for init, remove unused cli_env fixture - Update all test files to use data_dir parameter instead of env - Remove mock-based TestJSONLOutput class from tests_piping.py - Remove unused imports (MagicMock, patch) - Fix file permissions for cli_utils.py All tests now use real subprocess calls per CLAUDE.md guidelines: - NO MOCKS - tests exercise real code paths - NO SKIPS - every test runs	2025-12-31 10:53:45 +00:00
Nick Sweeting	7dd2d65770	Add pluginmap management command (#1742 )	2025-12-31 02:29:24 -08:00
Claude	bb52b5902a	Add unit tests for JSONL CLI pipeline commands (Phase 5 & 6) Add comprehensive unit tests for the CLI piping architecture: - test_cli_crawl.py: crawl create/list/update/delete tests - test_cli_snapshot.py: snapshot create/list/update/delete tests - test_cli_archiveresult.py: archiveresult create/list/update/delete tests - test_cli_run.py: run command create-or-update and pass-through tests Extend tests_piping.py with: - TestPassThroughBehavior: tests for pass-through behavior in all commands - TestPipelineAccumulation: tests for accumulating records through pipeline All tests use pytest fixtures from conftest.py with isolated DATA_DIR.	2025-12-31 10:21:05 +00:00
Claude	672ccf918d	Add pluginmap management command Adds a new CLI command `archivebox pluginmap` that displays: - ASCII art diagrams of all core state machines (Crawl, Snapshot, ArchiveResult, Binary) - Lists all auto-detected on_Modelname_xyz hooks grouped by model/event - Shows hook execution order (step 0-9), plugin name, and background status Usage: archivebox pluginmap # Show all diagrams and hooks archivebox pluginmap -m Snapshot # Filter to specific model archivebox pluginmap -a # Include disabled plugins archivebox pluginmap -q # Output JSON only	2025-12-31 10:19:58 +00:00
Claude	f3e11b61fd	Implement JSONL CLI pipeline architecture (Phases 1-4, 6) Phase 1: Model Prerequisites - Add ArchiveResult.from_json() and from_jsonl() methods - Fix Snapshot.to_json() to use tags_str (consistent with Crawl) Phase 2: Shared Utilities - Create archivebox/cli/cli_utils.py with shared apply_filters() - Update 7 CLI files to import from cli_utils.py instead of duplicating Phase 3: Pass-Through Behavior - Add pass-through to crawl create (non-Crawl records pass unchanged) - Add pass-through to snapshot create (Crawl records + others pass through) - Add pass-through to archiveresult create (Snapshot records + others) - Add create-or-update behavior to run command: - Records WITHOUT id: Create via Model.from_json() - Records WITH id: Lookup existing, re-queue - Outputs JSONL of all processed records for chaining Phase 4: Test Infrastructure - Create archivebox/tests/conftest.py with pytest-django fixtures - Include CLI helpers, output assertions, database assertions Phase 6: Config Update - Update supervisord_util.py: orchestrator -> run command This enables Unix-style piping: archivebox crawl create URL \| archivebox run archivebox archiveresult list --status=failed \| archivebox run curl API \| jq transform \| archivebox crawl create \| archivebox run	2025-12-31 10:07:14 +00:00
Nick Sweeting	dd2302ad92	new jsonl cli interface	2025-12-30 16:12:53 -08:00
Nick Sweeting	08366cfa46	document chrome configs	2025-12-30 12:42:50 -08:00
claude[bot]	251fe33e49	fix: rename --plugin to --plugins for consistency Changed from singular --plugin to plural --plugins in both snapshot and extract commands to match the pattern in archivebox add command. Updated to accept comma-separated plugin names (e.g., --plugins=screenshot,singlefile,title). - Updated CLI option from --plugin to --plugins - Added parsing for comma-separated plugin names - Updated function signatures and logic to handle multiple plugins - Updated help text, docstrings, and examples Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>	2025-12-30 20:20:29 +00:00
claude[bot]	64db6deab3	fix: revert incorrect --extract renaming, restore --plugin parameter The --plugins parameter was incorrectly renamed to --extract (boolean). This restores --plugin (singular, matching extract command) with correct semantics: specify which plugin to run after creating snapshots. - Changed --extract/--no-extract back to --plugin (string parameter) - Updated function signature and logic to use plugin parameter - Added ArchiveResult creation for specific plugin when --plugin is passed - Updated docstring and examples Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>	2025-12-30 20:15:48 +00:00
claude[bot]	762cddc8c5	fix: address PR review comments from cubic-dev-ai - Add JSONL_INDEX_FILENAME to ALLOWED_IN_DATA_DIR for consistency - Fix fallback logic in legacy.py to try JSON when JSONL parsing fails - Replace bare except clauses with specific exception types - Fix stdin double-consumption in archivebox_crawl.py - Merge CLI --tag option with crawl tags in archivebox_snapshot.py - Remove tautological mock tests (covered by integration tests) Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>	2025-12-30 20:09:51 +00:00
Claude	cf387ed59f	refactor: batch all URLs into single Crawl, update tests - archivebox crawl now creates one Crawl with all URLs as newline-separated string - Updated tests to reflect new pipeline: crawl -> snapshot -> extract - Added tests for Crawl JSONL parsing and output - Tests verify Crawl.from_jsonl() handles multiple URLs correctly	2025-12-30 20:06:56 +00:00
Claude	69965a2782	fix: correct CLI pipeline data flow for crawl -> snapshot -> extract - archivebox crawl: creates Crawl records, outputs Crawl JSONL - archivebox snapshot: accepts Crawl JSONL, creates Snapshots, outputs Snapshot JSONL - archivebox extract: accepts Snapshot JSONL, runs extractors, outputs ArchiveResult JSONL Changes: - Add Crawl.from_jsonl() method for creating Crawl from JSONL records - Rewrite archivebox_crawl.py to create Crawl jobs without immediately starting them - Update archivebox_snapshot.py to accept both Crawl JSONL and plain URLs - Update jsonl.py docstring to document the pipeline	2025-12-30 19:42:41 +00:00
Claude	ae648c9bc1	refactor: move remaining JSONL methods to models, clean up jsonl.py - Add Tag.to_jsonl() method with schema_version - Add Crawl.to_jsonl() method with schema_version - Fix Tag.from_jsonl() to not depend on jsonl.py helper - Update tests to use Snapshot.from_jsonl() instead of non-existent get_or_create_snapshot Remove model-specific functions from misc/jsonl.py: - tag_to_jsonl() - use Tag.to_jsonl() instead - crawl_to_jsonl() - use Crawl.to_jsonl() instead - get_or_create_tag() - use Tag.from_jsonl() instead - process_jsonl_records() - use model from_jsonl() methods directly jsonl.py now only contains generic I/O utilities: - Type constants (TYPE_SNAPSHOT, etc.) - parse_line(), read_stdin(), read_file(), read_args_or_stdin() - write_record(), write_records() - filter_by_type(), process_records()	2025-12-30 19:30:18 +00:00
Claude	bc273c5a7f	feat: add schema_version to JSONL outputs and remove dead code - Add schema_version (archivebox.VERSION) to all to_jsonl() outputs: - Snapshot.to_jsonl() - ArchiveResult.to_jsonl() - Binary.to_jsonl() - Process.to_jsonl() - Update CLI commands to use model methods directly: - archivebox_snapshot.py: snapshot.to_jsonl() - archivebox_extract.py: result.to_jsonl() - Remove dead wrapper functions from misc/jsonl.py: - snapshot_to_jsonl() - archiveresult_to_jsonl() - binary_to_jsonl() - process_to_jsonl() - machine_to_jsonl() - Update tests to use model methods directly	2025-12-30 19:24:53 +00:00
Nick Sweeting	96ee1bf686	more migration fixes	2025-12-30 09:57:33 -08:00
Nick Sweeting	95beddc5fc	more migration fixes	2025-12-29 22:12:57 -08:00
Nick Sweeting	2e350d317d	fix initial migrtaions	2025-12-29 21:27:31 -08:00
Nick Sweeting	3dd329600e	comment updates	2025-12-29 21:05:34 -08:00
Nick Sweeting	80f75126c6	more fixes	2025-12-29 21:03:05 -08:00
Claude	a5654e877f	rename media plugin to ytdlp with backwards-compatible aliases - Rename archivebox/plugins/media/ → archivebox/plugins/ytdlp/ - Rename hook script on_Snapshot__63_media.bg.py → on_Snapshot__63_ytdlp.bg.py - Update config.json: YTDLP_* as primary keys, MEDIA_* as x-aliases - Update templates CSS classes: media-* → ytdlp-* - Fix gallerydl bug: remove incorrect dependency on media plugin output - Update all codebase references to use YTDLP_* and SAVE_YTDLP - Add backwards compatibility test for MEDIA_ENABLED alias	2025-12-29 19:09:05 +00:00
Nick Sweeting	30c60eef76	much better tests and add page ui	2025-12-29 04:02:11 -08:00
Nick Sweeting	f4e7820533	use full dotted paths for all archivebox imports, add migrations and more fixes	2025-12-29 00:47:08 -08:00
Nick Sweeting	f0aa19fa7d	wip	2025-12-28 17:51:54 -08:00
Claude	057b49ad85	Update status command to use DB as source of truth Remove imports of deleted folder utility functions and rewrite status command to query Snapshot model directly. This aligns with the fs_version refactor where the DB is the single source of truth. - Use Snapshot.objects queries for indexed/archived/unarchived counts - Scan filesystem directly for present/orphaned directory counts - Simplify output to focus on essential status information	2025-12-28 19:19:03 +00:00
Nick Sweeting	bd265c0083	rename extractor to plugin everywhere	2025-12-28 04:43:15 -08:00
Nick Sweeting	50e527ec65	way better plugin hooks system wip	2025-12-28 03:39:59 -08:00
Claude	b632894bc9	Update views, API, and exports for new ArchiveResult output fields Replace old `output` field with new fields across the codebase: - output_str: Human-readable output summary - output_json: Structured metadata (optional) - output_files: Dict of output files with metadata - output_size: Total size in bytes - output_mimetypes: CSV of file mimetypes Files updated: - api/v1_core.py: Update MinimalArchiveResultSchema to expose new fields - api/v1_core.py: Update ArchiveResultFilterSchema to search output_str - cli/archivebox_extract.py: Use output_str in CLI output - core/admin_archiveresults.py: Update admin fields, search, and fieldsets - core/admin_archiveresults.py: Fix output_html variable name bug in output_summary - misc/jsonl.py: Update archiveresult_to_jsonl() to include new fields - plugins/extractor_utils.py: Update ExtractorResult helper class The embed_path() method already uses output_files and output_str, so snapshot detail page and template tags work correctly.	2025-12-27 20:28:22 +00:00
Claude	c3acadd528	Remove extractor field from Crawl model and fix tests - Remove extractor field from Crawl model (moved to config dict) - Update migration 0002_drop_seed_model to not add extractor - Update archivebox_add.py to use config['PARSER'] instead - Update admin.py recrawl to not pass extractor - Update jsonl.py serialization to not include extractor - Update test schema SCHEMA_0_8 to not include extractor - Set default timeout to 60s for test commands	2025-12-27 01:49:09 +00:00
Nick Sweeting	9838d7ba02	tons of ui fixes and plugin fixes	2025-12-25 03:59:51 -08:00
Nick Sweeting	bb53228ebf	remove Seed model in favor of Crawl as template	2025-12-25 01:52:41 -08:00
Nick Sweeting	28e6c5bb65	add mcp server support	2025-12-25 01:51:42 -08:00
Nick Sweeting	866f993f26	logging and admin ui improvements	2025-12-25 01:10:41 -08:00
Nick Sweeting	d95f0dc186	remove huey	2025-12-24 23:40:18 -08:00
Nick Sweeting	6c769d831c	wip 2	2025-12-24 21:46:14 -08:00
Nick Sweeting	1915333b81	wip major changes	2025-12-24 20:10:38 -08:00
Nick Sweeting	c1335fed37	Remove ABID system and KVTag model - use UUIDv7 IDs exclusively This commit completes the simplification of the ID system by: - Removing the ABID (ArchiveBox ID) system entirely - Removing the base_models/abid.py file - Removing KVTag model in favor of the existing Tag model in core/models.py - Simplifying all models to use standard UUIDv7 primary keys - Removing ABID-related admin functionality - Cleaning up commented-out ABID code from views and statemachines - Deleting migration files for ABID field removal (no longer needed) All models now use simple UUIDv7 ids via `id = models.UUIDField(primary_key=True, default=uuid7)` Note: Old migrations containing ABID references are preserved for database migration history compatibility. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2025-12-24 06:13:49 -08:00
Nick Sweeting	930b9bf386	add archivebox worker cli cmd to list of all cmds	2024-12-12 21:44:44 -08:00
Nick Sweeting	5cf7725f0e	add new archivebox worker implementation based on better distributed systems principles	2024-12-12 21:41:45 -08:00
Nick Sweeting	dcd7e2555e	add new archivebox_extract cli command	2024-12-03 02:14:56 -08:00
Nick Sweeting	b948e49013	add urls log to Crawl model	2024-11-19 06:32:33 -08:00
Nick Sweeting	4dd53dc12a	Merge branch 'newchanges' into dev	2024-11-19 05:28:20 -08:00
Nick Sweeting	b852951c58	fix cli loading edge case where setup_django wasnt running when it should	2024-11-19 05:27:35 -08:00
Nick Sweeting	f8e2f7c753	restore missing archivebox_update work	2024-11-19 05:09:19 -08:00
Nick Sweeting	52446b86ba	restore missing archivebox_status work	2024-11-19 05:08:41 -08:00

1 2 3 4 5

221 Commits