ArchiveBox

mirror of https://github.com/ArchiveBox/ArchiveBox.git synced 2026-04-06 07:47:53 +10:00

Author	SHA1	Message	Date
Claude	cf387ed59f	refactor: batch all URLs into single Crawl, update tests - archivebox crawl now creates one Crawl with all URLs as newline-separated string - Updated tests to reflect new pipeline: crawl -> snapshot -> extract - Added tests for Crawl JSONL parsing and output - Tests verify Crawl.from_jsonl() handles multiple URLs correctly	2025-12-30 20:06:56 +00:00
Claude	69965a2782	fix: correct CLI pipeline data flow for crawl -> snapshot -> extract - archivebox crawl: creates Crawl records, outputs Crawl JSONL - archivebox snapshot: accepts Crawl JSONL, creates Snapshots, outputs Snapshot JSONL - archivebox extract: accepts Snapshot JSONL, runs extractors, outputs ArchiveResult JSONL Changes: - Add Crawl.from_jsonl() method for creating Crawl from JSONL records - Rewrite archivebox_crawl.py to create Crawl jobs without immediately starting them - Update archivebox_snapshot.py to accept both Crawl JSONL and plain URLs - Update jsonl.py docstring to document the pipeline	2025-12-30 19:42:41 +00:00
Claude	ae648c9bc1	refactor: move remaining JSONL methods to models, clean up jsonl.py - Add Tag.to_jsonl() method with schema_version - Add Crawl.to_jsonl() method with schema_version - Fix Tag.from_jsonl() to not depend on jsonl.py helper - Update tests to use Snapshot.from_jsonl() instead of non-existent get_or_create_snapshot Remove model-specific functions from misc/jsonl.py: - tag_to_jsonl() - use Tag.to_jsonl() instead - crawl_to_jsonl() - use Crawl.to_jsonl() instead - get_or_create_tag() - use Tag.from_jsonl() instead - process_jsonl_records() - use model from_jsonl() methods directly jsonl.py now only contains generic I/O utilities: - Type constants (TYPE_SNAPSHOT, etc.) - parse_line(), read_stdin(), read_file(), read_args_or_stdin() - write_record(), write_records() - filter_by_type(), process_records()	2025-12-30 19:30:18 +00:00
Claude	bc273c5a7f	feat: add schema_version to JSONL outputs and remove dead code - Add schema_version (archivebox.VERSION) to all to_jsonl() outputs: - Snapshot.to_jsonl() - ArchiveResult.to_jsonl() - Binary.to_jsonl() - Process.to_jsonl() - Update CLI commands to use model methods directly: - archivebox_snapshot.py: snapshot.to_jsonl() - archivebox_extract.py: result.to_jsonl() - Remove dead wrapper functions from misc/jsonl.py: - snapshot_to_jsonl() - archiveresult_to_jsonl() - binary_to_jsonl() - process_to_jsonl() - machine_to_jsonl() - Update tests to use model methods directly	2025-12-30 19:24:53 +00:00
Nick Sweeting	96ee1bf686	more migration fixes	2025-12-30 09:57:33 -08:00
Nick Sweeting	95beddc5fc	more migration fixes	2025-12-29 22:12:57 -08:00
Nick Sweeting	2e350d317d	fix initial migrtaions	2025-12-29 21:27:31 -08:00
Nick Sweeting	3dd329600e	comment updates	2025-12-29 21:05:34 -08:00
Nick Sweeting	80f75126c6	more fixes	2025-12-29 21:03:05 -08:00
Claude	a5654e877f	rename media plugin to ytdlp with backwards-compatible aliases - Rename archivebox/plugins/media/ → archivebox/plugins/ytdlp/ - Rename hook script on_Snapshot__63_media.bg.py → on_Snapshot__63_ytdlp.bg.py - Update config.json: YTDLP_* as primary keys, MEDIA_* as x-aliases - Update templates CSS classes: media-* → ytdlp-* - Fix gallerydl bug: remove incorrect dependency on media plugin output - Update all codebase references to use YTDLP_* and SAVE_YTDLP - Add backwards compatibility test for MEDIA_ENABLED alias	2025-12-29 19:09:05 +00:00
Nick Sweeting	30c60eef76	much better tests and add page ui	2025-12-29 04:02:11 -08:00
Nick Sweeting	f4e7820533	use full dotted paths for all archivebox imports, add migrations and more fixes	2025-12-29 00:47:08 -08:00
Nick Sweeting	f0aa19fa7d	wip	2025-12-28 17:51:54 -08:00
Claude	057b49ad85	Update status command to use DB as source of truth Remove imports of deleted folder utility functions and rewrite status command to query Snapshot model directly. This aligns with the fs_version refactor where the DB is the single source of truth. - Use Snapshot.objects queries for indexed/archived/unarchived counts - Scan filesystem directly for present/orphaned directory counts - Simplify output to focus on essential status information	2025-12-28 19:19:03 +00:00
Nick Sweeting	bd265c0083	rename extractor to plugin everywhere	2025-12-28 04:43:15 -08:00
Nick Sweeting	50e527ec65	way better plugin hooks system wip	2025-12-28 03:39:59 -08:00
Claude	b632894bc9	Update views, API, and exports for new ArchiveResult output fields Replace old `output` field with new fields across the codebase: - output_str: Human-readable output summary - output_json: Structured metadata (optional) - output_files: Dict of output files with metadata - output_size: Total size in bytes - output_mimetypes: CSV of file mimetypes Files updated: - api/v1_core.py: Update MinimalArchiveResultSchema to expose new fields - api/v1_core.py: Update ArchiveResultFilterSchema to search output_str - cli/archivebox_extract.py: Use output_str in CLI output - core/admin_archiveresults.py: Update admin fields, search, and fieldsets - core/admin_archiveresults.py: Fix output_html variable name bug in output_summary - misc/jsonl.py: Update archiveresult_to_jsonl() to include new fields - plugins/extractor_utils.py: Update ExtractorResult helper class The embed_path() method already uses output_files and output_str, so snapshot detail page and template tags work correctly.	2025-12-27 20:28:22 +00:00
Claude	c3acadd528	Remove extractor field from Crawl model and fix tests - Remove extractor field from Crawl model (moved to config dict) - Update migration 0002_drop_seed_model to not add extractor - Update archivebox_add.py to use config['PARSER'] instead - Update admin.py recrawl to not pass extractor - Update jsonl.py serialization to not include extractor - Update test schema SCHEMA_0_8 to not include extractor - Set default timeout to 60s for test commands	2025-12-27 01:49:09 +00:00
Nick Sweeting	9838d7ba02	tons of ui fixes and plugin fixes	2025-12-25 03:59:51 -08:00
Nick Sweeting	bb53228ebf	remove Seed model in favor of Crawl as template	2025-12-25 01:52:41 -08:00
Nick Sweeting	28e6c5bb65	add mcp server support	2025-12-25 01:51:42 -08:00
Nick Sweeting	866f993f26	logging and admin ui improvements	2025-12-25 01:10:41 -08:00
Nick Sweeting	d95f0dc186	remove huey	2025-12-24 23:40:18 -08:00
Nick Sweeting	6c769d831c	wip 2	2025-12-24 21:46:14 -08:00
Nick Sweeting	1915333b81	wip major changes	2025-12-24 20:10:38 -08:00
Nick Sweeting	c1335fed37	Remove ABID system and KVTag model - use UUIDv7 IDs exclusively This commit completes the simplification of the ID system by: - Removing the ABID (ArchiveBox ID) system entirely - Removing the base_models/abid.py file - Removing KVTag model in favor of the existing Tag model in core/models.py - Simplifying all models to use standard UUIDv7 primary keys - Removing ABID-related admin functionality - Cleaning up commented-out ABID code from views and statemachines - Deleting migration files for ABID field removal (no longer needed) All models now use simple UUIDv7 ids via `id = models.UUIDField(primary_key=True, default=uuid7)` Note: Old migrations containing ABID references are preserved for database migration history compatibility. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2025-12-24 06:13:49 -08:00
Nick Sweeting	930b9bf386	add archivebox worker cli cmd to list of all cmds	2024-12-12 21:44:44 -08:00
Nick Sweeting	5cf7725f0e	add new archivebox worker implementation based on better distributed systems principles	2024-12-12 21:41:45 -08:00
Nick Sweeting	dcd7e2555e	add new archivebox_extract cli command	2024-12-03 02:14:56 -08:00
Nick Sweeting	b948e49013	add urls log to Crawl model	2024-11-19 06:32:33 -08:00
Nick Sweeting	4dd53dc12a	Merge branch 'newchanges' into dev	2024-11-19 05:28:20 -08:00
Nick Sweeting	b852951c58	fix cli loading edge case where setup_django wasnt running when it should	2024-11-19 05:27:35 -08:00
Nick Sweeting	f8e2f7c753	restore missing archivebox_update work	2024-11-19 05:09:19 -08:00
Nick Sweeting	52446b86ba	restore missing archivebox_status work	2024-11-19 05:08:41 -08:00
Nick Sweeting	0f536ff18b	restore missing archivebox_schedule work	2024-11-19 05:07:55 -08:00
Nick Sweeting	fe3320eff0	restore missing archivebox_remove work	2024-11-19 05:07:12 -08:00
Nick Sweeting	230bf34e14	restore missing archivebox_config work	2024-11-19 05:05:06 -08:00
Nick Sweeting	6740202d78	fix cli loading edge case where setup_django wasnt running when it should	2024-11-19 04:20:00 -08:00
Nick Sweeting	f21b86aba8	better cli colors	2024-11-19 04:10:07 -08:00
Nick Sweeting	0f860d40f1	working archivebox_status CLI cmd	2024-11-19 04:05:05 -08:00
Nick Sweeting	292730ebad	working archivebox_schedule cmd	2024-11-19 03:54:47 -08:00
Nick Sweeting	3a64ced697	fix archivebox delete errors	2024-11-19 03:45:44 -08:00
Nick Sweeting	0347b911aa	archivebox add and remove CLI cmds	2024-11-19 03:40:01 -08:00
Nick Sweeting	2595139180	improve statemachine logging and archivebox update CLI cmd	2024-11-19 03:31:05 -08:00
Nick Sweeting	c9a05c9d94	working archivebox update CLI cmd	2024-11-19 02:32:05 -08:00
Nick Sweeting	a0edf218e8	fix archivebox init and archivebox install CLI commands	2024-11-19 01:05:49 -08:00
Nick Sweeting	5f01fc8307	fix archivebox shell and manage CLI commands	2024-11-19 00:48:39 -08:00
Nick Sweeting	328eb98a38	move main funcs into cli files and switch to using click for CLI	2024-11-19 00:18:51 -08:00
Nick Sweeting	569081a9eb	rename abid_utils to base_models	2024-11-18 19:40:05 -08:00
Nick Sweeting	65afd405b1	merge seeds and crawls apps	2024-11-18 19:23:14 -08:00

1 2 3 4 5

205 Commits