ArchiveBox

mirror of https://github.com/ArchiveBox/ArchiveBox.git synced 2026-04-06 07:47:53 +10:00

Author	SHA1	Message	Date
Nick Sweeting	b749b26c5d	wip	2026-03-23 03:58:32 -07:00
Nick Sweeting	f400a2cd67	WIP: checkpoint working tree before rebasing onto dev	2026-03-22 20:25:18 -07:00
Nick Sweeting	c87079aa0a	Refactor ArchiveBox onto abx-dl bus runner	2026-03-21 11:47:57 -07:00
Nick Sweeting	9de084da65	bump package versions	2026-03-15 20:47:28 -07:00
Nick Sweeting	934e02695b	fix lint	2026-03-15 18:45:29 -07:00
Nick Sweeting	70c9358cf9	Improve scheduling, runtime paths, and API behavior	2026-03-15 18:31:56 -07:00
Nick Sweeting	e598614b05	Avoid filesystem lookups in snapshot admin list	2026-03-15 17:18:53 -07:00
Nick Sweeting	31e883ec53	Stabilize plugin and crawl integration tests	2026-03-15 08:16:52 -07:00
Nick Sweeting	1f792d7199	Restore CLI compat and plugin dependency handling	2026-03-15 06:06:18 -07:00
Nick Sweeting	c4d30a853f	Restore index-only snapshot output links	2026-03-15 04:58:46 -07:00
Nick Sweeting	4fa701fafe	Update abx dependencies and plugin test harness	2026-03-15 04:37:32 -07:00
Nick Sweeting	ecb1764590	switch to external plugins	2026-03-15 03:46:23 -07:00
Your Name	08b0dfaf12	Fix #1139 : Return tags as a JSON list in Snapshot.to_dict() for LLM/RAG integration Previously, `archivebox search --json` exported tags as a comma-separated string (e.g. "tag1,tag2"), which required manual parsing by consumers like LlamaIndex, LangChain, and other RAG frameworks. Now `to_dict()` returns tags as a proper JSON array (e.g. ["tag1", "tag2"]), making the export directly usable as structured metadata in LLM/RAG pipelines without additional preprocessing. `from_json()` is updated to accept both list and string formats for backward compatibility with existing JSON imports. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-02-20 21:21:38 -08:00
Nick Sweeting	ec4b27056e	wip	2026-01-21 03:19:56 -08:00
Nick Sweeting	f3f55d3395	perfect snapshot detail cards	2026-01-19 14:56:15 -08:00
Nick Sweeting	86e7973334	cleanup tui, startup, card templtes, and more	2026-01-19 14:33:20 -08:00
Nick Sweeting	c7b2217cd6	tons of fixes with codex	2026-01-19 01:00:53 -08:00
Nick Sweeting	7ceaeae2d9	rename archive_org to archivedotorg, add BinaryWorker, fix config pass-through	2026-01-04 22:38:15 -08:00
Nick Sweeting	456aaee287	more migration id/uuid and config propagation fixes	2026-01-04 16:16:26 -08:00
Nick Sweeting	839ae744cf	simplify entrypoints for orchestrator and workers	2026-01-04 13:17:07 -08:00
Nick Sweeting	3da523fc74	more consistent crawl, snapshot, hook cleanup and Process tracking	2026-01-02 04:27:38 -08:00
Nick Sweeting	dd77511026	unified Process source of truth and better screenshot tests	2026-01-02 04:20:34 -08:00
Nick Sweeting	3672174dad	fix transition mid transition	2026-01-02 00:24:44 -08:00
Nick Sweeting	65ee09ceab	move tests into subfolder, add missing install hooks	2026-01-02 00:22:07 -08:00
Nick Sweeting	9008cefca2	codecov, migrations, orchestrator fixes	2026-01-01 16:57:04 -08:00
Nick Sweeting	60422adc87	fix orchestrator statemachine and Process from archiveresult migrations	2026-01-01 16:43:02 -08:00
Nick Sweeting	876feac522	actually working migration path from 0.7.2 and 0.8.6 + renames and test coverage	2026-01-01 15:50:00 -08:00
Nick Sweeting	6fadcf5168	remove model health stats from models that dont need it	2026-01-01 15:50:00 -08:00
Nick Sweeting	a04e4a7345	cleanup migrations, json, jsonl	2025-12-31 15:36:43 -08:00
Nick Sweeting	bbbfffd0fa	Improve admin snapshot list/grid views with better UX (#1744 )	2025-12-31 03:51:11 -08:00
Claude	2e6dcb2b87	Improve admin snapshot list/grid views with better UX - Add prominent view mode switcher with List/Grid toggle buttons - Improve filter sidebar CSS with modern styling, rounded corners - Add live progress bar for in-progress snapshots showing hooks status - Show plugin icons only when output directory has content - Display archive result output_size sum from new field - Show hooks succeeded/total count in size column - Add get_progress_stats() method to Snapshot model - Add CSS for progress spinner and status badges - Update grid view template with progress indicator for archiving cards - Add tests for admin views, search, and progress stats	2025-12-31 11:28:03 +00:00
Nick Sweeting	575a595f26	Add unit tests for JSONL CLI pipeline commands (Phase 5 & 6) (#1743 )	2025-12-31 02:27:17 -08:00
Claude	b822352fc3	Delete pid_utils.py and migrate to Process model DELETED: - workers/pid_utils.py (-192 lines) - replaced by Process model methods SIMPLIFIED: - crawls/models.py Crawl.cleanup() (80 lines -> 10 lines) - hooks.py: deleted process_is_alive() and kill_process() (-45 lines) UPDATED to use Process model: - core/models.py: Snapshot.cleanup() and has_running_background_hooks() - machine/models.py: Binary.cleanup() - workers/worker.py: Worker.on_startup/shutdown, get_running_workers, start - workers/orchestrator.py: Orchestrator.on_startup/shutdown, is_running All subprocess management now uses: - Process.current() for registering current process - Process.get_running() / get_running_count() for querying - Process.cleanup_stale_running() for cleanup - safe_kill_process() for validated PID killing Total line reduction: ~250 lines	2025-12-31 10:15:22 +00:00
Claude	f3e11b61fd	Implement JSONL CLI pipeline architecture (Phases 1-4, 6) Phase 1: Model Prerequisites - Add ArchiveResult.from_json() and from_jsonl() methods - Fix Snapshot.to_json() to use tags_str (consistent with Crawl) Phase 2: Shared Utilities - Create archivebox/cli/cli_utils.py with shared apply_filters() - Update 7 CLI files to import from cli_utils.py instead of duplicating Phase 3: Pass-Through Behavior - Add pass-through to crawl create (non-Crawl records pass unchanged) - Add pass-through to snapshot create (Crawl records + others pass through) - Add pass-through to archiveresult create (Snapshot records + others) - Add create-or-update behavior to run command: - Records WITHOUT id: Create via Model.from_json() - Records WITH id: Lookup existing, re-queue - Outputs JSONL of all processed records for chaining Phase 4: Test Infrastructure - Create archivebox/tests/conftest.py with pytest-django fixtures - Include CLI helpers, output assertions, database assertions Phase 6: Config Update - Update supervisord_util.py: orchestrator -> run command This enables Unix-style piping: archivebox crawl create URL \| archivebox run archivebox archiveresult list --status=failed \| archivebox run curl API \| jq transform \| archivebox crawl create \| archivebox run	2025-12-31 10:07:14 +00:00
Nick Sweeting	3d8c62ffb1	fix extensions dir paths add personas migration	2025-12-31 01:40:59 -08:00
Claude	524e8e98c3	Capture exit codes and stderr from background hooks Extended graceful_terminate_background_hooks() to: - Reap processes with os.waitpid() to get exit codes - Write returncode to .returncode file for update_from_output() - Return detailed result dict with status, returncode, and pid Updated update_from_output() to: - Read .returncode and .stderr.log files - Determine status from returncode if no ArchiveResult JSONL record - Include stderr in output_str for failed hooks - Handle signal termination (negative returncodes like -9 for SIGKILL) - Clean up .returncode files along with other hook output files	2025-12-31 09:23:41 +00:00
Claude	b73199b33e	Refactor background hook cleanup to use graceful termination Changed Snapshot.cleanup() to gracefully terminate background hooks: 1. Send SIGTERM to all background hook processes first 2. Wait up to each hook's plugin-specific timeout 3. Send SIGKILL only to hooks still running after their timeout Added graceful_terminate_background_hooks() function in hooks.py that: - Collects all .pid files from output directory - Validates process identity using mtime - Sends SIGTERM to all valid processes in phase 1 - Polls each process for up to its plugin-specific timeout - Sends SIGKILL as last resort if timeout expires - Returns status for each hook (sigterm/sigkill/already_dead/invalid)	2025-12-31 09:03:27 +00:00
Nick Sweeting	29eb6280d3	tweak comment	2025-12-31 00:25:01 -08:00
Nick Sweeting	65b93d5a3b	tweak comment	2025-12-31 00:25:01 -08:00
Claude	754b096193	use hook-specific filenames to avoid overwrites Multiple hooks in the same plugin directory were overwriting each other's stdout.log, stderr.log, hook.pid, and cmd.sh files. Now each hook uses filenames prefixed with its hook name: - on_Snapshot__20_chrome_tab.bg.stdout.log - on_Snapshot__20_chrome_tab.bg.stderr.log - on_Snapshot__20_chrome_tab.bg.pid - on_Snapshot__20_chrome_tab.bg.sh Updated: - hooks.py run_hook() to use hook-specific names - core/models.py cleanup and update_from_output methods - Plugin scripts to no longer write redundant hook.pid files	2025-12-31 02:00:15 +00:00
Nick Sweeting	dd2302ad92	new jsonl cli interface	2025-12-30 16:12:53 -08:00
Nick Sweeting	ba8c28a866	use process_set for related name not processes	2025-12-30 12:55:23 -08:00
Nick Sweeting	1b49ea9a0e	improve jsonl logic	2025-12-30 12:43:36 -08:00
claude[bot]	762cddc8c5	fix: address PR review comments from cubic-dev-ai - Add JSONL_INDEX_FILENAME to ALLOWED_IN_DATA_DIR for consistency - Fix fallback logic in legacy.py to try JSON when JSONL parsing fails - Replace bare except clauses with specific exception types - Fix stdin double-consumption in archivebox_crawl.py - Merge CLI --tag option with crawl tags in archivebox_snapshot.py - Remove tautological mock tests (covered by integration tests) Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>	2025-12-30 20:09:51 +00:00
Claude	ae648c9bc1	refactor: move remaining JSONL methods to models, clean up jsonl.py - Add Tag.to_jsonl() method with schema_version - Add Crawl.to_jsonl() method with schema_version - Fix Tag.from_jsonl() to not depend on jsonl.py helper - Update tests to use Snapshot.from_jsonl() instead of non-existent get_or_create_snapshot Remove model-specific functions from misc/jsonl.py: - tag_to_jsonl() - use Tag.to_jsonl() instead - crawl_to_jsonl() - use Crawl.to_jsonl() instead - get_or_create_tag() - use Tag.from_jsonl() instead - process_jsonl_records() - use model from_jsonl() methods directly jsonl.py now only contains generic I/O utilities: - Type constants (TYPE_SNAPSHOT, etc.) - parse_line(), read_stdin(), read_file(), read_args_or_stdin() - write_record(), write_records() - filter_by_type(), process_records()	2025-12-30 19:30:18 +00:00
Claude	bc273c5a7f	feat: add schema_version to JSONL outputs and remove dead code - Add schema_version (archivebox.VERSION) to all to_jsonl() outputs: - Snapshot.to_jsonl() - ArchiveResult.to_jsonl() - Binary.to_jsonl() - Process.to_jsonl() - Update CLI commands to use model methods directly: - archivebox_snapshot.py: snapshot.to_jsonl() - archivebox_extract.py: result.to_jsonl() - Remove dead wrapper functions from misc/jsonl.py: - snapshot_to_jsonl() - archiveresult_to_jsonl() - binary_to_jsonl() - process_to_jsonl() - machine_to_jsonl() - Update tests to use model methods directly	2025-12-30 19:24:53 +00:00
Claude	a5206e7648	refactor: move to_jsonl() methods to models Move JSONL serialization from standalone functions to model methods to mirror the from_jsonl() pattern: - Add Binary.to_jsonl() method - Add Process.to_jsonl() method - Add ArchiveResult.to_jsonl() method - Add Snapshot.to_jsonl() method - Update write_index_jsonl() to use model methods - Update jsonl.py functions to be thin wrappers	2025-12-30 18:35:22 +00:00
Claude	d36079829b	feat: replace index.json with index.jsonl flat JSONL format Switch from hierarchical index.json to flat index.jsonl format for snapshot metadata storage. Each line is a self-contained JSON record with a 'type' field (Snapshot, ArchiveResult, Binary, Process). Changes: - Add JSONL_INDEX_FILENAME constant to constants.py - Add TYPE_PROCESS and TYPE_MACHINE to jsonl.py type constants - Add binary_to_jsonl(), process_to_jsonl(), machine_to_jsonl() converters - Add Snapshot.write_index_jsonl() to write new format - Add Snapshot.read_index_jsonl() to read new format - Add Snapshot.convert_index_json_to_jsonl() for migration - Update Snapshot.reconcile_with_index() to handle both formats - Update fs_migrate to convert during filesystem migration - Update load_from_directory/create_from_directory for both formats - Update legacy.py parse_json_links_details for JSONL support The new format is easier to parse, extend, and mix record types.	2025-12-30 18:21:06 +00:00
Nick Sweeting	80f75126c6	more fixes	2025-12-29 21:03:05 -08:00
Claude	f88182df7a	Merge remote-tracking branch 'origin/dev' into claude/add-max-url-attempts-oBHCD	2025-12-29 21:29:01 +00:00

1 2 3 4

186 Commits