ArchiveBox

mirror of https://github.com/ArchiveBox/ArchiveBox.git synced 2026-01-07 19:35:49 +10:00

Author	SHA1	Message	Date
Claude	83fdc51b45	Implement serial extractor execution with .env state sharing Major architectural change: extractors now run serially in a predefined order with state sharing via a .env file in each snapshot directory. Key changes: 1. New puppeteer extractor: - Launches Chrome with user data dir - Writes CDP URL and page target ID to .env - Leaves browser running for other extractors to reuse 2. Serial execution (src/extractors.ts): - Hardcoded EXTRACTOR_ORDER array defines execution sequence - Created runExtractorsSerial() method - Each extractor reads .env before running - Environment variables from .env passed to extractors - Extractors can append to .env for later extractors 3. Updated browser-based extractors (title, headers, screenshot): - Now reuse existing Chrome tab from puppeteer extractor - Read CHROME_CDP_URL and CHROME_PAGE_TARGET_ID from env - Connect to existing browser instead of launching new one - Leave page open after extraction for next extractor 4. .env file management: - Created in snapshot dir before first extractor runs - Loaded before each extractor execution - Simple KEY=VALUE parser with quote handling - Enables state passing between extractors 5. Updated CLI (src/cli.ts): - Use runExtractorsSerial instead of parallel execution - Better output formatting for serial execution - Show extractor execution order Benefits: - Single Chrome instance shared across all extractors (faster) - Predictable execution order (easier to debug) - State sharing enables complex workflows - Browser stays on same page (more efficient) - No need for separate Chrome remote debugging setup Breaking changes: - Extractors now run serially (not in parallel) - puppeteer extractor must run first for browser-based extractors - Added puppeteer package dependency Updated documentation: - README.md with new architecture details - Added section on .env state sharing - Updated extractor documentation - Added execution order section	2025-11-03 20:12:01 +00:00
Claude	7d92a2079a	Add comprehensive Chrome CDP setup guide	2025-11-03 19:11:57 +00:00
Claude	ee1db04b73	Update extractors to use Puppeteer with Chrome DevTools Protocol Changed screenshot, title, and headers extractors from various implementations to use Puppeteer connecting to Chrome via CDP. Key changes: - All three extractors now use puppeteer-core - Connect to Chrome via CHROME_CDP_URL environment variable - Shared browser instance across all extractors for efficiency - Added puppeteer-core as dependency (npm install) - Removed auto-install logic (cleaner, more predictable) - Better error messages when CHROME_CDP_URL not set Benefits: - Single Chrome instance for all extractors (better performance) - Consistent browser environment across extractors - Can use remote/containerized Chrome - Better for production deployments Breaking changes: - CHROME_CDP_URL environment variable now required for: - screenshot extractor - title extractor - headers extractor - Users must start Chrome with remote debugging: chrome --remote-debugging-port=9222 --headless Updated documentation: - README.md with Chrome setup instructions - Added section on Chrome DevTools Protocol setup - Added Docker setup example - Updated extractor documentation with CDP requirements	2025-11-03 19:10:55 +00:00
Claude	f4bb10bdae	Add TypeScript-based ArchiveBox implementation with simplified architecture This commit introduces archivebox-ts, a TypeScript reimplementation of ArchiveBox with a simplified, modular architecture. Key features: - Standalone executable extractors (bash, Node.js, Python with shebang) - Auto-installing dependencies per extractor - Simple interface: URL as $1 CLI arg, output to current directory - Environment variable-based configuration only - SQLite database with schema matching original ArchiveBox - Language-agnostic extractor system Core components: - src/cli.ts: Main CLI with Commander.js (init, add, list, status, extractors) - src/db.ts: SQLite operations using better-sqlite3 - src/models.ts: TypeScript interfaces matching database schema - src/extractors.ts: Extractor discovery and orchestration Sample extractors included: - favicon: Download site favicon (bash + curl) - title: Extract page title (Node.js) - headers: Extract HTTP headers (bash + curl) - wget: Full page download with WARC (bash + wget) - screenshot: Capture screenshot (Python + Playwright) Documentation: - README.md: Architecture overview and usage - QUICKSTART.md: 5-minute getting started guide - EXTRACTOR_GUIDE.md: Comprehensive extractor development guide - ARCHITECTURE.md: Design decisions and implementation details Tested and working: - Database initialization - URL archiving with multiple extractors - Parallel extractor execution - Result tracking and status reporting - All CLI commands functional	2025-11-03 18:16:57 +00:00
Nick Sweeting	c3024815f3	Add link to Proxmox installer (#1682 )	2025-05-19 15:29:45 -07:00
Nelson Minar	f72f04768c	Add link to Proxmox installer	2025-05-11 11:10:20 -07:00
Nick Sweeting	d93f32ab24	fix(export_browser_history): tilde doesn't expand in quotes (#1661 ) <!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line length changes. --> # Summary Patch submitted by @pcrockett # Related issues - Fixes https://github.com/ArchiveBox/ArchiveBox/issues/1657#issue-2856003985 # Changes these areas - [x] Bugfixes - [ ] Feature behavior - [ ] Command line interface - [ ] Configuration options - [ ] Internal architecture - [ ] Snapshot data layout on disk	2025-03-20 16:09:40 -07:00
Nick Sweeting	8b67186c93	make sure uv is using the right python binary	2025-03-20 16:04:58 -07:00
Nick Sweeting	26eb75e4e6	archivebox swag is now available!	2025-03-20 15:52:56 -07:00
Nick Sweeting	d9d67e9864	add swag link to funding links	2025-03-20 15:51:20 -07:00
Nick Sweeting	1ab4e06a15	remove dead competitor links	2025-03-19 19:22:35 -07:00
Philip Crockett	ba6a8c2da5	support XDG standard, search for chrome and chromium DBs	2025-02-18 21:38:52 +01:00
Philip Crockett	639aa7242b	fix typo	2025-02-18 21:22:52 +01:00
Phil Crockett	9fbc2d3818	fix chrome browser history export on Linux	2025-02-18 21:08:56 +01:00
Phil Crockett	58bf8d07e1	feat(export_browser_history): add linux support for firefox	2025-02-16 10:24:37 +01:00
Phil Crockett	feded9e3d4	fix(export_browser_history): fix sqlite quote syntax error	2025-02-16 10:24:13 +01:00
Phil Crockett	2e1ac0409d	feat(export_browser_history): fail script when errors occur	2025-02-16 08:34:41 +01:00
Phil Crockett	2ff3fc434e	feat(export_browser_history): basic arg parsing error message	2025-02-16 08:31:21 +01:00
Phil Crockett	0043b59bc8	fix(export_browser_history): tilde doesn't expand in quotes	2025-02-16 08:22:17 +01:00
Nick Sweeting	a27a91bbaa	Update README.md	2025-02-13 02:45:52 -05:00
Nick Sweeting	3ae30c43a9	Update README.md	2025-02-13 02:37:41 -05:00
Nick Sweeting	37c0ea7eba	Kill the timer process if it doesn't properly terminate. (#1649 )	2025-02-05 19:06:19 -05:00
Ben Muthalaly	71c02ca4eb	Update archivebox/misc/logging_util.py Co-authored-by: Nick Sweeting <git@sweeting.me>	2025-02-05 17:55:45 -06:00
Ben Muthalaly	9f4cf0a8e1	Kill the timer process if it doesn't properly terminate.	2025-02-03 02:47:33 -06:00
Nick Sweeting	12f109b1be	Update docker-compose.yml minor tweaks	2025-01-18 04:20:21 -05:00
Nick Sweeting	6edcac6a40	Fix two small errors in abx-{readwise,spec-config} (#1635 )	2025-01-17 17:17:36 -05:00
ckie	952bde6cfa	spec-config: fix CONSTANTS import I was getting: ImportError: cannot import name 'CONSTANTS' from partially initialized module 'archivebox' (most likely due to a circular import) (/nix/store/6fy0wgy7r3ld3k590kxgxrc0r1cca347-archivebox-0.8.6rc3/lib/python3.12/site-packages/archivebox/__init__.py)	2025-01-17 21:02:53 +02:00
ckie	58fc6d9cf8	readwise: fix SOURCES_DIR syntax Fixes: attributeerror: 'list' object has no attribute 'SOURCES_DIR'	2025-01-17 21:02:27 +02:00
Nick Sweeting	aa55e0d02e	Update 2-feature_request.yml	2025-01-08 19:20:50 -05:00
Nick Sweeting	e1c443aac4	Update 2-feature_request.yml	2025-01-08 19:19:04 -05:00
Nick Sweeting	d1c8acd3ff	Update 1-bug_report.yml	2025-01-08 19:15:21 -05:00
Nick Sweeting	fd21728732	Update 1-bug_report.yml	2025-01-08 19:12:46 -05:00
Nick Sweeting	b93918f926	Update 1-bug_report.yml	2025-01-08 19:12:18 -05:00
Nick Sweeting	ba5380f60b	Update 1-bug_report.yml	2025-01-08 19:11:23 -05:00
Nick Sweeting	7ba7ad6b3e	Update 1-bug_report.yml	2025-01-08 19:10:47 -05:00
Nick Sweeting	91eb3472e3	Update 1-bug_report.yml	2025-01-08 19:09:12 -05:00
Nick Sweeting	b28f2e704c	Update 1-bug_report.yml fix markdown formatting	2025-01-08 19:07:38 -05:00
Nick Sweeting	62a99c88d2	clarify filesystems selections in bug report github template	2025-01-08 19:05:41 -05:00
Nick Sweeting	765abc9d5a	Update pip.yml	2025-01-08 18:53:13 -05:00
Nick Sweeting	83bb8a211a	Remove outdated architecture diagram	2025-01-08 18:50:46 -05:00
Nick Sweeting	55a347c32e	Update file_migrations.py	2025-01-02 23:58:59 -08:00
Nick Sweeting	a851ad4c87	Update models.py	2025-01-02 23:58:45 -08:00
Nick Sweeting	96c5d2f7de	Update statemachines.py	2025-01-02 23:58:32 -08:00
Nick Sweeting	b74b0d23b4	Fix typo in timestamp scale factor (#1627 )	2024-12-26 01:57:08 -05:00
1over137	3312a34b39	Fix typo in timestamp scale factor	2024-12-25 11:50:40 +00:00
Nick Sweeting	1fb5ecf13d	change pip flow to use PAT	2024-12-18 18:55:29 -08:00
Nick Sweeting	46f4a90a2a	install needed packages to run archivebox during pip build	2024-12-18 18:39:58 -08:00
Nick Sweeting	e862031981	use uv to build pip package in github actions instead of pdm	2024-12-18 18:38:25 -08:00
Nick Sweeting	b78e892bf8	update github actions to build docker image	2024-12-18 18:19:35 -08:00
Nick Sweeting	baa3be7525	ignore requirements.txt now that its not needed	2024-12-18 18:09:56 -08:00

1 2 3 4 5 ...

4649 Commits