ArchiveBox

mirror of https://github.com/ArchiveBox/ArchiveBox.git synced 2026-01-15 17:45:38 +10:00

Author	SHA1	Message	Date
Claude	b71660e4b7	Add downloads, images, and infiniscroll extractors Implements three new extractors that run at the beginning of the extraction process to capture dynamic content: New extractors: - downloads: Catches file downloads triggered by the page using CDP download handlers - images: Catches all image HTTP responses based on MIME type, saves to images/ directory - infiniscroll: Scrolls page up to 10 times to load lazy content, then scrolls back to top These extractors run after puppeteer (which launches Chrome) but before other extractors that capture static content. They ensure dynamic and lazy-loaded content is captured. Extractor order: 1. puppeteer - launches Chrome 2. downloads - catches downloads (reloads page with listeners) 3. images - catches images (reloads page with listeners) 4. infiniscroll - scrolls to load lazy content 5. favicon, title, headers, screenshot, etc. - capture final state All three extractors: - Use Puppeteer to connect to existing Chrome tab via CDP - Reuse CHROME_CDP_URL and CHROME_PAGE_TARGET_ID from .env - Are fully configurable via environment variables - Follow the extractor contract (executable, URL as $1, output to CWD) Updates: - Added ExtractorName types to models.ts - Updated EXTRACTOR_ORDER in extractors.ts - Made all new extractors executable - Updated README with complete documentation - Verified all 17 extractors are discovered correctly	2025-11-03 20:47:12 +00:00
Claude	3e6e8cb111	Add remaining extractors (dom, pdf, htmltotext, readability, singlefile, git, media, archive_org) Implements all remaining extractors from original ArchiveBox following the serial execution pattern: Browser-based extractors (using Puppeteer + CDP): - dom: Extract full DOM HTML - pdf: Generate PDF of page - htmltotext: Extract plain text content - readability: Extract article content using Mozilla Readability algorithm Binary-based extractors (using native tools): - singlefile: Create single-file archive using single-file-cli - git: Clone git repositories - media: Download media using yt-dlp - archive_org: Submit to Internet Archive Wayback Machine All extractors: - Auto-install dependencies if needed - Accept URL as $1 argument - Output to current working directory - Configure via environment variables only - Read from .env file for shared state - Follow exit code contract (0 = success) Updates: - Added all extractor types to ExtractorName union in models.ts - Updated EXTRACTOR_ORDER with complete 14-extractor sequence - Installed jsdom and @mozilla/readability dependencies - Made all extractors executable - Updated README with complete documentation for all extractors	2025-11-03 20:31:51 +00:00
Claude	83fdc51b45	Implement serial extractor execution with .env state sharing Major architectural change: extractors now run serially in a predefined order with state sharing via a .env file in each snapshot directory. Key changes: 1. New puppeteer extractor: - Launches Chrome with user data dir - Writes CDP URL and page target ID to .env - Leaves browser running for other extractors to reuse 2. Serial execution (src/extractors.ts): - Hardcoded EXTRACTOR_ORDER array defines execution sequence - Created runExtractorsSerial() method - Each extractor reads .env before running - Environment variables from .env passed to extractors - Extractors can append to .env for later extractors 3. Updated browser-based extractors (title, headers, screenshot): - Now reuse existing Chrome tab from puppeteer extractor - Read CHROME_CDP_URL and CHROME_PAGE_TARGET_ID from env - Connect to existing browser instead of launching new one - Leave page open after extraction for next extractor 4. .env file management: - Created in snapshot dir before first extractor runs - Loaded before each extractor execution - Simple KEY=VALUE parser with quote handling - Enables state passing between extractors 5. Updated CLI (src/cli.ts): - Use runExtractorsSerial instead of parallel execution - Better output formatting for serial execution - Show extractor execution order Benefits: - Single Chrome instance shared across all extractors (faster) - Predictable execution order (easier to debug) - State sharing enables complex workflows - Browser stays on same page (more efficient) - No need for separate Chrome remote debugging setup Breaking changes: - Extractors now run serially (not in parallel) - puppeteer extractor must run first for browser-based extractors - Added puppeteer package dependency Updated documentation: - README.md with new architecture details - Added section on .env state sharing - Updated extractor documentation - Added execution order section	2025-11-03 20:12:01 +00:00
Claude	7d92a2079a	Add comprehensive Chrome CDP setup guide	2025-11-03 19:11:57 +00:00
Claude	ee1db04b73	Update extractors to use Puppeteer with Chrome DevTools Protocol Changed screenshot, title, and headers extractors from various implementations to use Puppeteer connecting to Chrome via CDP. Key changes: - All three extractors now use puppeteer-core - Connect to Chrome via CHROME_CDP_URL environment variable - Shared browser instance across all extractors for efficiency - Added puppeteer-core as dependency (npm install) - Removed auto-install logic (cleaner, more predictable) - Better error messages when CHROME_CDP_URL not set Benefits: - Single Chrome instance for all extractors (better performance) - Consistent browser environment across extractors - Can use remote/containerized Chrome - Better for production deployments Breaking changes: - CHROME_CDP_URL environment variable now required for: - screenshot extractor - title extractor - headers extractor - Users must start Chrome with remote debugging: chrome --remote-debugging-port=9222 --headless Updated documentation: - README.md with Chrome setup instructions - Added section on Chrome DevTools Protocol setup - Added Docker setup example - Updated extractor documentation with CDP requirements	2025-11-03 19:10:55 +00:00
Claude	f4bb10bdae	Add TypeScript-based ArchiveBox implementation with simplified architecture This commit introduces archivebox-ts, a TypeScript reimplementation of ArchiveBox with a simplified, modular architecture. Key features: - Standalone executable extractors (bash, Node.js, Python with shebang) - Auto-installing dependencies per extractor - Simple interface: URL as $1 CLI arg, output to current directory - Environment variable-based configuration only - SQLite database with schema matching original ArchiveBox - Language-agnostic extractor system Core components: - src/cli.ts: Main CLI with Commander.js (init, add, list, status, extractors) - src/db.ts: SQLite operations using better-sqlite3 - src/models.ts: TypeScript interfaces matching database schema - src/extractors.ts: Extractor discovery and orchestration Sample extractors included: - favicon: Download site favicon (bash + curl) - title: Extract page title (Node.js) - headers: Extract HTTP headers (bash + curl) - wget: Full page download with WARC (bash + wget) - screenshot: Capture screenshot (Python + Playwright) Documentation: - README.md: Architecture overview and usage - QUICKSTART.md: 5-minute getting started guide - EXTRACTOR_GUIDE.md: Comprehensive extractor development guide - ARCHITECTURE.md: Design decisions and implementation details Tested and working: - Database initialization - URL archiving with multiple extractors - Parallel extractor execution - Result tracking and status reporting - All CLI commands functional	2025-11-03 18:16:57 +00:00
Nick Sweeting	c3024815f3	Add link to Proxmox installer (#1682 )	2025-05-19 15:29:45 -07:00
Nelson Minar	f72f04768c	Add link to Proxmox installer	2025-05-11 11:10:20 -07:00
Nick Sweeting	d93f32ab24	fix(export_browser_history): tilde doesn't expand in quotes (#1661 ) <!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line length changes. --> # Summary Patch submitted by @pcrockett # Related issues - Fixes https://github.com/ArchiveBox/ArchiveBox/issues/1657#issue-2856003985 # Changes these areas - [x] Bugfixes - [ ] Feature behavior - [ ] Command line interface - [ ] Configuration options - [ ] Internal architecture - [ ] Snapshot data layout on disk	2025-03-20 16:09:40 -07:00
Nick Sweeting	8b67186c93	make sure uv is using the right python binary	2025-03-20 16:04:58 -07:00
Nick Sweeting	26eb75e4e6	archivebox swag is now available!	2025-03-20 15:52:56 -07:00
Nick Sweeting	d9d67e9864	add swag link to funding links	2025-03-20 15:51:20 -07:00
Nick Sweeting	1ab4e06a15	remove dead competitor links	2025-03-19 19:22:35 -07:00
Philip Crockett	ba6a8c2da5	support XDG standard, search for chrome and chromium DBs	2025-02-18 21:38:52 +01:00
Philip Crockett	639aa7242b	fix typo	2025-02-18 21:22:52 +01:00
Phil Crockett	9fbc2d3818	fix chrome browser history export on Linux	2025-02-18 21:08:56 +01:00
Phil Crockett	58bf8d07e1	feat(export_browser_history): add linux support for firefox	2025-02-16 10:24:37 +01:00
Phil Crockett	feded9e3d4	fix(export_browser_history): fix sqlite quote syntax error	2025-02-16 10:24:13 +01:00
Phil Crockett	2e1ac0409d	feat(export_browser_history): fail script when errors occur	2025-02-16 08:34:41 +01:00
Phil Crockett	2ff3fc434e	feat(export_browser_history): basic arg parsing error message	2025-02-16 08:31:21 +01:00
Phil Crockett	0043b59bc8	fix(export_browser_history): tilde doesn't expand in quotes	2025-02-16 08:22:17 +01:00
Nick Sweeting	a27a91bbaa	Update README.md	2025-02-13 02:45:52 -05:00
Nick Sweeting	3ae30c43a9	Update README.md	2025-02-13 02:37:41 -05:00
Nick Sweeting	37c0ea7eba	Kill the timer process if it doesn't properly terminate. (#1649 )	2025-02-05 19:06:19 -05:00
Ben Muthalaly	71c02ca4eb	Update archivebox/misc/logging_util.py Co-authored-by: Nick Sweeting <git@sweeting.me>	2025-02-05 17:55:45 -06:00
Ben Muthalaly	9f4cf0a8e1	Kill the timer process if it doesn't properly terminate.	2025-02-03 02:47:33 -06:00
Nick Sweeting	12f109b1be	Update docker-compose.yml minor tweaks	2025-01-18 04:20:21 -05:00
Nick Sweeting	6edcac6a40	Fix two small errors in abx-{readwise,spec-config} (#1635 )	2025-01-17 17:17:36 -05:00
ckie	952bde6cfa	spec-config: fix CONSTANTS import I was getting: ImportError: cannot import name 'CONSTANTS' from partially initialized module 'archivebox' (most likely due to a circular import) (/nix/store/6fy0wgy7r3ld3k590kxgxrc0r1cca347-archivebox-0.8.6rc3/lib/python3.12/site-packages/archivebox/__init__.py)	2025-01-17 21:02:53 +02:00
ckie	58fc6d9cf8	readwise: fix SOURCES_DIR syntax Fixes: attributeerror: 'list' object has no attribute 'SOURCES_DIR'	2025-01-17 21:02:27 +02:00
Nick Sweeting	aa55e0d02e	Update 2-feature_request.yml	2025-01-08 19:20:50 -05:00
Nick Sweeting	e1c443aac4	Update 2-feature_request.yml	2025-01-08 19:19:04 -05:00
Nick Sweeting	d1c8acd3ff	Update 1-bug_report.yml	2025-01-08 19:15:21 -05:00
Nick Sweeting	fd21728732	Update 1-bug_report.yml	2025-01-08 19:12:46 -05:00
Nick Sweeting	b93918f926	Update 1-bug_report.yml	2025-01-08 19:12:18 -05:00
Nick Sweeting	ba5380f60b	Update 1-bug_report.yml	2025-01-08 19:11:23 -05:00
Nick Sweeting	7ba7ad6b3e	Update 1-bug_report.yml	2025-01-08 19:10:47 -05:00
Nick Sweeting	91eb3472e3	Update 1-bug_report.yml	2025-01-08 19:09:12 -05:00
Nick Sweeting	b28f2e704c	Update 1-bug_report.yml fix markdown formatting	2025-01-08 19:07:38 -05:00
Nick Sweeting	62a99c88d2	clarify filesystems selections in bug report github template	2025-01-08 19:05:41 -05:00
Nick Sweeting	765abc9d5a	Update pip.yml	2025-01-08 18:53:13 -05:00
Nick Sweeting	83bb8a211a	Remove outdated architecture diagram	2025-01-08 18:50:46 -05:00
Nick Sweeting	55a347c32e	Update file_migrations.py	2025-01-02 23:58:59 -08:00
Nick Sweeting	a851ad4c87	Update models.py	2025-01-02 23:58:45 -08:00
Nick Sweeting	96c5d2f7de	Update statemachines.py	2025-01-02 23:58:32 -08:00
Nick Sweeting	b74b0d23b4	Fix typo in timestamp scale factor (#1627 )	2024-12-26 01:57:08 -05:00
1over137	3312a34b39	Fix typo in timestamp scale factor	2024-12-25 11:50:40 +00:00
Nick Sweeting	1fb5ecf13d	change pip flow to use PAT	2024-12-18 18:55:29 -08:00
Nick Sweeting	46f4a90a2a	install needed packages to run archivebox during pip build	2024-12-18 18:39:58 -08:00
Nick Sweeting	e862031981	use uv to build pip package in github actions instead of pdm	2024-12-18 18:38:25 -08:00

1 2 3 4 5 ...

4651 Commits