Commit Graph

4653 Commits

Author SHA1 Message Date
Claude
8a0dfa9b5f Add comprehensive testing documentation for Chrome extension support
Documents:
- Implementation status (all complete)
- Code validation results (syntax, TypeScript, dependencies)
- Test attempt results and blockers
- Test plan for network-enabled environment
- Workarounds for offline testing
2025-11-04 01:53:13 +00:00
Claude
891409a1cc Add Chrome extension support with 2captcha extractor and update singlefile
Implements a Chrome extension management system that allows extractors to use browser extensions:

New 2captcha extractor (runs BEFORE puppeteer):
- Downloads Chrome extensions from Web Store (.crx files)
- Unpacks extensions to ./extensions/ directory
- Writes CHROME_EXTENSIONS_PATHS and CHROME_EXTENSIONS_IDS to .env
- Supports 2captcha (CAPTCHA solving), singlefile, uBlock, cookie consent blocker
- Configurable via API_KEY_2CAPTCHA and EXTENSIONS_ENABLED env vars

Updated puppeteer extractor:
- Reads CHROME_EXTENSIONS_PATHS from .env
- Loads extensions when launching Chrome
- Runs in headed mode when extensions are present (extensions require visible browser)
- Passes extension IDs to Chrome via --load-extension and --allowlisted-extension-id

Updated singlefile extractor (now uses extension instead of CLI):
- Connects to existing Chrome browser via CDP
- Triggers SingleFile extension via Ctrl+Shift+Y keyboard shortcut
- Waits for downloaded file to appear in Chrome downloads directory
- More reliable than single-file-cli and better quality output
- Fully integrates with Chrome's extension ecosystem

Benefits:
- Automatic CAPTCHA solving via 2captcha extension
- Better ad/cookie blocking via uBlock and cookie consent extensions
- Higher quality single-file archives using official SingleFile extension
- Extensions share browser state (cookies, local storage, etc.)
- Foundation for adding more browser extensions in the future

Dependencies:
- Added unzip-crx-3 for unpacking .crx extension files
- Updated extractors to use puppeteer-core for CDP connections

Execution order:
1. 2captcha downloads/configures extensions
2. puppeteer launches Chrome with extensions loaded
3. All other extractors reuse the same Chrome instance with extensions active
2025-11-03 21:03:18 +00:00
Claude
b71660e4b7 Add downloads, images, and infiniscroll extractors
Implements three new extractors that run at the beginning of the extraction process to capture dynamic content:

New extractors:
- downloads: Catches file downloads triggered by the page using CDP download handlers
- images: Catches all image HTTP responses based on MIME type, saves to images/ directory
- infiniscroll: Scrolls page up to 10 times to load lazy content, then scrolls back to top

These extractors run after puppeteer (which launches Chrome) but before other extractors that capture static content. They ensure dynamic and lazy-loaded content is captured.

Extractor order:
1. puppeteer - launches Chrome
2. downloads - catches downloads (reloads page with listeners)
3. images - catches images (reloads page with listeners)
4. infiniscroll - scrolls to load lazy content
5. favicon, title, headers, screenshot, etc. - capture final state

All three extractors:
- Use Puppeteer to connect to existing Chrome tab via CDP
- Reuse CHROME_CDP_URL and CHROME_PAGE_TARGET_ID from .env
- Are fully configurable via environment variables
- Follow the extractor contract (executable, URL as $1, output to CWD)

Updates:
- Added ExtractorName types to models.ts
- Updated EXTRACTOR_ORDER in extractors.ts
- Made all new extractors executable
- Updated README with complete documentation
- Verified all 17 extractors are discovered correctly
2025-11-03 20:47:12 +00:00
Claude
3e6e8cb111 Add remaining extractors (dom, pdf, htmltotext, readability, singlefile, git, media, archive_org)
Implements all remaining extractors from original ArchiveBox following the serial execution pattern:

Browser-based extractors (using Puppeteer + CDP):
- dom: Extract full DOM HTML
- pdf: Generate PDF of page
- htmltotext: Extract plain text content
- readability: Extract article content using Mozilla Readability algorithm

Binary-based extractors (using native tools):
- singlefile: Create single-file archive using single-file-cli
- git: Clone git repositories
- media: Download media using yt-dlp
- archive_org: Submit to Internet Archive Wayback Machine

All extractors:
- Auto-install dependencies if needed
- Accept URL as $1 argument
- Output to current working directory
- Configure via environment variables only
- Read from .env file for shared state
- Follow exit code contract (0 = success)

Updates:
- Added all extractor types to ExtractorName union in models.ts
- Updated EXTRACTOR_ORDER with complete 14-extractor sequence
- Installed jsdom and @mozilla/readability dependencies
- Made all extractors executable
- Updated README with complete documentation for all extractors
2025-11-03 20:31:51 +00:00
Claude
83fdc51b45 Implement serial extractor execution with .env state sharing
Major architectural change: extractors now run serially in a predefined
order with state sharing via a .env file in each snapshot directory.

Key changes:

1. New puppeteer extractor:
   - Launches Chrome with user data dir
   - Writes CDP URL and page target ID to .env
   - Leaves browser running for other extractors to reuse

2. Serial execution (src/extractors.ts):
   - Hardcoded EXTRACTOR_ORDER array defines execution sequence
   - Created runExtractorsSerial() method
   - Each extractor reads .env before running
   - Environment variables from .env passed to extractors
   - Extractors can append to .env for later extractors

3. Updated browser-based extractors (title, headers, screenshot):
   - Now reuse existing Chrome tab from puppeteer extractor
   - Read CHROME_CDP_URL and CHROME_PAGE_TARGET_ID from env
   - Connect to existing browser instead of launching new one
   - Leave page open after extraction for next extractor

4. .env file management:
   - Created in snapshot dir before first extractor runs
   - Loaded before each extractor execution
   - Simple KEY=VALUE parser with quote handling
   - Enables state passing between extractors

5. Updated CLI (src/cli.ts):
   - Use runExtractorsSerial instead of parallel execution
   - Better output formatting for serial execution
   - Show extractor execution order

Benefits:
- Single Chrome instance shared across all extractors (faster)
- Predictable execution order (easier to debug)
- State sharing enables complex workflows
- Browser stays on same page (more efficient)
- No need for separate Chrome remote debugging setup

Breaking changes:
- Extractors now run serially (not in parallel)
- puppeteer extractor must run first for browser-based extractors
- Added puppeteer package dependency

Updated documentation:
- README.md with new architecture details
- Added section on .env state sharing
- Updated extractor documentation
- Added execution order section
2025-11-03 20:12:01 +00:00
Claude
7d92a2079a Add comprehensive Chrome CDP setup guide 2025-11-03 19:11:57 +00:00
Claude
ee1db04b73 Update extractors to use Puppeteer with Chrome DevTools Protocol
Changed screenshot, title, and headers extractors from various
implementations to use Puppeteer connecting to Chrome via CDP.

Key changes:
- All three extractors now use puppeteer-core
- Connect to Chrome via CHROME_CDP_URL environment variable
- Shared browser instance across all extractors for efficiency
- Added puppeteer-core as dependency (npm install)
- Removed auto-install logic (cleaner, more predictable)
- Better error messages when CHROME_CDP_URL not set

Benefits:
- Single Chrome instance for all extractors (better performance)
- Consistent browser environment across extractors
- Can use remote/containerized Chrome
- Better for production deployments

Breaking changes:
- CHROME_CDP_URL environment variable now required for:
  - screenshot extractor
  - title extractor
  - headers extractor
- Users must start Chrome with remote debugging:
  chrome --remote-debugging-port=9222 --headless

Updated documentation:
- README.md with Chrome setup instructions
- Added section on Chrome DevTools Protocol setup
- Added Docker setup example
- Updated extractor documentation with CDP requirements
2025-11-03 19:10:55 +00:00
Claude
f4bb10bdae Add TypeScript-based ArchiveBox implementation with simplified architecture
This commit introduces archivebox-ts, a TypeScript reimplementation of
ArchiveBox with a simplified, modular architecture.

Key features:
- Standalone executable extractors (bash, Node.js, Python with shebang)
- Auto-installing dependencies per extractor
- Simple interface: URL as $1 CLI arg, output to current directory
- Environment variable-based configuration only
- SQLite database with schema matching original ArchiveBox
- Language-agnostic extractor system

Core components:
- src/cli.ts: Main CLI with Commander.js (init, add, list, status, extractors)
- src/db.ts: SQLite operations using better-sqlite3
- src/models.ts: TypeScript interfaces matching database schema
- src/extractors.ts: Extractor discovery and orchestration

Sample extractors included:
- favicon: Download site favicon (bash + curl)
- title: Extract page title (Node.js)
- headers: Extract HTTP headers (bash + curl)
- wget: Full page download with WARC (bash + wget)
- screenshot: Capture screenshot (Python + Playwright)

Documentation:
- README.md: Architecture overview and usage
- QUICKSTART.md: 5-minute getting started guide
- EXTRACTOR_GUIDE.md: Comprehensive extractor development guide
- ARCHITECTURE.md: Design decisions and implementation details

Tested and working:
- Database initialization
- URL archiving with multiple extractors
- Parallel extractor execution
- Result tracking and status reporting
- All CLI commands functional
2025-11-03 18:16:57 +00:00
Nick Sweeting
c3024815f3 Add link to Proxmox installer (#1682) 2025-05-19 15:29:45 -07:00
Nelson Minar
f72f04768c Add link to Proxmox installer 2025-05-11 11:10:20 -07:00
Nick Sweeting
d93f32ab24 fix(export_browser_history): tilde doesn't expand in quotes (#1661)
<!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line
length changes. -->

# Summary

Patch submitted by @pcrockett

# Related issues

- Fixes
https://github.com/ArchiveBox/ArchiveBox/issues/1657#issue-2856003985

# Changes these areas

- [x] Bugfixes
- [ ] Feature behavior
- [ ] Command line interface
- [ ] Configuration options
- [ ] Internal architecture
- [ ] Snapshot data layout on disk
2025-03-20 16:09:40 -07:00
Nick Sweeting
8b67186c93 make sure uv is using the right python binary 2025-03-20 16:04:58 -07:00
Nick Sweeting
26eb75e4e6 archivebox swag is now available! 2025-03-20 15:52:56 -07:00
Nick Sweeting
d9d67e9864 add swag link to funding links 2025-03-20 15:51:20 -07:00
Nick Sweeting
1ab4e06a15 remove dead competitor links 2025-03-19 19:22:35 -07:00
Philip Crockett
ba6a8c2da5 support XDG standard, search for chrome and chromium DBs 2025-02-18 21:38:52 +01:00
Philip Crockett
639aa7242b fix typo 2025-02-18 21:22:52 +01:00
Phil Crockett
9fbc2d3818 fix chrome browser history export on Linux 2025-02-18 21:08:56 +01:00
Phil Crockett
58bf8d07e1 feat(export_browser_history): add linux support for firefox 2025-02-16 10:24:37 +01:00
Phil Crockett
feded9e3d4 fix(export_browser_history): fix sqlite quote syntax error 2025-02-16 10:24:13 +01:00
Phil Crockett
2e1ac0409d feat(export_browser_history): fail script when errors occur 2025-02-16 08:34:41 +01:00
Phil Crockett
2ff3fc434e feat(export_browser_history): basic arg parsing error message 2025-02-16 08:31:21 +01:00
Phil Crockett
0043b59bc8 fix(export_browser_history): tilde doesn't expand in quotes 2025-02-16 08:22:17 +01:00
Nick Sweeting
a27a91bbaa Update README.md 2025-02-13 02:45:52 -05:00
Nick Sweeting
3ae30c43a9 Update README.md 2025-02-13 02:37:41 -05:00
Nick Sweeting
37c0ea7eba Kill the timer process if it doesn't properly terminate. (#1649) 2025-02-05 19:06:19 -05:00
Ben Muthalaly
71c02ca4eb Update archivebox/misc/logging_util.py
Co-authored-by: Nick Sweeting <git@sweeting.me>
2025-02-05 17:55:45 -06:00
Ben Muthalaly
9f4cf0a8e1 Kill the timer process if it doesn't properly terminate. 2025-02-03 02:47:33 -06:00
Nick Sweeting
12f109b1be Update docker-compose.yml minor tweaks 2025-01-18 04:20:21 -05:00
Nick Sweeting
6edcac6a40 Fix two small errors in abx-{readwise,spec-config} (#1635) 2025-01-17 17:17:36 -05:00
ckie
952bde6cfa spec-config: fix CONSTANTS import
I was getting:
ImportError: cannot import name 'CONSTANTS' from partially initialized module 'archivebox' (most likely due to a circular import)
(/nix/store/6fy0wgy7r3ld3k590kxgxrc0r1cca347-archivebox-0.8.6rc3/lib/python3.12/site-packages/archivebox/__init__.py)
2025-01-17 21:02:53 +02:00
ckie
58fc6d9cf8 readwise: fix SOURCES_DIR syntax
Fixes: attributeerror: 'list' object has no attribute 'SOURCES_DIR'
2025-01-17 21:02:27 +02:00
Nick Sweeting
aa55e0d02e Update 2-feature_request.yml 2025-01-08 19:20:50 -05:00
Nick Sweeting
e1c443aac4 Update 2-feature_request.yml 2025-01-08 19:19:04 -05:00
Nick Sweeting
d1c8acd3ff Update 1-bug_report.yml 2025-01-08 19:15:21 -05:00
Nick Sweeting
fd21728732 Update 1-bug_report.yml 2025-01-08 19:12:46 -05:00
Nick Sweeting
b93918f926 Update 1-bug_report.yml 2025-01-08 19:12:18 -05:00
Nick Sweeting
ba5380f60b Update 1-bug_report.yml 2025-01-08 19:11:23 -05:00
Nick Sweeting
7ba7ad6b3e Update 1-bug_report.yml 2025-01-08 19:10:47 -05:00
Nick Sweeting
91eb3472e3 Update 1-bug_report.yml 2025-01-08 19:09:12 -05:00
Nick Sweeting
b28f2e704c Update 1-bug_report.yml fix markdown formatting 2025-01-08 19:07:38 -05:00
Nick Sweeting
62a99c88d2 clarify filesystems selections in bug report github template 2025-01-08 19:05:41 -05:00
Nick Sweeting
765abc9d5a Update pip.yml 2025-01-08 18:53:13 -05:00
Nick Sweeting
83bb8a211a Remove outdated architecture diagram 2025-01-08 18:50:46 -05:00
Nick Sweeting
55a347c32e Update file_migrations.py 2025-01-02 23:58:59 -08:00
Nick Sweeting
a851ad4c87 Update models.py 2025-01-02 23:58:45 -08:00
Nick Sweeting
96c5d2f7de Update statemachines.py 2025-01-02 23:58:32 -08:00
Nick Sweeting
b74b0d23b4 Fix typo in timestamp scale factor (#1627) 2024-12-26 01:57:08 -05:00
1over137
3312a34b39 Fix typo in timestamp scale factor 2024-12-25 11:50:40 +00:00
Nick Sweeting
1fb5ecf13d change pip flow to use PAT 2024-12-18 18:55:29 -08:00