Documents:
- Implementation status (all complete)
- Code validation results (syntax, TypeScript, dependencies)
- Test attempt results and blockers
- Test plan for network-enabled environment
- Workarounds for offline testing
Implements a Chrome extension management system that allows extractors to use browser extensions:
New 2captcha extractor (runs BEFORE puppeteer):
- Downloads Chrome extensions from Web Store (.crx files)
- Unpacks extensions to ./extensions/ directory
- Writes CHROME_EXTENSIONS_PATHS and CHROME_EXTENSIONS_IDS to .env
- Supports 2captcha (CAPTCHA solving), singlefile, uBlock, cookie consent blocker
- Configurable via API_KEY_2CAPTCHA and EXTENSIONS_ENABLED env vars
Updated puppeteer extractor:
- Reads CHROME_EXTENSIONS_PATHS from .env
- Loads extensions when launching Chrome
- Runs in headed mode when extensions are present (extensions require visible browser)
- Passes extension IDs to Chrome via --load-extension and --allowlisted-extension-id
Updated singlefile extractor (now uses extension instead of CLI):
- Connects to existing Chrome browser via CDP
- Triggers SingleFile extension via Ctrl+Shift+Y keyboard shortcut
- Waits for downloaded file to appear in Chrome downloads directory
- More reliable than single-file-cli and better quality output
- Fully integrates with Chrome's extension ecosystem
Benefits:
- Automatic CAPTCHA solving via 2captcha extension
- Better ad/cookie blocking via uBlock and cookie consent extensions
- Higher quality single-file archives using official SingleFile extension
- Extensions share browser state (cookies, local storage, etc.)
- Foundation for adding more browser extensions in the future
Dependencies:
- Added unzip-crx-3 for unpacking .crx extension files
- Updated extractors to use puppeteer-core for CDP connections
Execution order:
1. 2captcha downloads/configures extensions
2. puppeteer launches Chrome with extensions loaded
3. All other extractors reuse the same Chrome instance with extensions active
Implements three new extractors that run at the beginning of the extraction process to capture dynamic content:
New extractors:
- downloads: Catches file downloads triggered by the page using CDP download handlers
- images: Catches all image HTTP responses based on MIME type, saves to images/ directory
- infiniscroll: Scrolls page up to 10 times to load lazy content, then scrolls back to top
These extractors run after puppeteer (which launches Chrome) but before other extractors that capture static content. They ensure dynamic and lazy-loaded content is captured.
Extractor order:
1. puppeteer - launches Chrome
2. downloads - catches downloads (reloads page with listeners)
3. images - catches images (reloads page with listeners)
4. infiniscroll - scrolls to load lazy content
5. favicon, title, headers, screenshot, etc. - capture final state
All three extractors:
- Use Puppeteer to connect to existing Chrome tab via CDP
- Reuse CHROME_CDP_URL and CHROME_PAGE_TARGET_ID from .env
- Are fully configurable via environment variables
- Follow the extractor contract (executable, URL as $1, output to CWD)
Updates:
- Added ExtractorName types to models.ts
- Updated EXTRACTOR_ORDER in extractors.ts
- Made all new extractors executable
- Updated README with complete documentation
- Verified all 17 extractors are discovered correctly
Implements all remaining extractors from original ArchiveBox following the serial execution pattern:
Browser-based extractors (using Puppeteer + CDP):
- dom: Extract full DOM HTML
- pdf: Generate PDF of page
- htmltotext: Extract plain text content
- readability: Extract article content using Mozilla Readability algorithm
Binary-based extractors (using native tools):
- singlefile: Create single-file archive using single-file-cli
- git: Clone git repositories
- media: Download media using yt-dlp
- archive_org: Submit to Internet Archive Wayback Machine
All extractors:
- Auto-install dependencies if needed
- Accept URL as $1 argument
- Output to current working directory
- Configure via environment variables only
- Read from .env file for shared state
- Follow exit code contract (0 = success)
Updates:
- Added all extractor types to ExtractorName union in models.ts
- Updated EXTRACTOR_ORDER with complete 14-extractor sequence
- Installed jsdom and @mozilla/readability dependencies
- Made all extractors executable
- Updated README with complete documentation for all extractors
Major architectural change: extractors now run serially in a predefined
order with state sharing via a .env file in each snapshot directory.
Key changes:
1. New puppeteer extractor:
- Launches Chrome with user data dir
- Writes CDP URL and page target ID to .env
- Leaves browser running for other extractors to reuse
2. Serial execution (src/extractors.ts):
- Hardcoded EXTRACTOR_ORDER array defines execution sequence
- Created runExtractorsSerial() method
- Each extractor reads .env before running
- Environment variables from .env passed to extractors
- Extractors can append to .env for later extractors
3. Updated browser-based extractors (title, headers, screenshot):
- Now reuse existing Chrome tab from puppeteer extractor
- Read CHROME_CDP_URL and CHROME_PAGE_TARGET_ID from env
- Connect to existing browser instead of launching new one
- Leave page open after extraction for next extractor
4. .env file management:
- Created in snapshot dir before first extractor runs
- Loaded before each extractor execution
- Simple KEY=VALUE parser with quote handling
- Enables state passing between extractors
5. Updated CLI (src/cli.ts):
- Use runExtractorsSerial instead of parallel execution
- Better output formatting for serial execution
- Show extractor execution order
Benefits:
- Single Chrome instance shared across all extractors (faster)
- Predictable execution order (easier to debug)
- State sharing enables complex workflows
- Browser stays on same page (more efficient)
- No need for separate Chrome remote debugging setup
Breaking changes:
- Extractors now run serially (not in parallel)
- puppeteer extractor must run first for browser-based extractors
- Added puppeteer package dependency
Updated documentation:
- README.md with new architecture details
- Added section on .env state sharing
- Updated extractor documentation
- Added execution order section
Changed screenshot, title, and headers extractors from various
implementations to use Puppeteer connecting to Chrome via CDP.
Key changes:
- All three extractors now use puppeteer-core
- Connect to Chrome via CHROME_CDP_URL environment variable
- Shared browser instance across all extractors for efficiency
- Added puppeteer-core as dependency (npm install)
- Removed auto-install logic (cleaner, more predictable)
- Better error messages when CHROME_CDP_URL not set
Benefits:
- Single Chrome instance for all extractors (better performance)
- Consistent browser environment across extractors
- Can use remote/containerized Chrome
- Better for production deployments
Breaking changes:
- CHROME_CDP_URL environment variable now required for:
- screenshot extractor
- title extractor
- headers extractor
- Users must start Chrome with remote debugging:
chrome --remote-debugging-port=9222 --headless
Updated documentation:
- README.md with Chrome setup instructions
- Added section on Chrome DevTools Protocol setup
- Added Docker setup example
- Updated extractor documentation with CDP requirements
<!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line
length changes. -->
# Summary
Patch submitted by @pcrockett
# Related issues
- Fixes
https://github.com/ArchiveBox/ArchiveBox/issues/1657#issue-2856003985
# Changes these areas
- [x] Bugfixes
- [ ] Feature behavior
- [ ] Command line interface
- [ ] Configuration options
- [ ] Internal architecture
- [ ] Snapshot data layout on disk
I was getting:
ImportError: cannot import name 'CONSTANTS' from partially initialized module 'archivebox' (most likely due to a circular import)
(/nix/store/6fy0wgy7r3ld3k590kxgxrc0r1cca347-archivebox-0.8.6rc3/lib/python3.12/site-packages/archivebox/__init__.py)