Implements three new extractors that run at the beginning of the extraction process to capture dynamic content:
New extractors:
- downloads: Catches file downloads triggered by the page using CDP download handlers
- images: Catches all image HTTP responses based on MIME type, saves to images/ directory
- infiniscroll: Scrolls page up to 10 times to load lazy content, then scrolls back to top
These extractors run after puppeteer (which launches Chrome) but before other extractors that capture static content. They ensure dynamic and lazy-loaded content is captured.
Extractor order:
1. puppeteer - launches Chrome
2. downloads - catches downloads (reloads page with listeners)
3. images - catches images (reloads page with listeners)
4. infiniscroll - scrolls to load lazy content
5. favicon, title, headers, screenshot, etc. - capture final state
All three extractors:
- Use Puppeteer to connect to existing Chrome tab via CDP
- Reuse CHROME_CDP_URL and CHROME_PAGE_TARGET_ID from .env
- Are fully configurable via environment variables
- Follow the extractor contract (executable, URL as $1, output to CWD)
Updates:
- Added ExtractorName types to models.ts
- Updated EXTRACTOR_ORDER in extractors.ts
- Made all new extractors executable
- Updated README with complete documentation
- Verified all 17 extractors are discovered correctly
Implements all remaining extractors from original ArchiveBox following the serial execution pattern:
Browser-based extractors (using Puppeteer + CDP):
- dom: Extract full DOM HTML
- pdf: Generate PDF of page
- htmltotext: Extract plain text content
- readability: Extract article content using Mozilla Readability algorithm
Binary-based extractors (using native tools):
- singlefile: Create single-file archive using single-file-cli
- git: Clone git repositories
- media: Download media using yt-dlp
- archive_org: Submit to Internet Archive Wayback Machine
All extractors:
- Auto-install dependencies if needed
- Accept URL as $1 argument
- Output to current working directory
- Configure via environment variables only
- Read from .env file for shared state
- Follow exit code contract (0 = success)
Updates:
- Added all extractor types to ExtractorName union in models.ts
- Updated EXTRACTOR_ORDER with complete 14-extractor sequence
- Installed jsdom and @mozilla/readability dependencies
- Made all extractors executable
- Updated README with complete documentation for all extractors
Major architectural change: extractors now run serially in a predefined
order with state sharing via a .env file in each snapshot directory.
Key changes:
1. New puppeteer extractor:
- Launches Chrome with user data dir
- Writes CDP URL and page target ID to .env
- Leaves browser running for other extractors to reuse
2. Serial execution (src/extractors.ts):
- Hardcoded EXTRACTOR_ORDER array defines execution sequence
- Created runExtractorsSerial() method
- Each extractor reads .env before running
- Environment variables from .env passed to extractors
- Extractors can append to .env for later extractors
3. Updated browser-based extractors (title, headers, screenshot):
- Now reuse existing Chrome tab from puppeteer extractor
- Read CHROME_CDP_URL and CHROME_PAGE_TARGET_ID from env
- Connect to existing browser instead of launching new one
- Leave page open after extraction for next extractor
4. .env file management:
- Created in snapshot dir before first extractor runs
- Loaded before each extractor execution
- Simple KEY=VALUE parser with quote handling
- Enables state passing between extractors
5. Updated CLI (src/cli.ts):
- Use runExtractorsSerial instead of parallel execution
- Better output formatting for serial execution
- Show extractor execution order
Benefits:
- Single Chrome instance shared across all extractors (faster)
- Predictable execution order (easier to debug)
- State sharing enables complex workflows
- Browser stays on same page (more efficient)
- No need for separate Chrome remote debugging setup
Breaking changes:
- Extractors now run serially (not in parallel)
- puppeteer extractor must run first for browser-based extractors
- Added puppeteer package dependency
Updated documentation:
- README.md with new architecture details
- Added section on .env state sharing
- Updated extractor documentation
- Added execution order section
Changed screenshot, title, and headers extractors from various
implementations to use Puppeteer connecting to Chrome via CDP.
Key changes:
- All three extractors now use puppeteer-core
- Connect to Chrome via CHROME_CDP_URL environment variable
- Shared browser instance across all extractors for efficiency
- Added puppeteer-core as dependency (npm install)
- Removed auto-install logic (cleaner, more predictable)
- Better error messages when CHROME_CDP_URL not set
Benefits:
- Single Chrome instance for all extractors (better performance)
- Consistent browser environment across extractors
- Can use remote/containerized Chrome
- Better for production deployments
Breaking changes:
- CHROME_CDP_URL environment variable now required for:
- screenshot extractor
- title extractor
- headers extractor
- Users must start Chrome with remote debugging:
chrome --remote-debugging-port=9222 --headless
Updated documentation:
- README.md with Chrome setup instructions
- Added section on Chrome DevTools Protocol setup
- Added Docker setup example
- Updated extractor documentation with CDP requirements
<!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line
length changes. -->
# Summary
Patch submitted by @pcrockett
# Related issues
- Fixes
https://github.com/ArchiveBox/ArchiveBox/issues/1657#issue-2856003985
# Changes these areas
- [x] Bugfixes
- [ ] Feature behavior
- [ ] Command line interface
- [ ] Configuration options
- [ ] Internal architecture
- [ ] Snapshot data layout on disk
I was getting:
ImportError: cannot import name 'CONSTANTS' from partially initialized module 'archivebox' (most likely due to a circular import)
(/nix/store/6fy0wgy7r3ld3k590kxgxrc0r1cca347-archivebox-0.8.6rc3/lib/python3.12/site-packages/archivebox/__init__.py)