Major architectural change: extractors now run serially in a predefined
order with state sharing via a .env file in each snapshot directory.
Key changes:
1. New puppeteer extractor:
- Launches Chrome with user data dir
- Writes CDP URL and page target ID to .env
- Leaves browser running for other extractors to reuse
2. Serial execution (src/extractors.ts):
- Hardcoded EXTRACTOR_ORDER array defines execution sequence
- Created runExtractorsSerial() method
- Each extractor reads .env before running
- Environment variables from .env passed to extractors
- Extractors can append to .env for later extractors
3. Updated browser-based extractors (title, headers, screenshot):
- Now reuse existing Chrome tab from puppeteer extractor
- Read CHROME_CDP_URL and CHROME_PAGE_TARGET_ID from env
- Connect to existing browser instead of launching new one
- Leave page open after extraction for next extractor
4. .env file management:
- Created in snapshot dir before first extractor runs
- Loaded before each extractor execution
- Simple KEY=VALUE parser with quote handling
- Enables state passing between extractors
5. Updated CLI (src/cli.ts):
- Use runExtractorsSerial instead of parallel execution
- Better output formatting for serial execution
- Show extractor execution order
Benefits:
- Single Chrome instance shared across all extractors (faster)
- Predictable execution order (easier to debug)
- State sharing enables complex workflows
- Browser stays on same page (more efficient)
- No need for separate Chrome remote debugging setup
Breaking changes:
- Extractors now run serially (not in parallel)
- puppeteer extractor must run first for browser-based extractors
- Added puppeteer package dependency
Updated documentation:
- README.md with new architecture details
- Added section on .env state sharing
- Updated extractor documentation
- Added execution order section
Changed screenshot, title, and headers extractors from various
implementations to use Puppeteer connecting to Chrome via CDP.
Key changes:
- All three extractors now use puppeteer-core
- Connect to Chrome via CHROME_CDP_URL environment variable
- Shared browser instance across all extractors for efficiency
- Added puppeteer-core as dependency (npm install)
- Removed auto-install logic (cleaner, more predictable)
- Better error messages when CHROME_CDP_URL not set
Benefits:
- Single Chrome instance for all extractors (better performance)
- Consistent browser environment across extractors
- Can use remote/containerized Chrome
- Better for production deployments
Breaking changes:
- CHROME_CDP_URL environment variable now required for:
- screenshot extractor
- title extractor
- headers extractor
- Users must start Chrome with remote debugging:
chrome --remote-debugging-port=9222 --headless
Updated documentation:
- README.md with Chrome setup instructions
- Added section on Chrome DevTools Protocol setup
- Added Docker setup example
- Updated extractor documentation with CDP requirements
<!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line
length changes. -->
# Summary
Patch submitted by @pcrockett
# Related issues
- Fixes
https://github.com/ArchiveBox/ArchiveBox/issues/1657#issue-2856003985
# Changes these areas
- [x] Bugfixes
- [ ] Feature behavior
- [ ] Command line interface
- [ ] Configuration options
- [ ] Internal architecture
- [ ] Snapshot data layout on disk
I was getting:
ImportError: cannot import name 'CONSTANTS' from partially initialized module 'archivebox' (most likely due to a circular import)
(/nix/store/6fy0wgy7r3ld3k590kxgxrc0r1cca347-archivebox-0.8.6rc3/lib/python3.12/site-packages/archivebox/__init__.py)