16 KiB
Chrome Plugin Consolidation - COMPLETED ✓
Core Principle: One ArchiveResult Per Plugin
Critical Realization: Each plugin must produce exactly ONE ArchiveResult output. This is fundamental to ArchiveBox's architecture - you cannot have multiple outputs from a single plugin.
CRITICAL ARCHITECTURE CLARIFICATION
DO NOT CONFUSE THESE CONCEPTS:
-
Plugin = Directory name (e.g.,
chrome,consolelog,screenshot)- Lives in
archivebox/plugins/<plugin_name>/ - Can contain MULTIPLE hook files
- Produces ONE output directory:
users/{username}/snapshots/YYYYMMDD/{domain}/{snap_id}/{plugin_name}/ - Creates ONE ArchiveResult record per snapshot
- Lives in
-
Hook = Individual script file (e.g.,
on_Snapshot__20_chrome_tab.bg.js)- Lives inside a plugin directory
- One plugin can have MANY hooks
- All hooks in a plugin run sequentially when that plugin's ArchiveResult is processed
- All hooks write to the SAME output directory (the plugin directory)
-
Extractor = ArchiveResult.extractor field = PLUGIN NAME (not hook name)
ArchiveResult.extractor = 'chrome'(plugin name)- NOT
ArchiveResult.extractor = '20_chrome_tab.bg'(hook name)
-
Output Directory =
users/{username}/snapshots/YYYYMMDD/{domain}/{snap_id}/{plugin_name}/- One output directory per plugin (0.9.x structure)
- ALL hooks in that plugin write to this same directory
- Example:
users/default/snapshots/20251227/example.com/019b-6397-6a5b/chrome/contains outputs from ALL chrome hooks - Legacy:
archive/{timestamp}/with symlink for backwards compatibility
Example 1: Chrome Plugin (Infrastructure - NO ArchiveResult)
Plugin name: 'chrome'
ArchiveResult: NONE (infrastructure only)
Output directory: users/default/snapshots/20251227/example.com/019b-6397-6a5b/chrome/
Hooks:
- on_Snapshot__20_chrome_tab.bg.js # Launches Chrome, opens tab
- on_Snapshot__30_chrome_navigate.js # Navigates to URL
- on_Snapshot__45_chrome_tab_cleanup.py # Kills Chrome on cleanup
Writes (temporary infrastructure files, deleted on cleanup):
- chrome/cdp_url.txt # Other plugins read this to connect
- chrome/target_id.txt # Tab ID for CDP connection
- chrome/page_loaded.txt # Navigation completion marker
- chrome/navigation.json # Navigation state
- chrome/hook.pid # For cleanup
NO ArchiveResult JSON is produced - this is pure infrastructure.
On SIGTERM: Chrome exits, chrome/ directory is deleted.
Example 2: Screenshot Plugin (Output Plugin - CREATES ArchiveResult)
Plugin name: 'screenshot'
ArchiveResult.extractor: 'screenshot'
Output directory: users/default/snapshots/20251227/example.com/019b-6397-6a5b/screenshot/
Hooks:
- on_Snapshot__34_screenshot.js
Process:
1. Reads ../chrome/cdp_url.txt to get Chrome connection
2. Connects to Chrome CDP
3. Takes screenshot
4. Writes to: screenshot/screenshot.png
5. Emits ArchiveResult JSON to stdout
Creates ArchiveResult with status=succeeded, output_files={'screenshot.png': {}}
Example 3: PDF Plugin (Output Plugin - CREATES ArchiveResult)
Plugin name: 'pdf'
ArchiveResult.extractor: 'pdf'
Output directory: users/default/snapshots/20251227/example.com/019b-6397-6a5b/pdf/
Hooks:
- on_Snapshot__35_pdf.js
Process:
1. Reads ../chrome/cdp_url.txt to get Chrome connection
2. Connects to Chrome CDP
3. Generates PDF
4. Writes to: pdf/output.pdf
5. Emits ArchiveResult JSON to stdout
Creates ArchiveResult with status=succeeded, output_files={'output.pdf': {}}
Lifecycle:
1. Chrome hooks run → create chrome/ dir with infrastructure files
2. Screenshot/PDF/etc hooks run → read chrome/cdp_url.txt, write to their own dirs
3. Snapshot.cleanup() called → sends SIGTERM to background hooks
4. Chrome receives SIGTERM → exits, deletes chrome/ dir
5. Screenshot/PDF/etc dirs remain with their outputs
DO NOT:
- Create one ArchiveResult per hook
- Use hook names as extractor values
- Create separate output directories per hook
DO:
- Create one ArchiveResult per plugin
- Use plugin directory name as extractor value
- Run all hooks in a plugin when processing its ArchiveResult
- Write all hook outputs to the same plugin directory
This principle drove the entire consolidation strategy:
- Chrome plugin = Infrastructure only (NO ArchiveResult)
- Output plugins = Each produces ONE distinct ArchiveResult (kept separate)
Final Structure
1. Chrome Plugin (Infrastructure - No Output)
Location: archivebox/plugins/chrome/
This plugin provides shared Chrome infrastructure for other plugins. It manages the browser lifecycle but produces NO ArchiveResult - only infrastructure files in a single chrome/ output directory.
Consolidates these former plugins:
chrome_session/→ Mergedchrome_navigate/→ Mergedchrome_cleanup/→ Mergedchrome_extensions/→ Utilities merged
Hook Files:
chrome/
├── on_Crawl__00_chrome_install_config.py # Configure Chrome settings
├── on_Crawl__00_chrome_install.py # Install Chrome binary
├── on_Crawl__30_chrome_launch.bg.js # Launch Chrome (Crawl-level, bg)
├── on_Snapshot__20_chrome_tab.bg.js # Open tab (Snapshot-level, bg)
├── on_Snapshot__30_chrome_navigate.js # Navigate to URL (foreground)
├── on_Snapshot__45_chrome_tab_cleanup.py # Close tab, kill bg hooks
├── chrome_extension_utils.js # Extension utilities
├── config.json # Configuration
└── tests/test_chrome.py # Tests
Output Directory (Infrastructure Only):
chrome/
├── cdp_url.txt # WebSocket URL for CDP connection
├── pid.txt # Chrome process PID
├── target_id.txt # Current tab target ID
├── page_loaded.txt # Navigation completion marker
├── final_url.txt # Final URL after redirects
├── navigation.json # Navigation state (NEW)
└── hook.pid # Background hook PIDs (for cleanup)
New: navigation.json
Tracks navigation state with wait condition and timing:
{
"waitUntil": "networkidle2",
"elapsed": 1523,
"url": "https://example.com",
"finalUrl": "https://example.com/",
"status": 200,
"timestamp": "2025-12-27T22:15:30.123Z"
}
Fields:
waitUntil- Wait condition:networkidle0,networkidle2,domcontentloaded, orloadelapsed- Navigation time in millisecondsurl- Original requested URLfinalUrl- Final URL after redirects (success only)status- HTTP status code (success only)error- Error message (failure only)timestamp- ISO 8601 completion timestamp
2. Output Plugins (Each = One ArchiveResult)
These remain SEPARATE plugins because each produces a distinct output/ArchiveResult. Each plugin references ../chrome for infrastructure.
consolelog Plugin
archivebox/plugins/consolelog/
└── on_Snapshot__21_consolelog.bg.js
- Output:
console.jsonl(browser console messages) - Type: Background hook (CDP listener)
- References:
../chromefor CDP URL
ssl Plugin
archivebox/plugins/ssl/
└── on_Snapshot__23_ssl.bg.js
- Output:
ssl.jsonl(SSL/TLS certificate details) - Type: Background hook (CDP listener)
- References:
../chromefor CDP URL
responses Plugin
archivebox/plugins/responses/
└── on_Snapshot__24_responses.bg.js
- Output:
responses/directory withindex.jsonl(network responses) - Type: Background hook (CDP listener)
- References:
../chromefor CDP URL
redirects Plugin
archivebox/plugins/redirects/
└── on_Snapshot__31_redirects.bg.js
- Output:
redirects.jsonl(redirect chain) - Type: Background hook (CDP listener)
- References:
../chromefor CDP URL - Changed: Converted to background hook, now uses CDP
Network.requestWillBeSentto capture redirects from initial request
staticfile Plugin
archivebox/plugins/staticfile/
└── on_Snapshot__31_staticfile.bg.js
- Output: Downloaded static file (PDF, image, video, etc.)
- Type: Background hook (CDP listener)
- References:
../chromefor CDP URL - Changed: Converted from Python to JavaScript, now uses CDP to detect Content-Type from initial response and download via CDP
What Changed
1. Plugin Consolidation
- Merged
chrome_session,chrome_navigate,chrome_cleanup,chrome_extensions→chrome/ - Chrome plugin now has single output directory:
chrome/ - All Chrome infrastructure hooks reference
.(same directory)
2. Background Hook Conversions
redirects Plugin:
- Before: Ran AFTER navigation, reconnected to Chrome to check for redirects
- After: Background hook that sets up CDP listeners BEFORE navigation to capture redirects from initial request
- Method: Uses CDP
Network.requestWillBeSentevent withredirectResponseparameter
staticfile Plugin:
- Before: Python script that ran AFTER navigation, checked response headers
- After: Background JavaScript hook that sets up CDP listeners BEFORE navigation
- Method: Uses CDP
page.on('response')to capture Content-Type from initial request - Language: Converted from Python to JavaScript/Node.js for consistency
3. Navigation State Tracking
- Added:
navigation.jsonfile inchrome/output directory - Contains:
waitUntilcondition andelapsedmilliseconds - Purpose: Track navigation performance and wait conditions for analysis
4. Cleanup
- Deleted:
chrome_session/on_CrawlEnd__99_chrome_cleanup.py(manual cleanup hook) - Reason: Automatic cleanup via state machines is sufficient
- Verified: Cleanup mechanisms in
core/models.pyandcrawls/models.pywork correctly
Hook Execution Order
═══ CRAWL LEVEL ═══
00. chrome_install_config.py Configure Chrome settings
00. chrome_install.py Install Chrome binary
20. chrome_launch.bg.js Launch Chrome browser (STAYS RUNNING)
═══ PER-SNAPSHOT LEVEL ═══
Phase 1: PRE-NAVIGATION (Background hooks setup)
20. chrome_tab.bg.js Open new tab (STAYS ALIVE)
21. consolelog.bg.js Setup console listener (STAYS ALIVE)
23. ssl.bg.js Setup SSL listener (STAYS ALIVE)
24. responses.bg.js Setup network response listener (STAYS ALIVE)
31. redirects.bg.js Setup redirect listener (STAYS ALIVE)
31. staticfile.bg.js Setup staticfile detector (STAYS ALIVE)
Phase 2: NAVIGATION (Foreground - synchronization point)
30. chrome_navigate.js Navigate to URL (BLOCKS until page loaded)
↓
Writes navigation.json with waitUntil & elapsed
Writes page_loaded.txt marker
↓
All background hooks can now finalize
Phase 3: POST-NAVIGATION (Background hooks finalize)
(All .bg hooks save their data and wait for cleanup signal)
Phase 4: OTHER EXTRACTORS (use loaded page)
34. screenshot.js
37. singlefile.js
... (other extractors that need loaded page)
Phase 5: CLEANUP
45. chrome_tab_cleanup.py Close tab
Kill background hooks (SIGTERM → SIGKILL)
Update ArchiveResults
Background Hook Pattern
All .bg.js hooks follow this pattern:
- Setup: Create CDP listeners BEFORE navigation
- Capture: Collect data incrementally as events occur
- Write: Save data to filesystem continuously
- Wait: Keep process alive until SIGTERM
- Finalize: On SIGTERM, emit final JSONL result to stdout
- Exit: Clean exit with status code
Key files written:
hook.pid- Process ID for cleanup mechanism- Output files (e.g.,
console.jsonl,ssl.jsonl, etc.)
Automatic Cleanup Mechanism
Snapshot-level cleanup (core/models.py):
def cleanup(self):
"""Kill background hooks and close resources."""
# Scan OUTPUT_DIR for hook.pid files
# Send SIGTERM to processes
# Wait for graceful exit
# Send SIGKILL if process still alive
# Update ArchiveResults to FAILED if needed
Crawl-level cleanup (crawls/models.py):
def cleanup(self):
"""Kill Crawl-level background hooks (Chrome browser)."""
# Similar pattern for Crawl-level resources
# Kills Chrome launch process
State machine integration:
- Both
SnapshotMachineandCrawlMachinecallcleanup()when enteringsealedstate - Ensures all background processes are cleaned up properly
- No manual cleanup hooks needed
Directory References
Crawl output structure:
- Crawls output to:
users/{user_id}/crawls/{YYYYMMDD}/{crawl_id}/ - Example:
users/1/crawls/20251227/abc-def-123/ - Crawl-level plugins create subdirectories:
users/1/crawls/20251227/abc-def-123/chrome/
Snapshot output structure:
- Snapshots output to:
archive/{timestamp}/ - Snapshot-level plugins create subdirectories:
archive/{timestamp}/chrome/,archive/{timestamp}/consolelog/, etc.
Within chrome plugin:
- Hooks use
.orOUTPUT_DIRto reference thechrome/directory they're running in - Example:
fs.writeFileSync(path.join(OUTPUT_DIR, 'navigation.json'), ...)
From output plugins to chrome (same snapshot):
- Hooks use
../chrometo reference Chrome infrastructure in same snapshot - Example:
const CHROME_SESSION_DIR = '../chrome'; - Used to read:
cdp_url.txt,target_id.txt,page_loaded.txt
From snapshot hooks to crawl chrome:
- Snapshot hooks receive
CRAWL_OUTPUT_DIRenvironment variable (set by hooks.py) - Use:
path.join(process.env.CRAWL_OUTPUT_DIR, 'chrome')to find crawl-level Chrome - This allows snapshots to reuse the crawl's shared Chrome browser
Navigation synchronization:
- All hooks wait for
../chrome/page_loaded.txtbefore finalizing - This file is written by
chrome_navigate.jsafter navigation completes
Design Principles
-
One ArchiveResult Per Plugin
- Each plugin produces exactly ONE output/ArchiveResult
- Infrastructure plugins (like chrome) produce NO ArchiveResult
-
Chrome as Infrastructure
- Provides shared CDP connection, PIDs, navigation state
- No ArchiveResult output of its own
- Single output directory for all infrastructure files
-
Background Hooks for CDP
- Hooks that need CDP listeners BEFORE navigation are background (
.bg.js) - They capture events from the initial request/response
- Stay alive through navigation and cleanup
- Hooks that need CDP listeners BEFORE navigation are background (
-
Foreground for Synchronization
chrome_navigate.jsis foreground (not.bg)- Provides synchronization point - blocks until page loaded
- All other hooks wait for its completion marker
-
Automatic Cleanup
- State machines handle background hook cleanup
- No manual cleanup hooks needed
- SIGTERM for graceful exit, SIGKILL as backup
-
Clear Separation
- Infrastructure vs outputs
- One output directory per plugin
- Predictable, maintainable architecture
Benefits
✓ Architectural Clarity - Clear separation between infrastructure and outputs ✓ Correct Output Model - One ArchiveResult per plugin ✓ Better Performance - CDP listeners capture data from initial request ✓ No Duplication - Single Chrome infrastructure used by all ✓ Proper Lifecycle - Background hooks cleaned up automatically ✓ Maintainable - Easy to understand, debug, and extend ✓ Consistent - All background hooks follow same pattern ✓ Observable - Navigation state tracked for debugging
Testing
Run tests:
sudo -u testuser bash -c 'source .venv/bin/activate && python -m pytest archivebox/plugins/chrome/tests/ -v'
Migration Notes
For developers:
- Chrome infrastructure is now in
chrome/output dir (notchrome_session/) - Reference
../chrome/cdp_url.txtfrom output plugins - Navigation marker is
../chrome/page_loaded.txt - Navigation details in
../chrome/navigation.json
For users:
- No user-facing changes
- Output structure remains the same
- All extractors continue to work