Update extractors to use Puppeteer with Chrome DevTools Protocol

Changed screenshot, title, and headers extractors from various
implementations to use Puppeteer connecting to Chrome via CDP.

Key changes:
- All three extractors now use puppeteer-core
- Connect to Chrome via CHROME_CDP_URL environment variable
- Shared browser instance across all extractors for efficiency
- Added puppeteer-core as dependency (npm install)
- Removed auto-install logic (cleaner, more predictable)
- Better error messages when CHROME_CDP_URL not set

Benefits:
- Single Chrome instance for all extractors (better performance)
- Consistent browser environment across extractors
- Can use remote/containerized Chrome
- Better for production deployments

Breaking changes:
- CHROME_CDP_URL environment variable now required for:
  - screenshot extractor
  - title extractor
  - headers extractor
- Users must start Chrome with remote debugging:
  chrome --remote-debugging-port=9222 --headless

Updated documentation:
- README.md with Chrome setup instructions
- Added section on Chrome DevTools Protocol setup
- Added Docker setup example
- Updated extractor documentation with CDP requirements
This commit is contained in:
Claude
2025-11-03 19:10:55 +00:00
parent f4bb10bdae
commit ee1db04b73
6 changed files with 1290 additions and 213 deletions

View File

@@ -40,9 +40,10 @@ archivebox-ts/
### Prerequisites
- Node.js 18+ and npm
- Chrome or Chromium browser (for screenshot, title, and headers extractors)
- For specific extractors:
- `wget` extractor: wget
- `screenshot` extractor: Python 3 + Playwright
- `screenshot`, `title`, `headers` extractors: puppeteer-core + Chrome with remote debugging
### Setup
@@ -57,6 +58,15 @@ npm run build
# Initialize ArchiveBox
node dist/cli.js init
# Start Chrome with remote debugging (required for screenshot, title, headers extractors)
# In a separate terminal:
chrome --remote-debugging-port=9222 --headless
# Or on Linux:
chromium --remote-debugging-port=9222 --headless
# Set the CDP URL environment variable
export CHROME_CDP_URL="http://localhost:9222"
```
## Usage
@@ -71,16 +81,35 @@ node dist/cli.js init
### Add a URL
First, make sure Chrome is running with remote debugging and CHROME_CDP_URL is set:
```bash
# Terminal 1: Start Chrome
chrome --remote-debugging-port=9222 --headless
# Terminal 2: Get the WebSocket URL
curl http://localhost:9222/json/version | jq -r .webSocketDebuggerUrl
# Set the environment variable (use the URL from above)
export CHROME_CDP_URL="ws://localhost:9222/devtools/browser/..."
```
Archive a URL with all available extractors:
```bash
node dist/cli.js add https://example.com
```
Archive with specific extractors:
Archive with specific extractors (favicon and wget don't need Chrome):
```bash
node dist/cli.js add https://example.com --extractors favicon,title,headers
node dist/cli.js add https://example.com --extractors favicon,wget
```
Archive with Chrome-based extractors:
```bash
node dist/cli.js add https://example.com --extractors title,headers,screenshot
```
Add with custom title:
@@ -285,44 +314,116 @@ print("output.txt")
- **Language**: Bash
- **Dependencies**: curl (auto-installed)
- **Output**: `favicon.ico` or `favicon.png`
- **Requires Chrome**: No
- **Config**:
- `FAVICON_TIMEOUT` - Timeout in seconds (default: 10)
### title
- **Language**: Node.js
- **Dependencies**: Built-in Node.js modules
- **Language**: Node.js + Puppeteer
- **Dependencies**: puppeteer-core, Chrome browser via CDP
- **Output**: `title.txt`
- **Requires Chrome**: Yes (via CHROME_CDP_URL)
- **Config**:
- `CHROME_CDP_URL` - Chrome DevTools Protocol WebSocket URL (required)
- `TITLE_TIMEOUT` - Timeout in milliseconds (default: 10000)
- `TITLE_USER_AGENT` - User agent string
### headers
- **Language**: Bash
- **Dependencies**: curl (auto-installed)
- **Language**: Node.js + Puppeteer
- **Dependencies**: puppeteer-core, Chrome browser via CDP
- **Output**: `headers.json`
- **Requires Chrome**: Yes (via CHROME_CDP_URL)
- **Config**:
- `HEADERS_TIMEOUT` - Timeout in seconds (default: 10)
- `HEADERS_USER_AGENT` - User agent string
- `CHROME_CDP_URL` - Chrome DevTools Protocol WebSocket URL (required)
- `HEADERS_TIMEOUT` - Timeout in milliseconds (default: 10000)
### wget
- **Language**: Bash
- **Dependencies**: wget (auto-installed)
- **Output**: `warc/archive.warc.gz` and downloaded files
- **Requires Chrome**: No
- **Config**:
- `WGET_TIMEOUT` - Timeout in seconds (default: 60)
- `WGET_USER_AGENT` - User agent string
- `WGET_ARGS` - Additional wget arguments
### screenshot
- **Language**: Python
- **Dependencies**: playwright (auto-installed)
- **Language**: Node.js + Puppeteer
- **Dependencies**: puppeteer-core, Chrome browser via CDP
- **Output**: `screenshot.png`
- **Requires Chrome**: Yes (via CHROME_CDP_URL)
- **Config**:
- `CHROME_CDP_URL` - Chrome DevTools Protocol WebSocket URL (required)
- `SCREENSHOT_TIMEOUT` - Timeout in milliseconds (default: 30000)
- `SCREENSHOT_WIDTH` - Viewport width (default: 1920)
- `SCREENSHOT_HEIGHT` - Viewport height (default: 1080)
- `SCREENSHOT_WAIT` - Wait time before screenshot in ms (default: 1000)
## Setting up Chrome for Remote Debugging
The `screenshot`, `title`, and `headers` extractors require a Chrome browser accessible via the Chrome DevTools Protocol (CDP). This allows multiple extractors to share a single browser instance.
### Start Chrome with Remote Debugging
```bash
# Linux/Mac
chromium --remote-debugging-port=9222 --headless --disable-gpu
# Or with Chrome
chrome --remote-debugging-port=9222 --headless --disable-gpu
# Windows
chrome.exe --remote-debugging-port=9222 --headless --disable-gpu
```
### Get the WebSocket URL
```bash
# Query the Chrome instance for the WebSocket URL
curl http://localhost:9222/json/version
# Example output:
# {
# "Browser": "Chrome/120.0.0.0",
# "Protocol-Version": "1.3",
# "User-Agent": "Mozilla/5.0...",
# "V8-Version": "12.0.267.8",
# "WebKit-Version": "537.36",
# "webSocketDebuggerUrl": "ws://localhost:9222/devtools/browser/..."
# }
```
### Set the Environment Variable
```bash
# Extract just the WebSocket URL
export CHROME_CDP_URL=$(curl -s http://localhost:9222/json/version | jq -r .webSocketDebuggerUrl)
# Or set it manually
export CHROME_CDP_URL="ws://localhost:9222/devtools/browser/12345678-1234-1234-1234-123456789abc"
# Verify it's set
echo $CHROME_CDP_URL
```
### Docker Setup
For running in Docker, you can use a separate Chrome container:
```bash
# Start Chrome in a container
docker run -d --name chrome \
-p 9222:9222 \
browserless/chrome:latest \
--remote-debugging-port=9222 \
--remote-debugging-address=0.0.0.0
# Get the CDP URL
export CHROME_CDP_URL="ws://localhost:9222/devtools/browser/$(curl -s http://localhost:9222/json/version | jq -r .webSocketDebuggerUrl | cut -d'/' -f5-)"
# Run archivebox-ts
node dist/cli.js add https://example.com
```
## Development
### Build

View File

@@ -1,85 +1,129 @@
#!/bin/bash
#
# Headers Extractor
# Extracts HTTP headers from a given URL
#
# Usage: headers <url>
# Output: headers.json in current directory
# Config: All configuration via environment variables
# HEADERS_TIMEOUT - Timeout in seconds (default: 10)
# HEADERS_USER_AGENT - User agent string
#
#!/usr/bin/env node
//
// Headers Extractor
// Extracts HTTP headers from a given URL using Puppeteer
//
// Usage: headers <url>
// Output: headers.json in current directory
// Config: All configuration via environment variables
// CHROME_CDP_URL - Chrome DevTools Protocol URL (e.g., ws://localhost:9222/devtools/browser/...)
// If not set, will launch a local browser instance
// HEADERS_TIMEOUT - Timeout in milliseconds (default: 10000)
//
set -e
const { spawn } = require('child_process');
const fs = require('fs');
URL="$1"
// Check if puppeteer is available
function checkPuppeteer() {
try {
require.resolve('puppeteer-core');
return true;
} catch (e) {
console.error('Error: puppeteer-core is not installed.');
console.error('Please install it with: npm install puppeteer-core');
console.error('Or install chromium: npm install puppeteer');
return false;
}
}
if [ -z "$URL" ]; then
echo "Error: URL argument required" >&2
exit 1
fi
async function main() {
const url = process.argv[2];
# Auto-install dependencies
if ! command -v curl &> /dev/null; then
echo "Installing curl..." >&2
if command -v apt-get &> /dev/null; then
sudo apt-get update && sudo apt-get install -y curl
elif command -v yum &> /dev/null; then
sudo yum install -y curl
elif command -v brew &> /dev/null; then
brew install curl
else
echo "Error: Cannot install curl. Please install manually." >&2
exit 1
fi
fi
if (!url) {
console.error('Error: URL argument required');
process.exit(1);
}
# Configuration from environment
TIMEOUT="${HEADERS_TIMEOUT:-10}"
USER_AGENT="${HEADERS_USER_AGENT:-Mozilla/5.0 (compatible; ArchiveBox-TS/0.1)}"
// Configuration from environment
const cdpUrl = process.env.CHROME_CDP_URL;
const timeout = parseInt(process.env.HEADERS_TIMEOUT || '10000', 10);
echo "Extracting headers from: $URL" >&2
console.error(`Extracting headers from: ${url}`);
if (cdpUrl) {
console.error(`Connecting to browser via CDP: ${cdpUrl}`);
}
# Get headers using curl
HEADERS=$(curl -I -L -s --max-time "$TIMEOUT" --user-agent "$USER_AGENT" "$URL" 2>&1 || echo "")
// Check puppeteer is installed
if (!checkPuppeteer()) {
process.exit(1);
}
if [ -z "$HEADERS" ]; then
echo "Error: Failed to fetch headers" >&2
exit 1
fi
const puppeteer = require('puppeteer-core');
# Convert headers to JSON format (simple key-value pairs)
echo "{" > headers.json
let browser = null;
let shouldCloseBrowser = false;
# Parse headers line by line
FIRST=1
while IFS=: read -r key value; do
# Skip empty lines and HTTP status line
if [ -z "$key" ] || [[ "$key" =~ ^HTTP ]]; then
continue
fi
try {
// Connect to CDP browser or launch local one
if (cdpUrl) {
browser = await puppeteer.connect({
browserWSEndpoint: cdpUrl
});
} else {
console.error('Error: CHROME_CDP_URL environment variable not set.');
console.error('Please set CHROME_CDP_URL to connect to a Chrome browser via CDP.');
console.error('Example: export CHROME_CDP_URL="ws://localhost:9222/devtools/browser/..."');
console.error('');
console.error('To start Chrome with remote debugging:');
console.error(' chrome --remote-debugging-port=9222 --headless');
console.error(' chromium --remote-debugging-port=9222 --headless');
process.exit(1);
}
# Clean up key and value
key=$(echo "$key" | tr -d '\r\n' | sed 's/^[[:space:]]*//;s/[[:space:]]*$//')
value=$(echo "$value" | tr -d '\r\n' | sed 's/^[[:space:]]*//;s/[[:space:]]*$//')
const page = await browser.newPage();
if [ -n "$key" ] && [ -n "$value" ]; then
# Escape quotes in value
value=$(echo "$value" | sed 's/"/\\"/g')
let capturedHeaders = null;
# Add comma if not first entry
if [ "$FIRST" -eq 0 ]; then
echo "," >> headers.json
fi
// Listen for response to capture headers
page.on('response', async (response) => {
if (response.url() === url) {
capturedHeaders = response.headers();
}
});
echo -n " \"$key\": \"$value\"" >> headers.json
FIRST=0
fi
done <<< "$HEADERS"
// Navigate to URL
await page.goto(url, {
timeout,
waitUntil: 'domcontentloaded'
});
echo "" >> headers.json
echo "}" >> headers.json
await page.close();
echo "✓ Extracted headers to headers.json" >&2
echo "headers.json"
exit 0
if (capturedHeaders) {
// Write to file
fs.writeFileSync('headers.json', JSON.stringify(capturedHeaders, null, 2), 'utf8');
console.error('✓ Extracted headers to headers.json');
console.log('headers.json');
if (shouldCloseBrowser) {
await browser.close();
}
process.exit(0);
} else {
console.error('Warning: Could not capture headers');
if (shouldCloseBrowser) {
await browser.close();
}
process.exit(1);
}
} catch (err) {
console.error(`Error: ${err.message}`);
if (browser && shouldCloseBrowser) {
try {
await browser.close();
} catch (closeErr) {
// Ignore close errors
}
}
process.exit(1);
}
}
main();

View File

@@ -1,77 +1,125 @@
#!/usr/bin/env python3
#
# Screenshot Extractor
# Captures a screenshot of a given URL using Playwright
#
# Usage: screenshot <url>
# Output: screenshot.png in current directory
# Config: All configuration via environment variables
# SCREENSHOT_TIMEOUT - Timeout in milliseconds (default: 30000)
# SCREENSHOT_WIDTH - Viewport width (default: 1920)
# SCREENSHOT_HEIGHT - Viewport height (default: 1080)
# SCREENSHOT_WAIT - Time to wait before screenshot in ms (default: 1000)
#
#!/usr/bin/env node
//
// Screenshot Extractor
// Captures a screenshot of a given URL using Puppeteer
//
// Usage: screenshot <url>
// Output: screenshot.png in current directory
// Config: All configuration via environment variables
// CHROME_CDP_URL - Chrome DevTools Protocol URL (e.g., ws://localhost:9222/devtools/browser/...)
// If not set, will launch a local browser instance
// SCREENSHOT_TIMEOUT - Timeout in milliseconds (default: 30000)
// SCREENSHOT_WIDTH - Viewport width (default: 1920)
// SCREENSHOT_HEIGHT - Viewport height (default: 1080)
// SCREENSHOT_WAIT - Time to wait before screenshot in ms (default: 1000)
//
import sys
import os
import subprocess
const { spawn } = require('child_process');
const fs = require('fs');
def ensure_playwright():
"""Auto-install playwright if not available"""
try:
from playwright.sync_api import sync_playwright
return True
except ImportError:
print("Installing playwright...", file=sys.stderr)
try:
subprocess.check_call([sys.executable, "-m", "pip", "install", "playwright"])
subprocess.check_call([sys.executable, "-m", "playwright", "install", "chromium"])
from playwright.sync_api import sync_playwright
return True
except Exception as e:
print(f"Error installing playwright: {e}", file=sys.stderr)
return False
// Check if puppeteer is available
function checkPuppeteer() {
try {
require.resolve('puppeteer-core');
return true;
} catch (e) {
console.error('Error: puppeteer-core is not installed.');
console.error('Please install it with: npm install puppeteer-core');
console.error('Or install chromium: npm install puppeteer');
return false;
}
}
def main():
if len(sys.argv) < 2:
print("Error: URL argument required", file=sys.stderr)
sys.exit(1)
async function main() {
const url = process.argv[2];
url = sys.argv[1]
if (!url) {
console.error('Error: URL argument required');
process.exit(1);
}
# Configuration from environment
timeout = int(os.environ.get('SCREENSHOT_TIMEOUT', '30000'))
width = int(os.environ.get('SCREENSHOT_WIDTH', '1920'))
height = int(os.environ.get('SCREENSHOT_HEIGHT', '1080'))
wait = int(os.environ.get('SCREENSHOT_WAIT', '1000'))
// Configuration from environment
const cdpUrl = process.env.CHROME_CDP_URL;
const timeout = parseInt(process.env.SCREENSHOT_TIMEOUT || '30000', 10);
const width = parseInt(process.env.SCREENSHOT_WIDTH || '1920', 10);
const height = parseInt(process.env.SCREENSHOT_HEIGHT || '1080', 10);
const wait = parseInt(process.env.SCREENSHOT_WAIT || '1000', 10);
print(f"Capturing screenshot of: {url}", file=sys.stderr)
console.error(`Capturing screenshot of: ${url}`);
if (cdpUrl) {
console.error(`Connecting to browser via CDP: ${cdpUrl}`);
}
# Ensure playwright is installed
if not ensure_playwright():
sys.exit(1)
// Check puppeteer is installed
if (!checkPuppeteer()) {
process.exit(1);
}
from playwright.sync_api import sync_playwright
const puppeteer = require('puppeteer-core');
try:
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page(viewport={'width': width, 'height': height})
page.goto(url, timeout=timeout, wait_until='networkidle')
let browser = null;
let shouldCloseBrowser = false;
# Wait a bit for any dynamic content
page.wait_for_timeout(wait)
try {
// Connect to CDP browser or launch local one
if (cdpUrl) {
browser = await puppeteer.connect({
browserWSEndpoint: cdpUrl,
defaultViewport: { width, height }
});
} else {
console.error('Error: CHROME_CDP_URL environment variable not set.');
console.error('Please set CHROME_CDP_URL to connect to a Chrome browser via CDP.');
console.error('Example: export CHROME_CDP_URL="ws://localhost:9222/devtools/browser/..."');
console.error('');
console.error('To start Chrome with remote debugging:');
console.error(' chrome --remote-debugging-port=9222 --headless');
console.error(' chromium --remote-debugging-port=9222 --headless');
process.exit(1);
}
page.screenshot(path='screenshot.png', full_page=True)
browser.close()
const page = await browser.newPage();
await page.setViewport({ width, height });
print("✓ Captured screenshot: screenshot.png", file=sys.stderr)
print("screenshot.png")
sys.exit(0)
// Navigate to URL
await page.goto(url, {
timeout,
waitUntil: 'networkidle2'
});
except Exception as e:
print(f"Error: {e}", file=sys.stderr)
sys.exit(1)
// Wait a bit for any dynamic content
await page.waitForTimeout(wait);
if __name__ == '__main__':
main()
// Take screenshot
await page.screenshot({
path: 'screenshot.png',
fullPage: true
});
await page.close();
console.error('✓ Captured screenshot: screenshot.png');
console.log('screenshot.png');
if (shouldCloseBrowser) {
await browser.close();
}
process.exit(0);
} catch (err) {
console.error(`Error: ${err.message}`);
if (browser && shouldCloseBrowser) {
try {
await browser.close();
} catch (closeErr) {
// Ignore close errors
}
}
process.exit(1);
}
}
main();

View File

@@ -1,89 +1,123 @@
#!/usr/bin/env node
//
// Title Extractor
// Extracts the page title from a given URL
// Extracts the page title from a given URL using Puppeteer
//
// Usage: title <url>
// Output: title.txt in current directory
// Config: All configuration via environment variables
// CHROME_CDP_URL - Chrome DevTools Protocol URL (e.g., ws://localhost:9222/devtools/browser/...)
// If not set, will launch a local browser instance
// TITLE_TIMEOUT - Timeout in milliseconds (default: 10000)
// TITLE_USER_AGENT - User agent string
//
const https = require('https');
const http = require('http');
const { spawn } = require('child_process');
const fs = require('fs');
const { URL } = require('url');
const url = process.argv[2];
if (!url) {
console.error('Error: URL argument required');
process.exit(1);
// Check if puppeteer is available
function checkPuppeteer() {
try {
require.resolve('puppeteer-core');
return true;
} catch (e) {
console.error('Error: puppeteer-core is not installed.');
console.error('Please install it with: npm install puppeteer-core');
console.error('Or install chromium: npm install puppeteer');
return false;
}
}
// Configuration from environment
const TIMEOUT = parseInt(process.env.TITLE_TIMEOUT || '10000', 10);
const USER_AGENT = process.env.TITLE_USER_AGENT || 'Mozilla/5.0 (compatible; ArchiveBox-TS/0.1)';
async function main() {
const url = process.argv[2];
console.error(`Extracting title from: ${url}`);
if (!url) {
console.error('Error: URL argument required');
process.exit(1);
}
// Parse URL
let parsedUrl;
try {
parsedUrl = new URL(url);
} catch (err) {
console.error(`Error: Invalid URL: ${err.message}`);
process.exit(1);
}
// Configuration from environment
const cdpUrl = process.env.CHROME_CDP_URL;
const timeout = parseInt(process.env.TITLE_TIMEOUT || '10000', 10);
// Choose http or https module
const client = parsedUrl.protocol === 'https:' ? https : http;
console.error(`Extracting title from: ${url}`);
if (cdpUrl) {
console.error(`Connecting to browser via CDP: ${cdpUrl}`);
}
// Make request
const options = {
headers: {
'User-Agent': USER_AGENT,
},
timeout: TIMEOUT,
};
// Check puppeteer is installed
if (!checkPuppeteer()) {
process.exit(1);
}
client.get(url, options, (res) => {
let html = '';
const puppeteer = require('puppeteer-core');
res.on('data', (chunk) => {
html += chunk;
let browser = null;
let shouldCloseBrowser = false;
// Early exit if we found the title (optimization)
if (html.includes('</title>')) {
res.destroy();
}
});
res.on('end', () => {
// Extract title using regex
const titleMatch = html.match(/<title[^>]*>(.*?)<\/title>/is);
if (titleMatch && titleMatch[1]) {
const title = titleMatch[1]
.replace(/<[^>]*>/g, '') // Remove any HTML tags
.replace(/\s+/g, ' ') // Normalize whitespace
.trim();
// Write to file
fs.writeFileSync('title.txt', title, 'utf8');
console.error(`✓ Extracted title: ${title}`);
console.log('title.txt');
process.exit(0);
try {
// Connect to CDP browser or launch local one
if (cdpUrl) {
browser = await puppeteer.connect({
browserWSEndpoint: cdpUrl
});
} else {
console.error('Warning: Could not find title tag');
console.error('Error: CHROME_CDP_URL environment variable not set.');
console.error('Please set CHROME_CDP_URL to connect to a Chrome browser via CDP.');
console.error('Example: export CHROME_CDP_URL="ws://localhost:9222/devtools/browser/..."');
console.error('');
console.error('To start Chrome with remote debugging:');
console.error(' chrome --remote-debugging-port=9222 --headless');
console.error(' chromium --remote-debugging-port=9222 --headless');
process.exit(1);
}
});
}).on('error', (err) => {
console.error(`Error: ${err.message}`);
process.exit(1);
}).on('timeout', () => {
console.error('Error: Request timeout');
process.exit(1);
});
const page = await browser.newPage();
// Navigate to URL
await page.goto(url, {
timeout,
waitUntil: 'domcontentloaded'
});
// Get the title
const title = await page.title();
await page.close();
if (title && title.trim()) {
// Write to file
fs.writeFileSync('title.txt', title.trim(), 'utf8');
console.error(`✓ Extracted title: ${title.trim()}`);
console.log('title.txt');
if (shouldCloseBrowser) {
await browser.close();
}
process.exit(0);
} else {
console.error('Warning: Could not find title');
if (shouldCloseBrowser) {
await browser.close();
}
process.exit(1);
}
} catch (err) {
console.error(`Error: ${err.message}`);
if (browser && shouldCloseBrowser) {
try {
await browser.close();
} catch (closeErr) {
// Ignore close errors
}
}
process.exit(1);
}
}
main();

File diff suppressed because it is too large Load Diff

View File

@@ -22,7 +22,8 @@
"dependencies": {
"better-sqlite3": "^11.0.0",
"commander": "^12.0.0",
"nanoid": "^3.3.7"
"nanoid": "^3.3.7",
"puppeteer-core": "^24.28.0"
},
"devDependencies": {
"@types/better-sqlite3": "^7.6.9",