mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-01-05 18:35:50 +10:00
Update extractors to use Puppeteer with Chrome DevTools Protocol
Changed screenshot, title, and headers extractors from various implementations to use Puppeteer connecting to Chrome via CDP. Key changes: - All three extractors now use puppeteer-core - Connect to Chrome via CHROME_CDP_URL environment variable - Shared browser instance across all extractors for efficiency - Added puppeteer-core as dependency (npm install) - Removed auto-install logic (cleaner, more predictable) - Better error messages when CHROME_CDP_URL not set Benefits: - Single Chrome instance for all extractors (better performance) - Consistent browser environment across extractors - Can use remote/containerized Chrome - Better for production deployments Breaking changes: - CHROME_CDP_URL environment variable now required for: - screenshot extractor - title extractor - headers extractor - Users must start Chrome with remote debugging: chrome --remote-debugging-port=9222 --headless Updated documentation: - README.md with Chrome setup instructions - Added section on Chrome DevTools Protocol setup - Added Docker setup example - Updated extractor documentation with CDP requirements
This commit is contained in:
@@ -40,9 +40,10 @@ archivebox-ts/
|
||||
### Prerequisites
|
||||
|
||||
- Node.js 18+ and npm
|
||||
- Chrome or Chromium browser (for screenshot, title, and headers extractors)
|
||||
- For specific extractors:
|
||||
- `wget` extractor: wget
|
||||
- `screenshot` extractor: Python 3 + Playwright
|
||||
- `screenshot`, `title`, `headers` extractors: puppeteer-core + Chrome with remote debugging
|
||||
|
||||
### Setup
|
||||
|
||||
@@ -57,6 +58,15 @@ npm run build
|
||||
|
||||
# Initialize ArchiveBox
|
||||
node dist/cli.js init
|
||||
|
||||
# Start Chrome with remote debugging (required for screenshot, title, headers extractors)
|
||||
# In a separate terminal:
|
||||
chrome --remote-debugging-port=9222 --headless
|
||||
# Or on Linux:
|
||||
chromium --remote-debugging-port=9222 --headless
|
||||
|
||||
# Set the CDP URL environment variable
|
||||
export CHROME_CDP_URL="http://localhost:9222"
|
||||
```
|
||||
|
||||
## Usage
|
||||
@@ -71,16 +81,35 @@ node dist/cli.js init
|
||||
|
||||
### Add a URL
|
||||
|
||||
First, make sure Chrome is running with remote debugging and CHROME_CDP_URL is set:
|
||||
|
||||
```bash
|
||||
# Terminal 1: Start Chrome
|
||||
chrome --remote-debugging-port=9222 --headless
|
||||
|
||||
# Terminal 2: Get the WebSocket URL
|
||||
curl http://localhost:9222/json/version | jq -r .webSocketDebuggerUrl
|
||||
|
||||
# Set the environment variable (use the URL from above)
|
||||
export CHROME_CDP_URL="ws://localhost:9222/devtools/browser/..."
|
||||
```
|
||||
|
||||
Archive a URL with all available extractors:
|
||||
|
||||
```bash
|
||||
node dist/cli.js add https://example.com
|
||||
```
|
||||
|
||||
Archive with specific extractors:
|
||||
Archive with specific extractors (favicon and wget don't need Chrome):
|
||||
|
||||
```bash
|
||||
node dist/cli.js add https://example.com --extractors favicon,title,headers
|
||||
node dist/cli.js add https://example.com --extractors favicon,wget
|
||||
```
|
||||
|
||||
Archive with Chrome-based extractors:
|
||||
|
||||
```bash
|
||||
node dist/cli.js add https://example.com --extractors title,headers,screenshot
|
||||
```
|
||||
|
||||
Add with custom title:
|
||||
@@ -285,44 +314,116 @@ print("output.txt")
|
||||
- **Language**: Bash
|
||||
- **Dependencies**: curl (auto-installed)
|
||||
- **Output**: `favicon.ico` or `favicon.png`
|
||||
- **Requires Chrome**: No
|
||||
- **Config**:
|
||||
- `FAVICON_TIMEOUT` - Timeout in seconds (default: 10)
|
||||
|
||||
### title
|
||||
- **Language**: Node.js
|
||||
- **Dependencies**: Built-in Node.js modules
|
||||
- **Language**: Node.js + Puppeteer
|
||||
- **Dependencies**: puppeteer-core, Chrome browser via CDP
|
||||
- **Output**: `title.txt`
|
||||
- **Requires Chrome**: Yes (via CHROME_CDP_URL)
|
||||
- **Config**:
|
||||
- `CHROME_CDP_URL` - Chrome DevTools Protocol WebSocket URL (required)
|
||||
- `TITLE_TIMEOUT` - Timeout in milliseconds (default: 10000)
|
||||
- `TITLE_USER_AGENT` - User agent string
|
||||
|
||||
### headers
|
||||
- **Language**: Bash
|
||||
- **Dependencies**: curl (auto-installed)
|
||||
- **Language**: Node.js + Puppeteer
|
||||
- **Dependencies**: puppeteer-core, Chrome browser via CDP
|
||||
- **Output**: `headers.json`
|
||||
- **Requires Chrome**: Yes (via CHROME_CDP_URL)
|
||||
- **Config**:
|
||||
- `HEADERS_TIMEOUT` - Timeout in seconds (default: 10)
|
||||
- `HEADERS_USER_AGENT` - User agent string
|
||||
- `CHROME_CDP_URL` - Chrome DevTools Protocol WebSocket URL (required)
|
||||
- `HEADERS_TIMEOUT` - Timeout in milliseconds (default: 10000)
|
||||
|
||||
### wget
|
||||
- **Language**: Bash
|
||||
- **Dependencies**: wget (auto-installed)
|
||||
- **Output**: `warc/archive.warc.gz` and downloaded files
|
||||
- **Requires Chrome**: No
|
||||
- **Config**:
|
||||
- `WGET_TIMEOUT` - Timeout in seconds (default: 60)
|
||||
- `WGET_USER_AGENT` - User agent string
|
||||
- `WGET_ARGS` - Additional wget arguments
|
||||
|
||||
### screenshot
|
||||
- **Language**: Python
|
||||
- **Dependencies**: playwright (auto-installed)
|
||||
- **Language**: Node.js + Puppeteer
|
||||
- **Dependencies**: puppeteer-core, Chrome browser via CDP
|
||||
- **Output**: `screenshot.png`
|
||||
- **Requires Chrome**: Yes (via CHROME_CDP_URL)
|
||||
- **Config**:
|
||||
- `CHROME_CDP_URL` - Chrome DevTools Protocol WebSocket URL (required)
|
||||
- `SCREENSHOT_TIMEOUT` - Timeout in milliseconds (default: 30000)
|
||||
- `SCREENSHOT_WIDTH` - Viewport width (default: 1920)
|
||||
- `SCREENSHOT_HEIGHT` - Viewport height (default: 1080)
|
||||
- `SCREENSHOT_WAIT` - Wait time before screenshot in ms (default: 1000)
|
||||
|
||||
## Setting up Chrome for Remote Debugging
|
||||
|
||||
The `screenshot`, `title`, and `headers` extractors require a Chrome browser accessible via the Chrome DevTools Protocol (CDP). This allows multiple extractors to share a single browser instance.
|
||||
|
||||
### Start Chrome with Remote Debugging
|
||||
|
||||
```bash
|
||||
# Linux/Mac
|
||||
chromium --remote-debugging-port=9222 --headless --disable-gpu
|
||||
|
||||
# Or with Chrome
|
||||
chrome --remote-debugging-port=9222 --headless --disable-gpu
|
||||
|
||||
# Windows
|
||||
chrome.exe --remote-debugging-port=9222 --headless --disable-gpu
|
||||
```
|
||||
|
||||
### Get the WebSocket URL
|
||||
|
||||
```bash
|
||||
# Query the Chrome instance for the WebSocket URL
|
||||
curl http://localhost:9222/json/version
|
||||
|
||||
# Example output:
|
||||
# {
|
||||
# "Browser": "Chrome/120.0.0.0",
|
||||
# "Protocol-Version": "1.3",
|
||||
# "User-Agent": "Mozilla/5.0...",
|
||||
# "V8-Version": "12.0.267.8",
|
||||
# "WebKit-Version": "537.36",
|
||||
# "webSocketDebuggerUrl": "ws://localhost:9222/devtools/browser/..."
|
||||
# }
|
||||
```
|
||||
|
||||
### Set the Environment Variable
|
||||
|
||||
```bash
|
||||
# Extract just the WebSocket URL
|
||||
export CHROME_CDP_URL=$(curl -s http://localhost:9222/json/version | jq -r .webSocketDebuggerUrl)
|
||||
|
||||
# Or set it manually
|
||||
export CHROME_CDP_URL="ws://localhost:9222/devtools/browser/12345678-1234-1234-1234-123456789abc"
|
||||
|
||||
# Verify it's set
|
||||
echo $CHROME_CDP_URL
|
||||
```
|
||||
|
||||
### Docker Setup
|
||||
|
||||
For running in Docker, you can use a separate Chrome container:
|
||||
|
||||
```bash
|
||||
# Start Chrome in a container
|
||||
docker run -d --name chrome \
|
||||
-p 9222:9222 \
|
||||
browserless/chrome:latest \
|
||||
--remote-debugging-port=9222 \
|
||||
--remote-debugging-address=0.0.0.0
|
||||
|
||||
# Get the CDP URL
|
||||
export CHROME_CDP_URL="ws://localhost:9222/devtools/browser/$(curl -s http://localhost:9222/json/version | jq -r .webSocketDebuggerUrl | cut -d'/' -f5-)"
|
||||
|
||||
# Run archivebox-ts
|
||||
node dist/cli.js add https://example.com
|
||||
```
|
||||
|
||||
## Development
|
||||
|
||||
### Build
|
||||
|
||||
@@ -1,85 +1,129 @@
|
||||
#!/bin/bash
|
||||
#
|
||||
# Headers Extractor
|
||||
# Extracts HTTP headers from a given URL
|
||||
#
|
||||
# Usage: headers <url>
|
||||
# Output: headers.json in current directory
|
||||
# Config: All configuration via environment variables
|
||||
# HEADERS_TIMEOUT - Timeout in seconds (default: 10)
|
||||
# HEADERS_USER_AGENT - User agent string
|
||||
#
|
||||
#!/usr/bin/env node
|
||||
//
|
||||
// Headers Extractor
|
||||
// Extracts HTTP headers from a given URL using Puppeteer
|
||||
//
|
||||
// Usage: headers <url>
|
||||
// Output: headers.json in current directory
|
||||
// Config: All configuration via environment variables
|
||||
// CHROME_CDP_URL - Chrome DevTools Protocol URL (e.g., ws://localhost:9222/devtools/browser/...)
|
||||
// If not set, will launch a local browser instance
|
||||
// HEADERS_TIMEOUT - Timeout in milliseconds (default: 10000)
|
||||
//
|
||||
|
||||
set -e
|
||||
const { spawn } = require('child_process');
|
||||
const fs = require('fs');
|
||||
|
||||
URL="$1"
|
||||
// Check if puppeteer is available
|
||||
function checkPuppeteer() {
|
||||
try {
|
||||
require.resolve('puppeteer-core');
|
||||
return true;
|
||||
} catch (e) {
|
||||
console.error('Error: puppeteer-core is not installed.');
|
||||
console.error('Please install it with: npm install puppeteer-core');
|
||||
console.error('Or install chromium: npm install puppeteer');
|
||||
return false;
|
||||
}
|
||||
}
|
||||
|
||||
if [ -z "$URL" ]; then
|
||||
echo "Error: URL argument required" >&2
|
||||
exit 1
|
||||
fi
|
||||
async function main() {
|
||||
const url = process.argv[2];
|
||||
|
||||
# Auto-install dependencies
|
||||
if ! command -v curl &> /dev/null; then
|
||||
echo "Installing curl..." >&2
|
||||
if command -v apt-get &> /dev/null; then
|
||||
sudo apt-get update && sudo apt-get install -y curl
|
||||
elif command -v yum &> /dev/null; then
|
||||
sudo yum install -y curl
|
||||
elif command -v brew &> /dev/null; then
|
||||
brew install curl
|
||||
else
|
||||
echo "Error: Cannot install curl. Please install manually." >&2
|
||||
exit 1
|
||||
fi
|
||||
fi
|
||||
if (!url) {
|
||||
console.error('Error: URL argument required');
|
||||
process.exit(1);
|
||||
}
|
||||
|
||||
# Configuration from environment
|
||||
TIMEOUT="${HEADERS_TIMEOUT:-10}"
|
||||
USER_AGENT="${HEADERS_USER_AGENT:-Mozilla/5.0 (compatible; ArchiveBox-TS/0.1)}"
|
||||
// Configuration from environment
|
||||
const cdpUrl = process.env.CHROME_CDP_URL;
|
||||
const timeout = parseInt(process.env.HEADERS_TIMEOUT || '10000', 10);
|
||||
|
||||
echo "Extracting headers from: $URL" >&2
|
||||
console.error(`Extracting headers from: ${url}`);
|
||||
if (cdpUrl) {
|
||||
console.error(`Connecting to browser via CDP: ${cdpUrl}`);
|
||||
}
|
||||
|
||||
# Get headers using curl
|
||||
HEADERS=$(curl -I -L -s --max-time "$TIMEOUT" --user-agent "$USER_AGENT" "$URL" 2>&1 || echo "")
|
||||
// Check puppeteer is installed
|
||||
if (!checkPuppeteer()) {
|
||||
process.exit(1);
|
||||
}
|
||||
|
||||
if [ -z "$HEADERS" ]; then
|
||||
echo "Error: Failed to fetch headers" >&2
|
||||
exit 1
|
||||
fi
|
||||
const puppeteer = require('puppeteer-core');
|
||||
|
||||
# Convert headers to JSON format (simple key-value pairs)
|
||||
echo "{" > headers.json
|
||||
let browser = null;
|
||||
let shouldCloseBrowser = false;
|
||||
|
||||
# Parse headers line by line
|
||||
FIRST=1
|
||||
while IFS=: read -r key value; do
|
||||
# Skip empty lines and HTTP status line
|
||||
if [ -z "$key" ] || [[ "$key" =~ ^HTTP ]]; then
|
||||
continue
|
||||
fi
|
||||
try {
|
||||
// Connect to CDP browser or launch local one
|
||||
if (cdpUrl) {
|
||||
browser = await puppeteer.connect({
|
||||
browserWSEndpoint: cdpUrl
|
||||
});
|
||||
} else {
|
||||
console.error('Error: CHROME_CDP_URL environment variable not set.');
|
||||
console.error('Please set CHROME_CDP_URL to connect to a Chrome browser via CDP.');
|
||||
console.error('Example: export CHROME_CDP_URL="ws://localhost:9222/devtools/browser/..."');
|
||||
console.error('');
|
||||
console.error('To start Chrome with remote debugging:');
|
||||
console.error(' chrome --remote-debugging-port=9222 --headless');
|
||||
console.error(' chromium --remote-debugging-port=9222 --headless');
|
||||
process.exit(1);
|
||||
}
|
||||
|
||||
# Clean up key and value
|
||||
key=$(echo "$key" | tr -d '\r\n' | sed 's/^[[:space:]]*//;s/[[:space:]]*$//')
|
||||
value=$(echo "$value" | tr -d '\r\n' | sed 's/^[[:space:]]*//;s/[[:space:]]*$//')
|
||||
const page = await browser.newPage();
|
||||
|
||||
if [ -n "$key" ] && [ -n "$value" ]; then
|
||||
# Escape quotes in value
|
||||
value=$(echo "$value" | sed 's/"/\\"/g')
|
||||
let capturedHeaders = null;
|
||||
|
||||
# Add comma if not first entry
|
||||
if [ "$FIRST" -eq 0 ]; then
|
||||
echo "," >> headers.json
|
||||
fi
|
||||
// Listen for response to capture headers
|
||||
page.on('response', async (response) => {
|
||||
if (response.url() === url) {
|
||||
capturedHeaders = response.headers();
|
||||
}
|
||||
});
|
||||
|
||||
echo -n " \"$key\": \"$value\"" >> headers.json
|
||||
FIRST=0
|
||||
fi
|
||||
done <<< "$HEADERS"
|
||||
// Navigate to URL
|
||||
await page.goto(url, {
|
||||
timeout,
|
||||
waitUntil: 'domcontentloaded'
|
||||
});
|
||||
|
||||
echo "" >> headers.json
|
||||
echo "}" >> headers.json
|
||||
await page.close();
|
||||
|
||||
echo "✓ Extracted headers to headers.json" >&2
|
||||
echo "headers.json"
|
||||
exit 0
|
||||
if (capturedHeaders) {
|
||||
// Write to file
|
||||
fs.writeFileSync('headers.json', JSON.stringify(capturedHeaders, null, 2), 'utf8');
|
||||
console.error('✓ Extracted headers to headers.json');
|
||||
console.log('headers.json');
|
||||
|
||||
if (shouldCloseBrowser) {
|
||||
await browser.close();
|
||||
}
|
||||
|
||||
process.exit(0);
|
||||
} else {
|
||||
console.error('Warning: Could not capture headers');
|
||||
|
||||
if (shouldCloseBrowser) {
|
||||
await browser.close();
|
||||
}
|
||||
|
||||
process.exit(1);
|
||||
}
|
||||
|
||||
} catch (err) {
|
||||
console.error(`Error: ${err.message}`);
|
||||
|
||||
if (browser && shouldCloseBrowser) {
|
||||
try {
|
||||
await browser.close();
|
||||
} catch (closeErr) {
|
||||
// Ignore close errors
|
||||
}
|
||||
}
|
||||
|
||||
process.exit(1);
|
||||
}
|
||||
}
|
||||
|
||||
main();
|
||||
|
||||
@@ -1,77 +1,125 @@
|
||||
#!/usr/bin/env python3
|
||||
#
|
||||
# Screenshot Extractor
|
||||
# Captures a screenshot of a given URL using Playwright
|
||||
#
|
||||
# Usage: screenshot <url>
|
||||
# Output: screenshot.png in current directory
|
||||
# Config: All configuration via environment variables
|
||||
# SCREENSHOT_TIMEOUT - Timeout in milliseconds (default: 30000)
|
||||
# SCREENSHOT_WIDTH - Viewport width (default: 1920)
|
||||
# SCREENSHOT_HEIGHT - Viewport height (default: 1080)
|
||||
# SCREENSHOT_WAIT - Time to wait before screenshot in ms (default: 1000)
|
||||
#
|
||||
#!/usr/bin/env node
|
||||
//
|
||||
// Screenshot Extractor
|
||||
// Captures a screenshot of a given URL using Puppeteer
|
||||
//
|
||||
// Usage: screenshot <url>
|
||||
// Output: screenshot.png in current directory
|
||||
// Config: All configuration via environment variables
|
||||
// CHROME_CDP_URL - Chrome DevTools Protocol URL (e.g., ws://localhost:9222/devtools/browser/...)
|
||||
// If not set, will launch a local browser instance
|
||||
// SCREENSHOT_TIMEOUT - Timeout in milliseconds (default: 30000)
|
||||
// SCREENSHOT_WIDTH - Viewport width (default: 1920)
|
||||
// SCREENSHOT_HEIGHT - Viewport height (default: 1080)
|
||||
// SCREENSHOT_WAIT - Time to wait before screenshot in ms (default: 1000)
|
||||
//
|
||||
|
||||
import sys
|
||||
import os
|
||||
import subprocess
|
||||
const { spawn } = require('child_process');
|
||||
const fs = require('fs');
|
||||
|
||||
def ensure_playwright():
|
||||
"""Auto-install playwright if not available"""
|
||||
try:
|
||||
from playwright.sync_api import sync_playwright
|
||||
return True
|
||||
except ImportError:
|
||||
print("Installing playwright...", file=sys.stderr)
|
||||
try:
|
||||
subprocess.check_call([sys.executable, "-m", "pip", "install", "playwright"])
|
||||
subprocess.check_call([sys.executable, "-m", "playwright", "install", "chromium"])
|
||||
from playwright.sync_api import sync_playwright
|
||||
return True
|
||||
except Exception as e:
|
||||
print(f"Error installing playwright: {e}", file=sys.stderr)
|
||||
return False
|
||||
// Check if puppeteer is available
|
||||
function checkPuppeteer() {
|
||||
try {
|
||||
require.resolve('puppeteer-core');
|
||||
return true;
|
||||
} catch (e) {
|
||||
console.error('Error: puppeteer-core is not installed.');
|
||||
console.error('Please install it with: npm install puppeteer-core');
|
||||
console.error('Or install chromium: npm install puppeteer');
|
||||
return false;
|
||||
}
|
||||
}
|
||||
|
||||
def main():
|
||||
if len(sys.argv) < 2:
|
||||
print("Error: URL argument required", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
async function main() {
|
||||
const url = process.argv[2];
|
||||
|
||||
url = sys.argv[1]
|
||||
if (!url) {
|
||||
console.error('Error: URL argument required');
|
||||
process.exit(1);
|
||||
}
|
||||
|
||||
# Configuration from environment
|
||||
timeout = int(os.environ.get('SCREENSHOT_TIMEOUT', '30000'))
|
||||
width = int(os.environ.get('SCREENSHOT_WIDTH', '1920'))
|
||||
height = int(os.environ.get('SCREENSHOT_HEIGHT', '1080'))
|
||||
wait = int(os.environ.get('SCREENSHOT_WAIT', '1000'))
|
||||
// Configuration from environment
|
||||
const cdpUrl = process.env.CHROME_CDP_URL;
|
||||
const timeout = parseInt(process.env.SCREENSHOT_TIMEOUT || '30000', 10);
|
||||
const width = parseInt(process.env.SCREENSHOT_WIDTH || '1920', 10);
|
||||
const height = parseInt(process.env.SCREENSHOT_HEIGHT || '1080', 10);
|
||||
const wait = parseInt(process.env.SCREENSHOT_WAIT || '1000', 10);
|
||||
|
||||
print(f"Capturing screenshot of: {url}", file=sys.stderr)
|
||||
console.error(`Capturing screenshot of: ${url}`);
|
||||
if (cdpUrl) {
|
||||
console.error(`Connecting to browser via CDP: ${cdpUrl}`);
|
||||
}
|
||||
|
||||
# Ensure playwright is installed
|
||||
if not ensure_playwright():
|
||||
sys.exit(1)
|
||||
// Check puppeteer is installed
|
||||
if (!checkPuppeteer()) {
|
||||
process.exit(1);
|
||||
}
|
||||
|
||||
from playwright.sync_api import sync_playwright
|
||||
const puppeteer = require('puppeteer-core');
|
||||
|
||||
try:
|
||||
with sync_playwright() as p:
|
||||
browser = p.chromium.launch()
|
||||
page = browser.new_page(viewport={'width': width, 'height': height})
|
||||
page.goto(url, timeout=timeout, wait_until='networkidle')
|
||||
let browser = null;
|
||||
let shouldCloseBrowser = false;
|
||||
|
||||
# Wait a bit for any dynamic content
|
||||
page.wait_for_timeout(wait)
|
||||
try {
|
||||
// Connect to CDP browser or launch local one
|
||||
if (cdpUrl) {
|
||||
browser = await puppeteer.connect({
|
||||
browserWSEndpoint: cdpUrl,
|
||||
defaultViewport: { width, height }
|
||||
});
|
||||
} else {
|
||||
console.error('Error: CHROME_CDP_URL environment variable not set.');
|
||||
console.error('Please set CHROME_CDP_URL to connect to a Chrome browser via CDP.');
|
||||
console.error('Example: export CHROME_CDP_URL="ws://localhost:9222/devtools/browser/..."');
|
||||
console.error('');
|
||||
console.error('To start Chrome with remote debugging:');
|
||||
console.error(' chrome --remote-debugging-port=9222 --headless');
|
||||
console.error(' chromium --remote-debugging-port=9222 --headless');
|
||||
process.exit(1);
|
||||
}
|
||||
|
||||
page.screenshot(path='screenshot.png', full_page=True)
|
||||
browser.close()
|
||||
const page = await browser.newPage();
|
||||
await page.setViewport({ width, height });
|
||||
|
||||
print("✓ Captured screenshot: screenshot.png", file=sys.stderr)
|
||||
print("screenshot.png")
|
||||
sys.exit(0)
|
||||
// Navigate to URL
|
||||
await page.goto(url, {
|
||||
timeout,
|
||||
waitUntil: 'networkidle2'
|
||||
});
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error: {e}", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
// Wait a bit for any dynamic content
|
||||
await page.waitForTimeout(wait);
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
||||
// Take screenshot
|
||||
await page.screenshot({
|
||||
path: 'screenshot.png',
|
||||
fullPage: true
|
||||
});
|
||||
|
||||
await page.close();
|
||||
|
||||
console.error('✓ Captured screenshot: screenshot.png');
|
||||
console.log('screenshot.png');
|
||||
|
||||
if (shouldCloseBrowser) {
|
||||
await browser.close();
|
||||
}
|
||||
|
||||
process.exit(0);
|
||||
|
||||
} catch (err) {
|
||||
console.error(`Error: ${err.message}`);
|
||||
|
||||
if (browser && shouldCloseBrowser) {
|
||||
try {
|
||||
await browser.close();
|
||||
} catch (closeErr) {
|
||||
// Ignore close errors
|
||||
}
|
||||
}
|
||||
|
||||
process.exit(1);
|
||||
}
|
||||
}
|
||||
|
||||
main();
|
||||
|
||||
@@ -1,89 +1,123 @@
|
||||
#!/usr/bin/env node
|
||||
//
|
||||
// Title Extractor
|
||||
// Extracts the page title from a given URL
|
||||
// Extracts the page title from a given URL using Puppeteer
|
||||
//
|
||||
// Usage: title <url>
|
||||
// Output: title.txt in current directory
|
||||
// Config: All configuration via environment variables
|
||||
// CHROME_CDP_URL - Chrome DevTools Protocol URL (e.g., ws://localhost:9222/devtools/browser/...)
|
||||
// If not set, will launch a local browser instance
|
||||
// TITLE_TIMEOUT - Timeout in milliseconds (default: 10000)
|
||||
// TITLE_USER_AGENT - User agent string
|
||||
//
|
||||
|
||||
const https = require('https');
|
||||
const http = require('http');
|
||||
const { spawn } = require('child_process');
|
||||
const fs = require('fs');
|
||||
const { URL } = require('url');
|
||||
|
||||
const url = process.argv[2];
|
||||
|
||||
if (!url) {
|
||||
console.error('Error: URL argument required');
|
||||
process.exit(1);
|
||||
// Check if puppeteer is available
|
||||
function checkPuppeteer() {
|
||||
try {
|
||||
require.resolve('puppeteer-core');
|
||||
return true;
|
||||
} catch (e) {
|
||||
console.error('Error: puppeteer-core is not installed.');
|
||||
console.error('Please install it with: npm install puppeteer-core');
|
||||
console.error('Or install chromium: npm install puppeteer');
|
||||
return false;
|
||||
}
|
||||
}
|
||||
|
||||
// Configuration from environment
|
||||
const TIMEOUT = parseInt(process.env.TITLE_TIMEOUT || '10000', 10);
|
||||
const USER_AGENT = process.env.TITLE_USER_AGENT || 'Mozilla/5.0 (compatible; ArchiveBox-TS/0.1)';
|
||||
async function main() {
|
||||
const url = process.argv[2];
|
||||
|
||||
console.error(`Extracting title from: ${url}`);
|
||||
if (!url) {
|
||||
console.error('Error: URL argument required');
|
||||
process.exit(1);
|
||||
}
|
||||
|
||||
// Parse URL
|
||||
let parsedUrl;
|
||||
try {
|
||||
parsedUrl = new URL(url);
|
||||
} catch (err) {
|
||||
console.error(`Error: Invalid URL: ${err.message}`);
|
||||
process.exit(1);
|
||||
}
|
||||
// Configuration from environment
|
||||
const cdpUrl = process.env.CHROME_CDP_URL;
|
||||
const timeout = parseInt(process.env.TITLE_TIMEOUT || '10000', 10);
|
||||
|
||||
// Choose http or https module
|
||||
const client = parsedUrl.protocol === 'https:' ? https : http;
|
||||
console.error(`Extracting title from: ${url}`);
|
||||
if (cdpUrl) {
|
||||
console.error(`Connecting to browser via CDP: ${cdpUrl}`);
|
||||
}
|
||||
|
||||
// Make request
|
||||
const options = {
|
||||
headers: {
|
||||
'User-Agent': USER_AGENT,
|
||||
},
|
||||
timeout: TIMEOUT,
|
||||
};
|
||||
// Check puppeteer is installed
|
||||
if (!checkPuppeteer()) {
|
||||
process.exit(1);
|
||||
}
|
||||
|
||||
client.get(url, options, (res) => {
|
||||
let html = '';
|
||||
const puppeteer = require('puppeteer-core');
|
||||
|
||||
res.on('data', (chunk) => {
|
||||
html += chunk;
|
||||
let browser = null;
|
||||
let shouldCloseBrowser = false;
|
||||
|
||||
// Early exit if we found the title (optimization)
|
||||
if (html.includes('</title>')) {
|
||||
res.destroy();
|
||||
}
|
||||
});
|
||||
|
||||
res.on('end', () => {
|
||||
// Extract title using regex
|
||||
const titleMatch = html.match(/<title[^>]*>(.*?)<\/title>/is);
|
||||
|
||||
if (titleMatch && titleMatch[1]) {
|
||||
const title = titleMatch[1]
|
||||
.replace(/<[^>]*>/g, '') // Remove any HTML tags
|
||||
.replace(/\s+/g, ' ') // Normalize whitespace
|
||||
.trim();
|
||||
|
||||
// Write to file
|
||||
fs.writeFileSync('title.txt', title, 'utf8');
|
||||
console.error(`✓ Extracted title: ${title}`);
|
||||
console.log('title.txt');
|
||||
process.exit(0);
|
||||
try {
|
||||
// Connect to CDP browser or launch local one
|
||||
if (cdpUrl) {
|
||||
browser = await puppeteer.connect({
|
||||
browserWSEndpoint: cdpUrl
|
||||
});
|
||||
} else {
|
||||
console.error('Warning: Could not find title tag');
|
||||
console.error('Error: CHROME_CDP_URL environment variable not set.');
|
||||
console.error('Please set CHROME_CDP_URL to connect to a Chrome browser via CDP.');
|
||||
console.error('Example: export CHROME_CDP_URL="ws://localhost:9222/devtools/browser/..."');
|
||||
console.error('');
|
||||
console.error('To start Chrome with remote debugging:');
|
||||
console.error(' chrome --remote-debugging-port=9222 --headless');
|
||||
console.error(' chromium --remote-debugging-port=9222 --headless');
|
||||
process.exit(1);
|
||||
}
|
||||
});
|
||||
}).on('error', (err) => {
|
||||
console.error(`Error: ${err.message}`);
|
||||
process.exit(1);
|
||||
}).on('timeout', () => {
|
||||
console.error('Error: Request timeout');
|
||||
process.exit(1);
|
||||
});
|
||||
|
||||
const page = await browser.newPage();
|
||||
|
||||
// Navigate to URL
|
||||
await page.goto(url, {
|
||||
timeout,
|
||||
waitUntil: 'domcontentloaded'
|
||||
});
|
||||
|
||||
// Get the title
|
||||
const title = await page.title();
|
||||
|
||||
await page.close();
|
||||
|
||||
if (title && title.trim()) {
|
||||
// Write to file
|
||||
fs.writeFileSync('title.txt', title.trim(), 'utf8');
|
||||
console.error(`✓ Extracted title: ${title.trim()}`);
|
||||
console.log('title.txt');
|
||||
|
||||
if (shouldCloseBrowser) {
|
||||
await browser.close();
|
||||
}
|
||||
|
||||
process.exit(0);
|
||||
} else {
|
||||
console.error('Warning: Could not find title');
|
||||
|
||||
if (shouldCloseBrowser) {
|
||||
await browser.close();
|
||||
}
|
||||
|
||||
process.exit(1);
|
||||
}
|
||||
|
||||
} catch (err) {
|
||||
console.error(`Error: ${err.message}`);
|
||||
|
||||
if (browser && shouldCloseBrowser) {
|
||||
try {
|
||||
await browser.close();
|
||||
} catch (closeErr) {
|
||||
// Ignore close errors
|
||||
}
|
||||
}
|
||||
|
||||
process.exit(1);
|
||||
}
|
||||
}
|
||||
|
||||
main();
|
||||
|
||||
855
archivebox-ts/package-lock.json
generated
855
archivebox-ts/package-lock.json
generated
File diff suppressed because it is too large
Load Diff
@@ -22,7 +22,8 @@
|
||||
"dependencies": {
|
||||
"better-sqlite3": "^11.0.0",
|
||||
"commander": "^12.0.0",
|
||||
"nanoid": "^3.3.7"
|
||||
"nanoid": "^3.3.7",
|
||||
"puppeteer-core": "^24.28.0"
|
||||
},
|
||||
"devDependencies": {
|
||||
"@types/better-sqlite3": "^7.6.9",
|
||||
|
||||
Reference in New Issue
Block a user