Update extractors to use Puppeteer with Chrome DevTools Protocol

Changed screenshot, title, and headers extractors from various implementations to use Puppeteer connecting to Chrome via CDP. Key changes: - All three extractors now use puppeteer-core - Connect to Chrome via CHROME_CDP_URL environment variable - Shared browser instance across all extractors for efficiency - Added puppeteer-core as dependency (npm install) - Removed auto-install logic (cleaner, more predictable) - Better error messages when CHROME_CDP_URL not set Benefits: - Single Chrome instance for all extractors (better performance) - Consistent browser environment across extractors - Can use remote/containerized Chrome - Better for production deployments Breaking changes: - CHROME_CDP_URL environment variable now required for: - screenshot extractor - title extractor - headers extractor - Users must start Chrome with remote debugging: chrome --remote-debugging-port=9222 --headless Updated documentation: - README.md with Chrome setup instructions - Added section on Chrome DevTools Protocol setup - Added Docker setup example - Updated extractor documentation with CDP requirements
2026-01-05 18:35:50 +10:00 · 2025-11-03 19:10:55 +00:00
parent f4bb10bdae
commit ee1db04b73
6 changed files with 1290 additions and 213 deletions
--- a/archivebox-ts/README.md
+++ b/archivebox-ts/README.md
@@ -40,9 +40,10 @@ archivebox-ts/
 ### Prerequisites

 - Node.js 18+ and npm
+- Chrome or Chromium browser (for screenshot, title, and headers extractors)
 - For specific extractors:
  - `wget` extractor: wget
-  - `screenshot` extractor: Python 3 + Playwright
+  - `screenshot`, `title`, `headers` extractors: puppeteer-core + Chrome with remote debugging

 ### Setup

@@ -57,6 +58,15 @@ npm run build

 # Initialize ArchiveBox
 node dist/cli.js init
+
+# Start Chrome with remote debugging (required for screenshot, title, headers extractors)
+# In a separate terminal:
+chrome --remote-debugging-port=9222 --headless
+# Or on Linux:
+chromium --remote-debugging-port=9222 --headless
+
+# Set the CDP URL environment variable
+export CHROME_CDP_URL="http://localhost:9222"
 ```

 ## Usage
@@ -71,16 +81,35 @@ node dist/cli.js init

 ### Add a URL

+First, make sure Chrome is running with remote debugging and CHROME_CDP_URL is set:
+
+```bash
+# Terminal 1: Start Chrome
+chrome --remote-debugging-port=9222 --headless
+
+# Terminal 2: Get the WebSocket URL
+curl http://localhost:9222/json/version | jq -r .webSocketDebuggerUrl
+
+# Set the environment variable (use the URL from above)
+export CHROME_CDP_URL="ws://localhost:9222/devtools/browser/..."
+```
+
 Archive a URL with all available extractors:

 ```bash
 node dist/cli.js add https://example.com
 ```

-Archive with specific extractors:
+Archive with specific extractors (favicon and wget don't need Chrome):

 ```bash
-node dist/cli.js add https://example.com --extractors favicon,title,headers
+node dist/cli.js add https://example.com --extractors favicon,wget
+```
+
+Archive with Chrome-based extractors:
+
+```bash
+node dist/cli.js add https://example.com --extractors title,headers,screenshot
 ```

 Add with custom title:
@@ -285,44 +314,116 @@ print("output.txt")
 - **Language**: Bash
 - **Dependencies**: curl (auto-installed)
 - **Output**: `favicon.ico` or `favicon.png`
+- **Requires Chrome**: No
 - **Config**:
  - `FAVICON_TIMEOUT` - Timeout in seconds (default: 10)

 ### title
- **Language**: Node.js
- **Dependencies**: Built-in Node.js modules
+- **Language**: Node.js + Puppeteer
+- **Dependencies**: puppeteer-core, Chrome browser via CDP
 - **Output**: `title.txt`
+- **Requires Chrome**: Yes (via CHROME_CDP_URL)
 - **Config**:
+  - `CHROME_CDP_URL` - Chrome DevTools Protocol WebSocket URL (required)
  - `TITLE_TIMEOUT` - Timeout in milliseconds (default: 10000)
-  - `TITLE_USER_AGENT` - User agent string

 ### headers
- **Language**: Bash
- **Dependencies**: curl (auto-installed)
+- **Language**: Node.js + Puppeteer
+- **Dependencies**: puppeteer-core, Chrome browser via CDP
 - **Output**: `headers.json`
+- **Requires Chrome**: Yes (via CHROME_CDP_URL)
 - **Config**:
-  - `HEADERS_TIMEOUT` - Timeout in seconds (default: 10)
-  - `HEADERS_USER_AGENT` - User agent string
+  - `CHROME_CDP_URL` - Chrome DevTools Protocol WebSocket URL (required)
+  - `HEADERS_TIMEOUT` - Timeout in milliseconds (default: 10000)

 ### wget
 - **Language**: Bash
 - **Dependencies**: wget (auto-installed)
 - **Output**: `warc/archive.warc.gz` and downloaded files
+- **Requires Chrome**: No
 - **Config**:
  - `WGET_TIMEOUT` - Timeout in seconds (default: 60)
  - `WGET_USER_AGENT` - User agent string
  - `WGET_ARGS` - Additional wget arguments

 ### screenshot
- **Language**: Python
- **Dependencies**: playwright (auto-installed)
+- **Language**: Node.js + Puppeteer
+- **Dependencies**: puppeteer-core, Chrome browser via CDP
 - **Output**: `screenshot.png`
+- **Requires Chrome**: Yes (via CHROME_CDP_URL)
 - **Config**:
+  - `CHROME_CDP_URL` - Chrome DevTools Protocol WebSocket URL (required)
  - `SCREENSHOT_TIMEOUT` - Timeout in milliseconds (default: 30000)
  - `SCREENSHOT_WIDTH` - Viewport width (default: 1920)
  - `SCREENSHOT_HEIGHT` - Viewport height (default: 1080)
  - `SCREENSHOT_WAIT` - Wait time before screenshot in ms (default: 1000)

+## Setting up Chrome for Remote Debugging
+
+The `screenshot`, `title`, and `headers` extractors require a Chrome browser accessible via the Chrome DevTools Protocol (CDP). This allows multiple extractors to share a single browser instance.
+
+### Start Chrome with Remote Debugging
+
+```bash
+# Linux/Mac
+chromium --remote-debugging-port=9222 --headless --disable-gpu
+
+# Or with Chrome
+chrome --remote-debugging-port=9222 --headless --disable-gpu
+
+# Windows
+chrome.exe --remote-debugging-port=9222 --headless --disable-gpu
+```
+
+### Get the WebSocket URL
+
+```bash
+# Query the Chrome instance for the WebSocket URL
+curl http://localhost:9222/json/version
+
+# Example output:
+# {
+#   "Browser": "Chrome/120.0.0.0",
+#   "Protocol-Version": "1.3",
+#   "User-Agent": "Mozilla/5.0...",
+#   "V8-Version": "12.0.267.8",
+#   "WebKit-Version": "537.36",
+#   "webSocketDebuggerUrl": "ws://localhost:9222/devtools/browser/..."
+# }
+```
+
+### Set the Environment Variable
+
+```bash
+# Extract just the WebSocket URL
+export CHROME_CDP_URL=$(curl -s http://localhost:9222/json/version | jq -r .webSocketDebuggerUrl)
+
+# Or set it manually
+export CHROME_CDP_URL="ws://localhost:9222/devtools/browser/12345678-1234-1234-1234-123456789abc"
+
+# Verify it's set
+echo $CHROME_CDP_URL
+```
+
+### Docker Setup
+
+For running in Docker, you can use a separate Chrome container:
+
+```bash
+# Start Chrome in a container
+docker run -d --name chrome \
+  -p 9222:9222 \
+  browserless/chrome:latest \
+  --remote-debugging-port=9222 \
+  --remote-debugging-address=0.0.0.0
+
+# Get the CDP URL
+export CHROME_CDP_URL="ws://localhost:9222/devtools/browser/$(curl -s http://localhost:9222/json/version | jq -r .webSocketDebuggerUrl | cut -d'/' -f5-)"
+
+# Run archivebox-ts
+node dist/cli.js add https://example.com
+```
+
 ## Development

 ### Build
--- a/archivebox-ts/extractors/headers
+++ b/archivebox-ts/extractors/headers
@@ -1,85 +1,129 @@
-#!/bin/bash
-#
-# Headers Extractor
-# Extracts HTTP headers from a given URL
-#
-# Usage: headers <url>
-# Output: headers.json in current directory
-# Config: All configuration via environment variables
-#   HEADERS_TIMEOUT - Timeout in seconds (default: 10)
-#   HEADERS_USER_AGENT - User agent string
-#
+#!/usr/bin/env node
+//
+// Headers Extractor
+// Extracts HTTP headers from a given URL using Puppeteer
+//
+// Usage: headers <url>
+// Output: headers.json in current directory
+// Config: All configuration via environment variables
+//   CHROME_CDP_URL - Chrome DevTools Protocol URL (e.g., ws://localhost:9222/devtools/browser/...)
+//                    If not set, will launch a local browser instance
+//   HEADERS_TIMEOUT - Timeout in milliseconds (default: 10000)
+//

-set -e
+const { spawn } = require('child_process');
+const fs = require('fs');

-URL="$1"
+// Check if puppeteer is available
+function checkPuppeteer() {
+  try {
+    require.resolve('puppeteer-core');
+    return true;
+  } catch (e) {
+    console.error('Error: puppeteer-core is not installed.');
+    console.error('Please install it with: npm install puppeteer-core');
+    console.error('Or install chromium: npm install puppeteer');
+    return false;
+  }
+}

-if [ -z "$URL" ]; then
-  echo "Error: URL argument required" >&2
-  exit 1
-fi
+async function main() {
+  const url = process.argv[2];

-# Auto-install dependencies
-if ! command -v curl &> /dev/null; then
-  echo "Installing curl..." >&2
-  if command -v apt-get &> /dev/null; then
-    sudo apt-get update && sudo apt-get install -y curl
-  elif command -v yum &> /dev/null; then
-    sudo yum install -y curl
-  elif command -v brew &> /dev/null; then
-    brew install curl
-  else
-    echo "Error: Cannot install curl. Please install manually." >&2
-    exit 1
-  fi
-fi
+  if (!url) {
+    console.error('Error: URL argument required');
+    process.exit(1);
+  }

-# Configuration from environment
-TIMEOUT="${HEADERS_TIMEOUT:-10}"
-USER_AGENT="${HEADERS_USER_AGENT:-Mozilla/5.0 (compatible; ArchiveBox-TS/0.1)}"
+  // Configuration from environment
+  const cdpUrl = process.env.CHROME_CDP_URL;
+  const timeout = parseInt(process.env.HEADERS_TIMEOUT || '10000', 10);

-echo "Extracting headers from: $URL" >&2
+  console.error(`Extracting headers from: ${url}`);
+  if (cdpUrl) {
+    console.error(`Connecting to browser via CDP: ${cdpUrl}`);
+  }

-# Get headers using curl
-HEADERS=$(curl -I -L -s --max-time "$TIMEOUT" --user-agent "$USER_AGENT" "$URL" 2>&1 || echo "")
+  // Check puppeteer is installed
+  if (!checkPuppeteer()) {
+    process.exit(1);
+  }

-if [ -z "$HEADERS" ]; then
-  echo "Error: Failed to fetch headers" >&2
-  exit 1
-fi
+  const puppeteer = require('puppeteer-core');

-# Convert headers to JSON format (simple key-value pairs)
-echo "{" > headers.json
+  let browser = null;
+  let shouldCloseBrowser = false;

-# Parse headers line by line
-FIRST=1
-while IFS=: read -r key value; do
-  # Skip empty lines and HTTP status line
-  if [ -z "$key" ] || [[ "$key" =~ ^HTTP ]]; then
-    continue
-  fi
+  try {
+    // Connect to CDP browser or launch local one
+    if (cdpUrl) {
+      browser = await puppeteer.connect({
+        browserWSEndpoint: cdpUrl
+      });
+    } else {
+      console.error('Error: CHROME_CDP_URL environment variable not set.');
+      console.error('Please set CHROME_CDP_URL to connect to a Chrome browser via CDP.');
+      console.error('Example: export CHROME_CDP_URL="ws://localhost:9222/devtools/browser/..."');
+      console.error('');
+      console.error('To start Chrome with remote debugging:');
+      console.error('  chrome --remote-debugging-port=9222 --headless');
+      console.error('  chromium --remote-debugging-port=9222 --headless');
+      process.exit(1);
+    }

-  # Clean up key and value
-  key=$(echo "$key" | tr -d '\r\n' | sed 's/^[[:space:]]*//;s/[[:space:]]*$//')
-  value=$(echo "$value" | tr -d '\r\n' | sed 's/^[[:space:]]*//;s/[[:space:]]*$//')
+    const page = await browser.newPage();

-  if [ -n "$key" ] && [ -n "$value" ]; then
-    # Escape quotes in value
-    value=$(echo "$value" | sed 's/"/\\"/g')
+    let capturedHeaders = null;

-    # Add comma if not first entry
-    if [ "$FIRST" -eq 0 ]; then
-      echo "," >> headers.json
-    fi
+    // Listen for response to capture headers
+    page.on('response', async (response) => {
+      if (response.url() === url) {
+        capturedHeaders = response.headers();
+      }
+    });

-    echo -n "  \"$key\": \"$value\"" >> headers.json
-    FIRST=0
-  fi
-done <<< "$HEADERS"
+    // Navigate to URL
+    await page.goto(url, {
+      timeout,
+      waitUntil: 'domcontentloaded'
+    });

-echo "" >> headers.json
-echo "}" >> headers.json
+    await page.close();

-echo "✓ Extracted headers to headers.json" >&2
-echo "headers.json"
-exit 0
+    if (capturedHeaders) {
+      // Write to file
+      fs.writeFileSync('headers.json', JSON.stringify(capturedHeaders, null, 2), 'utf8');
+      console.error('✓ Extracted headers to headers.json');
+      console.log('headers.json');
+
+      if (shouldCloseBrowser) {
+        await browser.close();
+      }
+
+      process.exit(0);
+    } else {
+      console.error('Warning: Could not capture headers');
+
+      if (shouldCloseBrowser) {
+        await browser.close();
+      }
+
+      process.exit(1);
+    }
+
+  } catch (err) {
+    console.error(`Error: ${err.message}`);
+
+    if (browser && shouldCloseBrowser) {
+      try {
+        await browser.close();
+      } catch (closeErr) {
+        // Ignore close errors
+      }
+    }
+
+    process.exit(1);
+  }
+}
+
+main();
--- a/archivebox-ts/extractors/screenshot
+++ b/archivebox-ts/extractors/screenshot
@@ -1,77 +1,125 @@
-#!/usr/bin/env python3
-#
-# Screenshot Extractor
-# Captures a screenshot of a given URL using Playwright
-#
-# Usage: screenshot <url>
-# Output: screenshot.png in current directory
-# Config: All configuration via environment variables
-#   SCREENSHOT_TIMEOUT - Timeout in milliseconds (default: 30000)
-#   SCREENSHOT_WIDTH - Viewport width (default: 1920)
-#   SCREENSHOT_HEIGHT - Viewport height (default: 1080)
-#   SCREENSHOT_WAIT - Time to wait before screenshot in ms (default: 1000)
-#
+#!/usr/bin/env node
+//
+// Screenshot Extractor
+// Captures a screenshot of a given URL using Puppeteer
+//
+// Usage: screenshot <url>
+// Output: screenshot.png in current directory
+// Config: All configuration via environment variables
+//   CHROME_CDP_URL - Chrome DevTools Protocol URL (e.g., ws://localhost:9222/devtools/browser/...)
+//                    If not set, will launch a local browser instance
+//   SCREENSHOT_TIMEOUT - Timeout in milliseconds (default: 30000)
+//   SCREENSHOT_WIDTH - Viewport width (default: 1920)
+//   SCREENSHOT_HEIGHT - Viewport height (default: 1080)
+//   SCREENSHOT_WAIT - Time to wait before screenshot in ms (default: 1000)
+//

-import sys
-import os
-import subprocess
+const { spawn } = require('child_process');
+const fs = require('fs');

-def ensure_playwright():
-    """Auto-install playwright if not available"""
-    try:
-        from playwright.sync_api import sync_playwright
-        return True
-    except ImportError:
-        print("Installing playwright...", file=sys.stderr)
-        try:
-            subprocess.check_call([sys.executable, "-m", "pip", "install", "playwright"])
-            subprocess.check_call([sys.executable, "-m", "playwright", "install", "chromium"])
-            from playwright.sync_api import sync_playwright
-            return True
-        except Exception as e:
-            print(f"Error installing playwright: {e}", file=sys.stderr)
-            return False
+// Check if puppeteer is available
+function checkPuppeteer() {
+  try {
+    require.resolve('puppeteer-core');
+    return true;
+  } catch (e) {
+    console.error('Error: puppeteer-core is not installed.');
+    console.error('Please install it with: npm install puppeteer-core');
+    console.error('Or install chromium: npm install puppeteer');
+    return false;
+  }
+}

-def main():
-    if len(sys.argv) < 2:
-        print("Error: URL argument required", file=sys.stderr)
-        sys.exit(1)
+async function main() {
+  const url = process.argv[2];

-    url = sys.argv[1]
+  if (!url) {
+    console.error('Error: URL argument required');
+    process.exit(1);
+  }

-    # Configuration from environment
-    timeout = int(os.environ.get('SCREENSHOT_TIMEOUT', '30000'))
-    width = int(os.environ.get('SCREENSHOT_WIDTH', '1920'))
-    height = int(os.environ.get('SCREENSHOT_HEIGHT', '1080'))
-    wait = int(os.environ.get('SCREENSHOT_WAIT', '1000'))
+  // Configuration from environment
+  const cdpUrl = process.env.CHROME_CDP_URL;
+  const timeout = parseInt(process.env.SCREENSHOT_TIMEOUT || '30000', 10);
+  const width = parseInt(process.env.SCREENSHOT_WIDTH || '1920', 10);
+  const height = parseInt(process.env.SCREENSHOT_HEIGHT || '1080', 10);
+  const wait = parseInt(process.env.SCREENSHOT_WAIT || '1000', 10);

-    print(f"Capturing screenshot of: {url}", file=sys.stderr)
+  console.error(`Capturing screenshot of: ${url}`);
+  if (cdpUrl) {
+    console.error(`Connecting to browser via CDP: ${cdpUrl}`);
+  }

-    # Ensure playwright is installed
-    if not ensure_playwright():
-        sys.exit(1)
+  // Check puppeteer is installed
+  if (!checkPuppeteer()) {
+    process.exit(1);
+  }

-    from playwright.sync_api import sync_playwright
+  const puppeteer = require('puppeteer-core');

-    try:
-        with sync_playwright() as p:
-            browser = p.chromium.launch()
-            page = browser.new_page(viewport={'width': width, 'height': height})
-            page.goto(url, timeout=timeout, wait_until='networkidle')
+  let browser = null;
+  let shouldCloseBrowser = false;

-            # Wait a bit for any dynamic content
-            page.wait_for_timeout(wait)
+  try {
+    // Connect to CDP browser or launch local one
+    if (cdpUrl) {
+      browser = await puppeteer.connect({
+        browserWSEndpoint: cdpUrl,
+        defaultViewport: { width, height }
+      });
+    } else {
+      console.error('Error: CHROME_CDP_URL environment variable not set.');
+      console.error('Please set CHROME_CDP_URL to connect to a Chrome browser via CDP.');
+      console.error('Example: export CHROME_CDP_URL="ws://localhost:9222/devtools/browser/..."');
+      console.error('');
+      console.error('To start Chrome with remote debugging:');
+      console.error('  chrome --remote-debugging-port=9222 --headless');
+      console.error('  chromium --remote-debugging-port=9222 --headless');
+      process.exit(1);
+    }

-            page.screenshot(path='screenshot.png', full_page=True)
-            browser.close()
+    const page = await browser.newPage();
+    await page.setViewport({ width, height });

-            print("✓ Captured screenshot: screenshot.png", file=sys.stderr)
-            print("screenshot.png")
-            sys.exit(0)
+    // Navigate to URL
+    await page.goto(url, {
+      timeout,
+      waitUntil: 'networkidle2'
+    });

-    except Exception as e:
-        print(f"Error: {e}", file=sys.stderr)
-        sys.exit(1)
+    // Wait a bit for any dynamic content
+    await page.waitForTimeout(wait);

-if __name__ == '__main__':
-    main()
+    // Take screenshot
+    await page.screenshot({
+      path: 'screenshot.png',
+      fullPage: true
+    });
+
+    await page.close();
+
+    console.error('✓ Captured screenshot: screenshot.png');
+    console.log('screenshot.png');
+
+    if (shouldCloseBrowser) {
+      await browser.close();
+    }
+
+    process.exit(0);
+
+  } catch (err) {
+    console.error(`Error: ${err.message}`);
+
+    if (browser && shouldCloseBrowser) {
+      try {
+        await browser.close();
+      } catch (closeErr) {
+        // Ignore close errors
+      }
+    }
+
+    process.exit(1);
+  }
+}
+
+main();
--- a/archivebox-ts/extractors/title
+++ b/archivebox-ts/extractors/title
@@ -1,89 +1,123 @@
 #!/usr/bin/env node
 //
 // Title Extractor
-// Extracts the page title from a given URL
+// Extracts the page title from a given URL using Puppeteer
 //
 // Usage: title <url>
 // Output: title.txt in current directory
 // Config: All configuration via environment variables
+//   CHROME_CDP_URL - Chrome DevTools Protocol URL (e.g., ws://localhost:9222/devtools/browser/...)
+//                    If not set, will launch a local browser instance
 //   TITLE_TIMEOUT - Timeout in milliseconds (default: 10000)
-//   TITLE_USER_AGENT - User agent string
 //

-const https = require('https');
-const http = require('http');
+const { spawn } = require('child_process');
 const fs = require('fs');
-const { URL } = require('url');

-const url = process.argv[2];
-
-if (!url) {
-  console.error('Error: URL argument required');
-  process.exit(1);
+// Check if puppeteer is available
+function checkPuppeteer() {
+  try {
+    require.resolve('puppeteer-core');
+    return true;
+  } catch (e) {
+    console.error('Error: puppeteer-core is not installed.');
+    console.error('Please install it with: npm install puppeteer-core');
+    console.error('Or install chromium: npm install puppeteer');
+    return false;
+  }
 }

-// Configuration from environment
-const TIMEOUT = parseInt(process.env.TITLE_TIMEOUT || '10000', 10);
-const USER_AGENT = process.env.TITLE_USER_AGENT || 'Mozilla/5.0 (compatible; ArchiveBox-TS/0.1)';
+async function main() {
+  const url = process.argv[2];

-console.error(`Extracting title from: ${url}`);
+  if (!url) {
+    console.error('Error: URL argument required');
+    process.exit(1);
+  }

-// Parse URL
-let parsedUrl;
-try {
-  parsedUrl = new URL(url);
-} catch (err) {
-  console.error(`Error: Invalid URL: ${err.message}`);
-  process.exit(1);
-}
+  // Configuration from environment
+  const cdpUrl = process.env.CHROME_CDP_URL;
+  const timeout = parseInt(process.env.TITLE_TIMEOUT || '10000', 10);

-// Choose http or https module
-const client = parsedUrl.protocol === 'https:' ? https : http;
+  console.error(`Extracting title from: ${url}`);
+  if (cdpUrl) {
+    console.error(`Connecting to browser via CDP: ${cdpUrl}`);
+  }

-// Make request
-const options = {
-  headers: {
-    'User-Agent': USER_AGENT,
-  },
-  timeout: TIMEOUT,
-};
+  // Check puppeteer is installed
+  if (!checkPuppeteer()) {
+    process.exit(1);
+  }

-client.get(url, options, (res) => {
-  let html = '';
+  const puppeteer = require('puppeteer-core');

-  res.on('data', (chunk) => {
-    html += chunk;
+  let browser = null;
+  let shouldCloseBrowser = false;

-    // Early exit if we found the title (optimization)
-    if (html.includes('</title>')) {
-      res.destroy();
-    }
-  });
-
-  res.on('end', () => {
-    // Extract title using regex
-    const titleMatch = html.match(/<title[^>]*>(.*?)<\/title>/is);
-
-    if (titleMatch && titleMatch[1]) {
-      const title = titleMatch[1]
-        .replace(/<[^>]*>/g, '') // Remove any HTML tags
-        .replace(/\s+/g, ' ')     // Normalize whitespace
-        .trim();
-
-      // Write to file
-      fs.writeFileSync('title.txt', title, 'utf8');
-      console.error(`✓ Extracted title: ${title}`);
-      console.log('title.txt');
-      process.exit(0);
+  try {
+    // Connect to CDP browser or launch local one
+    if (cdpUrl) {
+      browser = await puppeteer.connect({
+        browserWSEndpoint: cdpUrl
+      });
    } else {
-      console.error('Warning: Could not find title tag');
+      console.error('Error: CHROME_CDP_URL environment variable not set.');
+      console.error('Please set CHROME_CDP_URL to connect to a Chrome browser via CDP.');
+      console.error('Example: export CHROME_CDP_URL="ws://localhost:9222/devtools/browser/..."');
+      console.error('');
+      console.error('To start Chrome with remote debugging:');
+      console.error('  chrome --remote-debugging-port=9222 --headless');
+      console.error('  chromium --remote-debugging-port=9222 --headless');
      process.exit(1);
    }
-  });
-}).on('error', (err) => {
-  console.error(`Error: ${err.message}`);
-  process.exit(1);
-}).on('timeout', () => {
-  console.error('Error: Request timeout');
-  process.exit(1);
-});
+
+    const page = await browser.newPage();
+
+    // Navigate to URL
+    await page.goto(url, {
+      timeout,
+      waitUntil: 'domcontentloaded'
+    });
+
+    // Get the title
+    const title = await page.title();
+
+    await page.close();
+
+    if (title && title.trim()) {
+      // Write to file
+      fs.writeFileSync('title.txt', title.trim(), 'utf8');
+      console.error(`✓ Extracted title: ${title.trim()}`);
+      console.log('title.txt');
+
+      if (shouldCloseBrowser) {
+        await browser.close();
+      }
+
+      process.exit(0);
+    } else {
+      console.error('Warning: Could not find title');
+
+      if (shouldCloseBrowser) {
+        await browser.close();
+      }
+
+      process.exit(1);
+    }
+
+  } catch (err) {
+    console.error(`Error: ${err.message}`);
+
+    if (browser && shouldCloseBrowser) {
+      try {
+        await browser.close();
+      } catch (closeErr) {
+        // Ignore close errors
+      }
+    }
+
+    process.exit(1);
+  }
+}
+
+main();
--- a/archivebox-ts/package-lock.json
+++ b/archivebox-ts/package-lock.json
--- a/archivebox-ts/package.json
+++ b/archivebox-ts/package.json
@@ -22,7 +22,8 @@
  "dependencies": {
    "better-sqlite3": "^11.0.0",
    "commander": "^12.0.0",
-    "nanoid": "^3.3.7"
+    "nanoid": "^3.3.7",
+    "puppeteer-core": "^24.28.0"
  },
  "devDependencies": {
    "@types/better-sqlite3": "^7.6.9",