mirror of https://github.com/ArchiveBox/ArchiveBox.git synced 2026-01-08 20:05:49 +10:00

Files

Claude f4bb10bdae Add TypeScript-based ArchiveBox implementation with simplified architecture

This commit introduces archivebox-ts, a TypeScript reimplementation of
ArchiveBox with a simplified, modular architecture.

Key features:
- Standalone executable extractors (bash, Node.js, Python with shebang)
- Auto-installing dependencies per extractor
- Simple interface: URL as $1 CLI arg, output to current directory
- Environment variable-based configuration only
- SQLite database with schema matching original ArchiveBox
- Language-agnostic extractor system

Core components:
- src/cli.ts: Main CLI with Commander.js (init, add, list, status, extractors)
- src/db.ts: SQLite operations using better-sqlite3
- src/models.ts: TypeScript interfaces matching database schema
- src/extractors.ts: Extractor discovery and orchestration

Sample extractors included:
- favicon: Download site favicon (bash + curl)
- title: Extract page title (Node.js)
- headers: Extract HTTP headers (bash + curl)
- wget: Full page download with WARC (bash + wget)
- screenshot: Capture screenshot (Python + Playwright)

Documentation:
- README.md: Architecture overview and usage
- QUICKSTART.md: 5-minute getting started guide
- EXTRACTOR_GUIDE.md: Comprehensive extractor development guide
- ARCHITECTURE.md: Design decisions and implementation details

Tested and working:
- Database initialization
- URL archiving with multiple extractors
- Parallel extractor execution
- Result tracking and status reporting
- All CLI commands functional

2025-11-03 18:16:57 +00:00

extractors

Add TypeScript-based ArchiveBox implementation with simplified architecture

2025-11-03 18:16:57 +00:00

src

Add TypeScript-based ArchiveBox implementation with simplified architecture

2025-11-03 18:16:57 +00:00

.gitignore

Add TypeScript-based ArchiveBox implementation with simplified architecture

2025-11-03 18:16:57 +00:00

ARCHITECTURE.md

Add TypeScript-based ArchiveBox implementation with simplified architecture

2025-11-03 18:16:57 +00:00

EXTRACTOR_GUIDE.md

Add TypeScript-based ArchiveBox implementation with simplified architecture

2025-11-03 18:16:57 +00:00

package-lock.json

Add TypeScript-based ArchiveBox implementation with simplified architecture

2025-11-03 18:16:57 +00:00

package.json

Add TypeScript-based ArchiveBox implementation with simplified architecture

2025-11-03 18:16:57 +00:00

QUICKSTART.md

Add TypeScript-based ArchiveBox implementation with simplified architecture

2025-11-03 18:16:57 +00:00

README.md

Add TypeScript-based ArchiveBox implementation with simplified architecture

2025-11-03 18:16:57 +00:00

tsconfig.json

Add TypeScript-based ArchiveBox implementation with simplified architecture

2025-11-03 18:16:57 +00:00

README.md

ArchiveBox TypeScript

A TypeScript-based version of ArchiveBox with a simplified, modular architecture.

Overview

This is a reimplementation of ArchiveBox using TypeScript with a focus on simplicity and modularity. The key architectural changes are:

Standalone Extractors: Each extractor is a standalone executable (bash, Node.js, or Python with shebang) that can run independently
Auto-Installing Dependencies: Extractors automatically install their own dependencies when first run
Simple Interface: Extractors receive URL as $1 CLI argument and output files to current working directory
Environment-Based Config: All configuration passed via environment variables, no CLI flags
SQLite Database: Uses SQLite with schema matching the original ArchiveBox

Directory Structure

archivebox-ts/
├── src/
│   ├── cli.ts           # Main CLI entry point
│   ├── db.ts            # SQLite database operations
│   ├── models.ts        # TypeScript interfaces
│   └── extractors.ts    # Extractor orchestration
├── extractors/          # Standalone extractor executables
│   ├── favicon          # Bash script to download favicon
│   ├── title            # Node.js script to extract title
│   ├── headers          # Bash script to extract HTTP headers
│   ├── wget             # Bash script for full page download
│   └── screenshot       # Python script for screenshots
├── data/                # Created on init
│   ├── index.sqlite3    # SQLite database
│   └── archive/         # Archived snapshots
├── package.json
├── tsconfig.json
└── README.md

Installation

Prerequisites

Node.js 18+ and npm
For specific extractors:
- wget extractor: wget
- screenshot extractor: Python 3 + Playwright

Setup

cd archivebox-ts

# Install dependencies
npm install

# Build TypeScript
npm run build

# Initialize ArchiveBox
node dist/cli.js init

Usage

Initialize

Create the data directory and database:

node dist/cli.js init

Add a URL

Archive a URL with all available extractors:

node dist/cli.js add https://example.com

Archive with specific extractors:

node dist/cli.js add https://example.com --extractors favicon,title,headers

Add with custom title:

node dist/cli.js add https://example.com --title "Example Domain"

List Snapshots

List all archived snapshots:

node dist/cli.js list

With pagination:

node dist/cli.js list --limit 10 --offset 20

Check Status

View detailed status of a snapshot:

node dist/cli.js status <snapshot-id>

List Extractors

See all available extractors:

node dist/cli.js extractors

Database Schema

The SQLite database uses a schema compatible with ArchiveBox:

Snapshots Table

Represents a single URL being archived.

Column	Type	Description
id	TEXT (UUID)	Primary key
abid	TEXT	ArchiveBox ID (snp_...)
url	TEXT	URL being archived (unique)
timestamp	TEXT	Unix timestamp string
title	TEXT	Page title
created_at	TEXT	ISO datetime
bookmarked_at	TEXT	ISO datetime
downloaded_at	TEXT	ISO datetime when complete
modified_at	TEXT	ISO datetime
status	TEXT	queued, started, sealed
retry_at	TEXT	ISO datetime for retry
config	TEXT (JSON)	Configuration
notes	TEXT	Extra notes
output_dir	TEXT	Path to output directory

Archive Results Table

Represents the result of running one extractor on one snapshot.

Column	Type	Description
id	TEXT (UUID)	Primary key
abid	TEXT	ArchiveBox ID (res_...)
snapshot_id	TEXT	Foreign key to snapshot
extractor	TEXT	Extractor name
status	TEXT	queued, started, succeeded, failed, skipped, backoff
created_at	TEXT	ISO datetime
modified_at	TEXT	ISO datetime
start_ts	TEXT	ISO datetime when started
end_ts	TEXT	ISO datetime when finished
cmd	TEXT (JSON)	Command executed
pwd	TEXT	Working directory
cmd_version	TEXT	Binary version
output	TEXT	Output file path or result
retry_at	TEXT	ISO datetime for retry
config	TEXT (JSON)	Configuration
notes	TEXT	Extra notes

Creating Custom Extractors

Extractors are standalone executable files in the extractors/ directory.

Extractor Contract

Executable: File must have execute permissions (chmod +x)
Shebang: Must start with shebang (e.g., #!/bin/bash, #!/usr/bin/env node)
First Argument: Receives URL as $1 (bash) or process.argv[2] (Node.js) or sys.argv[1] (Python)
Working Directory: Run in the output directory, write files there
Environment Config: Read all config from environment variables
Exit Code: Return 0 for success, non-zero for failure
Output: Print the main output file path to stdout
Logging: Print progress/errors to stderr
Auto-Install: Optionally auto-install dependencies on first run

Example Bash Extractor

#!/bin/bash
#
# My Custom Extractor
# Description of what it does
#
# Config via environment variables:
#   MY_TIMEOUT - Timeout in seconds (default: 30)
#

set -e

URL="$1"

if [ -z "$URL" ]; then
  echo "Error: URL argument required" >&2
  exit 1
fi

# Auto-install dependencies (optional)
if ! command -v some-tool &> /dev/null; then
  echo "Installing some-tool..." >&2
  sudo apt-get install -y some-tool
fi

# Get config from environment
TIMEOUT="${MY_TIMEOUT:-30}"

echo "Processing $URL..." >&2

# Do the extraction work
some-tool --timeout "$TIMEOUT" "$URL" > output.txt

echo "✓ Done" >&2
echo "output.txt"
exit 0

Example Node.js Extractor

#!/usr/bin/env node
//
// My Custom Extractor
// Config via environment variables:
//   MY_TIMEOUT - Timeout in ms
//

const url = process.argv[2];
if (!url) {
  console.error('Error: URL argument required');
  process.exit(1);
}

const timeout = parseInt(process.env.MY_TIMEOUT || '10000', 10);

console.error(`Processing ${url}...`);

// Do extraction work
// Write files to current directory

console.error('✓ Done');
console.log('output.txt');

Example Python Extractor

#!/usr/bin/env python3
#
# My Custom Extractor
# Config via environment variables:
#   MY_TIMEOUT - Timeout in seconds
#

import sys
import os

url = sys.argv[1] if len(sys.argv) > 1 else None
if not url:
    print("Error: URL argument required", file=sys.stderr)
    sys.exit(1)

timeout = int(os.environ.get('MY_TIMEOUT', '30'))

print(f"Processing {url}...", file=sys.stderr)

# Do extraction work
# Write files to current directory

print("✓ Done", file=sys.stderr)
print("output.txt")

Available Extractors

favicon

Language: Bash
Dependencies: curl (auto-installed)
Output: favicon.ico or favicon.png
Config:
- FAVICON_TIMEOUT - Timeout in seconds (default: 10)

title

Language: Node.js
Dependencies: Built-in Node.js modules
Output: title.txt
Config:
- TITLE_TIMEOUT - Timeout in milliseconds (default: 10000)
- TITLE_USER_AGENT - User agent string

headers

Language: Bash
Dependencies: curl (auto-installed)
Output: headers.json
Config:
- HEADERS_TIMEOUT - Timeout in seconds (default: 10)
- HEADERS_USER_AGENT - User agent string

wget

Language: Bash
Dependencies: wget (auto-installed)
Output: warc/archive.warc.gz and downloaded files
Config:
- WGET_TIMEOUT - Timeout in seconds (default: 60)
- WGET_USER_AGENT - User agent string
- WGET_ARGS - Additional wget arguments

screenshot

Language: Python
Dependencies: playwright (auto-installed)
Output: screenshot.png
Config:
- SCREENSHOT_TIMEOUT - Timeout in milliseconds (default: 30000)
- SCREENSHOT_WIDTH - Viewport width (default: 1920)
- SCREENSHOT_HEIGHT - Viewport height (default: 1080)
- SCREENSHOT_WAIT - Wait time before screenshot in ms (default: 1000)

Development

Build

npm run build

Watch Mode

npm run dev

Project Structure

src/models.ts - TypeScript interfaces matching the database schema
src/db.ts - Database layer with SQLite operations
src/extractors.ts - Extractor discovery and orchestration
src/cli.ts - CLI commands and application logic

Differences from Original ArchiveBox

Simplified

No Plugin System: Instead of a complex ABX plugin framework, extractors are simple executable files
Simpler Config: Only environment variables, no configuration file parsing
No Web UI: Command-line only (for now)
No Background Workers: Direct execution (could be added)
No User System: Single-user mode

Architecture Improvements

Extractors are Standalone: Each extractor can be tested independently
Language Agnostic: Write extractors in any language (bash, Python, Node.js, Go, etc.)
Easy to Extend: Just drop an executable file in extractors/ directory
Minimal Dependencies: Core system only needs Node.js and SQLite

Future Enhancements

Background job queue for processing
Web UI for browsing archives
Search functionality
More extractors (pdf, dom, singlefile, readability, etc.)
Import/export functionality
Schedule automatic archiving
Browser extension integration

License

MIT

Credits

Based on ArchiveBox by Nick Sweeting and contributors.