Commit Graph

4649 Commits

Author SHA1 Message Date
Claude
83fdc51b45 Implement serial extractor execution with .env state sharing
Major architectural change: extractors now run serially in a predefined
order with state sharing via a .env file in each snapshot directory.

Key changes:

1. New puppeteer extractor:
   - Launches Chrome with user data dir
   - Writes CDP URL and page target ID to .env
   - Leaves browser running for other extractors to reuse

2. Serial execution (src/extractors.ts):
   - Hardcoded EXTRACTOR_ORDER array defines execution sequence
   - Created runExtractorsSerial() method
   - Each extractor reads .env before running
   - Environment variables from .env passed to extractors
   - Extractors can append to .env for later extractors

3. Updated browser-based extractors (title, headers, screenshot):
   - Now reuse existing Chrome tab from puppeteer extractor
   - Read CHROME_CDP_URL and CHROME_PAGE_TARGET_ID from env
   - Connect to existing browser instead of launching new one
   - Leave page open after extraction for next extractor

4. .env file management:
   - Created in snapshot dir before first extractor runs
   - Loaded before each extractor execution
   - Simple KEY=VALUE parser with quote handling
   - Enables state passing between extractors

5. Updated CLI (src/cli.ts):
   - Use runExtractorsSerial instead of parallel execution
   - Better output formatting for serial execution
   - Show extractor execution order

Benefits:
- Single Chrome instance shared across all extractors (faster)
- Predictable execution order (easier to debug)
- State sharing enables complex workflows
- Browser stays on same page (more efficient)
- No need for separate Chrome remote debugging setup

Breaking changes:
- Extractors now run serially (not in parallel)
- puppeteer extractor must run first for browser-based extractors
- Added puppeteer package dependency

Updated documentation:
- README.md with new architecture details
- Added section on .env state sharing
- Updated extractor documentation
- Added execution order section
2025-11-03 20:12:01 +00:00
Claude
7d92a2079a Add comprehensive Chrome CDP setup guide 2025-11-03 19:11:57 +00:00
Claude
ee1db04b73 Update extractors to use Puppeteer with Chrome DevTools Protocol
Changed screenshot, title, and headers extractors from various
implementations to use Puppeteer connecting to Chrome via CDP.

Key changes:
- All three extractors now use puppeteer-core
- Connect to Chrome via CHROME_CDP_URL environment variable
- Shared browser instance across all extractors for efficiency
- Added puppeteer-core as dependency (npm install)
- Removed auto-install logic (cleaner, more predictable)
- Better error messages when CHROME_CDP_URL not set

Benefits:
- Single Chrome instance for all extractors (better performance)
- Consistent browser environment across extractors
- Can use remote/containerized Chrome
- Better for production deployments

Breaking changes:
- CHROME_CDP_URL environment variable now required for:
  - screenshot extractor
  - title extractor
  - headers extractor
- Users must start Chrome with remote debugging:
  chrome --remote-debugging-port=9222 --headless

Updated documentation:
- README.md with Chrome setup instructions
- Added section on Chrome DevTools Protocol setup
- Added Docker setup example
- Updated extractor documentation with CDP requirements
2025-11-03 19:10:55 +00:00
Claude
f4bb10bdae Add TypeScript-based ArchiveBox implementation with simplified architecture
This commit introduces archivebox-ts, a TypeScript reimplementation of
ArchiveBox with a simplified, modular architecture.

Key features:
- Standalone executable extractors (bash, Node.js, Python with shebang)
- Auto-installing dependencies per extractor
- Simple interface: URL as $1 CLI arg, output to current directory
- Environment variable-based configuration only
- SQLite database with schema matching original ArchiveBox
- Language-agnostic extractor system

Core components:
- src/cli.ts: Main CLI with Commander.js (init, add, list, status, extractors)
- src/db.ts: SQLite operations using better-sqlite3
- src/models.ts: TypeScript interfaces matching database schema
- src/extractors.ts: Extractor discovery and orchestration

Sample extractors included:
- favicon: Download site favicon (bash + curl)
- title: Extract page title (Node.js)
- headers: Extract HTTP headers (bash + curl)
- wget: Full page download with WARC (bash + wget)
- screenshot: Capture screenshot (Python + Playwright)

Documentation:
- README.md: Architecture overview and usage
- QUICKSTART.md: 5-minute getting started guide
- EXTRACTOR_GUIDE.md: Comprehensive extractor development guide
- ARCHITECTURE.md: Design decisions and implementation details

Tested and working:
- Database initialization
- URL archiving with multiple extractors
- Parallel extractor execution
- Result tracking and status reporting
- All CLI commands functional
2025-11-03 18:16:57 +00:00
Nick Sweeting
c3024815f3 Add link to Proxmox installer (#1682) 2025-05-19 15:29:45 -07:00
Nelson Minar
f72f04768c Add link to Proxmox installer 2025-05-11 11:10:20 -07:00
Nick Sweeting
d93f32ab24 fix(export_browser_history): tilde doesn't expand in quotes (#1661)
<!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line
length changes. -->

# Summary

Patch submitted by @pcrockett

# Related issues

- Fixes
https://github.com/ArchiveBox/ArchiveBox/issues/1657#issue-2856003985

# Changes these areas

- [x] Bugfixes
- [ ] Feature behavior
- [ ] Command line interface
- [ ] Configuration options
- [ ] Internal architecture
- [ ] Snapshot data layout on disk
2025-03-20 16:09:40 -07:00
Nick Sweeting
8b67186c93 make sure uv is using the right python binary 2025-03-20 16:04:58 -07:00
Nick Sweeting
26eb75e4e6 archivebox swag is now available! 2025-03-20 15:52:56 -07:00
Nick Sweeting
d9d67e9864 add swag link to funding links 2025-03-20 15:51:20 -07:00
Nick Sweeting
1ab4e06a15 remove dead competitor links 2025-03-19 19:22:35 -07:00
Philip Crockett
ba6a8c2da5 support XDG standard, search for chrome and chromium DBs 2025-02-18 21:38:52 +01:00
Philip Crockett
639aa7242b fix typo 2025-02-18 21:22:52 +01:00
Phil Crockett
9fbc2d3818 fix chrome browser history export on Linux 2025-02-18 21:08:56 +01:00
Phil Crockett
58bf8d07e1 feat(export_browser_history): add linux support for firefox 2025-02-16 10:24:37 +01:00
Phil Crockett
feded9e3d4 fix(export_browser_history): fix sqlite quote syntax error 2025-02-16 10:24:13 +01:00
Phil Crockett
2e1ac0409d feat(export_browser_history): fail script when errors occur 2025-02-16 08:34:41 +01:00
Phil Crockett
2ff3fc434e feat(export_browser_history): basic arg parsing error message 2025-02-16 08:31:21 +01:00
Phil Crockett
0043b59bc8 fix(export_browser_history): tilde doesn't expand in quotes 2025-02-16 08:22:17 +01:00
Nick Sweeting
a27a91bbaa Update README.md 2025-02-13 02:45:52 -05:00
Nick Sweeting
3ae30c43a9 Update README.md 2025-02-13 02:37:41 -05:00
Nick Sweeting
37c0ea7eba Kill the timer process if it doesn't properly terminate. (#1649) 2025-02-05 19:06:19 -05:00
Ben Muthalaly
71c02ca4eb Update archivebox/misc/logging_util.py
Co-authored-by: Nick Sweeting <git@sweeting.me>
2025-02-05 17:55:45 -06:00
Ben Muthalaly
9f4cf0a8e1 Kill the timer process if it doesn't properly terminate. 2025-02-03 02:47:33 -06:00
Nick Sweeting
12f109b1be Update docker-compose.yml minor tweaks 2025-01-18 04:20:21 -05:00
Nick Sweeting
6edcac6a40 Fix two small errors in abx-{readwise,spec-config} (#1635) 2025-01-17 17:17:36 -05:00
ckie
952bde6cfa spec-config: fix CONSTANTS import
I was getting:
ImportError: cannot import name 'CONSTANTS' from partially initialized module 'archivebox' (most likely due to a circular import)
(/nix/store/6fy0wgy7r3ld3k590kxgxrc0r1cca347-archivebox-0.8.6rc3/lib/python3.12/site-packages/archivebox/__init__.py)
2025-01-17 21:02:53 +02:00
ckie
58fc6d9cf8 readwise: fix SOURCES_DIR syntax
Fixes: attributeerror: 'list' object has no attribute 'SOURCES_DIR'
2025-01-17 21:02:27 +02:00
Nick Sweeting
aa55e0d02e Update 2-feature_request.yml 2025-01-08 19:20:50 -05:00
Nick Sweeting
e1c443aac4 Update 2-feature_request.yml 2025-01-08 19:19:04 -05:00
Nick Sweeting
d1c8acd3ff Update 1-bug_report.yml 2025-01-08 19:15:21 -05:00
Nick Sweeting
fd21728732 Update 1-bug_report.yml 2025-01-08 19:12:46 -05:00
Nick Sweeting
b93918f926 Update 1-bug_report.yml 2025-01-08 19:12:18 -05:00
Nick Sweeting
ba5380f60b Update 1-bug_report.yml 2025-01-08 19:11:23 -05:00
Nick Sweeting
7ba7ad6b3e Update 1-bug_report.yml 2025-01-08 19:10:47 -05:00
Nick Sweeting
91eb3472e3 Update 1-bug_report.yml 2025-01-08 19:09:12 -05:00
Nick Sweeting
b28f2e704c Update 1-bug_report.yml fix markdown formatting 2025-01-08 19:07:38 -05:00
Nick Sweeting
62a99c88d2 clarify filesystems selections in bug report github template 2025-01-08 19:05:41 -05:00
Nick Sweeting
765abc9d5a Update pip.yml 2025-01-08 18:53:13 -05:00
Nick Sweeting
83bb8a211a Remove outdated architecture diagram 2025-01-08 18:50:46 -05:00
Nick Sweeting
55a347c32e Update file_migrations.py 2025-01-02 23:58:59 -08:00
Nick Sweeting
a851ad4c87 Update models.py 2025-01-02 23:58:45 -08:00
Nick Sweeting
96c5d2f7de Update statemachines.py 2025-01-02 23:58:32 -08:00
Nick Sweeting
b74b0d23b4 Fix typo in timestamp scale factor (#1627) 2024-12-26 01:57:08 -05:00
1over137
3312a34b39 Fix typo in timestamp scale factor 2024-12-25 11:50:40 +00:00
Nick Sweeting
1fb5ecf13d change pip flow to use PAT 2024-12-18 18:55:29 -08:00
Nick Sweeting
46f4a90a2a install needed packages to run archivebox during pip build 2024-12-18 18:39:58 -08:00
Nick Sweeting
e862031981 use uv to build pip package in github actions instead of pdm 2024-12-18 18:38:25 -08:00
Nick Sweeting
b78e892bf8 update github actions to build docker image 2024-12-18 18:19:35 -08:00
Nick Sweeting
baa3be7525 ignore requirements.txt now that its not needed 2024-12-18 18:09:56 -08:00