new gallerydl plugin and more

This commit is contained in:
Nick Sweeting
2025-12-26 11:55:03 -08:00
parent 9838d7ba02
commit 4fd7fcdbcf
20 changed files with 3495 additions and 1435 deletions

View File

@@ -6,7 +6,10 @@
"Bash(xargs:*)",
"Bash(python -c:*)",
"Bash(printf:*)",
"Bash(pkill:*)"
"Bash(pkill:*)",
"Bash(python3:*)",
"Bash(sqlite3:*)",
"WebFetch(domain:github.com)"
]
}
}

View File

@@ -1,300 +0,0 @@
# JS Implementation Features to Port to Python ArchiveBox
## Priority: High Impact Features
### 1. **Screen Recording** ⭐⭐⭐
**JS Implementation:** Captures MP4 video + animated GIF of the archiving session
```javascript
// Records browser activity including scrolling, interactions
PuppeteerScreenRecorder screenrecording.mp4
ffmpeg conversion screenrecording.gif (first 10s, optimized)
```
**Enhancement for Python:**
- Add `on_Snapshot__24_screenrecording.py`
- Use puppeteer or playwright screen recording APIs
- Generate both full MP4 and thumbnail GIF
- **Value:** Visual proof of what was captured, useful for QA and debugging
### 2. **AI Quality Assurance** ⭐⭐⭐
**JS Implementation:** Uses GPT-4o to analyze screenshots and validate archive quality
```javascript
// ai_qa.py analyzes screenshot.png and returns:
{
"pct_visible": 85,
"warnings": ["Some content may be cut off"],
"main_content_title": "Article Title",
"main_content_author": "Author Name",
"main_content_date": "2024-01-15",
"website_brand_name": "Example.com"
}
```
**Enhancement for Python:**
- Add `on_Snapshot__95_aiqa.py` (runs after screenshot)
- Integrate with OpenAI API or local vision models
- Validates: content visibility, broken layouts, CAPTCHA blocks, error pages
- **Value:** Automatic detection of failed archives, quality scoring
### 3. **Network Response Archiving** ⭐⭐⭐
**JS Implementation:** Saves ALL network responses in organized structure
```
responses/
├── all/ # Timestamped unique files
│ ├── 20240101120000__GET__https%3A%2F%2Fexample.com%2Fapi.json
│ └── ...
├── script/ # Organized by resource type
│ └── example.com/path/to/script.js → ../all/...
├── stylesheet/
├── image/
├── media/
└── index.jsonl # Searchable index
```
**Enhancement for Python:**
- Add `on_Snapshot__23_responses.py`
- Save all HTTP responses (XHR, images, scripts, etc.)
- Create both timestamped and URL-organized views via symlinks
- Generate `index.jsonl` with metadata (URL, method, status, mimeType, sha256)
- **Value:** Complete HTTP-level archive, better debugging, API response preservation
### 4. **Detailed Metadata Extractors** ⭐⭐
#### 4a. SSL/TLS Details (`on_Snapshot__16_ssl.py`)
```python
{
"protocol": "TLS 1.3",
"cipher": "AES_128_GCM",
"securityState": "secure",
"securityDetails": {
"issuer": "Let's Encrypt",
"validFrom": ...,
"validTo": ...
}
}
```
#### 4b. SEO Metadata (`on_Snapshot__17_seo.py`)
Extracts all `<meta>` tags:
```python
{
"og:title": "Page Title",
"og:image": "https://example.com/image.jpg",
"twitter:card": "summary_large_image",
"description": "Page description",
...
}
```
#### 4c. Accessibility Tree (`on_Snapshot__18_accessibility.py`)
```python
{
"headings": ["# Main Title", "## Section 1", ...],
"iframes": ["https://embed.example.com/..."],
"tree": { ... } # Full accessibility snapshot
}
```
#### 4d. Outlinks Categorization (`on_Snapshot__19_outlinks.py`)
Better than current implementation - categorizes by type:
```python
{
"hrefs": [...], # All <a> links
"images": [...], # <img src>
"css_stylesheets": [...], # <link rel=stylesheet>
"js_scripts": [...], # <script src>
"iframes": [...], # <iframe src>
"css_images": [...], # background-image: url()
"links": [{...}] # <link> tags (rel, href)
}
```
#### 4e. Redirects Chain (`on_Snapshot__15_redirects.py`)
Tracks full redirect sequence:
```python
{
"redirects_from_http": [
{"url": "http://ex.com", "status": 301, "isMainFrame": True},
{"url": "https://ex.com", "status": 302, "isMainFrame": True},
{"url": "https://www.ex.com", "status": 200, "isMainFrame": True}
]
}
```
**Value:** Rich metadata for research, SEO analysis, security auditing
### 5. **Enhanced Screenshot System** ⭐⭐
**JS Implementation:**
- `screenshot.png` - Full-page PNG at high resolution (4:3 ratio)
- `screenshot.jpg` - Compressed JPEG for thumbnails (1440x1080, 90% quality)
- Automatically crops to reasonable height for long pages
**Enhancement for Python:**
- Update `screenshot` extractor to generate both formats
- Use aspect ratio optimization (4:3 is better for thumbnails than 16:9)
- **Value:** Faster loading thumbnails, better storage efficiency
### 6. **Console Log Capture** ⭐⭐
**JS Implementation:**
```
console.log - Captures all console output
ERROR /path/to/script.js:123 "Uncaught TypeError: ..."
WARNING https://example.com/api Failed to load resource: net::ERR_BLOCKED_BY_CLIENT
```
**Enhancement for Python:**
- Add `on_Snapshot__20_consolelog.py`
- Useful for debugging JavaScript errors, tracking blocked resources
- **Value:** Identifies rendering issues, ad blockers, CORS problems
## Priority: Nice-to-Have Enhancements
### 7. **Request/Response Headers** ⭐
**Current:** Headers extractor exists but could be enhanced
**JS Enhancement:** Separates request vs response, includes extra headers
### 8. **Human Behavior Emulation** ⭐
**JS Implementation:**
- Mouse jiggling with ghost-cursor
- Smart scrolling with infinite scroll detection
- Comment expansion (Reddit, HackerNews, etc.)
- Form submission
- CAPTCHA solving via 2captcha extension
**Enhancement for Python:**
- Add `on_Snapshot__05_human_behavior.py` (runs BEFORE other extractors)
- Implement scrolling, clicking "Load More", expanding comments
- **Value:** Captures more content from dynamic sites
### 9. **CAPTCHA Solving** ⭐
**JS Implementation:** Integrates 2captcha extension
**Enhancement:** Add optional CAPTCHA solving via 2captcha API
**Value:** Access to Cloudflare-protected sites
### 10. **Source Map Downloading**
**JS Implementation:** Automatically downloads `.map` files for JS/CSS
**Enhancement:** Add `on_Snapshot__30_sourcemaps.py`
**Value:** Helps debug minified code
### 11. **Pandoc Markdown Conversion**
**JS Implementation:** Converts HTML ↔ Markdown using Pandoc
```bash
pandoc --from html --to markdown_github --wrap=none
```
**Enhancement:** Add `on_Snapshot__34_pandoc.py`
**Value:** Human-readable Markdown format
### 12. **Authentication Management** ⭐
**JS Implementation:**
- Sophisticated cookie storage with `cookies.txt` export
- LocalStorage + SessionStorage preservation
- Merge new cookies with existing ones (no overwrites)
**Enhancement:**
- Improve `auth.json` management to match JS sophistication
- Add `cookies.txt` export (Netscape format) for compatibility with wget/curl
- **Value:** Better session persistence across runs
### 13. **File Integrity & Versioning** ⭐⭐
**JS Implementation:**
- SHA256 hash for every file
- Merkle tree directory hashes
- Version directories (`versions/YYYYMMDDHHMMSS/`)
- Symlinks to latest versions
- `.files.json` manifest with metadata
**Enhancement:**
- Add `on_Snapshot__99_integrity.py` (runs last)
- Generate SHA256 hashes for all outputs
- Create version manifests
- **Value:** Verify archive integrity, detect corruption, track changes
### 14. **Directory Organization**
**JS Structure (superior):**
```
archive/<timestamp>/
├── versions/
│ ├── 20240101120000/ # Each run = new version
│ │ ├── screenshot.png
│ │ ├── singlefile.html
│ │ └── ...
│ └── 20240102150000/
├── screenshot.png → versions/20240102150000/screenshot.png # Symlink to latest
├── singlefile.html → ...
└── metrics.json
```
**Current Python:** All outputs in flat structure
**Enhancement:** Add versioning layer for tracking changes over time
### 15. **Speedtest Integration**
**JS Implementation:** Runs fast.com speedtest once per day
**Enhancement:** Optional `on_Snapshot__01_speedtest.py`
**Value:** Diagnose slow archives, track connection quality
### 16. **gallery-dl Support** ⭐
**JS Implementation:** Downloads photo galleries (Instagram, Twitter, etc.)
**Enhancement:** Add `on_Snapshot__30_photos.py` alongside existing `media` extractor
**Value:** Better support for image-heavy sites
## Implementation Priority Ranking
### Must-Have (High ROI):
1. **Network Response Archiving** - Complete HTTP archive
2. **AI Quality Assurance** - Automatic validation
3. **Screen Recording** - Visual proof of capture
4. **Enhanced Metadata** (SSL, SEO, Accessibility, Outlinks) - Research value
### Should-Have (Medium ROI):
5. **Console Log Capture** - Debugging aid
6. **File Integrity Hashing** - Archive verification
7. **Enhanced Screenshots** - Better thumbnails
8. **Versioning System** - Track changes over time
### Nice-to-Have (Lower ROI):
9. **Human Behavior Emulation** - Dynamic content
10. **CAPTCHA Solving** - Access restricted sites
11. **gallery-dl** - Image collections
12. **Pandoc Markdown** - Readable format
## Technical Considerations
### Dependencies Needed:
- **Screen Recording:** `playwright` or `puppeteer` with recording API
- **AI QA:** `openai` Python SDK or local vision model
- **Network Archiving:** CDP protocol access (already have via Chrome)
- **File Hashing:** Built-in `hashlib` (no new deps)
- **gallery-dl:** Install via pip
### Performance Impact:
- Screen recording: +2-3 seconds overhead per snapshot
- AI QA: +0.5-2 seconds (API call) per snapshot
- Response archiving: Minimal (async writes)
- File hashing: +0.1-0.5 seconds per snapshot
- Metadata extraction: Minimal (same page visit)
### Architecture Compatibility:
All proposed enhancements fit the existing hook-based plugin architecture:
- Use standard `on_Snapshot__NN_name.py` naming
- Return `ExtractorResult` objects
- Can reuse shared Chrome CDP sessions
- Follow existing error handling patterns
## Summary Statistics
**JS Implementation:**
- 35+ output types
- ~3000 lines of archiving logic
- Extensive quality assurance
- Complete HTTP-level capture
**Current Python Implementation:**
- 12 extractors
- Strong foundation with room for enhancement
**Recommended Additions:**
- **8 new high-priority extractors**
- **6 enhanced versions of existing extractors**
- **3 optional nice-to-have extractors**
This would bring the Python implementation to feature parity with the JS version while maintaining better code organization and the existing plugin architecture.

View File

@@ -1,819 +0,0 @@
# ArchiveBox 2025 Simplification Plan
**Status:** FINAL - Ready for implementation
**Last Updated:** 2024-12-24
---
## Final Decisions Summary
| Decision | Choice |
|----------|--------|
| Task Queue | Keep `retry_at` polling pattern (no Django Tasks) |
| State Machine | Preserve current semantics; only replace mixins/statemachines if identical retry/lock guarantees are kept |
| Event Model | Remove completely |
| ABX Plugin System | Remove entirely (`archivebox/pkgs/`) |
| abx-pkg | Keep as external pip dependency (separate repo: github.com/ArchiveBox/abx-pkg) |
| Binary Providers | File-based plugins using abx-pkg internally |
| Search Backends | **Hybrid:** hooks for indexing, Python classes for querying |
| Auth Methods | Keep simple (LDAP + normal), no pluginization needed |
| ABID | Already removed (ignore old references) |
| ArchiveResult | **Keep pre-creation** with `status=queued` + `retry_at` for consistency |
| Plugin Directory | **`archivebox/plugins/*`** for built-ins, **`data/plugins/*`** for user hooks (flat `on_*__*.*` files) |
| Locking | Use `retry_at` consistently across Crawl, Snapshot, ArchiveResult |
| Worker Model | **Separate processes** per model type + per extractor, visible in htop |
| Concurrency | **Per-extractor configurable** (e.g., `ytdlp_max_parallel=5`) |
| InstalledBinary | **Keep model** + add Dependency model for audit trail |
---
## Architecture Overview
### Consistent Queue/Lock Pattern
All models (Crawl, Snapshot, ArchiveResult) use the same pattern:
```python
class StatusMixin(models.Model):
status = models.CharField(max_length=15, db_index=True)
retry_at = models.DateTimeField(default=timezone.now, null=True, db_index=True)
class Meta:
abstract = True
def tick(self) -> bool:
"""Override in subclass. Returns True if state changed."""
raise NotImplementedError
# Worker query (same for all models):
Model.objects.filter(
status__in=['queued', 'started'],
retry_at__lte=timezone.now()
).order_by('retry_at').first()
# Claim (atomic via optimistic locking):
updated = Model.objects.filter(
id=obj.id,
retry_at=obj.retry_at
).update(
retry_at=timezone.now() + timedelta(seconds=60)
)
if updated == 1: # Successfully claimed
obj.refresh_from_db()
obj.tick()
```
**Failure/cleanup guarantees**
- Objects stuck in `started` with a past `retry_at` must be reclaimed automatically using the existing retry/backoff rules.
- `tick()` implementations must continue to bump `retry_at` / transition to `backoff` the same way current statemachines do so that failures get retried without manual intervention.
### Process Tree (Separate Processes, Visible in htop)
```
archivebox server
├── orchestrator (pid=1000)
│ ├── crawl_worker_0 (pid=1001)
│ ├── crawl_worker_1 (pid=1002)
│ ├── snapshot_worker_0 (pid=1003)
│ ├── snapshot_worker_1 (pid=1004)
│ ├── snapshot_worker_2 (pid=1005)
│ ├── wget_worker_0 (pid=1006)
│ ├── wget_worker_1 (pid=1007)
│ ├── ytdlp_worker_0 (pid=1008) # Limited concurrency
│ ├── ytdlp_worker_1 (pid=1009)
│ ├── screenshot_worker_0 (pid=1010)
│ ├── screenshot_worker_1 (pid=1011)
│ ├── screenshot_worker_2 (pid=1012)
│ └── ...
```
**Configurable per-extractor concurrency:**
```python
# archivebox.conf or environment
WORKER_CONCURRENCY = {
'crawl': 2,
'snapshot': 3,
'wget': 2,
'ytdlp': 2, # Bandwidth-limited
'screenshot': 3,
'singlefile': 2,
'title': 5, # Fast, can run many
'favicon': 5,
}
```
---
## Hook System
### Discovery (Glob at Startup)
```python
# archivebox/hooks.py
from pathlib import Path
import subprocess
import os
import json
from django.conf import settings
BUILTIN_PLUGIN_DIR = Path(__file__).parent.parent / 'plugins'
USER_PLUGIN_DIR = settings.DATA_DIR / 'plugins'
def discover_hooks(event_name: str) -> list[Path]:
"""Find all scripts matching on_{EventName}__*.{sh,py,js} under archivebox/plugins/* and data/plugins/*"""
hooks = []
for base in (BUILTIN_PLUGIN_DIR, USER_PLUGIN_DIR):
if not base.exists():
continue
for ext in ('sh', 'py', 'js'):
hooks.extend(base.glob(f'*/on_{event_name}__*.{ext}'))
return sorted(hooks)
def run_hook(script: Path, output_dir: Path, **kwargs) -> dict:
"""Execute hook with --key=value args, cwd=output_dir."""
args = [str(script)]
for key, value in kwargs.items():
args.append(f'--{key.replace("_", "-")}={json.dumps(value, default=str)}')
env = os.environ.copy()
env['ARCHIVEBOX_DATA_DIR'] = str(settings.DATA_DIR)
result = subprocess.run(
args,
cwd=output_dir,
capture_output=True,
text=True,
timeout=300,
env=env,
)
return {
'returncode': result.returncode,
'stdout': result.stdout,
'stderr': result.stderr,
}
```
### Hook Interface
- **Input:** CLI args `--url=... --snapshot-id=...`
- **Location:** Built-in hooks in `archivebox/plugins/<plugin>/on_*__*.*`, user hooks in `data/plugins/<plugin>/on_*__*.*`
- **Internal API:** Should treat ArchiveBox as an external CLI—call `archivebox config --get ...`, `archivebox find ...`, import `abx-pkg` only when running in their own venvs.
- **Output:** Files written to `$PWD` (the output_dir), can call `archivebox create ...`
- **Logging:** stdout/stderr captured to ArchiveResult
- **Exit code:** 0 = success, non-zero = failure
---
## Unified Config Access
- Implement `archivebox.config.get_config(scope='global'|'crawl'|'snapshot'|...)` that merges defaults, config files, environment variables, DB overrides, and per-object config (seed/crawl/snapshot).
- Provide helpers (`get_config()`, `get_flat_config()`) for Python callers so `abx.pm.hook.get_CONFIG*` can be removed.
- Ensure the CLI command `archivebox config --get KEY` (and a machine-readable `--format=json`) uses the same API so hook scripts can query config via subprocess calls.
- Document that plugin hooks should prefer the CLI to fetch config rather than importing Django internals, guaranteeing they work from shell/bash/js without ArchiveBoxs runtime.
---
### Example Extractor Hooks
**Bash:**
```bash
#!/usr/bin/env bash
# plugins/on_Snapshot__wget.sh
set -e
# Parse args
for arg in "$@"; do
case $arg in
--url=*) URL="${arg#*=}" ;;
--snapshot-id=*) SNAPSHOT_ID="${arg#*=}" ;;
esac
done
# Find wget binary
WGET=$(archivebox find InstalledBinary --name=wget --format=abspath)
[ -z "$WGET" ] && echo "wget not found" >&2 && exit 1
# Run extraction (writes to $PWD)
$WGET --mirror --page-requisites --adjust-extension "$URL" 2>&1
echo "Completed wget mirror of $URL"
```
**Python:**
```python
#!/usr/bin/env python3
# plugins/on_Snapshot__singlefile.py
import argparse
import subprocess
import sys
def main():
parser = argparse.ArgumentParser()
parser.add_argument('--url', required=True)
parser.add_argument('--snapshot-id', required=True)
args = parser.parse_args()
# Find binary via CLI
result = subprocess.run(
['archivebox', 'find', 'InstalledBinary', '--name=single-file', '--format=abspath'],
capture_output=True, text=True
)
bin_path = result.stdout.strip()
if not bin_path:
print("single-file not installed", file=sys.stderr)
sys.exit(1)
# Run extraction (writes to $PWD)
subprocess.run([bin_path, args.url, '--output', 'singlefile.html'], check=True)
print(f"Saved {args.url} to singlefile.html")
if __name__ == '__main__':
main()
```
---
## Binary Providers & Dependencies
- Move dependency tracking into a dedicated `dependencies` module (or extend `archivebox/machine/`) with two Django models:
```yaml
Dependency:
id: uuidv7
bin_name: extractor binary executable name (ytdlp|wget|screenshot|...)
bin_provider: apt | brew | pip | npm | gem | nix | '*' for any
custom_cmds: JSON of provider->install command overrides (optional)
config: JSON of env vars/settings to apply during install
created_at: utc datetime
InstalledBinary:
id: uuidv7
dependency: FK to Dependency
bin_name: executable name again
bin_abspath: filesystem path
bin_version: semver string
bin_hash: sha256 of the binary
bin_provider: apt | brew | pip | npm | gem | nix | custom | ...
created_at: utc datetime (last seen/installed)
is_valid: property returning True when both abspath+version are set
```
- Provide CLI commands for hook scripts: `archivebox find InstalledBinary --name=wget --format=abspath`, `archivebox dependency create ...`, etc.
- Hooks remain language agnostic and should not import ArchiveBox Django modules; they rely on CLI commands plus their own runtime (python/bash/js).
### Provider Hooks
- Built-in provider plugins live under `archivebox/plugins/<provider>/on_Dependency__*.py` (e.g., apt, brew, pip, custom).
- Each provider hook:
1. Checks if the Dependency allows that provider via `bin_provider` or wildcard `'*'`.
2. Builds the install command (`custom_cmds[provider]` override or sane default like `apt install -y <bin_name>`).
3. Executes the command (bash/python) and, on success, records/updates an `InstalledBinary`.
Example outline (bash or python, but still interacting via CLI):
```bash
# archivebox/plugins/apt/on_Dependency__install_using_apt_provider.sh
set -euo pipefail
DEP_JSON=$(archivebox dependency show --id="$DEPENDENCY_ID" --format=json)
BIN_NAME=$(echo "$DEP_JSON" | jq -r '.bin_name')
PROVIDER_ALLOWED=$(echo "$DEP_JSON" | jq -r '.bin_provider')
if [[ "$PROVIDER_ALLOWED" == "*" || "$PROVIDER_ALLOWED" == *"apt"* ]]; then
INSTALL_CMD=$(echo "$DEP_JSON" | jq -r '.custom_cmds.apt // empty')
INSTALL_CMD=${INSTALL_CMD:-"apt install -y --no-install-recommends $BIN_NAME"}
bash -lc "$INSTALL_CMD"
archivebox dependency register-installed \
--dependency-id="$DEPENDENCY_ID" \
--bin-provider=apt \
--bin-abspath="$(command -v "$BIN_NAME")" \
--bin-version="$("$(command -v "$BIN_NAME")" --version | head -n1)" \
--bin-hash="$(sha256sum "$(command -v "$BIN_NAME")" | cut -d' ' -f1)"
fi
```
- Extractor-level hooks (e.g., `archivebox/plugins/wget/on_Crawl__install_wget_extractor_if_needed.*`) ensure dependencies exist before starting work by creating/updating `Dependency` records (via CLI) and then invoking provider hooks.
- Remove all reliance on `abx.pm.hook.binary_load` / ABX plugin packages; `abx-pkg` can remain as a normal pip dependency that hooks import if useful.
---
## Search Backends (Hybrid)
### Indexing: Hook Scripts
Triggered when ArchiveResult completes successfully (from the Django side we simply fire the event; indexing logic lives in standalone hook scripts):
```python
#!/usr/bin/env python3
# plugins/on_ArchiveResult__index_sqlitefts.py
import argparse
import sqlite3
import os
from pathlib import Path
def main():
parser = argparse.ArgumentParser()
parser.add_argument('--snapshot-id', required=True)
parser.add_argument('--extractor', required=True)
args = parser.parse_args()
# Read text content from output files
content = ""
for f in Path.cwd().rglob('*.txt'):
content += f.read_text(errors='ignore') + "\n"
for f in Path.cwd().rglob('*.html'):
content += strip_html(f.read_text(errors='ignore')) + "\n"
if not content.strip():
return
# Add to FTS index
db = sqlite3.connect(os.environ['ARCHIVEBOX_DATA_DIR'] + '/search.sqlite3')
db.execute('CREATE VIRTUAL TABLE IF NOT EXISTS fts USING fts5(snapshot_id, content)')
db.execute('INSERT OR REPLACE INTO fts VALUES (?, ?)', (args.snapshot_id, content))
db.commit()
if __name__ == '__main__':
main()
```
### Querying: CLI-backed Python Classes
```python
# archivebox/search/backends/sqlitefts.py
import subprocess
import json
class SQLiteFTSBackend:
name = 'sqlitefts'
def search(self, query: str, limit: int = 50) -> list[str]:
"""Call plugins/on_Search__query_sqlitefts.* and parse stdout."""
result = subprocess.run(
['archivebox', 'search-backend', '--backend', self.name, '--query', query, '--limit', str(limit)],
capture_output=True,
check=True,
text=True,
)
return json.loads(result.stdout or '[]')
# archivebox/search/__init__.py
from django.conf import settings
def get_backend():
name = getattr(settings, 'SEARCH_BACKEND', 'sqlitefts')
if name == 'sqlitefts':
from .backends.sqlitefts import SQLiteFTSBackend
return SQLiteFTSBackend()
elif name == 'sonic':
from .backends.sonic import SonicBackend
return SonicBackend()
raise ValueError(f'Unknown search backend: {name}')
def search(query: str) -> list[str]:
return get_backend().search(query)
```
- Each backend script lives under `archivebox/plugins/search/on_Search__query_<backend>.py` (with user overrides in `data/plugins/...`) and outputs JSON list of snapshot IDs. Python wrappers simply invoke the CLI to keep Django isolated from backend implementations.
---
## Simplified Models
> Goal: reduce line count without sacrificing the correctness guarantees we currently get from `ModelWithStateMachine` + python-statemachine. We keep the mixins/statemachines unless we can prove a smaller implementation enforces the same transitions/retry locking.
### Snapshot
```python
class Snapshot(models.Model):
id = models.UUIDField(primary_key=True, default=uuid7)
url = models.URLField(unique=True, db_index=True)
timestamp = models.CharField(max_length=32, unique=True, db_index=True)
title = models.CharField(max_length=512, null=True, blank=True)
created_by = models.ForeignKey(settings.AUTH_USER_MODEL, on_delete=models.CASCADE)
created_at = models.DateTimeField(default=timezone.now)
modified_at = models.DateTimeField(auto_now=True)
crawl = models.ForeignKey('crawls.Crawl', on_delete=models.CASCADE, null=True)
tags = models.ManyToManyField('Tag', through='SnapshotTag')
# Status (consistent with Crawl, ArchiveResult)
status = models.CharField(max_length=15, default='queued', db_index=True)
retry_at = models.DateTimeField(default=timezone.now, null=True, db_index=True)
# Inline fields (no mixins)
config = models.JSONField(default=dict)
notes = models.TextField(blank=True, default='')
FINAL_STATES = ['sealed']
@property
def output_dir(self) -> Path:
return settings.ARCHIVE_DIR / self.timestamp
def tick(self) -> bool:
if self.status == 'queued' and self.can_start():
self.start()
return True
elif self.status == 'started' and self.is_finished():
self.seal()
return True
return False
def can_start(self) -> bool:
return bool(self.url)
def is_finished(self) -> bool:
results = self.archiveresult_set.all()
if not results.exists():
return False
return not results.filter(status__in=['queued', 'started', 'backoff']).exists()
def start(self):
self.status = 'started'
self.retry_at = timezone.now() + timedelta(seconds=10)
self.output_dir.mkdir(parents=True, exist_ok=True)
self.save()
self.create_pending_archiveresults()
def seal(self):
self.status = 'sealed'
self.retry_at = None
self.save()
def create_pending_archiveresults(self):
for extractor in get_config(defaults=settings, crawl=self.crawl, snapshot=self).ENABLED_EXTRACTORS:
ArchiveResult.objects.get_or_create(
snapshot=self,
extractor=extractor,
defaults={
'status': 'queued',
'retry_at': timezone.now(),
'created_by': self.created_by,
}
)
```
### ArchiveResult
```python
class ArchiveResult(models.Model):
id = models.UUIDField(primary_key=True, default=uuid7)
snapshot = models.ForeignKey(Snapshot, on_delete=models.CASCADE)
extractor = models.CharField(max_length=32, db_index=True)
created_by = models.ForeignKey(settings.AUTH_USER_MODEL, on_delete=models.CASCADE)
created_at = models.DateTimeField(default=timezone.now)
modified_at = models.DateTimeField(auto_now=True)
# Status
status = models.CharField(max_length=15, default='queued', db_index=True)
retry_at = models.DateTimeField(default=timezone.now, null=True, db_index=True)
# Execution
start_ts = models.DateTimeField(null=True)
end_ts = models.DateTimeField(null=True)
output = models.CharField(max_length=1024, null=True)
cmd = models.JSONField(null=True)
pwd = models.CharField(max_length=256, null=True)
# Audit trail
machine = models.ForeignKey('machine.Machine', on_delete=models.SET_NULL, null=True)
iface = models.ForeignKey('machine.NetworkInterface', on_delete=models.SET_NULL, null=True)
installed_binary = models.ForeignKey('machine.InstalledBinary', on_delete=models.SET_NULL, null=True)
FINAL_STATES = ['succeeded', 'failed']
class Meta:
unique_together = ('snapshot', 'extractor')
@property
def output_dir(self) -> Path:
return self.snapshot.output_dir / self.extractor
def tick(self) -> bool:
if self.status == 'queued' and self.can_start():
self.start()
return True
elif self.status == 'backoff' and self.can_retry():
self.status = 'queued'
self.retry_at = timezone.now()
self.save()
return True
return False
def can_start(self) -> bool:
return bool(self.snapshot.url)
def can_retry(self) -> bool:
return self.retry_at and self.retry_at <= timezone.now()
def start(self):
self.status = 'started'
self.start_ts = timezone.now()
self.retry_at = timezone.now() + timedelta(seconds=120)
self.output_dir.mkdir(parents=True, exist_ok=True)
self.save()
# Run hook and complete
self.run_extractor_hook()
def run_extractor_hook(self):
from archivebox.hooks import discover_hooks, run_hook
hooks = discover_hooks(f'Snapshot__{self.extractor}')
if not hooks:
self.status = 'failed'
self.output = f'No hook for: {self.extractor}'
self.end_ts = timezone.now()
self.retry_at = None
self.save()
return
result = run_hook(
hooks[0],
output_dir=self.output_dir,
url=self.snapshot.url,
snapshot_id=str(self.snapshot.id),
)
self.status = 'succeeded' if result['returncode'] == 0 else 'failed'
self.output = result['stdout'][:1024] or result['stderr'][:1024]
self.end_ts = timezone.now()
self.retry_at = None
self.save()
# Trigger search indexing if succeeded
if self.status == 'succeeded':
self.trigger_search_indexing()
def trigger_search_indexing(self):
from archivebox.hooks import discover_hooks, run_hook
for hook in discover_hooks('ArchiveResult__index'):
run_hook(hook, output_dir=self.output_dir,
snapshot_id=str(self.snapshot.id),
extractor=self.extractor)
```
- `ArchiveResult` must continue storing execution metadata (`cmd`, `pwd`, `machine`, `iface`, `installed_binary`, timestamps) exactly as before, even though the extractor now runs via hook scripts. `run_extractor_hook()` is responsible for capturing those values (e.g., wrapping subprocess calls).
- Any refactor of `Snapshot`, `ArchiveResult`, or `Crawl` has to keep the same `FINAL_STATES`, `retry_at` semantics, and tag/output directory handling that `ModelWithStateMachine` currently provides.
---
## Simplified Worker System
```python
# archivebox/workers/orchestrator.py
import os
import time
import multiprocessing
from datetime import timedelta
from django.utils import timezone
from django.conf import settings
class Worker:
"""Base worker for processing queued objects."""
Model = None
name = 'worker'
def get_queue(self):
return self.Model.objects.filter(
retry_at__lte=timezone.now()
).exclude(
status__in=self.Model.FINAL_STATES
).order_by('retry_at')
def claim(self, obj) -> bool:
"""Atomic claim via optimistic lock."""
updated = self.Model.objects.filter(
id=obj.id,
retry_at=obj.retry_at
).update(retry_at=timezone.now() + timedelta(seconds=60))
return updated == 1
def run(self):
print(f'[{self.name}] Started pid={os.getpid()}')
while True:
obj = self.get_queue().first()
if obj and self.claim(obj):
try:
obj.refresh_from_db()
obj.tick()
except Exception as e:
print(f'[{self.name}] Error: {e}')
obj.retry_at = timezone.now() + timedelta(seconds=60)
obj.save(update_fields=['retry_at'])
else:
time.sleep(0.5)
class CrawlWorker(Worker):
from crawls.models import Crawl
Model = Crawl
name = 'crawl'
class SnapshotWorker(Worker):
from core.models import Snapshot
Model = Snapshot
name = 'snapshot'
class ExtractorWorker(Worker):
"""Worker for a specific extractor."""
from core.models import ArchiveResult
Model = ArchiveResult
def __init__(self, extractor: str):
self.extractor = extractor
self.name = extractor
def get_queue(self):
return super().get_queue().filter(extractor=self.extractor)
class Orchestrator:
def __init__(self):
self.processes = []
def spawn(self):
config = settings.WORKER_CONCURRENCY
for i in range(config.get('crawl', 2)):
self._spawn(CrawlWorker, f'crawl_{i}')
for i in range(config.get('snapshot', 3)):
self._spawn(SnapshotWorker, f'snapshot_{i}')
for extractor, count in config.items():
if extractor in ('crawl', 'snapshot'):
continue
for i in range(count):
self._spawn(ExtractorWorker, f'{extractor}_{i}', extractor)
def _spawn(self, cls, name, *args):
worker = cls(*args) if args else cls()
worker.name = name
p = multiprocessing.Process(target=worker.run, name=name)
p.start()
self.processes.append(p)
def run(self):
print(f'Orchestrator pid={os.getpid()}')
self.spawn()
try:
while True:
for p in self.processes:
if not p.is_alive():
print(f'{p.name} died, restarting...')
# Respawn logic
time.sleep(5)
except KeyboardInterrupt:
for p in self.processes:
p.terminate()
```
---
## Directory Structure
```
archivebox-nue/
├── archivebox/
│ ├── __init__.py
│ ├── config.py # Simple env-based config
│ ├── hooks.py # Hook discovery + execution
│ │
│ ├── core/
│ │ ├── models.py # Snapshot, ArchiveResult, Tag
│ │ ├── admin.py
│ │ └── views.py
│ │
│ ├── crawls/
│ │ ├── models.py # Crawl, Seed, CrawlSchedule, Outlink
│ │ └── admin.py
│ │
│ ├── machine/
│ │ ├── models.py # Machine, NetworkInterface, Dependency, InstalledBinary
│ │ └── admin.py
│ │
│ ├── workers/
│ │ └── orchestrator.py # ~150 lines
│ │
│ ├── api/
│ │ └── ...
│ │
│ ├── cli/
│ │ └── ...
│ │
│ ├── search/
│ │ ├── __init__.py
│ │ └── backends/
│ │ ├── sqlitefts.py
│ │ └── sonic.py
│ │
│ ├── index/
│ ├── parsers/
│ ├── misc/
│ └── templates/
-├── plugins/ # Built-in hooks (ArchiveBox never imports these directly)
│ ├── wget/
│ │ └── on_Snapshot__wget.sh
│ ├── dependencies/
│ │ ├── on_Dependency__install_using_apt_provider.sh
│ │ └── on_Dependency__install_using_custom_bash.py
│ ├── search/
│ │ ├── on_ArchiveResult__index_sqlitefts.py
│ │ └── on_Search__query_sqlitefts.py
│ └── ...
├── data/
│ └── plugins/ # User-provided hooks mirror builtin layout
└── pyproject.toml
```
---
## Implementation Phases
### Phase 1: Build Unified Config + Hook Scaffold
1. Implement `archivebox.config.get_config()` + CLI plumbing (`archivebox config --get ... --format=json`) without touching abx yet.
2. Add `archivebox/hooks.py` with dual plugin directories (`archivebox/plugins`, `data/plugins`), discovery, and execution helpers.
3. Keep the existing ABX/worker system running while new APIs land; surface warnings where `abx.pm.*` is still in use.
### Phase 2: Gradual ABX Removal
1. Rename `archivebox/pkgs/` to `archivebox/pkgs.unused/` and start deleting packages once equivalent hook scripts exist.
2. Remove `pluggy`, `python-statemachine`, and all `abx-*` dependencies/workspace entries from `pyproject.toml` only after consumers are migrated.
3. Replace every `abx.pm.hook.get_*` usage in CLI/config/search/extractors with the new config + hook APIs.
### Phase 3: Worker + State Machine Simplification
1. Introduce the process-per-model orchestrator while preserving `ModelWithStateMachine` semantics (Snapshot/Crawl/ArchiveResult).
2. Only drop mixins/statemachine dependency after verifying the new `tick()` implementations keep retries/backoff/final states identical.
3. Ensure Huey/task entry points either delegate to the new orchestrator or are retired cleanly so background work isnt double-run.
### Phase 4: Hook-Based Extractors & Dependencies
1. Create builtin extractor hooks in `archivebox/plugins/*/on_Snapshot__*.{sh,py,js}`; have `ArchiveResult.run_extractor_hook()` capture cmd/pwd/machine/install metadata.
2. Implement the new `Dependency`/`InstalledBinary` models + CLI commands, and port provider/install logic into hook scripts that only talk via CLI.
3. Add CLI helpers `archivebox find InstalledBinary`, `archivebox dependency ...` used by all hooks and document how user plugins extend them.
### Phase 5: Search Backends & Indexing Hooks
1. Migrate indexing triggers to hook scripts (`on_ArchiveResult__index_*`) that run standalone and write into `$ARCHIVEBOX_DATA_DIR/search.*`.
2. Implement CLI-driven query hooks (`on_Search__query_*`) plus lightweight Python wrappers in `archivebox/search/backends/`.
3. Remove any remaining ABX search integration.
---
## What Gets Deleted
```
archivebox/pkgs/ # ~5,000 lines
archivebox/workers/actor.py # If exists
```
## Dependencies Removed
```toml
"pluggy>=1.5.0"
"python-statemachine>=2.3.6"
# + all 30 abx-* packages
```
## Dependencies Kept
```toml
"django>=6.0"
"django-ninja>=1.3.0"
"abx-pkg>=0.6.0" # External, for binary management
"click>=8.1.7"
"rich>=13.8.0"
```
---
## Estimated Savings
| Component | Lines Removed |
|-----------|---------------|
| pkgs/ (ABX) | ~5,000 |
| statemachines | ~300 |
| workers/ | ~500 |
| base_models mixins | ~100 |
| **Total** | **~6,000 lines** |
Plus 30+ dependencies removed, massive reduction in conceptual complexity.
---
**Status: READY FOR IMPLEMENTATION**
Begin with Phase 1: Rename `archivebox/pkgs/` to add `.unused` suffix (delete after porting) and fix imports.

1341
STORAGE_CAS_PLAN.md Normal file

File diff suppressed because it is too large Load Diff

View File

@@ -1,127 +0,0 @@
# Chrome Extensions Test Results ✅
Date: 2025-12-24
Status: **ALL TESTS PASSED**
## Test Summary
Ran comprehensive tests of the Chrome extension system including:
- Extension downloads from Chrome Web Store
- Extension unpacking and installation
- Metadata caching and persistence
- Cache performance verification
## Results
### ✅ Extension Downloads (4/4 successful)
| Extension | Version | Size | Status |
|-----------|---------|------|--------|
| captcha2 (2captcha) | 3.7.2 | 396 KB | ✅ Downloaded |
| istilldontcareaboutcookies | 1.1.9 | 550 KB | ✅ Downloaded |
| ublock (uBlock Origin) | 1.68.0 | 4.0 MB | ✅ Downloaded |
| singlefile | 1.22.96 | 1.2 MB | ✅ Downloaded |
### ✅ Extension Installation (4/4 successful)
All extensions were successfully unpacked with valid `manifest.json` files:
- captcha2: Manifest V3 ✓
- istilldontcareaboutcookies: Valid manifest ✓
- ublock: Valid manifest ✓
- singlefile: Valid manifest ✓
### ✅ Metadata Caching (4/4 successful)
Extension metadata cached to `*.extension.json` files with complete information:
- Web Store IDs
- Download URLs
- File paths (absolute)
- Computed extension IDs
- Version numbers
Example metadata (captcha2):
```json
{
"webstore_id": "ifibfemgeogfhoebkmokieepdoobkbpo",
"name": "captcha2",
"crx_path": "[...]/ifibfemgeogfhoebkmokieepdoobkbpo__captcha2.crx",
"unpacked_path": "[...]/ifibfemgeogfhoebkmokieepdoobkbpo__captcha2",
"id": "gafcdbhijmmjlojcakmjlapdliecgila",
"version": "3.7.2"
}
```
### ✅ Cache Performance Verification
**Test**: Ran captcha2 installation twice in a row
**First run**: Downloaded and installed extension (5s)
**Second run**: Used cache, skipped installation (0.01s)
**Performance gain**: ~500x faster on subsequent runs
**Log output from second run**:
```
[*] 2captcha extension already installed (using cache)
[✓] 2captcha extension setup complete
```
## File Structure Created
```
data/personas/Test/chrome_extensions/
├── captcha2.extension.json (709 B)
├── istilldontcareaboutcookies.extension.json (763 B)
├── ublock.extension.json (704 B)
├── singlefile.extension.json (717 B)
├── ifibfemgeogfhoebkmokieepdoobkbpo__captcha2/ (unpacked)
├── ifibfemgeogfhoebkmokieepdoobkbpo__captcha2.crx (396 KB)
├── edibdbjcniadpccecjdfdjjppcpchdlm__istilldontcareaboutcookies/ (unpacked)
├── edibdbjcniadpccecjdfdjjppcpchdlm__istilldontcareaboutcookies.crx (550 KB)
├── cjpalhdlnbpafiamejdnhcphjbkeiagm__ublock/ (unpacked)
├── cjpalhdlnbpafiamejdnhcphjbkeiagm__ublock.crx (4.0 MB)
├── mpiodijhokgodhhofbcjdecpffjipkle__singlefile/ (unpacked)
└── mpiodijhokgodhhofbcjdecpffjipkle__singlefile.crx (1.2 MB)
```
Total size: ~6.2 MB for all 4 extensions
## Notes
### Expected Warnings
The following warnings are **expected and harmless**:
```
warning [*.crx]: 1062-1322 extra bytes at beginning or within zipfile
(attempting to process anyway)
```
This occurs because CRX files have a Chrome-specific header (containing signature data) before the ZIP content. The `unzip` command detects this and processes the ZIP data correctly anyway.
### Cache Invalidation
To force re-download of extensions:
```bash
rm -rf data/personas/Test/chrome_extensions/
```
## Next Steps
✅ Extensions are ready to use with Chrome
- Load via `--load-extension` and `--allowlisted-extension-id` flags
- Extensions can be configured at runtime via CDP
- 2captcha config plugin ready to inject API key
✅ Ready for integration testing with:
- chrome_session plugin (load extensions on browser start)
- captcha2_config plugin (configure 2captcha API key)
- singlefile extractor (trigger extension action)
## Conclusion
The Chrome extension system is **production-ready** with:
- ✅ Robust download and installation
- ✅ Efficient multi-level caching
- ✅ Proper error handling
- ✅ Performance optimized for thousands of snapshots

View File

@@ -45,7 +45,7 @@
### Crawls App
- Archive an entire website -> [Crawl page]
- What are the seed URLs?
- What are the starting URLs?
- How many hops to follow?
- Follow links to external domains?
- Follow links to parent URLs?

View File

@@ -1,3 +0,0 @@
[SERVER_CONFIG]
SECRET_KEY = amuxg7v5e2l_6jrktp_f3kszlpx4ieqk4rtwda5q6nfiavits4

File diff suppressed because it is too large Load Diff

View File

@@ -66,6 +66,13 @@ def render_archiveresults_list(archiveresults_qs, limit=50):
rows.append(f'''
<tr style="border-bottom: 1px solid #f1f5f9; transition: background 0.15s;" onmouseover="this.style.background='#f8fafc'" onmouseout="this.style.background='transparent'">
<td style="padding: 10px 12px; white-space: nowrap;">
<a href="{reverse('admin:core_archiveresult_change', args=[result.id])}"
style="color: #2563eb; text-decoration: none; font-family: ui-monospace, monospace; font-size: 11px;"
title="View/edit archive result">
<code>{str(result.id)[:8]}</code>
</a>
</td>
<td style="padding: 10px 12px; white-space: nowrap;">
<span style="display: inline-block; padding: 3px 10px; border-radius: 12px;
font-size: 11px; font-weight: 600; text-transform: uppercase;
@@ -75,7 +82,13 @@ def render_archiveresults_list(archiveresults_qs, limit=50):
{icon}
</td>
<td style="padding: 10px 12px; font-weight: 500; color: #334155;">
{result.extractor}
<a href="{output_link}" target="_blank"
style="color: #334155; text-decoration: none;"
title="View output fullscreen"
onmouseover="this.style.color='#2563eb'; this.style.textDecoration='underline';"
onmouseout="this.style.color='#334155'; this.style.textDecoration='none';">
{result.extractor}
</a>
</td>
<td style="padding: 10px 12px; max-width: 280px;">
<span onclick="document.getElementById('{row_id}').open = !document.getElementById('{row_id}').open"
@@ -102,14 +115,14 @@ def render_archiveresults_list(archiveresults_qs, limit=50):
</td>
</tr>
<tr style="border-bottom: 1px solid #e2e8f0;">
<td colspan="7" style="padding: 0 12px 10px 12px;">
<td colspan="8" style="padding: 0 12px 10px 12px;">
<details id="{row_id}" style="margin: 0;">
<summary style="cursor: pointer; font-size: 11px; color: #94a3b8; user-select: none;">
Details &amp; Output
</summary>
<div style="margin-top: 8px; padding: 10px; background: #f8fafc; border: 1px solid #e2e8f0; border-radius: 6px; max-height: 200px; overflow: auto;">
<div style="font-size: 11px; color: #64748b; margin-bottom: 8px;">
<span style="margin-right: 16px;"><b>ID:</b> <code>{str(result.id)[:8]}...</code></span>
<span style="margin-right: 16px;"><b>ID:</b> <code>{str(result.id)}</code></span>
<span style="margin-right: 16px;"><b>Version:</b> <code>{version}</code></span>
<span style="margin-right: 16px;"><b>PWD:</b> <code>{result.pwd or '-'}</code></span>
</div>
@@ -132,7 +145,7 @@ def render_archiveresults_list(archiveresults_qs, limit=50):
if total_count > limit:
footer = f'''
<tr>
<td colspan="7" style="padding: 12px; text-align: center; color: #64748b; font-size: 13px; background: #f8fafc;">
<td colspan="8" style="padding: 12px; text-align: center; color: #64748b; font-size: 13px; background: #f8fafc;">
Showing {limit} of {total_count} results &nbsp;
<a href="/admin/core/archiveresult/?snapshot__id__exact={results[0].snapshot_id if results else ''}"
style="color: #2563eb;">View all →</a>
@@ -145,6 +158,7 @@ def render_archiveresults_list(archiveresults_qs, limit=50):
<table style="width: 100%; border-collapse: collapse; font-size: 14px;">
<thead>
<tr style="background: #f8fafc; border-bottom: 2px solid #e2e8f0;">
<th style="padding: 10px 12px; text-align: left; font-weight: 600; color: #475569; font-size: 12px; text-transform: uppercase; letter-spacing: 0.05em;">ID</th>
<th style="padding: 10px 12px; text-align: left; font-weight: 600; color: #475569; font-size: 12px; text-transform: uppercase; letter-spacing: 0.05em;">Status</th>
<th style="padding: 10px 12px; text-align: left; font-weight: 600; color: #475569; font-size: 12px; width: 32px;"></th>
<th style="padding: 10px 12px; text-align: left; font-weight: 600; color: #475569; font-size: 12px; text-transform: uppercase; letter-spacing: 0.05em;">Extractor</th>

View File

@@ -635,40 +635,143 @@ class Snapshot(ModelWithOutputDir, ModelWithConfig, ModelWithNotes, ModelWithHea
# =========================================================================
def canonical_outputs(self) -> Dict[str, Optional[str]]:
"""Predict the expected output paths that should be present after archiving"""
"""
Intelligently discover the best output file for each extractor.
Uses actual ArchiveResult data and filesystem scanning with smart heuristics.
"""
FAVICON_PROVIDER = 'https://www.google.com/s2/favicons?domain={}'
# Mimetypes that can be embedded/previewed in an iframe
IFRAME_EMBEDDABLE_EXTENSIONS = {
'html', 'htm', 'pdf', 'txt', 'md', 'json', 'jsonl',
'png', 'jpg', 'jpeg', 'gif', 'webp', 'svg', 'ico',
'mp4', 'webm', 'mp3', 'opus', 'ogg', 'wav',
}
MIN_DISPLAY_SIZE = 15_000 # 15KB - filter out tiny files
MAX_SCAN_FILES = 50 # Don't scan massive directories
def find_best_output_in_dir(dir_path: Path, extractor_name: str) -> Optional[str]:
"""Find the best representative file in an extractor's output directory"""
if not dir_path.exists() or not dir_path.is_dir():
return None
candidates = []
file_count = 0
# Special handling for media extractor - look for thumbnails
is_media_dir = extractor_name == 'media'
# Scan for suitable files
for file_path in dir_path.rglob('*'):
file_count += 1
if file_count > MAX_SCAN_FILES:
break
if file_path.is_dir() or file_path.name.startswith('.'):
continue
ext = file_path.suffix.lstrip('.').lower()
if ext not in IFRAME_EMBEDDABLE_EXTENSIONS:
continue
try:
size = file_path.stat().st_size
except OSError:
continue
# For media dir, allow smaller image files (thumbnails are often < 15KB)
min_size = 5_000 if (is_media_dir and ext in ('png', 'jpg', 'jpeg', 'webp', 'gif')) else MIN_DISPLAY_SIZE
if size < min_size:
continue
# Prefer main files: index.html, output.*, content.*, etc.
priority = 0
name_lower = file_path.name.lower()
if is_media_dir:
# Special prioritization for media directories
if any(keyword in name_lower for keyword in ('thumb', 'thumbnail', 'cover', 'poster')):
priority = 200 # Highest priority for thumbnails
elif ext in ('png', 'jpg', 'jpeg', 'webp', 'gif'):
priority = 150 # High priority for any image
elif ext in ('mp4', 'webm', 'mp3', 'opus', 'ogg'):
priority = 100 # Lower priority for actual media files
else:
priority = 50
elif 'index' in name_lower:
priority = 100
elif name_lower.startswith(('output', 'content', extractor_name)):
priority = 50
elif ext in ('html', 'htm', 'pdf'):
priority = 30
elif ext in ('png', 'jpg', 'jpeg', 'webp'):
priority = 20
else:
priority = 10
candidates.append((priority, size, file_path))
if not candidates:
return None
# Sort by priority (desc), then size (desc)
candidates.sort(key=lambda x: (x[0], x[1]), reverse=True)
best_file = candidates[0][2]
return str(best_file.relative_to(Path(self.output_dir)))
canonical = {
'index_path': 'index.html',
'favicon_path': 'favicon.ico',
'google_favicon_path': FAVICON_PROVIDER.format(self.domain),
'wget_path': f'warc/{self.timestamp}',
'warc_path': 'warc/',
'singlefile_path': 'singlefile.html',
'readability_path': 'readability/content.html',
'mercury_path': 'mercury/content.html',
'htmltotext_path': 'htmltotext.txt',
'pdf_path': 'output.pdf',
'screenshot_path': 'screenshot.png',
'dom_path': 'output.html',
'archive_org_path': f'https://web.archive.org/web/{self.base_url}',
'git_path': 'git/',
'media_path': 'media/',
'headers_path': 'headers.json',
}
# Scan each ArchiveResult's output directory for the best file
snap_dir = Path(self.output_dir)
for result in self.archiveresult_set.filter(status='succeeded'):
if not result.output:
continue
# Try to find the best output file for this extractor
extractor_dir = snap_dir / result.extractor
best_output = None
if result.output and (snap_dir / result.output).exists():
# Use the explicit output path if it exists
best_output = result.output
elif extractor_dir.exists():
# Intelligently find the best file in the extractor's directory
best_output = find_best_output_in_dir(extractor_dir, result.extractor)
if best_output:
canonical[f'{result.extractor}_path'] = best_output
# Also scan top-level for legacy outputs (backwards compatibility)
for file_path in snap_dir.glob('*'):
if file_path.is_dir() or file_path.name in ('index.html', 'index.json'):
continue
ext = file_path.suffix.lstrip('.').lower()
if ext not in IFRAME_EMBEDDABLE_EXTENSIONS:
continue
try:
size = file_path.stat().st_size
if size >= MIN_DISPLAY_SIZE:
# Add as generic output with stem as key
key = f'{file_path.stem}_path'
if key not in canonical:
canonical[key] = file_path.name
except OSError:
continue
if self.is_static:
static_path = f'warc/{self.timestamp}'
canonical.update({
'title': self.basename,
'wget_path': static_path,
'pdf_path': static_path,
'screenshot_path': static_path,
'dom_path': static_path,
'singlefile_path': static_path,
'readability_path': static_path,
'mercury_path': static_path,
'htmltotext_path': static_path,
})
return canonical
def latest_outputs(self, status: Optional[str] = None) -> Dict[str, Any]:

View File

@@ -86,54 +86,37 @@ class SnapshotView(View):
}
archiveresults[result.extractor] = result_info
existing_files = {result['path'] for result in archiveresults.values()}
min_size_threshold = 10_000 # bytes
allowed_extensions = {
'txt',
'html',
'htm',
'png',
'jpg',
'jpeg',
'gif',
'webp'
'svg',
'webm',
'mp4',
'mp3',
'opus',
'pdf',
'md',
}
# Use canonical_outputs for intelligent discovery
# This method now scans ArchiveResults and uses smart heuristics
canonical = snapshot.canonical_outputs()
# iterate through all the files in the snapshot dir and add the biggest ones to the result list
# Add any newly discovered outputs from canonical_outputs to archiveresults
snap_dir = Path(snapshot.output_dir)
if not os.path.isdir(snap_dir) and os.access(snap_dir, os.R_OK):
return {}
for result_file in (*snap_dir.glob('*'), *snap_dir.glob('*/*')):
extension = result_file.suffix.lstrip('.').lower()
if result_file.is_dir() or result_file.name.startswith('.') or extension not in allowed_extensions:
continue
if result_file.name in existing_files or result_file.name == 'index.html':
for key, path in canonical.items():
if not key.endswith('_path') or not path or path.startswith('http'):
continue
extractor_name = key.replace('_path', '')
if extractor_name in archiveresults:
continue # Already have this from ArchiveResult
file_path = snap_dir / path
if not file_path.exists() or not file_path.is_file():
continue
# Skip circular symlinks and other stat() failures
try:
file_size = result_file.stat().st_size or 0
file_size = file_path.stat().st_size
if file_size >= 15_000: # Only show files > 15KB
archiveresults[extractor_name] = {
'name': extractor_name,
'path': path,
'ts': ts_to_date_str(file_path.stat().st_mtime or 0),
'size': file_size,
'result': None,
}
except OSError:
continue
if file_size > min_size_threshold:
archiveresults[result_file.name] = {
'name': result_file.stem,
'path': result_file.relative_to(snap_dir),
'ts': ts_to_date_str(result_file.stat().st_mtime or 0),
'size': file_size,
'result': None, # No ArchiveResult object for filesystem-discovered files
}
# Get available extractors from hooks (sorted by numeric prefix for ordering)
# Convert to base names for display ordering
all_extractors = [get_extractor_name(e) for e in get_extractors()]

View File

@@ -267,52 +267,89 @@ def run_hook(
# Capture files before execution to detect new output
files_before = set(output_dir.rglob('*')) if output_dir.exists() else set()
# Detect if this is a background hook (long-running daemon)
is_background = '__background' in script.stem
# Set up output files for ALL hooks (useful for debugging)
stdout_file = output_dir / 'stdout.log'
stderr_file = output_dir / 'stderr.log'
pid_file = output_dir / 'hook.pid'
try:
result = subprocess.run(
cmd,
cwd=str(output_dir),
capture_output=True,
text=True,
timeout=timeout,
env=env,
)
# Open log files for writing
with open(stdout_file, 'w') as out, open(stderr_file, 'w') as err:
process = subprocess.Popen(
cmd,
cwd=str(output_dir),
stdout=out,
stderr=err,
env=env,
)
# Write PID for all hooks (useful for debugging/cleanup)
pid_file.write_text(str(process.pid))
if is_background:
# Background hook - return None immediately, don't wait
# Process continues running, writing to stdout.log
# ArchiveResult will poll for completion later
return None
# Normal hook - wait for completion with timeout
try:
returncode = process.wait(timeout=timeout)
except subprocess.TimeoutExpired:
process.kill()
process.wait() # Clean up zombie
duration_ms = int((time.time() - start_time) * 1000)
return HookResult(
returncode=-1,
stdout='',
stderr=f'Hook timed out after {timeout} seconds',
output_json=None,
output_files=[],
duration_ms=duration_ms,
hook=str(script),
)
# Read output from files
stdout = stdout_file.read_text() if stdout_file.exists() else ''
stderr = stderr_file.read_text() if stderr_file.exists() else ''
# Detect new files created by the hook
files_after = set(output_dir.rglob('*')) if output_dir.exists() else set()
new_files = [str(f.relative_to(output_dir)) for f in (files_after - files_before) if f.is_file()]
# Exclude the log files themselves from new_files
new_files = [f for f in new_files if f not in ('stdout.log', 'stderr.log', 'hook.pid')]
# Try to parse stdout as JSON
# Parse RESULT_JSON from stdout
output_json = None
stdout = result.stdout.strip()
if stdout:
try:
output_json = json.loads(stdout)
except json.JSONDecodeError:
pass # Not JSON output, that's fine
for line in stdout.splitlines():
if line.startswith('RESULT_JSON='):
try:
output_json = json.loads(line[len('RESULT_JSON='):])
break
except json.JSONDecodeError:
pass
duration_ms = int((time.time() - start_time) * 1000)
# Clean up log files on success (keep on failure for debugging)
if returncode == 0:
stdout_file.unlink(missing_ok=True)
stderr_file.unlink(missing_ok=True)
pid_file.unlink(missing_ok=True)
return HookResult(
returncode=result.returncode,
stdout=result.stdout,
stderr=result.stderr,
returncode=returncode,
stdout=stdout,
stderr=stderr,
output_json=output_json,
output_files=new_files,
duration_ms=duration_ms,
hook=str(script),
)
except subprocess.TimeoutExpired:
duration_ms = int((time.time() - start_time) * 1000)
return HookResult(
returncode=-1,
stdout='',
stderr=f'Hook timed out after {timeout} seconds',
output_json=None,
output_files=[],
duration_ms=duration_ms,
hook=str(script),
)
except Exception as e:
duration_ms = int((time.time() - start_time) * 1000)
return HookResult(

View File

@@ -0,0 +1,181 @@
# MCP Server Test Results
**Date:** 2025-12-25
**Status:** ✅ ALL TESTS PASSING
**Environment:** Run from inside ArchiveBox data directory
## Test Summary
All 10 manual tests passed successfully, demonstrating full MCP server functionality.
### Test 1: Initialize ✅
```json
{"jsonrpc":"2.0","id":1,"method":"initialize","params":{}}
```
**Result:** Successfully initialized
- Server: `archivebox-mcp`
- Version: `0.9.0rc1`
- Protocol: `2025-11-25`
### Test 2: Tools Discovery ✅
```json
{"jsonrpc":"2.0","id":2,"method":"tools/list","params":{}}
```
**Result:** Successfully discovered **20 CLI commands**
- Meta (3): help, version, mcp
- Setup (2): init, install
- Archive (10): add, remove, update, search, status, config, schedule, server, shell, manage
- Workers (2): orchestrator, worker
- Tasks (3): crawl, snapshot, extract
All tools have properly auto-generated JSON Schemas from Click metadata.
### Test 3: Version Tool ✅
```json
{"jsonrpc":"2.0","id":3,"method":"tools/call","params":{"name":"version","arguments":{"quiet":true}}}
```
**Result:** `0.9.0rc1`
Simple commands execute correctly.
### Test 4: Status Tool (Django Required) ✅
```json
{"jsonrpc":"2.0","id":4,"method":"tools/call","params":{"name":"status","arguments":{}}}
```
**Result:** Successfully accessed Django database
- Displayed archive statistics
- Showed indexed snapshots: 3
- Showed archived snapshots: 2
- Last UI login information
- Storage size and file counts
**KEY**: Django is now properly initialized before running archive commands!
### Test 5: Search Tool with JSON Output ✅
```json
{"jsonrpc":"2.0","id":5,"method":"tools/call","params":{"name":"search","arguments":{"json":true}}}
```
**Result:** Returned structured JSON data from database
- Full snapshot objects with metadata
- Archive paths and canonical URLs
- Timestamps and status information
### Test 6: Config Tool ✅
```json
{"jsonrpc":"2.0","id":6,"method":"tools/call","params":{"name":"config","arguments":{}}}
```
**Result:** Listed all configuration in TOML format
- SHELL_CONFIG, SERVER_CONFIG, ARCHIVING_CONFIG sections
- All config values properly displayed
### Test 7: Search for Specific URL ✅
```json
{"jsonrpc":"2.0","id":7,"method":"tools/call","params":{"name":"search","arguments":{"filter_patterns":"example.com"}}}
```
**Result:** Successfully filtered and found matching URL
### Test 8: Add URL (Index Only) ✅
```json
{"jsonrpc":"2.0","id":8,"method":"tools/call","params":{"name":"add","arguments":{"urls":"https://example.com","index_only":true}}}
```
**Result:** Successfully created Crawl and Snapshot
- Crawl ID: 019b54ef-b06c-74bf-b347-7047085a9f35
- Snapshot ID: 019b54ef-b080-72ff-96d8-c381575a94f4
- Status: queued
**KEY**: Positional arguments (like `urls`) are now handled correctly!
### Test 9: Verify Added URL ✅
```json
{"jsonrpc":"2.0","id":9,"method":"tools/call","params":{"name":"search","arguments":{"filter_patterns":"example.com"}}}
```
**Result:** Confirmed https://example.com was added to database
### Test 10: Add URL with Background Archiving ✅
```json
{"jsonrpc":"2.0","id":10,"method":"tools/call","params":{"name":"add","arguments":{"urls":"https://example.org","plugins":"title","bg":true}}}
```
**Result:** Successfully queued for background archiving
- Created Crawl: 019b54f0-8c01-7384-b998-1eaf14ca7797
- Background mode: URLs queued for orchestrator
### Test 11: Error Handling ✅
```json
{"jsonrpc":"2.0","id":11,"method":"invalid_method","params":{}}
```
**Result:** Proper JSON-RPC error
- Error code: -32601 (Method not found)
- Appropriate error message
### Test 12: Unknown Tool Error ✅
```json
{"jsonrpc":"2.0","id":12,"method":"tools/call","params":{"name":"nonexistent_tool"}}
```
**Result:** Proper error with traceback
- Error code: -32603 (Internal error)
- ValueError: "Unknown tool: nonexistent_tool"
## Key Fixes Applied
### Fix 1: Django Setup for Archive Commands
**Problem:** Commands requiring database access failed with "Apps aren't loaded yet"
**Solution:** Added automatic Django setup before executing archive commands
```python
if cmd_name in ArchiveBoxGroup.archive_commands:
setup_django()
check_data_folder()
```
### Fix 2: Positional Arguments vs Options
**Problem:** Commands with positional arguments (like `add urls`) failed
**Solution:** Distinguished between Click.Argument and Click.Option types
```python
if isinstance(param, click.Argument):
positional_args.append(str(value)) # No dashes
else:
args.append(f'--{param_name}') # With dashes
```
### Fix 3: JSON Serialization of Click Sentinels
**Problem:** Click's sentinel values caused JSON encoding errors
**Solution:** Custom JSON encoder to handle special types
```python
class MCPJSONEncoder(json.JSONEncoder):
def default(self, obj):
if isinstance(obj, click.core._SentinelClass):
return None
```
## Performance
- **Tool discovery:** ~100ms (lazy-loads on first call, then cached)
- **Simple commands:** 50-200ms (version, help)
- **Database commands:** 200-500ms (status, search)
- **Add commands:** 300-800ms (creates database records)
## Architecture Validation
**Stateless** - No database models or session management
**Dynamic** - Automatically syncs with CLI changes
**Zero duplication** - Single source of truth (Click decorators)
**Minimal code** - ~400 lines total
**Protocol compliant** - Follows MCP 2025-11-25 spec
## Conclusion
The MCP server is **fully functional and production-ready**. It successfully:
1. ✅ Auto-discovers all 20 CLI commands
2. ✅ Generates JSON Schemas from Click metadata
3. ✅ Handles both stdio and potential HTTP/SSE transports
4. ✅ Properly sets up Django for database operations
5. ✅ Distinguishes between arguments and options
6. ✅ Executes commands with correct parameter passing
7. ✅ Captures stdout and stderr
8. ✅ Returns MCP-formatted responses
9. ✅ Provides proper error handling
10. ✅ Works from inside ArchiveBox data directories
**Ready for AI agent integration!** 🎉

View File

@@ -552,12 +552,9 @@ def log_worker_event(
if worker_id and worker_type in ('CrawlWorker', 'Orchestrator') and worker_type != 'DB':
worker_parts.append(f'id={worker_id}')
# Format worker label - only add brackets if there are additional identifiers
# Use double brackets [[...]] to escape Rich markup
if len(worker_parts) > 1:
worker_label = f'{worker_parts[0]}[[{", ".join(worker_parts[1:])}]]'
else:
worker_label = worker_parts[0]
# Build worker label parts for brackets (shown inside brackets)
worker_label_base = worker_parts[0]
worker_bracket_content = ", ".join(worker_parts[1:]) if len(worker_parts) > 1 else None
# Build URL/extractor display (shown AFTER the label, outside brackets)
url_extractor_parts = []
@@ -613,9 +610,18 @@ def log_worker_event(
from rich.text import Text
# Create a Rich Text object for proper formatting
# Text.append() treats content as literal (no markup parsing)
text = Text()
text.append(indent)
text.append(f'{worker_label} {event}{error_str}', style=color)
text.append(worker_label_base, style=color)
# Add bracketed content if present (using Text.append to avoid markup issues)
if worker_bracket_content:
text.append('[', style=color)
text.append(worker_bracket_content, style=color)
text.append(']', style=color)
text.append(f' {event}{error_str}', style=color)
# Add URL/extractor info first (more important)
if url_extractor_str:

View File

@@ -1,9 +1,10 @@
#!/usr/bin/env node
/**
* Capture console output from a page (DAEMON MODE).
* Capture console output from a page.
*
* This hook daemonizes and stays alive to capture console logs throughout
* the snapshot lifecycle. It's killed by chrome_cleanup at the end.
* This hook sets up CDP listeners BEFORE chrome_navigate loads the page,
* then waits for navigation to complete. The listeners stay active through
* navigation and capture all console output.
*
* Usage: on_Snapshot__21_consolelog.js --url=<url> --snapshot-id=<uuid>
* Output: Writes console.jsonl + listener.pid
@@ -150,10 +151,30 @@ async function setupListeners() {
}
});
// Don't disconnect - keep browser connection alive
return { browser, page };
}
async function waitForNavigation() {
// Wait for chrome_navigate to complete (it writes page_loaded.txt)
const navDir = path.join(CHROME_SESSION_DIR, '../chrome_navigate');
const pageLoadedMarker = path.join(navDir, 'page_loaded.txt');
const maxWait = 120000; // 2 minutes
const pollInterval = 100;
let waitTime = 0;
while (!fs.existsSync(pageLoadedMarker) && waitTime < maxWait) {
await new Promise(resolve => setTimeout(resolve, pollInterval));
waitTime += pollInterval;
}
if (!fs.existsSync(pageLoadedMarker)) {
throw new Error('Timeout waiting for navigation (chrome_navigate did not complete)');
}
// Wait a bit longer for any post-load console output
await new Promise(resolve => setTimeout(resolve, 500));
}
async function main() {
const args = parseArgs();
const url = args.url;
@@ -179,13 +200,16 @@ async function main() {
const startTs = new Date();
try {
// Set up listeners
// Set up listeners BEFORE navigation
await setupListeners();
// Write PID file so chrome_cleanup can kill us
// Write PID file so chrome_cleanup can kill any remaining processes
fs.writeFileSync(path.join(OUTPUT_DIR, PID_FILE), String(process.pid));
// Report success immediately (we're staying alive in background)
// Wait for chrome_navigate to complete (BLOCKING)
await waitForNavigation();
// Report success
const endTs = new Date();
const duration = (endTs - startTs) / 1000;
@@ -207,18 +231,7 @@ async function main() {
};
console.log(`RESULT_JSON=${JSON.stringify(result)}`);
// Daemonize: detach from parent and keep running
// This process will be killed by chrome_cleanup
if (process.stdin.isTTY) {
process.stdin.pause();
}
process.stdin.unref();
process.stdout.end();
process.stderr.end();
// Keep the process alive indefinitely
// Will be killed by chrome_cleanup via the PID file
setInterval(() => {}, 1000);
process.exit(0);
} catch (e) {
const error = `${e.name}: ${e.message}`;

View File

@@ -0,0 +1,45 @@
{
"$schema": "http://json-schema.org/draft-07/schema#",
"type": "object",
"additionalProperties": false,
"properties": {
"SAVE_GALLERY_DL": {
"type": "boolean",
"default": true,
"x-aliases": ["USE_GALLERY_DL", "FETCH_GALLERY"],
"description": "Enable gallery downloading with gallery-dl"
},
"GALLERY_DL_BINARY": {
"type": "string",
"default": "gallery-dl",
"description": "Path to gallery-dl binary"
},
"GALLERY_DL_TIMEOUT": {
"type": "integer",
"default": 3600,
"minimum": 30,
"x-fallback": "TIMEOUT",
"description": "Timeout for gallery downloads in seconds"
},
"GALLERY_DL_CHECK_SSL_VALIDITY": {
"type": "boolean",
"default": true,
"x-fallback": "CHECK_SSL_VALIDITY",
"description": "Whether to verify SSL certificates"
},
"GALLERY_DL_ARGS": {
"type": "array",
"items": {"type": "string"},
"default": [
"--write-metadata",
"--write-info-json"
],
"description": "Default gallery-dl arguments"
},
"GALLERY_DL_EXTRA_ARGS": {
"type": "string",
"default": "",
"description": "Extra arguments for gallery-dl (space-separated)"
}
}
}

View File

@@ -0,0 +1,129 @@
#!/usr/bin/env python3
"""
Validation hook for gallery-dl.
Runs at crawl start to verify gallery-dl binary is available.
Outputs JSONL for InstalledBinary and Machine config updates.
"""
import os
import sys
import json
import shutil
import hashlib
import subprocess
from pathlib import Path
def get_binary_version(abspath: str, version_flag: str = '--version') -> str | None:
"""Get version string from binary."""
try:
result = subprocess.run(
[abspath, version_flag],
capture_output=True,
text=True,
timeout=5,
)
if result.returncode == 0 and result.stdout:
first_line = result.stdout.strip().split('\n')[0]
return first_line[:64]
except Exception:
pass
return None
def get_binary_hash(abspath: str) -> str | None:
"""Get SHA256 hash of binary."""
try:
with open(abspath, 'rb') as f:
return hashlib.sha256(f.read()).hexdigest()
except Exception:
return None
def find_gallerydl() -> dict | None:
"""Find gallery-dl binary."""
try:
from abx_pkg import Binary, PipProvider, EnvProvider
class GalleryDlBinary(Binary):
name: str = 'gallery-dl'
binproviders_supported = [PipProvider(), EnvProvider()]
binary = GalleryDlBinary()
loaded = binary.load()
if loaded and loaded.abspath:
return {
'name': 'gallery-dl',
'abspath': str(loaded.abspath),
'version': str(loaded.version) if loaded.version else None,
'sha256': loaded.sha256 if hasattr(loaded, 'sha256') else None,
'binprovider': loaded.binprovider.name if loaded.binprovider else 'env',
}
except ImportError:
pass
except Exception:
pass
# Fallback to shutil.which
abspath = shutil.which('gallery-dl') or os.environ.get('GALLERY_DL_BINARY', '')
if abspath and Path(abspath).is_file():
return {
'name': 'gallery-dl',
'abspath': abspath,
'version': get_binary_version(abspath),
'sha256': get_binary_hash(abspath),
'binprovider': 'env',
}
return None
def main():
# Check for gallery-dl (required)
gallerydl_result = find_gallerydl()
missing_deps = []
# Emit results for gallery-dl
if gallerydl_result and gallerydl_result.get('abspath'):
print(json.dumps({
'type': 'InstalledBinary',
'name': gallerydl_result['name'],
'abspath': gallerydl_result['abspath'],
'version': gallerydl_result['version'],
'sha256': gallerydl_result['sha256'],
'binprovider': gallerydl_result['binprovider'],
}))
print(json.dumps({
'type': 'Machine',
'_method': 'update',
'key': 'config/GALLERY_DL_BINARY',
'value': gallerydl_result['abspath'],
}))
if gallerydl_result['version']:
print(json.dumps({
'type': 'Machine',
'_method': 'update',
'key': 'config/GALLERY_DL_VERSION',
'value': gallerydl_result['version'],
}))
else:
print(json.dumps({
'type': 'Dependency',
'bin_name': 'gallery-dl',
'bin_providers': 'pip,env',
}))
missing_deps.append('gallery-dl')
if missing_deps:
print(f"Missing dependencies: {', '.join(missing_deps)}", file=sys.stderr)
sys.exit(1)
else:
sys.exit(0)
if __name__ == '__main__':
main()

View File

@@ -0,0 +1,299 @@
#!/usr/bin/env python3
"""
Download image galleries from a URL using gallery-dl.
Usage: on_Snapshot__gallerydl.py --url=<url> --snapshot-id=<uuid>
Output: Downloads gallery images to $PWD/gallerydl/
Environment variables:
GALLERY_DL_BINARY: Path to gallery-dl binary
GALLERY_DL_TIMEOUT: Timeout in seconds (default: 3600 for large galleries)
GALLERY_DL_CHECK_SSL_VALIDITY: Whether to check SSL certificates (default: True)
GALLERY_DL_EXTRA_ARGS: Extra arguments for gallery-dl (space-separated)
# Gallery-dl feature toggles
USE_GALLERY_DL: Enable gallery-dl gallery extraction (default: True)
SAVE_GALLERY_DL: Alias for USE_GALLERY_DL
# Fallback to ARCHIVING_CONFIG values if GALLERY_DL_* not set:
GALLERY_DL_TIMEOUT: Fallback timeout for gallery downloads
TIMEOUT: Fallback timeout
CHECK_SSL_VALIDITY: Fallback SSL check
"""
import json
import os
import shutil
import subprocess
import sys
from datetime import datetime, timezone
from pathlib import Path
import rich_click as click
# Extractor metadata
EXTRACTOR_NAME = 'gallerydl'
BIN_NAME = 'gallery-dl'
BIN_PROVIDERS = 'pip,env'
OUTPUT_DIR = '.'
def get_env(name: str, default: str = '') -> str:
return os.environ.get(name, default).strip()
def get_env_bool(name: str, default: bool = False) -> bool:
val = get_env(name, '').lower()
if val in ('true', '1', 'yes', 'on'):
return True
if val in ('false', '0', 'no', 'off'):
return False
return default
def get_env_int(name: str, default: int = 0) -> int:
try:
return int(get_env(name, str(default)))
except ValueError:
return default
STATICFILE_DIR = '../staticfile'
MEDIA_DIR = '../media'
def has_staticfile_output() -> bool:
"""Check if staticfile extractor already downloaded this URL."""
staticfile_dir = Path(STATICFILE_DIR)
return staticfile_dir.exists() and any(staticfile_dir.iterdir())
def has_media_output() -> bool:
"""Check if media extractor already downloaded this URL."""
media_dir = Path(MEDIA_DIR)
return media_dir.exists() and any(media_dir.iterdir())
def find_gallerydl() -> str | None:
"""Find gallery-dl binary."""
gallerydl = get_env('GALLERY_DL_BINARY')
if gallerydl and os.path.isfile(gallerydl):
return gallerydl
binary = shutil.which('gallery-dl')
if binary:
return binary
return None
def get_version(binary: str) -> str:
"""Get gallery-dl version."""
try:
result = subprocess.run([binary, '--version'], capture_output=True, text=True, timeout=10)
return result.stdout.strip()[:64]
except Exception:
return ''
# Default gallery-dl args
def get_gallerydl_default_args() -> list[str]:
"""Build default gallery-dl arguments."""
return [
'--write-metadata',
'--write-info-json',
]
def save_gallery(url: str, binary: str) -> tuple[bool, str | None, str]:
"""
Download gallery using gallery-dl.
Returns: (success, output_path, error_message)
"""
# Get config from env (with GALLERY_DL_ prefix or fallback to ARCHIVING_CONFIG style)
timeout = get_env_int('GALLERY_DL_TIMEOUT') or get_env_int('TIMEOUT', 3600)
check_ssl = get_env_bool('GALLERY_DL_CHECK_SSL_VALIDITY', get_env_bool('CHECK_SSL_VALIDITY', True))
extra_args = get_env('GALLERY_DL_EXTRA_ARGS', '')
# Output directory is current directory (hook already runs in output dir)
output_dir = Path(OUTPUT_DIR)
# Build command (later options take precedence)
cmd = [
binary,
*get_gallerydl_default_args(),
'-d', str(output_dir),
]
if not check_ssl:
cmd.append('--no-check-certificate')
if extra_args:
cmd.extend(extra_args.split())
cmd.append(url)
try:
result = subprocess.run(cmd, capture_output=True, timeout=timeout, text=True)
# Check if any gallery files were downloaded
gallery_extensions = (
'.jpg', '.jpeg', '.png', '.gif', '.webp', '.bmp', '.svg',
'.mp4', '.webm', '.mkv', '.avi', '.mov', '.flv',
'.json', '.txt', '.zip',
)
downloaded_files = [
f for f in output_dir.glob('*')
if f.is_file() and f.suffix.lower() in gallery_extensions
]
if downloaded_files:
# Return first image file, or first file if no images
image_files = [
f for f in downloaded_files
if f.suffix.lower() in ('.jpg', '.jpeg', '.png', '.gif', '.webp', '.bmp')
]
output = str(image_files[0]) if image_files else str(downloaded_files[0])
return True, output, ''
else:
stderr = result.stderr
# These are NOT errors - page simply has no downloadable gallery
# Return success with no output (legitimate "nothing to download")
if 'unsupported URL' in stderr.lower():
return True, None, '' # Not a gallery site - success, no output
if 'no results' in stderr.lower():
return True, None, '' # No gallery found - success, no output
if result.returncode == 0:
return True, None, '' # gallery-dl exited cleanly, just no gallery - success
# These ARE errors - something went wrong
if '404' in stderr:
return False, None, '404 Not Found'
if '403' in stderr:
return False, None, '403 Forbidden'
if 'Unable to extract' in stderr:
return False, None, 'Unable to extract gallery info'
return False, None, f'gallery-dl error: {stderr[:200]}'
except subprocess.TimeoutExpired:
return False, None, f'Timed out after {timeout} seconds'
except Exception as e:
return False, None, f'{type(e).__name__}: {e}'
@click.command()
@click.option('--url', required=True, help='URL to download gallery from')
@click.option('--snapshot-id', required=True, help='Snapshot UUID')
def main(url: str, snapshot_id: str):
"""Download image gallery from a URL using gallery-dl."""
start_ts = datetime.now(timezone.utc)
version = ''
output = None
status = 'failed'
error = ''
binary = None
cmd_str = ''
try:
# Check if gallery-dl is enabled
if not (get_env_bool('USE_GALLERY_DL', True) and get_env_bool('SAVE_GALLERY_DL', True)):
print('Skipping gallery-dl (USE_GALLERY_DL=False or SAVE_GALLERY_DL=False)')
status = 'skipped'
end_ts = datetime.now(timezone.utc)
print(f'START_TS={start_ts.isoformat()}')
print(f'END_TS={end_ts.isoformat()}')
print(f'STATUS={status}')
print(f'RESULT_JSON={json.dumps({"extractor": EXTRACTOR_NAME, "status": status, "url": url, "snapshot_id": snapshot_id})}')
sys.exit(0)
# Check if staticfile or media extractors already handled this (skip)
if has_staticfile_output():
print(f'Skipping gallery-dl - staticfile extractor already downloaded this')
status = 'skipped'
print(f'START_TS={start_ts.isoformat()}')
print(f'END_TS={datetime.now(timezone.utc).isoformat()}')
print(f'STATUS={status}')
print(f'RESULT_JSON={json.dumps({"extractor": EXTRACTOR_NAME, "status": status, "url": url, "snapshot_id": snapshot_id})}')
sys.exit(0)
if has_media_output():
print(f'Skipping gallery-dl - media extractor already downloaded this')
status = 'skipped'
print(f'START_TS={start_ts.isoformat()}')
print(f'END_TS={datetime.now(timezone.utc).isoformat()}')
print(f'STATUS={status}')
print(f'RESULT_JSON={json.dumps({"extractor": EXTRACTOR_NAME, "status": status, "url": url, "snapshot_id": snapshot_id})}')
sys.exit(0)
# Find binary
binary = find_gallerydl()
if not binary:
print(f'ERROR: {BIN_NAME} binary not found', file=sys.stderr)
print(f'DEPENDENCY_NEEDED={BIN_NAME}', file=sys.stderr)
print(f'BIN_PROVIDERS={BIN_PROVIDERS}', file=sys.stderr)
print(f'INSTALL_HINT=pip install gallery-dl', file=sys.stderr)
sys.exit(1)
version = get_version(binary)
cmd_str = f'{binary} {url}'
# Run extraction
success, output, error = save_gallery(url, binary)
status = 'succeeded' if success else 'failed'
if success:
output_dir = Path(OUTPUT_DIR)
files = list(output_dir.glob('*'))
file_count = len([f for f in files if f.is_file()])
if file_count > 0:
print(f'gallery-dl completed: {file_count} files downloaded')
else:
print(f'gallery-dl completed: no gallery found on page (this is normal)')
except Exception as e:
error = f'{type(e).__name__}: {e}'
status = 'failed'
# Print results
end_ts = datetime.now(timezone.utc)
duration = (end_ts - start_ts).total_seconds()
print(f'START_TS={start_ts.isoformat()}')
print(f'END_TS={end_ts.isoformat()}')
print(f'DURATION={duration:.2f}')
if cmd_str:
print(f'CMD={cmd_str}')
if version:
print(f'VERSION={version}')
if output:
print(f'OUTPUT={output}')
print(f'STATUS={status}')
if error:
print(f'ERROR={error}', file=sys.stderr)
# Print JSON result
result_json = {
'extractor': EXTRACTOR_NAME,
'url': url,
'snapshot_id': snapshot_id,
'status': status,
'start_ts': start_ts.isoformat(),
'end_ts': end_ts.isoformat(),
'duration': round(duration, 2),
'cmd_version': version,
'output': output,
'error': error or None,
}
print(f'RESULT_JSON={json.dumps(result_json)}')
sys.exit(0 if status == 'succeeded' else 1)
if __name__ == '__main__':
main()

View File

@@ -1,9 +1,10 @@
#!/usr/bin/env node
/**
* Archive all network responses during page load (DAEMON MODE).
* Archive all network responses during page load.
*
* This hook daemonizes and stays alive to capture network responses throughout
* the snapshot lifecycle. It's killed by chrome_cleanup at the end.
* This hook sets up CDP listeners BEFORE chrome_navigate loads the page,
* then waits for navigation to complete. The listeners capture all network
* responses during the navigation.
*
* Usage: on_Snapshot__24_responses.js --url=<url> --snapshot-id=<uuid>
* Output: Creates responses/ directory with index.jsonl + listener.pid
@@ -14,7 +15,6 @@ const path = require('path');
const crypto = require('crypto');
const puppeteer = require('puppeteer-core');
// Extractor metadata
const EXTRACTOR_NAME = 'responses';
const OUTPUT_DIR = '.';
const PID_FILE = 'listener.pid';
@@ -23,7 +23,6 @@ const CHROME_SESSION_DIR = '../chrome_session';
// Resource types to capture (by default, capture everything)
const DEFAULT_TYPES = ['script', 'stylesheet', 'font', 'image', 'media', 'xhr', 'websocket'];
// Parse command line arguments
function parseArgs() {
const args = {};
process.argv.slice(2).forEach(arg => {
@@ -35,7 +34,6 @@ function parseArgs() {
return args;
}
// Get environment variable with default
function getEnv(name, defaultValue = '') {
return (process.env[name] || defaultValue).trim();
}
@@ -52,7 +50,6 @@ function getEnvInt(name, defaultValue = 0) {
return isNaN(val) ? defaultValue : val;
}
// Get CDP URL from chrome_session
function getCdpUrl() {
const cdpFile = path.join(CHROME_SESSION_DIR, 'cdp_url.txt');
if (fs.existsSync(cdpFile)) {
@@ -69,7 +66,6 @@ function getPageId() {
return null;
}
// Get file extension from MIME type
function getExtensionFromMimeType(mimeType) {
const mimeMap = {
'text/html': 'html',
@@ -101,7 +97,6 @@ function getExtensionFromMimeType(mimeType) {
return mimeMap[mimeBase] || '';
}
// Get extension from URL path
function getExtensionFromUrl(url) {
try {
const pathname = new URL(url).pathname;
@@ -112,49 +107,42 @@ function getExtensionFromUrl(url) {
}
}
// Sanitize filename
function sanitizeFilename(str, maxLen = 200) {
return str
.replace(/[^a-zA-Z0-9._-]/g, '_')
.slice(0, maxLen);
}
// Create symlink (handle errors gracefully)
async function createSymlink(target, linkPath) {
try {
// Create parent directory
const dir = path.dirname(linkPath);
if (!fs.existsSync(dir)) {
fs.mkdirSync(dir, { recursive: true });
}
// Remove existing symlink/file if present
if (fs.existsSync(linkPath)) {
fs.unlinkSync(linkPath);
}
// Create relative symlink
const relativePath = path.relative(dir, target);
fs.symlinkSync(relativePath, linkPath);
} catch (e) {
// Ignore symlink errors (file conflicts, permissions, etc.)
// Ignore symlink errors
}
}
// Set up response listener
async function setupListener() {
const typesStr = getEnv('RESPONSES_TYPES', DEFAULT_TYPES.join(','));
const typesToSave = typesStr.split(',').map(t => t.trim().toLowerCase());
// Create subdirectories for organizing responses
// Create subdirectories
const allDir = path.join(OUTPUT_DIR, 'all');
if (!fs.existsSync(allDir)) {
fs.mkdirSync(allDir, { recursive: true });
}
// Create index file
const indexPath = path.join(OUTPUT_DIR, 'index.jsonl');
fs.writeFileSync(indexPath, ''); // Clear existing
fs.writeFileSync(indexPath, '');
const cdpUrl = getCdpUrl();
if (!cdpUrl) {
@@ -182,7 +170,7 @@ async function setupListener() {
throw new Error('No page found');
}
// Set up response listener to capture network traffic
// Set up response listener
page.on('response', async (response) => {
try {
const request = response.request();
@@ -205,7 +193,6 @@ async function setupListener() {
try {
bodyBuffer = await response.buffer();
} catch (e) {
// Some responses can't be captured (already consumed, etc.)
return;
}
@@ -234,7 +221,6 @@ async function setupListener() {
const filename = path.basename(pathname) || 'index' + (extension ? '.' + extension : '');
const dirPath = path.dirname(pathname);
// Create symlink: responses/<type>/<hostname>/<path>/<filename>
const symlinkDir = path.join(OUTPUT_DIR, resourceType, hostname, dirPath);
const symlinkPath = path.join(symlinkDir, filename);
await createSymlink(uniquePath, symlinkPath);
@@ -250,7 +236,7 @@ async function setupListener() {
const indexEntry = {
ts: timestamp,
method,
url: method === 'DATA' ? url.slice(0, 128) : url, // Truncate data: URLs
url: method === 'DATA' ? url.slice(0, 128) : url,
urlSha256,
status,
resourceType,
@@ -267,10 +253,30 @@ async function setupListener() {
}
});
// Don't disconnect - keep browser connection alive
return { browser, page };
}
async function waitForNavigation() {
// Wait for chrome_navigate to complete
const navDir = path.join(CHROME_SESSION_DIR, '../chrome_navigate');
const pageLoadedMarker = path.join(navDir, 'page_loaded.txt');
const maxWait = 120000; // 2 minutes
const pollInterval = 100;
let waitTime = 0;
while (!fs.existsSync(pageLoadedMarker) && waitTime < maxWait) {
await new Promise(resolve => setTimeout(resolve, pollInterval));
waitTime += pollInterval;
}
if (!fs.existsSync(pageLoadedMarker)) {
throw new Error('Timeout waiting for navigation (chrome_navigate did not complete)');
}
// Wait a bit longer for any post-load responses
await new Promise(resolve => setTimeout(resolve, 1000));
}
async function main() {
const args = parseArgs();
const url = args.url;
@@ -296,13 +302,16 @@ async function main() {
const startTs = new Date();
try {
// Set up listener
// Set up listener BEFORE navigation
await setupListener();
// Write PID file so chrome_cleanup can kill us
// Write PID file
fs.writeFileSync(path.join(OUTPUT_DIR, PID_FILE), String(process.pid));
// Report success immediately (we're staying alive in background)
// Wait for chrome_navigate to complete (BLOCKING)
await waitForNavigation();
// Report success
const endTs = new Date();
const duration = (endTs - startTs) / 1000;
@@ -324,18 +333,7 @@ async function main() {
};
console.log(`RESULT_JSON=${JSON.stringify(result)}`);
// Daemonize: detach from parent and keep running
// This process will be killed by chrome_cleanup
if (process.stdin.isTTY) {
process.stdin.pause();
}
process.stdin.unref();
process.stdout.end();
process.stderr.end();
// Keep the process alive indefinitely
// Will be killed by chrome_cleanup via the PID file
setInterval(() => {}, 1000);
process.exit(0);
} catch (e) {
const error = `${e.name}: ${e.message}`;

View File

@@ -1,9 +1,10 @@
#!/usr/bin/env node
/**
* Extract SSL/TLS certificate details from a URL (DAEMON MODE).
* Extract SSL/TLS certificate details from a URL.
*
* This hook daemonizes and stays alive to capture SSL details throughout
* the snapshot lifecycle. It's killed by chrome_cleanup at the end.
* This hook sets up CDP listeners BEFORE chrome_navigate loads the page,
* then waits for navigation to complete. The listener captures SSL details
* during the navigation request.
*
* Usage: on_Snapshot__23_ssl.js --url=<url> --snapshot-id=<uuid>
* Output: Writes ssl.json + listener.pid
@@ -13,14 +14,12 @@ const fs = require('fs');
const path = require('path');
const puppeteer = require('puppeteer-core');
// Extractor metadata
const EXTRACTOR_NAME = 'ssl';
const OUTPUT_DIR = '.';
const OUTPUT_FILE = 'ssl.json';
const PID_FILE = 'listener.pid';
const CHROME_SESSION_DIR = '../chrome_session';
// Parse command line arguments
function parseArgs() {
const args = {};
process.argv.slice(2).forEach(arg => {
@@ -32,7 +31,6 @@ function parseArgs() {
return args;
}
// Get environment variable with default
function getEnv(name, defaultValue = '') {
return (process.env[name] || defaultValue).trim();
}
@@ -44,7 +42,6 @@ function getEnvBool(name, defaultValue = false) {
return defaultValue;
}
// Get CDP URL from chrome_session
function getCdpUrl() {
const cdpFile = path.join(CHROME_SESSION_DIR, 'cdp_url.txt');
if (fs.existsSync(cdpFile)) {
@@ -61,7 +58,6 @@ function getPageId() {
return null;
}
// Set up SSL listener
async function setupListener(url) {
const outputPath = path.join(OUTPUT_DIR, OUTPUT_FILE);
@@ -96,7 +92,7 @@ async function setupListener(url) {
throw new Error('No page found');
}
// Set up listener to capture SSL details when chrome_navigate loads the page
// Set up listener to capture SSL details during navigation
page.on('response', async (response) => {
try {
const request = response.request();
@@ -148,10 +144,27 @@ async function setupListener(url) {
}
});
// Don't disconnect - keep browser connection alive
return { browser, page };
}
async function waitForNavigation() {
// Wait for chrome_navigate to complete (it writes page_loaded.txt)
const navDir = path.join(CHROME_SESSION_DIR, '../chrome_navigate');
const pageLoadedMarker = path.join(navDir, 'page_loaded.txt');
const maxWait = 120000; // 2 minutes
const pollInterval = 100;
let waitTime = 0;
while (!fs.existsSync(pageLoadedMarker) && waitTime < maxWait) {
await new Promise(resolve => setTimeout(resolve, pollInterval));
waitTime += pollInterval;
}
if (!fs.existsSync(pageLoadedMarker)) {
throw new Error('Timeout waiting for navigation (chrome_navigate did not complete)');
}
}
async function main() {
const args = parseArgs();
const url = args.url;
@@ -177,13 +190,16 @@ async function main() {
const startTs = new Date();
try {
// Set up listener
// Set up listener BEFORE navigation
await setupListener(url);
// Write PID file so chrome_cleanup can kill us
// Write PID file so chrome_cleanup can kill any remaining processes
fs.writeFileSync(path.join(OUTPUT_DIR, PID_FILE), String(process.pid));
// Report success immediately (we're staying alive in background)
// Wait for chrome_navigate to complete (BLOCKING)
await waitForNavigation();
// Report success
const endTs = new Date();
const duration = (endTs - startTs) / 1000;
@@ -205,18 +221,7 @@ async function main() {
};
console.log(`RESULT_JSON=${JSON.stringify(result)}`);
// Daemonize: detach from parent and keep running
// This process will be killed by chrome_cleanup
if (process.stdin.isTTY) {
process.stdin.pause();
}
process.stdin.unref();
process.stdout.end();
process.stderr.end();
// Keep the process alive indefinitely
// Will be killed by chrome_cleanup via the PID file
setInterval(() => {}, 1000);
process.exit(0);
} catch (e) {
const error = `${e.name}: ${e.message}`;