wip major changes

This commit is contained in:
Nick Sweeting
2025-12-24 20:09:51 -08:00
parent c1335fed37
commit 1915333b81
450 changed files with 35814 additions and 19015 deletions

View File

@@ -0,0 +1,9 @@
{
"permissions": {
"allow": [
"Bash(python -m archivebox:*)",
"Bash(ls:*)",
"Bash(xargs:*)"
]
}
}

3
ArchiveBox.conf Normal file
View File

@@ -0,0 +1,3 @@
[SERVER_CONFIG]
SECRET_KEY = y6fw9wcaqls9sx_dze6ahky9ggpkpzoaw5g5v98_u3ro5j0_4f

300
PLUGIN_ENHANCEMENTS.md Normal file
View File

@@ -0,0 +1,300 @@
# JS Implementation Features to Port to Python ArchiveBox
## Priority: High Impact Features
### 1. **Screen Recording** ⭐⭐⭐
**JS Implementation:** Captures MP4 video + animated GIF of the archiving session
```javascript
// Records browser activity including scrolling, interactions
PuppeteerScreenRecorder screenrecording.mp4
ffmpeg conversion screenrecording.gif (first 10s, optimized)
```
**Enhancement for Python:**
- Add `on_Snapshot__24_screenrecording.py`
- Use puppeteer or playwright screen recording APIs
- Generate both full MP4 and thumbnail GIF
- **Value:** Visual proof of what was captured, useful for QA and debugging
### 2. **AI Quality Assurance** ⭐⭐⭐
**JS Implementation:** Uses GPT-4o to analyze screenshots and validate archive quality
```javascript
// ai_qa.py analyzes screenshot.png and returns:
{
"pct_visible": 85,
"warnings": ["Some content may be cut off"],
"main_content_title": "Article Title",
"main_content_author": "Author Name",
"main_content_date": "2024-01-15",
"website_brand_name": "Example.com"
}
```
**Enhancement for Python:**
- Add `on_Snapshot__95_aiqa.py` (runs after screenshot)
- Integrate with OpenAI API or local vision models
- Validates: content visibility, broken layouts, CAPTCHA blocks, error pages
- **Value:** Automatic detection of failed archives, quality scoring
### 3. **Network Response Archiving** ⭐⭐⭐
**JS Implementation:** Saves ALL network responses in organized structure
```
responses/
├── all/ # Timestamped unique files
│ ├── 20240101120000__GET__https%3A%2F%2Fexample.com%2Fapi.json
│ └── ...
├── script/ # Organized by resource type
│ └── example.com/path/to/script.js → ../all/...
├── stylesheet/
├── image/
├── media/
└── index.jsonl # Searchable index
```
**Enhancement for Python:**
- Add `on_Snapshot__23_responses.py`
- Save all HTTP responses (XHR, images, scripts, etc.)
- Create both timestamped and URL-organized views via symlinks
- Generate `index.jsonl` with metadata (URL, method, status, mimeType, sha256)
- **Value:** Complete HTTP-level archive, better debugging, API response preservation
### 4. **Detailed Metadata Extractors** ⭐⭐
#### 4a. SSL/TLS Details (`on_Snapshot__16_ssl.py`)
```python
{
"protocol": "TLS 1.3",
"cipher": "AES_128_GCM",
"securityState": "secure",
"securityDetails": {
"issuer": "Let's Encrypt",
"validFrom": ...,
"validTo": ...
}
}
```
#### 4b. SEO Metadata (`on_Snapshot__17_seo.py`)
Extracts all `<meta>` tags:
```python
{
"og:title": "Page Title",
"og:image": "https://example.com/image.jpg",
"twitter:card": "summary_large_image",
"description": "Page description",
...
}
```
#### 4c. Accessibility Tree (`on_Snapshot__18_accessibility.py`)
```python
{
"headings": ["# Main Title", "## Section 1", ...],
"iframes": ["https://embed.example.com/..."],
"tree": { ... } # Full accessibility snapshot
}
```
#### 4d. Outlinks Categorization (`on_Snapshot__19_outlinks.py`)
Better than current implementation - categorizes by type:
```python
{
"hrefs": [...], # All <a> links
"images": [...], # <img src>
"css_stylesheets": [...], # <link rel=stylesheet>
"js_scripts": [...], # <script src>
"iframes": [...], # <iframe src>
"css_images": [...], # background-image: url()
"links": [{...}] # <link> tags (rel, href)
}
```
#### 4e. Redirects Chain (`on_Snapshot__15_redirects.py`)
Tracks full redirect sequence:
```python
{
"redirects_from_http": [
{"url": "http://ex.com", "status": 301, "isMainFrame": True},
{"url": "https://ex.com", "status": 302, "isMainFrame": True},
{"url": "https://www.ex.com", "status": 200, "isMainFrame": True}
]
}
```
**Value:** Rich metadata for research, SEO analysis, security auditing
### 5. **Enhanced Screenshot System** ⭐⭐
**JS Implementation:**
- `screenshot.png` - Full-page PNG at high resolution (4:3 ratio)
- `screenshot.jpg` - Compressed JPEG for thumbnails (1440x1080, 90% quality)
- Automatically crops to reasonable height for long pages
**Enhancement for Python:**
- Update `screenshot` extractor to generate both formats
- Use aspect ratio optimization (4:3 is better for thumbnails than 16:9)
- **Value:** Faster loading thumbnails, better storage efficiency
### 6. **Console Log Capture** ⭐⭐
**JS Implementation:**
```
console.log - Captures all console output
ERROR /path/to/script.js:123 "Uncaught TypeError: ..."
WARNING https://example.com/api Failed to load resource: net::ERR_BLOCKED_BY_CLIENT
```
**Enhancement for Python:**
- Add `on_Snapshot__20_consolelog.py`
- Useful for debugging JavaScript errors, tracking blocked resources
- **Value:** Identifies rendering issues, ad blockers, CORS problems
## Priority: Nice-to-Have Enhancements
### 7. **Request/Response Headers** ⭐
**Current:** Headers extractor exists but could be enhanced
**JS Enhancement:** Separates request vs response, includes extra headers
### 8. **Human Behavior Emulation** ⭐
**JS Implementation:**
- Mouse jiggling with ghost-cursor
- Smart scrolling with infinite scroll detection
- Comment expansion (Reddit, HackerNews, etc.)
- Form submission
- CAPTCHA solving via 2captcha extension
**Enhancement for Python:**
- Add `on_Snapshot__05_human_behavior.py` (runs BEFORE other extractors)
- Implement scrolling, clicking "Load More", expanding comments
- **Value:** Captures more content from dynamic sites
### 9. **CAPTCHA Solving** ⭐
**JS Implementation:** Integrates 2captcha extension
**Enhancement:** Add optional CAPTCHA solving via 2captcha API
**Value:** Access to Cloudflare-protected sites
### 10. **Source Map Downloading**
**JS Implementation:** Automatically downloads `.map` files for JS/CSS
**Enhancement:** Add `on_Snapshot__30_sourcemaps.py`
**Value:** Helps debug minified code
### 11. **Pandoc Markdown Conversion**
**JS Implementation:** Converts HTML ↔ Markdown using Pandoc
```bash
pandoc --from html --to markdown_github --wrap=none
```
**Enhancement:** Add `on_Snapshot__34_pandoc.py`
**Value:** Human-readable Markdown format
### 12. **Authentication Management** ⭐
**JS Implementation:**
- Sophisticated cookie storage with `cookies.txt` export
- LocalStorage + SessionStorage preservation
- Merge new cookies with existing ones (no overwrites)
**Enhancement:**
- Improve `auth.json` management to match JS sophistication
- Add `cookies.txt` export (Netscape format) for compatibility with wget/curl
- **Value:** Better session persistence across runs
### 13. **File Integrity & Versioning** ⭐⭐
**JS Implementation:**
- SHA256 hash for every file
- Merkle tree directory hashes
- Version directories (`versions/YYYYMMDDHHMMSS/`)
- Symlinks to latest versions
- `.files.json` manifest with metadata
**Enhancement:**
- Add `on_Snapshot__99_integrity.py` (runs last)
- Generate SHA256 hashes for all outputs
- Create version manifests
- **Value:** Verify archive integrity, detect corruption, track changes
### 14. **Directory Organization**
**JS Structure (superior):**
```
archive/<timestamp>/
├── versions/
│ ├── 20240101120000/ # Each run = new version
│ │ ├── screenshot.png
│ │ ├── singlefile.html
│ │ └── ...
│ └── 20240102150000/
├── screenshot.png → versions/20240102150000/screenshot.png # Symlink to latest
├── singlefile.html → ...
└── metrics.json
```
**Current Python:** All outputs in flat structure
**Enhancement:** Add versioning layer for tracking changes over time
### 15. **Speedtest Integration**
**JS Implementation:** Runs fast.com speedtest once per day
**Enhancement:** Optional `on_Snapshot__01_speedtest.py`
**Value:** Diagnose slow archives, track connection quality
### 16. **gallery-dl Support** ⭐
**JS Implementation:** Downloads photo galleries (Instagram, Twitter, etc.)
**Enhancement:** Add `on_Snapshot__30_photos.py` alongside existing `media` extractor
**Value:** Better support for image-heavy sites
## Implementation Priority Ranking
### Must-Have (High ROI):
1. **Network Response Archiving** - Complete HTTP archive
2. **AI Quality Assurance** - Automatic validation
3. **Screen Recording** - Visual proof of capture
4. **Enhanced Metadata** (SSL, SEO, Accessibility, Outlinks) - Research value
### Should-Have (Medium ROI):
5. **Console Log Capture** - Debugging aid
6. **File Integrity Hashing** - Archive verification
7. **Enhanced Screenshots** - Better thumbnails
8. **Versioning System** - Track changes over time
### Nice-to-Have (Lower ROI):
9. **Human Behavior Emulation** - Dynamic content
10. **CAPTCHA Solving** - Access restricted sites
11. **gallery-dl** - Image collections
12. **Pandoc Markdown** - Readable format
## Technical Considerations
### Dependencies Needed:
- **Screen Recording:** `playwright` or `puppeteer` with recording API
- **AI QA:** `openai` Python SDK or local vision model
- **Network Archiving:** CDP protocol access (already have via Chrome)
- **File Hashing:** Built-in `hashlib` (no new deps)
- **gallery-dl:** Install via pip
### Performance Impact:
- Screen recording: +2-3 seconds overhead per snapshot
- AI QA: +0.5-2 seconds (API call) per snapshot
- Response archiving: Minimal (async writes)
- File hashing: +0.1-0.5 seconds per snapshot
- Metadata extraction: Minimal (same page visit)
### Architecture Compatibility:
All proposed enhancements fit the existing hook-based plugin architecture:
- Use standard `on_Snapshot__NN_name.py` naming
- Return `ExtractorResult` objects
- Can reuse shared Chrome CDP sessions
- Follow existing error handling patterns
## Summary Statistics
**JS Implementation:**
- 35+ output types
- ~3000 lines of archiving logic
- Extensive quality assurance
- Complete HTTP-level capture
**Current Python Implementation:**
- 12 extractors
- Strong foundation with room for enhancement
**Recommended Additions:**
- **8 new high-priority extractors**
- **6 enhanced versions of existing extractors**
- **3 optional nice-to-have extractors**
This would bring the Python implementation to feature parity with the JS version while maintaining better code organization and the existing plugin architecture.

819
SIMPLIFICATION_PLAN.md Normal file
View File

@@ -0,0 +1,819 @@
# ArchiveBox 2025 Simplification Plan
**Status:** FINAL - Ready for implementation
**Last Updated:** 2024-12-24
---
## Final Decisions Summary
| Decision | Choice |
|----------|--------|
| Task Queue | Keep `retry_at` polling pattern (no Django Tasks) |
| State Machine | Preserve current semantics; only replace mixins/statemachines if identical retry/lock guarantees are kept |
| Event Model | Remove completely |
| ABX Plugin System | Remove entirely (`archivebox/pkgs/`) |
| abx-pkg | Keep as external pip dependency (separate repo: github.com/ArchiveBox/abx-pkg) |
| Binary Providers | File-based plugins using abx-pkg internally |
| Search Backends | **Hybrid:** hooks for indexing, Python classes for querying |
| Auth Methods | Keep simple (LDAP + normal), no pluginization needed |
| ABID | Already removed (ignore old references) |
| ArchiveResult | **Keep pre-creation** with `status=queued` + `retry_at` for consistency |
| Plugin Directory | **`archivebox/plugins/*`** for built-ins, **`data/plugins/*`** for user hooks (flat `on_*__*.*` files) |
| Locking | Use `retry_at` consistently across Crawl, Snapshot, ArchiveResult |
| Worker Model | **Separate processes** per model type + per extractor, visible in htop |
| Concurrency | **Per-extractor configurable** (e.g., `ytdlp_max_parallel=5`) |
| InstalledBinary | **Keep model** + add Dependency model for audit trail |
---
## Architecture Overview
### Consistent Queue/Lock Pattern
All models (Crawl, Snapshot, ArchiveResult) use the same pattern:
```python
class StatusMixin(models.Model):
status = models.CharField(max_length=15, db_index=True)
retry_at = models.DateTimeField(default=timezone.now, null=True, db_index=True)
class Meta:
abstract = True
def tick(self) -> bool:
"""Override in subclass. Returns True if state changed."""
raise NotImplementedError
# Worker query (same for all models):
Model.objects.filter(
status__in=['queued', 'started'],
retry_at__lte=timezone.now()
).order_by('retry_at').first()
# Claim (atomic via optimistic locking):
updated = Model.objects.filter(
id=obj.id,
retry_at=obj.retry_at
).update(
retry_at=timezone.now() + timedelta(seconds=60)
)
if updated == 1: # Successfully claimed
obj.refresh_from_db()
obj.tick()
```
**Failure/cleanup guarantees**
- Objects stuck in `started` with a past `retry_at` must be reclaimed automatically using the existing retry/backoff rules.
- `tick()` implementations must continue to bump `retry_at` / transition to `backoff` the same way current statemachines do so that failures get retried without manual intervention.
### Process Tree (Separate Processes, Visible in htop)
```
archivebox server
├── orchestrator (pid=1000)
│ ├── crawl_worker_0 (pid=1001)
│ ├── crawl_worker_1 (pid=1002)
│ ├── snapshot_worker_0 (pid=1003)
│ ├── snapshot_worker_1 (pid=1004)
│ ├── snapshot_worker_2 (pid=1005)
│ ├── wget_worker_0 (pid=1006)
│ ├── wget_worker_1 (pid=1007)
│ ├── ytdlp_worker_0 (pid=1008) # Limited concurrency
│ ├── ytdlp_worker_1 (pid=1009)
│ ├── screenshot_worker_0 (pid=1010)
│ ├── screenshot_worker_1 (pid=1011)
│ ├── screenshot_worker_2 (pid=1012)
│ └── ...
```
**Configurable per-extractor concurrency:**
```python
# archivebox.conf or environment
WORKER_CONCURRENCY = {
'crawl': 2,
'snapshot': 3,
'wget': 2,
'ytdlp': 2, # Bandwidth-limited
'screenshot': 3,
'singlefile': 2,
'title': 5, # Fast, can run many
'favicon': 5,
}
```
---
## Hook System
### Discovery (Glob at Startup)
```python
# archivebox/hooks.py
from pathlib import Path
import subprocess
import os
import json
from django.conf import settings
BUILTIN_PLUGIN_DIR = Path(__file__).parent.parent / 'plugins'
USER_PLUGIN_DIR = settings.DATA_DIR / 'plugins'
def discover_hooks(event_name: str) -> list[Path]:
"""Find all scripts matching on_{EventName}__*.{sh,py,js} under archivebox/plugins/* and data/plugins/*"""
hooks = []
for base in (BUILTIN_PLUGIN_DIR, USER_PLUGIN_DIR):
if not base.exists():
continue
for ext in ('sh', 'py', 'js'):
hooks.extend(base.glob(f'*/on_{event_name}__*.{ext}'))
return sorted(hooks)
def run_hook(script: Path, output_dir: Path, **kwargs) -> dict:
"""Execute hook with --key=value args, cwd=output_dir."""
args = [str(script)]
for key, value in kwargs.items():
args.append(f'--{key.replace("_", "-")}={json.dumps(value, default=str)}')
env = os.environ.copy()
env['ARCHIVEBOX_DATA_DIR'] = str(settings.DATA_DIR)
result = subprocess.run(
args,
cwd=output_dir,
capture_output=True,
text=True,
timeout=300,
env=env,
)
return {
'returncode': result.returncode,
'stdout': result.stdout,
'stderr': result.stderr,
}
```
### Hook Interface
- **Input:** CLI args `--url=... --snapshot-id=...`
- **Location:** Built-in hooks in `archivebox/plugins/<plugin>/on_*__*.*`, user hooks in `data/plugins/<plugin>/on_*__*.*`
- **Internal API:** Should treat ArchiveBox as an external CLI—call `archivebox config --get ...`, `archivebox find ...`, import `abx-pkg` only when running in their own venvs.
- **Output:** Files written to `$PWD` (the output_dir), can call `archivebox create ...`
- **Logging:** stdout/stderr captured to ArchiveResult
- **Exit code:** 0 = success, non-zero = failure
---
## Unified Config Access
- Implement `archivebox.config.get_config(scope='global'|'crawl'|'snapshot'|...)` that merges defaults, config files, environment variables, DB overrides, and per-object config (seed/crawl/snapshot).
- Provide helpers (`get_config()`, `get_flat_config()`) for Python callers so `abx.pm.hook.get_CONFIG*` can be removed.
- Ensure the CLI command `archivebox config --get KEY` (and a machine-readable `--format=json`) uses the same API so hook scripts can query config via subprocess calls.
- Document that plugin hooks should prefer the CLI to fetch config rather than importing Django internals, guaranteeing they work from shell/bash/js without ArchiveBoxs runtime.
---
### Example Extractor Hooks
**Bash:**
```bash
#!/usr/bin/env bash
# plugins/on_Snapshot__wget.sh
set -e
# Parse args
for arg in "$@"; do
case $arg in
--url=*) URL="${arg#*=}" ;;
--snapshot-id=*) SNAPSHOT_ID="${arg#*=}" ;;
esac
done
# Find wget binary
WGET=$(archivebox find InstalledBinary --name=wget --format=abspath)
[ -z "$WGET" ] && echo "wget not found" >&2 && exit 1
# Run extraction (writes to $PWD)
$WGET --mirror --page-requisites --adjust-extension "$URL" 2>&1
echo "Completed wget mirror of $URL"
```
**Python:**
```python
#!/usr/bin/env python3
# plugins/on_Snapshot__singlefile.py
import argparse
import subprocess
import sys
def main():
parser = argparse.ArgumentParser()
parser.add_argument('--url', required=True)
parser.add_argument('--snapshot-id', required=True)
args = parser.parse_args()
# Find binary via CLI
result = subprocess.run(
['archivebox', 'find', 'InstalledBinary', '--name=single-file', '--format=abspath'],
capture_output=True, text=True
)
bin_path = result.stdout.strip()
if not bin_path:
print("single-file not installed", file=sys.stderr)
sys.exit(1)
# Run extraction (writes to $PWD)
subprocess.run([bin_path, args.url, '--output', 'singlefile.html'], check=True)
print(f"Saved {args.url} to singlefile.html")
if __name__ == '__main__':
main()
```
---
## Binary Providers & Dependencies
- Move dependency tracking into a dedicated `dependencies` module (or extend `archivebox/machine/`) with two Django models:
```yaml
Dependency:
id: uuidv7
bin_name: extractor binary executable name (ytdlp|wget|screenshot|...)
bin_provider: apt | brew | pip | npm | gem | nix | '*' for any
custom_cmds: JSON of provider->install command overrides (optional)
config: JSON of env vars/settings to apply during install
created_at: utc datetime
InstalledBinary:
id: uuidv7
dependency: FK to Dependency
bin_name: executable name again
bin_abspath: filesystem path
bin_version: semver string
bin_hash: sha256 of the binary
bin_provider: apt | brew | pip | npm | gem | nix | custom | ...
created_at: utc datetime (last seen/installed)
is_valid: property returning True when both abspath+version are set
```
- Provide CLI commands for hook scripts: `archivebox find InstalledBinary --name=wget --format=abspath`, `archivebox dependency create ...`, etc.
- Hooks remain language agnostic and should not import ArchiveBox Django modules; they rely on CLI commands plus their own runtime (python/bash/js).
### Provider Hooks
- Built-in provider plugins live under `archivebox/plugins/<provider>/on_Dependency__*.py` (e.g., apt, brew, pip, custom).
- Each provider hook:
1. Checks if the Dependency allows that provider via `bin_provider` or wildcard `'*'`.
2. Builds the install command (`custom_cmds[provider]` override or sane default like `apt install -y <bin_name>`).
3. Executes the command (bash/python) and, on success, records/updates an `InstalledBinary`.
Example outline (bash or python, but still interacting via CLI):
```bash
# archivebox/plugins/apt/on_Dependency__install_using_apt_provider.sh
set -euo pipefail
DEP_JSON=$(archivebox dependency show --id="$DEPENDENCY_ID" --format=json)
BIN_NAME=$(echo "$DEP_JSON" | jq -r '.bin_name')
PROVIDER_ALLOWED=$(echo "$DEP_JSON" | jq -r '.bin_provider')
if [[ "$PROVIDER_ALLOWED" == "*" || "$PROVIDER_ALLOWED" == *"apt"* ]]; then
INSTALL_CMD=$(echo "$DEP_JSON" | jq -r '.custom_cmds.apt // empty')
INSTALL_CMD=${INSTALL_CMD:-"apt install -y --no-install-recommends $BIN_NAME"}
bash -lc "$INSTALL_CMD"
archivebox dependency register-installed \
--dependency-id="$DEPENDENCY_ID" \
--bin-provider=apt \
--bin-abspath="$(command -v "$BIN_NAME")" \
--bin-version="$("$(command -v "$BIN_NAME")" --version | head -n1)" \
--bin-hash="$(sha256sum "$(command -v "$BIN_NAME")" | cut -d' ' -f1)"
fi
```
- Extractor-level hooks (e.g., `archivebox/plugins/wget/on_Crawl__install_wget_extractor_if_needed.*`) ensure dependencies exist before starting work by creating/updating `Dependency` records (via CLI) and then invoking provider hooks.
- Remove all reliance on `abx.pm.hook.binary_load` / ABX plugin packages; `abx-pkg` can remain as a normal pip dependency that hooks import if useful.
---
## Search Backends (Hybrid)
### Indexing: Hook Scripts
Triggered when ArchiveResult completes successfully (from the Django side we simply fire the event; indexing logic lives in standalone hook scripts):
```python
#!/usr/bin/env python3
# plugins/on_ArchiveResult__index_sqlitefts.py
import argparse
import sqlite3
import os
from pathlib import Path
def main():
parser = argparse.ArgumentParser()
parser.add_argument('--snapshot-id', required=True)
parser.add_argument('--extractor', required=True)
args = parser.parse_args()
# Read text content from output files
content = ""
for f in Path.cwd().rglob('*.txt'):
content += f.read_text(errors='ignore') + "\n"
for f in Path.cwd().rglob('*.html'):
content += strip_html(f.read_text(errors='ignore')) + "\n"
if not content.strip():
return
# Add to FTS index
db = sqlite3.connect(os.environ['ARCHIVEBOX_DATA_DIR'] + '/search.sqlite3')
db.execute('CREATE VIRTUAL TABLE IF NOT EXISTS fts USING fts5(snapshot_id, content)')
db.execute('INSERT OR REPLACE INTO fts VALUES (?, ?)', (args.snapshot_id, content))
db.commit()
if __name__ == '__main__':
main()
```
### Querying: CLI-backed Python Classes
```python
# archivebox/search/backends/sqlitefts.py
import subprocess
import json
class SQLiteFTSBackend:
name = 'sqlitefts'
def search(self, query: str, limit: int = 50) -> list[str]:
"""Call plugins/on_Search__query_sqlitefts.* and parse stdout."""
result = subprocess.run(
['archivebox', 'search-backend', '--backend', self.name, '--query', query, '--limit', str(limit)],
capture_output=True,
check=True,
text=True,
)
return json.loads(result.stdout or '[]')
# archivebox/search/__init__.py
from django.conf import settings
def get_backend():
name = getattr(settings, 'SEARCH_BACKEND', 'sqlitefts')
if name == 'sqlitefts':
from .backends.sqlitefts import SQLiteFTSBackend
return SQLiteFTSBackend()
elif name == 'sonic':
from .backends.sonic import SonicBackend
return SonicBackend()
raise ValueError(f'Unknown search backend: {name}')
def search(query: str) -> list[str]:
return get_backend().search(query)
```
- Each backend script lives under `archivebox/plugins/search/on_Search__query_<backend>.py` (with user overrides in `data/plugins/...`) and outputs JSON list of snapshot IDs. Python wrappers simply invoke the CLI to keep Django isolated from backend implementations.
---
## Simplified Models
> Goal: reduce line count without sacrificing the correctness guarantees we currently get from `ModelWithStateMachine` + python-statemachine. We keep the mixins/statemachines unless we can prove a smaller implementation enforces the same transitions/retry locking.
### Snapshot
```python
class Snapshot(models.Model):
id = models.UUIDField(primary_key=True, default=uuid7)
url = models.URLField(unique=True, db_index=True)
timestamp = models.CharField(max_length=32, unique=True, db_index=True)
title = models.CharField(max_length=512, null=True, blank=True)
created_by = models.ForeignKey(settings.AUTH_USER_MODEL, on_delete=models.CASCADE)
created_at = models.DateTimeField(default=timezone.now)
modified_at = models.DateTimeField(auto_now=True)
crawl = models.ForeignKey('crawls.Crawl', on_delete=models.CASCADE, null=True)
tags = models.ManyToManyField('Tag', through='SnapshotTag')
# Status (consistent with Crawl, ArchiveResult)
status = models.CharField(max_length=15, default='queued', db_index=True)
retry_at = models.DateTimeField(default=timezone.now, null=True, db_index=True)
# Inline fields (no mixins)
config = models.JSONField(default=dict)
notes = models.TextField(blank=True, default='')
FINAL_STATES = ['sealed']
@property
def output_dir(self) -> Path:
return settings.ARCHIVE_DIR / self.timestamp
def tick(self) -> bool:
if self.status == 'queued' and self.can_start():
self.start()
return True
elif self.status == 'started' and self.is_finished():
self.seal()
return True
return False
def can_start(self) -> bool:
return bool(self.url)
def is_finished(self) -> bool:
results = self.archiveresult_set.all()
if not results.exists():
return False
return not results.filter(status__in=['queued', 'started', 'backoff']).exists()
def start(self):
self.status = 'started'
self.retry_at = timezone.now() + timedelta(seconds=10)
self.output_dir.mkdir(parents=True, exist_ok=True)
self.save()
self.create_pending_archiveresults()
def seal(self):
self.status = 'sealed'
self.retry_at = None
self.save()
def create_pending_archiveresults(self):
for extractor in get_config(defaults=settings, crawl=self.crawl, snapshot=self).ENABLED_EXTRACTORS:
ArchiveResult.objects.get_or_create(
snapshot=self,
extractor=extractor,
defaults={
'status': 'queued',
'retry_at': timezone.now(),
'created_by': self.created_by,
}
)
```
### ArchiveResult
```python
class ArchiveResult(models.Model):
id = models.UUIDField(primary_key=True, default=uuid7)
snapshot = models.ForeignKey(Snapshot, on_delete=models.CASCADE)
extractor = models.CharField(max_length=32, db_index=True)
created_by = models.ForeignKey(settings.AUTH_USER_MODEL, on_delete=models.CASCADE)
created_at = models.DateTimeField(default=timezone.now)
modified_at = models.DateTimeField(auto_now=True)
# Status
status = models.CharField(max_length=15, default='queued', db_index=True)
retry_at = models.DateTimeField(default=timezone.now, null=True, db_index=True)
# Execution
start_ts = models.DateTimeField(null=True)
end_ts = models.DateTimeField(null=True)
output = models.CharField(max_length=1024, null=True)
cmd = models.JSONField(null=True)
pwd = models.CharField(max_length=256, null=True)
# Audit trail
machine = models.ForeignKey('machine.Machine', on_delete=models.SET_NULL, null=True)
iface = models.ForeignKey('machine.NetworkInterface', on_delete=models.SET_NULL, null=True)
installed_binary = models.ForeignKey('machine.InstalledBinary', on_delete=models.SET_NULL, null=True)
FINAL_STATES = ['succeeded', 'failed']
class Meta:
unique_together = ('snapshot', 'extractor')
@property
def output_dir(self) -> Path:
return self.snapshot.output_dir / self.extractor
def tick(self) -> bool:
if self.status == 'queued' and self.can_start():
self.start()
return True
elif self.status == 'backoff' and self.can_retry():
self.status = 'queued'
self.retry_at = timezone.now()
self.save()
return True
return False
def can_start(self) -> bool:
return bool(self.snapshot.url)
def can_retry(self) -> bool:
return self.retry_at and self.retry_at <= timezone.now()
def start(self):
self.status = 'started'
self.start_ts = timezone.now()
self.retry_at = timezone.now() + timedelta(seconds=120)
self.output_dir.mkdir(parents=True, exist_ok=True)
self.save()
# Run hook and complete
self.run_extractor_hook()
def run_extractor_hook(self):
from archivebox.hooks import discover_hooks, run_hook
hooks = discover_hooks(f'Snapshot__{self.extractor}')
if not hooks:
self.status = 'failed'
self.output = f'No hook for: {self.extractor}'
self.end_ts = timezone.now()
self.retry_at = None
self.save()
return
result = run_hook(
hooks[0],
output_dir=self.output_dir,
url=self.snapshot.url,
snapshot_id=str(self.snapshot.id),
)
self.status = 'succeeded' if result['returncode'] == 0 else 'failed'
self.output = result['stdout'][:1024] or result['stderr'][:1024]
self.end_ts = timezone.now()
self.retry_at = None
self.save()
# Trigger search indexing if succeeded
if self.status == 'succeeded':
self.trigger_search_indexing()
def trigger_search_indexing(self):
from archivebox.hooks import discover_hooks, run_hook
for hook in discover_hooks('ArchiveResult__index'):
run_hook(hook, output_dir=self.output_dir,
snapshot_id=str(self.snapshot.id),
extractor=self.extractor)
```
- `ArchiveResult` must continue storing execution metadata (`cmd`, `pwd`, `machine`, `iface`, `installed_binary`, timestamps) exactly as before, even though the extractor now runs via hook scripts. `run_extractor_hook()` is responsible for capturing those values (e.g., wrapping subprocess calls).
- Any refactor of `Snapshot`, `ArchiveResult`, or `Crawl` has to keep the same `FINAL_STATES`, `retry_at` semantics, and tag/output directory handling that `ModelWithStateMachine` currently provides.
---
## Simplified Worker System
```python
# archivebox/workers/orchestrator.py
import os
import time
import multiprocessing
from datetime import timedelta
from django.utils import timezone
from django.conf import settings
class Worker:
"""Base worker for processing queued objects."""
Model = None
name = 'worker'
def get_queue(self):
return self.Model.objects.filter(
retry_at__lte=timezone.now()
).exclude(
status__in=self.Model.FINAL_STATES
).order_by('retry_at')
def claim(self, obj) -> bool:
"""Atomic claim via optimistic lock."""
updated = self.Model.objects.filter(
id=obj.id,
retry_at=obj.retry_at
).update(retry_at=timezone.now() + timedelta(seconds=60))
return updated == 1
def run(self):
print(f'[{self.name}] Started pid={os.getpid()}')
while True:
obj = self.get_queue().first()
if obj and self.claim(obj):
try:
obj.refresh_from_db()
obj.tick()
except Exception as e:
print(f'[{self.name}] Error: {e}')
obj.retry_at = timezone.now() + timedelta(seconds=60)
obj.save(update_fields=['retry_at'])
else:
time.sleep(0.5)
class CrawlWorker(Worker):
from crawls.models import Crawl
Model = Crawl
name = 'crawl'
class SnapshotWorker(Worker):
from core.models import Snapshot
Model = Snapshot
name = 'snapshot'
class ExtractorWorker(Worker):
"""Worker for a specific extractor."""
from core.models import ArchiveResult
Model = ArchiveResult
def __init__(self, extractor: str):
self.extractor = extractor
self.name = extractor
def get_queue(self):
return super().get_queue().filter(extractor=self.extractor)
class Orchestrator:
def __init__(self):
self.processes = []
def spawn(self):
config = settings.WORKER_CONCURRENCY
for i in range(config.get('crawl', 2)):
self._spawn(CrawlWorker, f'crawl_{i}')
for i in range(config.get('snapshot', 3)):
self._spawn(SnapshotWorker, f'snapshot_{i}')
for extractor, count in config.items():
if extractor in ('crawl', 'snapshot'):
continue
for i in range(count):
self._spawn(ExtractorWorker, f'{extractor}_{i}', extractor)
def _spawn(self, cls, name, *args):
worker = cls(*args) if args else cls()
worker.name = name
p = multiprocessing.Process(target=worker.run, name=name)
p.start()
self.processes.append(p)
def run(self):
print(f'Orchestrator pid={os.getpid()}')
self.spawn()
try:
while True:
for p in self.processes:
if not p.is_alive():
print(f'{p.name} died, restarting...')
# Respawn logic
time.sleep(5)
except KeyboardInterrupt:
for p in self.processes:
p.terminate()
```
---
## Directory Structure
```
archivebox-nue/
├── archivebox/
│ ├── __init__.py
│ ├── config.py # Simple env-based config
│ ├── hooks.py # Hook discovery + execution
│ │
│ ├── core/
│ │ ├── models.py # Snapshot, ArchiveResult, Tag
│ │ ├── admin.py
│ │ └── views.py
│ │
│ ├── crawls/
│ │ ├── models.py # Crawl, Seed, CrawlSchedule, Outlink
│ │ └── admin.py
│ │
│ ├── machine/
│ │ ├── models.py # Machine, NetworkInterface, Dependency, InstalledBinary
│ │ └── admin.py
│ │
│ ├── workers/
│ │ └── orchestrator.py # ~150 lines
│ │
│ ├── api/
│ │ └── ...
│ │
│ ├── cli/
│ │ └── ...
│ │
│ ├── search/
│ │ ├── __init__.py
│ │ └── backends/
│ │ ├── sqlitefts.py
│ │ └── sonic.py
│ │
│ ├── index/
│ ├── parsers/
│ ├── misc/
│ └── templates/
-├── plugins/ # Built-in hooks (ArchiveBox never imports these directly)
│ ├── wget/
│ │ └── on_Snapshot__wget.sh
│ ├── dependencies/
│ │ ├── on_Dependency__install_using_apt_provider.sh
│ │ └── on_Dependency__install_using_custom_bash.py
│ ├── search/
│ │ ├── on_ArchiveResult__index_sqlitefts.py
│ │ └── on_Search__query_sqlitefts.py
│ └── ...
├── data/
│ └── plugins/ # User-provided hooks mirror builtin layout
└── pyproject.toml
```
---
## Implementation Phases
### Phase 1: Build Unified Config + Hook Scaffold
1. Implement `archivebox.config.get_config()` + CLI plumbing (`archivebox config --get ... --format=json`) without touching abx yet.
2. Add `archivebox/hooks.py` with dual plugin directories (`archivebox/plugins`, `data/plugins`), discovery, and execution helpers.
3. Keep the existing ABX/worker system running while new APIs land; surface warnings where `abx.pm.*` is still in use.
### Phase 2: Gradual ABX Removal
1. Rename `archivebox/pkgs/` to `archivebox/pkgs.unused/` and start deleting packages once equivalent hook scripts exist.
2. Remove `pluggy`, `python-statemachine`, and all `abx-*` dependencies/workspace entries from `pyproject.toml` only after consumers are migrated.
3. Replace every `abx.pm.hook.get_*` usage in CLI/config/search/extractors with the new config + hook APIs.
### Phase 3: Worker + State Machine Simplification
1. Introduce the process-per-model orchestrator while preserving `ModelWithStateMachine` semantics (Snapshot/Crawl/ArchiveResult).
2. Only drop mixins/statemachine dependency after verifying the new `tick()` implementations keep retries/backoff/final states identical.
3. Ensure Huey/task entry points either delegate to the new orchestrator or are retired cleanly so background work isnt double-run.
### Phase 4: Hook-Based Extractors & Dependencies
1. Create builtin extractor hooks in `archivebox/plugins/*/on_Snapshot__*.{sh,py,js}`; have `ArchiveResult.run_extractor_hook()` capture cmd/pwd/machine/install metadata.
2. Implement the new `Dependency`/`InstalledBinary` models + CLI commands, and port provider/install logic into hook scripts that only talk via CLI.
3. Add CLI helpers `archivebox find InstalledBinary`, `archivebox dependency ...` used by all hooks and document how user plugins extend them.
### Phase 5: Search Backends & Indexing Hooks
1. Migrate indexing triggers to hook scripts (`on_ArchiveResult__index_*`) that run standalone and write into `$ARCHIVEBOX_DATA_DIR/search.*`.
2. Implement CLI-driven query hooks (`on_Search__query_*`) plus lightweight Python wrappers in `archivebox/search/backends/`.
3. Remove any remaining ABX search integration.
---
## What Gets Deleted
```
archivebox/pkgs/ # ~5,000 lines
archivebox/workers/actor.py # If exists
```
## Dependencies Removed
```toml
"pluggy>=1.5.0"
"python-statemachine>=2.3.6"
# + all 30 abx-* packages
```
## Dependencies Kept
```toml
"django>=6.0"
"django-ninja>=1.3.0"
"abx-pkg>=0.6.0" # External, for binary management
"click>=8.1.7"
"rich>=13.8.0"
```
---
## Estimated Savings
| Component | Lines Removed |
|-----------|---------------|
| pkgs/ (ABX) | ~5,000 |
| statemachines | ~300 |
| workers/ | ~500 |
| base_models mixins | ~100 |
| **Total** | **~6,000 lines** |
Plus 30+ dependencies removed, massive reduction in conceptual complexity.
---
**Status: READY FOR IMPLEMENTATION**
Begin with Phase 1: Rename `archivebox/pkgs/` to add `.unused` suffix (delete after porting) and fix imports.

127
TEST_RESULTS.md Normal file
View File

@@ -0,0 +1,127 @@
# Chrome Extensions Test Results ✅
Date: 2025-12-24
Status: **ALL TESTS PASSED**
## Test Summary
Ran comprehensive tests of the Chrome extension system including:
- Extension downloads from Chrome Web Store
- Extension unpacking and installation
- Metadata caching and persistence
- Cache performance verification
## Results
### ✅ Extension Downloads (4/4 successful)
| Extension | Version | Size | Status |
|-----------|---------|------|--------|
| captcha2 (2captcha) | 3.7.2 | 396 KB | ✅ Downloaded |
| istilldontcareaboutcookies | 1.1.9 | 550 KB | ✅ Downloaded |
| ublock (uBlock Origin) | 1.68.0 | 4.0 MB | ✅ Downloaded |
| singlefile | 1.22.96 | 1.2 MB | ✅ Downloaded |
### ✅ Extension Installation (4/4 successful)
All extensions were successfully unpacked with valid `manifest.json` files:
- captcha2: Manifest V3 ✓
- istilldontcareaboutcookies: Valid manifest ✓
- ublock: Valid manifest ✓
- singlefile: Valid manifest ✓
### ✅ Metadata Caching (4/4 successful)
Extension metadata cached to `*.extension.json` files with complete information:
- Web Store IDs
- Download URLs
- File paths (absolute)
- Computed extension IDs
- Version numbers
Example metadata (captcha2):
```json
{
"webstore_id": "ifibfemgeogfhoebkmokieepdoobkbpo",
"name": "captcha2",
"crx_path": "[...]/ifibfemgeogfhoebkmokieepdoobkbpo__captcha2.crx",
"unpacked_path": "[...]/ifibfemgeogfhoebkmokieepdoobkbpo__captcha2",
"id": "gafcdbhijmmjlojcakmjlapdliecgila",
"version": "3.7.2"
}
```
### ✅ Cache Performance Verification
**Test**: Ran captcha2 installation twice in a row
**First run**: Downloaded and installed extension (5s)
**Second run**: Used cache, skipped installation (0.01s)
**Performance gain**: ~500x faster on subsequent runs
**Log output from second run**:
```
[*] 2captcha extension already installed (using cache)
[✓] 2captcha extension setup complete
```
## File Structure Created
```
data/personas/Test/chrome_extensions/
├── captcha2.extension.json (709 B)
├── istilldontcareaboutcookies.extension.json (763 B)
├── ublock.extension.json (704 B)
├── singlefile.extension.json (717 B)
├── ifibfemgeogfhoebkmokieepdoobkbpo__captcha2/ (unpacked)
├── ifibfemgeogfhoebkmokieepdoobkbpo__captcha2.crx (396 KB)
├── edibdbjcniadpccecjdfdjjppcpchdlm__istilldontcareaboutcookies/ (unpacked)
├── edibdbjcniadpccecjdfdjjppcpchdlm__istilldontcareaboutcookies.crx (550 KB)
├── cjpalhdlnbpafiamejdnhcphjbkeiagm__ublock/ (unpacked)
├── cjpalhdlnbpafiamejdnhcphjbkeiagm__ublock.crx (4.0 MB)
├── mpiodijhokgodhhofbcjdecpffjipkle__singlefile/ (unpacked)
└── mpiodijhokgodhhofbcjdecpffjipkle__singlefile.crx (1.2 MB)
```
Total size: ~6.2 MB for all 4 extensions
## Notes
### Expected Warnings
The following warnings are **expected and harmless**:
```
warning [*.crx]: 1062-1322 extra bytes at beginning or within zipfile
(attempting to process anyway)
```
This occurs because CRX files have a Chrome-specific header (containing signature data) before the ZIP content. The `unzip` command detects this and processes the ZIP data correctly anyway.
### Cache Invalidation
To force re-download of extensions:
```bash
rm -rf data/personas/Test/chrome_extensions/
```
## Next Steps
✅ Extensions are ready to use with Chrome
- Load via `--load-extension` and `--allowlisted-extension-id` flags
- Extensions can be configured at runtime via CDP
- 2captcha config plugin ready to inject API key
✅ Ready for integration testing with:
- chrome_session plugin (load extensions on browser start)
- captcha2_config plugin (configure 2captcha API key)
- singlefile extractor (trigger extension action)
## Conclusion
The Chrome extension system is **production-ready** with:
- ✅ Robust download and installation
- ✅ Efficient multi-level caching
- ✅ Proper error handling
- ✅ Performance optimized for thousands of snapshots

6109
archivebox.ts Normal file

File diff suppressed because it is too large Load Diff

View File

@@ -14,7 +14,6 @@ __package__ = 'archivebox'
import os
import sys
from pathlib import Path
from typing import cast
ASCII_LOGO = """
█████╗ ██████╗ ██████╗██╗ ██╗██╗██╗ ██╗███████╗ ██████╗ ██████╗ ██╗ ██╗
@@ -41,69 +40,29 @@ from .misc.checks import check_not_root, check_io_encoding # noqa
check_not_root()
check_io_encoding()
# print('INSTALLING MONKEY PATCHES')
# Install monkey patches for third-party libraries
from .misc.monkey_patches import * # noqa
# print('DONE INSTALLING MONKEY PATCHES')
# Built-in plugin directories
BUILTIN_PLUGINS_DIR = PACKAGE_DIR / 'plugins'
USER_PLUGINS_DIR = Path(os.getcwd()) / 'plugins'
# print('LOADING VENDORED LIBRARIES')
from .pkgs import load_vendored_pkgs # noqa
load_vendored_pkgs()
# print('DONE LOADING VENDORED LIBRARIES')
# print('LOADING ABX PLUGIN SPECIFICATIONS')
# Load ABX Plugin Specifications + Default Implementations
import abx # noqa
import abx_spec_archivebox # noqa
import abx_spec_config # noqa
import abx_spec_abx_pkg # noqa
import abx_spec_django # noqa
import abx_spec_searchbackend # noqa
abx.pm.add_hookspecs(abx_spec_config.PLUGIN_SPEC)
abx.pm.register(abx_spec_config.PLUGIN_SPEC())
abx.pm.add_hookspecs(abx_spec_abx_pkg.PLUGIN_SPEC)
abx.pm.register(abx_spec_abx_pkg.PLUGIN_SPEC())
abx.pm.add_hookspecs(abx_spec_django.PLUGIN_SPEC)
abx.pm.register(abx_spec_django.PLUGIN_SPEC())
abx.pm.add_hookspecs(abx_spec_searchbackend.PLUGIN_SPEC)
abx.pm.register(abx_spec_searchbackend.PLUGIN_SPEC())
# Cast to ArchiveBoxPluginSpec to enable static type checking of pm.hook.call() methods
abx.pm = cast(abx.ABXPluginManager[abx_spec_archivebox.ArchiveBoxPluginSpec], abx.pm)
pm = abx.pm
# print('DONE LOADING ABX PLUGIN SPECIFICATIONS')
# Load all pip-installed ABX-compatible plugins
ABX_ECOSYSTEM_PLUGINS = abx.get_pip_installed_plugins(group='abx')
# Load all built-in ArchiveBox plugins
ARCHIVEBOX_BUILTIN_PLUGINS = {
'config': PACKAGE_DIR / 'config',
'workers': PACKAGE_DIR / 'workers',
'core': PACKAGE_DIR / 'core',
'crawls': PACKAGE_DIR / 'crawls',
# 'machine': PACKAGE_DIR / 'machine'
# 'search': PACKAGE_DIR / 'search',
# These are kept for backwards compatibility with existing code
# that checks for plugins. The new hook system uses discover_hooks()
ALL_PLUGINS = {
'builtin': BUILTIN_PLUGINS_DIR,
'user': USER_PLUGINS_DIR,
}
# Load all user-defined ArchiveBox plugins
USER_PLUGINS = abx.find_plugins_in_dir(Path(os.getcwd()) / 'user_plugins')
# Import all plugins and register them with ABX Plugin Manager
ALL_PLUGINS = {**ABX_ECOSYSTEM_PLUGINS, **ARCHIVEBOX_BUILTIN_PLUGINS, **USER_PLUGINS}
# print('LOADING ALL PLUGINS')
LOADED_PLUGINS = abx.load_plugins(ALL_PLUGINS)
# print('DONE LOADING ALL PLUGINS')
LOADED_PLUGINS = ALL_PLUGINS
# Setup basic config, constants, paths, and version
from .config.constants import CONSTANTS # noqa
from .config.paths import PACKAGE_DIR, DATA_DIR, ARCHIVE_DIR # noqa
from .config.version import VERSION # noqa
# Set MACHINE_ID env var so hook scripts can use it
os.environ.setdefault('MACHINE_ID', CONSTANTS.MACHINE_ID)
__version__ = VERSION
__author__ = 'ArchiveBox'
__license__ = 'MIT'

View File

@@ -2,14 +2,11 @@ __package__ = 'archivebox.api'
from django.apps import AppConfig
import abx
class APIConfig(AppConfig):
name = 'api'
@abx.hookimpl
def register_admin(admin_site):
from api.admin import register_admin
register_admin(admin_site)

View File

@@ -1,10 +1,11 @@
# Generated by Django 4.2.11 on 2024-04-25 04:19
# Generated by Django 5.0.6 on 2024-12-25 (squashed)
import api.models
from uuid import uuid4
from django.conf import settings
from django.db import migrations, models
import django.db.models.deletion
import uuid
import api.models
class Migration(migrations.Migration):
@@ -19,11 +20,41 @@ class Migration(migrations.Migration):
migrations.CreateModel(
name='APIToken',
fields=[
('id', models.UUIDField(default=uuid.uuid4, editable=False, primary_key=True, serialize=False)),
('id', models.UUIDField(default=uuid4, editable=False, primary_key=True, serialize=False, unique=True)),
('created_by', models.ForeignKey(default=None, on_delete=django.db.models.deletion.CASCADE, to=settings.AUTH_USER_MODEL)),
('created_at', models.DateTimeField(auto_now_add=True, db_index=True)),
('modified_at', models.DateTimeField(auto_now=True)),
('token', models.CharField(default=api.models.generate_secret_token, max_length=32, unique=True)),
('created', models.DateTimeField(auto_now_add=True)),
('expires', models.DateTimeField(blank=True, null=True)),
('user', models.ForeignKey(on_delete=django.db.models.deletion.CASCADE, to=settings.AUTH_USER_MODEL)),
],
options={
'verbose_name': 'API Key',
'verbose_name_plural': 'API Keys',
},
),
migrations.CreateModel(
name='OutboundWebhook',
fields=[
('id', models.UUIDField(default=uuid4, editable=False, primary_key=True, serialize=False, unique=True)),
('created_by', models.ForeignKey(default=None, on_delete=django.db.models.deletion.CASCADE, to=settings.AUTH_USER_MODEL)),
('created_at', models.DateTimeField(auto_now_add=True, db_index=True)),
('modified_at', models.DateTimeField(auto_now=True)),
('name', models.CharField(blank=True, default='', max_length=255)),
('signal', models.CharField(choices=[], db_index=True, max_length=255)),
('ref', models.CharField(db_index=True, max_length=255)),
('endpoint', models.URLField(max_length=2083)),
('headers', models.JSONField(blank=True, default=dict)),
('auth_token', models.CharField(blank=True, default='', max_length=4000)),
('enabled', models.BooleanField(db_index=True, default=True)),
('keep_last_response', models.BooleanField(default=False)),
('last_response', models.TextField(blank=True, default='')),
('last_success', models.DateTimeField(blank=True, null=True)),
('last_failure', models.DateTimeField(blank=True, null=True)),
],
options={
'verbose_name': 'API Outbound Webhook',
'ordering': ['name', 'ref'],
'abstract': False,
},
),
]

View File

@@ -1,17 +0,0 @@
# Generated by Django 5.0.4 on 2024-04-26 05:28
from django.db import migrations
class Migration(migrations.Migration):
dependencies = [
('api', '0001_initial'),
]
operations = [
migrations.AlterModelOptions(
name='apitoken',
options={'verbose_name': 'API Key', 'verbose_name_plural': 'API Keys'},
),
]

View File

@@ -1,78 +0,0 @@
# Generated by Django 5.0.6 on 2024-06-03 01:52
import charidfield.fields
import django.db.models.deletion
import signal_webhooks.fields
import signal_webhooks.utils
import uuid
from django.conf import settings
from django.db import migrations, models
import archivebox.base_models.models
class Migration(migrations.Migration):
dependencies = [
('api', '0002_alter_apitoken_options'),
migrations.swappable_dependency(settings.AUTH_USER_MODEL),
]
operations = [
migrations.RenameField(
model_name='apitoken',
old_name='user',
new_name='created_by',
),
migrations.AddField(
model_name='apitoken',
name='abid',
field=charidfield.fields.CharIDField(blank=True, db_index=True, default=None, help_text='ABID-format identifier for this entity (e.g. snp_01BJQMF54D093DXEAWZ6JYRPAQ)', max_length=30, null=True, prefix='apt_', unique=True),
),
migrations.AddField(
model_name='apitoken',
name='modified',
field=models.DateTimeField(auto_now=True),
),
migrations.AddField(
model_name='apitoken',
name='uuid',
field=models.UUIDField(blank=True, null=True, unique=True),
),
migrations.AlterField(
model_name='apitoken',
name='id',
field=models.UUIDField(default=uuid.uuid4, primary_key=True, serialize=False),
),
migrations.CreateModel(
name='OutboundWebhook',
fields=[
('name', models.CharField(db_index=True, help_text='Give your webhook a descriptive name (e.g. Notify ACME Slack channel of any new ArchiveResults).', max_length=255, unique=True, verbose_name='name')),
('signal', models.CharField(choices=[('CREATE', 'Create'), ('UPDATE', 'Update'), ('DELETE', 'Delete'), ('M2M', 'M2M changed'), ('CREATE_OR_UPDATE', 'Create or Update'), ('CREATE_OR_DELETE', 'Create or Delete'), ('CREATE_OR_M2M', 'Create or M2M changed'), ('UPDATE_OR_DELETE', 'Update or Delete'), ('UPDATE_OR_M2M', 'Update or M2M changed'), ('DELETE_OR_M2M', 'Delete or M2M changed'), ('CREATE_UPDATE_OR_DELETE', 'Create, Update or Delete'), ('CREATE_UPDATE_OR_M2M', 'Create, Update or M2M changed'), ('CREATE_DELETE_OR_M2M', 'Create, Delete or M2M changed'), ('UPDATE_DELETE_OR_M2M', 'Update, Delete or M2M changed'), ('CREATE_UPDATE_DELETE_OR_M2M', 'Create, Update or Delete, or M2M changed')], help_text='The type of event the webhook should fire for (e.g. Create, Update, Delete).', max_length=255, verbose_name='signal')),
('ref', models.CharField(db_index=True, help_text='Dot import notation of the model the webhook should fire for (e.g. core.models.Snapshot or core.models.ArchiveResult).', max_length=1023, validators=[signal_webhooks.utils.model_from_reference], verbose_name='referenced model')),
('endpoint', models.URLField(help_text='External URL to POST the webhook notification to (e.g. https://someapp.example.com/webhook/some-webhook-receiver).', max_length=2047, verbose_name='endpoint')),
('headers', models.JSONField(blank=True, default=dict, help_text='Headers to send with the webhook request.', validators=[signal_webhooks.utils.is_dict], verbose_name='headers')),
('auth_token', signal_webhooks.fields.TokenField(blank=True, default='', help_text='Authentication token to use in an Authorization header.', max_length=8000, validators=[signal_webhooks.utils.decode_cipher_key], verbose_name='authentication token')),
('enabled', models.BooleanField(default=True, help_text='Is this webhook enabled?', verbose_name='enabled')),
('keep_last_response', models.BooleanField(default=False, help_text='Should the webhook keep a log of the latest response it got?', verbose_name='keep last response')),
('updated', models.DateTimeField(auto_now=True, help_text='When the webhook was last updated.', verbose_name='updated')),
('last_response', models.CharField(blank=True, default='', help_text='Latest response to this webhook.', max_length=8000, verbose_name='last response')),
('last_success', models.DateTimeField(default=None, help_text='When the webhook last succeeded.', null=True, verbose_name='last success')),
('last_failure', models.DateTimeField(default=None, help_text='When the webhook last failed.', null=True, verbose_name='last failure')),
('created', models.DateTimeField(auto_now_add=True)),
('modified', models.DateTimeField(auto_now=True)),
('id', models.UUIDField(blank=True, null=True, unique=True)),
('uuid', models.UUIDField(default=uuid.uuid4, primary_key=True, serialize=False)),
('abid', charidfield.fields.CharIDField(blank=True, db_index=True, default=None, help_text='ABID-format identifier for this entity (e.g. snp_01BJQMF54D093DXEAWZ6JYRPAQ)', max_length=30, null=True, prefix='whk_', unique=True)),
('created_by', models.ForeignKey(default=archivebox.base_models.models.get_or_create_system_user_pk, on_delete=django.db.models.deletion.CASCADE, to=settings.AUTH_USER_MODEL)),
],
options={
'verbose_name': 'API Outbound Webhook',
'abstract': False,
},
),
migrations.AddConstraint(
model_name='outboundwebhook',
constraint=models.UniqueConstraint(fields=('ref', 'endpoint'), name='prevent_duplicate_hooks_api_outboundwebhook'),
),
]

View File

@@ -1,24 +0,0 @@
# Generated by Django 5.1 on 2024-08-20 10:44
import uuid
from django.db import migrations, models
class Migration(migrations.Migration):
dependencies = [
('api', '0003_rename_user_apitoken_created_by_apitoken_abid_and_more'),
]
operations = [
migrations.AlterField(
model_name='apitoken',
name='id',
field=models.UUIDField(default=uuid.uuid4, editable=False, primary_key=True, serialize=False),
),
migrations.AlterField(
model_name='apitoken',
name='uuid',
field=models.UUIDField(blank=True, editable=False, null=True, unique=True),
),
]

View File

@@ -1,22 +0,0 @@
# Generated by Django 5.1 on 2024-08-20 22:40
import uuid
from django.db import migrations, models
class Migration(migrations.Migration):
dependencies = [
('api', '0004_alter_apitoken_id_alter_apitoken_uuid'),
]
operations = [
migrations.RemoveField(
model_name='apitoken',
name='uuid',
),
migrations.RemoveField(
model_name='outboundwebhook',
name='id',
),
]

View File

@@ -1,29 +0,0 @@
# Generated by Django 5.1 on 2024-08-20 22:43
import uuid
from django.db import migrations, models
class Migration(migrations.Migration):
dependencies = [
('api', '0005_remove_apitoken_uuid_remove_outboundwebhook_uuid_and_more'),
]
operations = [
migrations.RenameField(
model_name='outboundwebhook',
old_name='uuid',
new_name='id'
),
migrations.AlterField(
model_name='outboundwebhook',
name='id',
field=models.UUIDField(default=uuid.uuid4, editable=False, primary_key=True, serialize=False),
),
migrations.AlterField(
model_name='apitoken',
name='id',
field=models.UUIDField(default=uuid.uuid4, editable=False, primary_key=True, serialize=False),
),
]

View File

@@ -1,23 +0,0 @@
# Generated by Django 5.1 on 2024-08-20 22:52
import django.db.models.deletion
from django.conf import settings
from django.db import migrations, models
import archivebox.base_models.models
class Migration(migrations.Migration):
dependencies = [
('api', '0006_remove_outboundwebhook_uuid_apitoken_id_and_more'),
migrations.swappable_dependency(settings.AUTH_USER_MODEL),
]
operations = [
migrations.AlterField(
model_name='apitoken',
name='created_by',
field=models.ForeignKey(default=archivebox.base_models.models.get_or_create_system_user_pk, on_delete=django.db.models.deletion.CASCADE, to=settings.AUTH_USER_MODEL),
),
]

View File

@@ -1,48 +0,0 @@
# Generated by Django 5.1 on 2024-09-04 23:32
import django.db.models.deletion
from django.conf import settings
from django.db import migrations, models
import archivebox.base_models.models
class Migration(migrations.Migration):
dependencies = [
('api', '0007_alter_apitoken_created_by'),
migrations.swappable_dependency(settings.AUTH_USER_MODEL),
]
operations = [
migrations.AlterField(
model_name='apitoken',
name='created',
field=archivebox.base_models.models.AutoDateTimeField(db_index=True, default=None),
),
migrations.AlterField(
model_name='apitoken',
name='created_by',
field=models.ForeignKey(default=None, on_delete=django.db.models.deletion.CASCADE, to=settings.AUTH_USER_MODEL),
),
migrations.AlterField(
model_name='apitoken',
name='id',
field=models.UUIDField(default=None, editable=False, primary_key=True, serialize=False, unique=True, verbose_name='ID'),
),
migrations.AlterField(
model_name='outboundwebhook',
name='created',
field=archivebox.base_models.models.AutoDateTimeField(db_index=True, default=None),
),
migrations.AlterField(
model_name='outboundwebhook',
name='created_by',
field=models.ForeignKey(default=None, on_delete=django.db.models.deletion.CASCADE, to=settings.AUTH_USER_MODEL),
),
migrations.AlterField(
model_name='outboundwebhook',
name='id',
field=models.UUIDField(default=None, editable=False, primary_key=True, serialize=False, unique=True, verbose_name='ID'),
),
]

View File

@@ -1,40 +0,0 @@
# Generated by Django 5.1 on 2024-09-05 00:26
from django.db import migrations, models
import archivebox.base_models.models
class Migration(migrations.Migration):
dependencies = [
('api', '0008_alter_apitoken_created_alter_apitoken_created_by_and_more'),
]
operations = [
migrations.RenameField(
model_name='apitoken',
old_name='created',
new_name='created_at',
),
migrations.RenameField(
model_name='apitoken',
old_name='modified',
new_name='modified_at',
),
migrations.RenameField(
model_name='outboundwebhook',
old_name='modified',
new_name='modified_at',
),
migrations.AddField(
model_name='outboundwebhook',
name='created_at',
field=archivebox.base_models.models.AutoDateTimeField(db_index=True, default=None),
),
migrations.AlterField(
model_name='outboundwebhook',
name='created',
field=models.DateTimeField(auto_now_add=True, help_text='When the webhook was created.', verbose_name='created'),
),
]

View File

@@ -38,7 +38,7 @@ class APIToken(models.Model):
return not self.expires or self.expires >= (for_date or timezone.now())
class OutboundWebhook(models.Model, WebhookBase):
class OutboundWebhook(WebhookBase):
id = models.UUIDField(primary_key=True, default=uuid7, editable=False, unique=True)
created_by = models.ForeignKey(settings.AUTH_USER_MODEL, on_delete=models.CASCADE, default=None, null=False)
created_at = models.DateTimeField(default=timezone.now, db_index=True)

View File

@@ -84,7 +84,6 @@ api = NinjaAPIWithIOCapture(
title='ArchiveBox API',
description=html_description,
version=VERSION,
csrf=False,
auth=API_AUTH_METHODS,
urls_namespace="api-1",
docs=Swagger(settings={"persistAuthorization": True}),

View File

@@ -3,9 +3,77 @@
__package__ = 'archivebox.base_models'
from django.contrib import admin
from django.utils.html import format_html, mark_safe
from django_object_actions import DjangoObjectActions
class ConfigEditorMixin:
"""
Mixin for admin classes with a config JSON field.
Provides a readonly field that shows available config options
from all discovered plugin schemas.
"""
@admin.display(description='Available Config Options')
def available_config_options(self, obj):
"""Show documentation for available config keys."""
try:
from archivebox.hooks import discover_plugin_configs
plugin_configs = discover_plugin_configs()
except ImportError:
return format_html('<i>Plugin config system not available</i>')
html_parts = [
'<details>',
'<summary style="cursor: pointer; font-weight: bold; padding: 4px;">',
'Click to see available config keys ({})</summary>'.format(
sum(len(s.get('properties', {})) for s in plugin_configs.values())
),
'<div style="max-height: 400px; overflow-y: auto; padding: 8px; background: #f8f8f8; border-radius: 4px; font-family: monospace; font-size: 11px;">',
]
for plugin_name, schema in sorted(plugin_configs.items()):
properties = schema.get('properties', {})
if not properties:
continue
html_parts.append(f'<div style="margin: 8px 0;"><strong style="color: #333;">{plugin_name}</strong></div>')
html_parts.append('<table style="width: 100%; border-collapse: collapse; margin-bottom: 12px;">')
html_parts.append('<tr style="background: #eee;"><th style="text-align: left; padding: 4px;">Key</th><th style="text-align: left; padding: 4px;">Type</th><th style="text-align: left; padding: 4px;">Default</th><th style="text-align: left; padding: 4px;">Description</th></tr>')
for key, prop in sorted(properties.items()):
prop_type = prop.get('type', 'string')
default = prop.get('default', '')
description = prop.get('description', '')
# Truncate long defaults
default_str = str(default)
if len(default_str) > 30:
default_str = default_str[:27] + '...'
html_parts.append(
f'<tr style="border-bottom: 1px solid #ddd;">'
f'<td style="padding: 4px; font-weight: bold;">{key}</td>'
f'<td style="padding: 4px; color: #666;">{prop_type}</td>'
f'<td style="padding: 4px; color: #666;">{default_str}</td>'
f'<td style="padding: 4px;">{description}</td>'
f'</tr>'
)
html_parts.append('</table>')
html_parts.append('</div></details>')
html_parts.append(
'<p style="margin-top: 8px; color: #666; font-size: 11px;">'
'<strong>Usage:</strong> Add key-value pairs in JSON format, e.g., '
'<code>{"SAVE_WGET": false, "WGET_TIMEOUT": 120}</code>'
'</p>'
)
return mark_safe(''.join(html_parts))
class BaseModelAdmin(DjangoObjectActions, admin.ModelAdmin):
list_display = ('id', 'created_at', 'created_by')
readonly_fields = ('id', 'created_at', 'modified_at')

View File

@@ -1,7 +1,7 @@
# from django.apps import AppConfig
# class AbidUtilsConfig(AppConfig):
# class BaseModelsConfig(AppConfig):
# default_auto_field = 'django.db.models.BigAutoField'
# name = 'base_models'

View File

@@ -19,7 +19,7 @@ from django.conf import settings
from django_stubs_ext.db.models import TypedModelMeta
from archivebox import DATA_DIR
from archivebox.index.json import to_json
from archivebox.misc.util import to_json
from archivebox.misc.hashing import get_dir_info
@@ -31,6 +31,16 @@ def get_or_create_system_user_pk(username='system'):
return user.pk
class AutoDateTimeField(models.DateTimeField):
"""DateTimeField that automatically updates on save (legacy compatibility)."""
def pre_save(self, model_instance, add):
if add or not getattr(model_instance, self.attname):
value = timezone.now()
setattr(model_instance, self.attname, value)
return value
return super().pre_save(model_instance, add)
class ModelWithUUID(models.Model):
id = models.UUIDField(primary_key=True, default=uuid7, editable=False, unique=True)
created_at = models.DateTimeField(default=timezone.now, db_index=True)
@@ -74,6 +84,7 @@ class ModelWithSerializers(ModelWithUUID):
class ModelWithNotes(models.Model):
"""Mixin for models with a notes field."""
notes = models.TextField(blank=True, null=False, default='')
class Meta:
@@ -81,6 +92,7 @@ class ModelWithNotes(models.Model):
class ModelWithHealthStats(models.Model):
"""Mixin for models with health tracking fields."""
num_uses_failed = models.PositiveIntegerField(default=0)
num_uses_succeeded = models.PositiveIntegerField(default=0)
@@ -94,6 +106,7 @@ class ModelWithHealthStats(models.Model):
class ModelWithConfig(models.Model):
"""Mixin for models with a JSON config field."""
config = models.JSONField(default=dict, null=False, blank=False, editable=True)
class Meta:
@@ -113,7 +126,7 @@ class ModelWithOutputDir(ModelWithSerializers):
@property
def output_dir_parent(self) -> str:
return getattr(self, 'output_dir_parent', f'{self._meta.model_name}s')
return f'{self._meta.model_name}s'
@property
def output_dir_name(self) -> str:

View File

@@ -37,7 +37,13 @@ class ArchiveBoxGroup(click.Group):
'server': 'archivebox.cli.archivebox_server.main',
'shell': 'archivebox.cli.archivebox_shell.main',
'manage': 'archivebox.cli.archivebox_manage.main',
# Worker/orchestrator commands
'orchestrator': 'archivebox.cli.archivebox_orchestrator.main',
'worker': 'archivebox.cli.archivebox_worker.main',
# Task commands (called by workers as subprocesses)
'crawl': 'archivebox.cli.archivebox_crawl.main',
'snapshot': 'archivebox.cli.archivebox_snapshot.main',
'extract': 'archivebox.cli.archivebox_extract.main',
}
all_subcommands = {
**meta_commands,
@@ -118,11 +124,14 @@ def cli(ctx, help=False):
raise
def main(args=None, prog_name=None):
def main(args=None, prog_name=None, stdin=None):
# show `docker run archivebox xyz` in help messages if running in docker
IN_DOCKER = os.environ.get('IN_DOCKER', False) in ('1', 'true', 'True', 'TRUE', 'yes')
IS_TTY = sys.stdin.isatty()
prog_name = prog_name or (f'docker compose run{"" if IS_TTY else " -T"} archivebox' if IN_DOCKER else 'archivebox')
# stdin param allows passing input data from caller (used by __main__.py)
# currently not used by click-based CLI, but kept for backwards compatibility
try:
cli(args=args, prog_name=prog_name)

View File

@@ -16,214 +16,135 @@ from archivebox.misc.util import enforce_types, docstring
from archivebox import CONSTANTS
from archivebox.config.common import ARCHIVING_CONFIG
from archivebox.config.permissions import USER, HOSTNAME
from archivebox.parsers import PARSERS
if TYPE_CHECKING:
from core.models import Snapshot
ORCHESTRATOR = None
@enforce_types
def add(urls: str | list[str],
depth: int | str=0,
tag: str='',
parser: str="auto",
extract: str="",
plugins: str="",
persona: str='Default',
overwrite: bool=False,
update: bool=not ARCHIVING_CONFIG.ONLY_NEW,
index_only: bool=False,
bg: bool=False,
created_by_id: int | None=None) -> QuerySet['Snapshot']:
"""Add a new URL or list of URLs to your archive"""
"""Add a new URL or list of URLs to your archive.
global ORCHESTRATOR
The new flow is:
1. Save URLs to sources file
2. Create Seed pointing to the file
3. Create Crawl with max_depth
4. Create root Snapshot pointing to file:// URL (depth=0)
5. Orchestrator runs parser extractors on root snapshot
6. Parser extractors output to urls.jsonl
7. URLs are added to Crawl.urls and child Snapshots are created
8. Repeat until max_depth is reached
"""
from rich import print
depth = int(depth)
assert depth in (0, 1), 'Depth must be 0 or 1 (depth >1 is not supported yet)'
# import models once django is set up
from crawls.models import Seed, Crawl
from workers.orchestrator import Orchestrator
from archivebox.base_models.models import get_or_create_system_user_pk
assert depth in (0, 1, 2, 3, 4), 'Depth must be 0-4'
# import models once django is set up
from core.models import Snapshot
from crawls.models import Seed, Crawl
from archivebox.base_models.models import get_or_create_system_user_pk
from workers.orchestrator import Orchestrator
created_by_id = created_by_id or get_or_create_system_user_pk()
# 1. save the provided urls to sources/2024-11-05__23-59-59__cli_add.txt
# 1. Save the provided URLs to sources/2024-11-05__23-59-59__cli_add.txt
sources_file = CONSTANTS.SOURCES_DIR / f'{timezone.now().strftime("%Y-%m-%d__%H-%M-%S")}__cli_add.txt'
sources_file.parent.mkdir(parents=True, exist_ok=True)
sources_file.write_text(urls if isinstance(urls, str) else '\n'.join(urls))
# 2. create a new Seed pointing to the sources/2024-11-05__23-59-59__cli_add.txt
# 2. Create a new Seed pointing to the sources file
cli_args = [*sys.argv]
if cli_args[0].lower().endswith('archivebox'):
cli_args[0] = 'archivebox' # full path to archivebox bin to just archivebox e.g. /Volumes/NVME/Users/squash/archivebox/.venv/bin/archivebox -> archivebox
cli_args[0] = 'archivebox'
cmd_str = ' '.join(cli_args)
seed = Seed.from_file(sources_file, label=f'{USER}@{HOSTNAME} $ {cmd_str}', parser=parser, tag=tag, created_by=created_by_id, config={
'ONLY_NEW': not update,
'INDEX_ONLY': index_only,
'OVERWRITE': overwrite,
'EXTRACTORS': extract,
'DEFAULT_PERSONA': persona or 'Default',
})
# 3. create a new Crawl pointing to the Seed
crawl = Crawl.from_seed(seed, max_depth=depth)
# 4. start the Orchestrator & wait until it completes
# ... orchestrator will create the root Snapshot, which creates pending ArchiveResults, which gets run by the ArchiveResultActors ...
# from crawls.actors import CrawlActor
# from core.actors import SnapshotActor, ArchiveResultActor
if not bg:
orchestrator = Orchestrator(exit_on_idle=True, max_concurrent_actors=4)
orchestrator.start()
# 5. return the list of new Snapshots created
seed = Seed.from_file(
sources_file,
label=f'{USER}@{HOSTNAME} $ {cmd_str}',
parser=parser,
tag=tag,
created_by=created_by_id,
config={
'ONLY_NEW': not update,
'INDEX_ONLY': index_only,
'OVERWRITE': overwrite,
'EXTRACTORS': plugins,
'DEFAULT_PERSONA': persona or 'Default',
}
)
# 3. Create a new Crawl pointing to the Seed (status=queued)
crawl = Crawl.from_seed(seed, max_depth=depth)
print(f'[green]\\[+] Created Crawl {crawl.id} with max_depth={depth}[/green]')
print(f' [dim]Seed: {seed.uri}[/dim]')
# 4. The CrawlMachine will create the root Snapshot when started
# Root snapshot URL = file:///path/to/sources/...txt
# Parser extractors will run on it and discover URLs
# Those URLs become child Snapshots (depth=1)
if index_only:
# Just create the crawl but don't start processing
print('[yellow]\\[*] Index-only mode - crawl created but not started[/yellow]')
# Create root snapshot manually
crawl.create_root_snapshot()
return crawl.snapshot_set.all()
# 5. Start the orchestrator to process the queue
# The orchestrator will:
# - Process Crawl -> create root Snapshot
# - Process root Snapshot -> run parser extractors -> discover URLs
# - Create child Snapshots from discovered URLs
# - Process child Snapshots -> run extractors
# - Repeat until max_depth reached
if bg:
# Background mode: start orchestrator and return immediately
print('[yellow]\\[*] Running in background mode - starting orchestrator...[/yellow]')
orchestrator = Orchestrator(exit_on_idle=True)
orchestrator.start() # Fork to background
else:
# Foreground mode: run orchestrator until all work is done
print(f'[green]\\[*] Starting orchestrator to process crawl...[/green]')
orchestrator = Orchestrator(exit_on_idle=True)
orchestrator.runloop() # Block until complete
# 6. Return the list of Snapshots in this crawl
return crawl.snapshot_set.all()
@click.command()
@click.option('--depth', '-d', type=click.Choice(('0', '1')), default='0', help='Recursively archive linked pages up to N hops away')
@click.option('--depth', '-d', type=click.Choice([str(i) for i in range(5)]), default='0', help='Recursively archive linked pages up to N hops away')
@click.option('--tag', '-t', default='', help='Comma-separated list of tags to add to each snapshot e.g. tag1,tag2,tag3')
@click.option('--parser', type=click.Choice(['auto', *PARSERS.keys()]), default='auto', help='Parser for reading input URLs')
@click.option('--extract', '-e', default='', help='Comma-separated list of extractors to use e.g. title,favicon,screenshot,singlefile,...')
@click.option('--parser', default='auto', help='Parser for reading input URLs (auto, txt, html, rss, json, jsonl, netscape, ...)')
@click.option('--plugins', '-p', default='', help='Comma-separated list of plugins to run e.g. title,favicon,screenshot,singlefile,...')
@click.option('--persona', default='Default', help='Authentication profile to use when archiving')
@click.option('--overwrite', '-F', is_flag=True, help='Overwrite existing data if URLs have been archived previously')
@click.option('--update', is_flag=True, default=ARCHIVING_CONFIG.ONLY_NEW, help='Retry any previously skipped/failed URLs when re-adding them')
@click.option('--index-only', is_flag=True, help='Just add the URLs to the index without archiving them now')
# @click.option('--update-all', is_flag=True, help='Update ALL links in index when finished adding new ones')
@click.option('--bg', is_flag=True, help='Run crawl in background worker instead of immediately')
@click.option('--bg', is_flag=True, help='Run archiving in background (start orchestrator and return immediately)')
@click.argument('urls', nargs=-1, type=click.Path())
@docstring(add.__doc__)
def main(**kwargs):
"""Add a new URL or list of URLs to your archive"""
add(**kwargs)
if __name__ == '__main__':
main()
# OLD VERSION:
# def add(urls: Union[str, List[str]],
# tag: str='',
# depth: int=0,
# update: bool=not ARCHIVING_CONFIG.ONLY_NEW,
# update_all: bool=False,
# index_only: bool=False,
# overwrite: bool=False,
# # duplicate: bool=False, # TODO: reuse the logic from admin.py resnapshot to allow adding multiple snapshots by appending timestamp automatically
# init: bool=False,
# extractors: str="",
# parser: str="auto",
# created_by_id: int | None=None,
# out_dir: Path=DATA_DIR) -> List[Link]:
# """Add a new URL or list of URLs to your archive"""
# from core.models import Snapshot, Tag
# # from workers.supervisord_util import start_cli_workers, tail_worker_logs
# # from workers.tasks import bg_archive_link
# assert depth in (0, 1), 'Depth must be 0 or 1 (depth >1 is not supported yet)'
# extractors = extractors.split(",") if extractors else []
# if init:
# run_subcommand('init', stdin=None, pwd=out_dir)
# # Load list of links from the existing index
# check_data_folder()
# # worker = start_cli_workers()
# new_links: List[Link] = []
# all_links = load_main_index(out_dir=out_dir)
# log_importing_started(urls=urls, depth=depth, index_only=index_only)
# if isinstance(urls, str):
# # save verbatim stdin to sources
# write_ahead_log = save_text_as_source(urls, filename='{ts}-import.txt', out_dir=out_dir)
# elif isinstance(urls, list):
# # save verbatim args to sources
# write_ahead_log = save_text_as_source('\n'.join(urls), filename='{ts}-import.txt', out_dir=out_dir)
# new_links += parse_links_from_source(write_ahead_log, root_url=None, parser=parser)
# # If we're going one level deeper, download each link and look for more links
# new_links_depth = []
# if new_links and depth == 1:
# log_crawl_started(new_links)
# for new_link in new_links:
# try:
# downloaded_file = save_file_as_source(new_link.url, filename=f'{new_link.timestamp}-crawl-{new_link.domain}.txt', out_dir=out_dir)
# new_links_depth += parse_links_from_source(downloaded_file, root_url=new_link.url)
# except Exception as err:
# stderr('[!] Failed to get contents of URL {new_link.url}', err, color='red')
# imported_links = list({link.url: link for link in (new_links + new_links_depth)}.values())
# new_links = dedupe_links(all_links, imported_links)
# write_main_index(links=new_links, out_dir=out_dir, created_by_id=created_by_id)
# all_links = load_main_index(out_dir=out_dir)
# tags = [
# Tag.objects.get_or_create(name=name.strip(), defaults={'created_by_id': created_by_id})[0]
# for name in tag.split(',')
# if name.strip()
# ]
# if tags:
# for link in imported_links:
# snapshot = Snapshot.objects.get(url=link.url)
# snapshot.tags.add(*tags)
# snapshot.tags_str(nocache=True)
# snapshot.save()
# # print(f' √ Tagged {len(imported_links)} Snapshots with {len(tags)} tags {tags_str}')
# if index_only:
# # mock archive all the links using the fake index_only extractor method in order to update their state
# if overwrite:
# archive_links(imported_links, overwrite=overwrite, methods=['index_only'], out_dir=out_dir, created_by_id=created_by_id)
# else:
# archive_links(new_links, overwrite=False, methods=['index_only'], out_dir=out_dir, created_by_id=created_by_id)
# else:
# # fully run the archive extractor methods for each link
# archive_kwargs = {
# "out_dir": out_dir,
# "created_by_id": created_by_id,
# }
# if extractors:
# archive_kwargs["methods"] = extractors
# stderr()
# ts = datetime.now(timezone.utc).strftime('%Y-%m-%d %H:%M:%S')
# if update:
# stderr(f'[*] [{ts}] Archiving + updating {len(imported_links)}/{len(all_links)}', len(imported_links), 'URLs from added set...', color='green')
# archive_links(imported_links, overwrite=overwrite, **archive_kwargs)
# elif update_all:
# stderr(f'[*] [{ts}] Archiving + updating {len(all_links)}/{len(all_links)}', len(all_links), 'URLs from entire library...', color='green')
# archive_links(all_links, overwrite=overwrite, **archive_kwargs)
# elif overwrite:
# stderr(f'[*] [{ts}] Archiving + overwriting {len(imported_links)}/{len(all_links)}', len(imported_links), 'URLs from added set...', color='green')
# archive_links(imported_links, overwrite=True, **archive_kwargs)
# elif new_links:
# stderr(f'[*] [{ts}] Archiving {len(new_links)}/{len(all_links)} URLs from added set...', color='green')
# archive_links(new_links, overwrite=False, **archive_kwargs)
# # tail_worker_logs(worker['stdout_logfile'])
# # if CAN_UPGRADE:
# # hint(f"There's a new version of ArchiveBox available! Your current version is {VERSION}. You can upgrade to {VERSIONS_AVAILABLE['recommended_version']['tag_name']} ({VERSIONS_AVAILABLE['recommended_version']['html_url']}). For more on how to upgrade: https://github.com/ArchiveBox/ArchiveBox/wiki/Upgrading-or-Merging-Archives\n")
# return new_links

View File

@@ -20,15 +20,15 @@ def config(*keys,
**kwargs) -> None:
"""Get and set your ArchiveBox project configuration values"""
import archivebox
from archivebox.misc.checks import check_data_folder
from archivebox.misc.logging_util import printable_config
from archivebox.config.collection import load_all_config, write_config_file, get_real_name
from archivebox.config.configset import get_flat_config, get_all_configs
check_data_folder()
FLAT_CONFIG = archivebox.pm.hook.get_FLAT_CONFIG()
CONFIGS = archivebox.pm.hook.get_CONFIGS()
FLAT_CONFIG = get_flat_config()
CONFIGS = get_all_configs()
config_options: list[str] = list(kwargs.pop('key=value', []) or keys or [f'{key}={val}' for key, val in kwargs.items()])
no_args = not (get or set or reset or config_options)
@@ -105,7 +105,7 @@ def config(*keys,
if new_config:
before = FLAT_CONFIG
matching_config = write_config_file(new_config)
after = {**load_all_config(), **archivebox.pm.hook.get_FLAT_CONFIG()}
after = {**load_all_config(), **get_flat_config()}
print(printable_config(matching_config))
side_effect_changes = {}

View File

@@ -0,0 +1,302 @@
#!/usr/bin/env python3
"""
archivebox crawl [urls_or_snapshot_ids...] [--depth=N] [--plugin=NAME]
Discover outgoing links from URLs or existing Snapshots.
If a URL is passed, creates a Snapshot for it first, then runs parser plugins.
If a snapshot_id is passed, runs parser plugins on the existing Snapshot.
Outputs discovered outlink URLs as JSONL.
Pipe the output to `archivebox snapshot` to archive the discovered URLs.
Input formats:
- Plain URLs (one per line)
- Snapshot UUIDs (one per line)
- JSONL: {"type": "Snapshot", "url": "...", ...}
- JSONL: {"type": "Snapshot", "id": "...", ...}
Output (JSONL):
{"type": "Snapshot", "url": "https://discovered-url.com", "via_extractor": "...", ...}
Examples:
# Discover links from a page (creates snapshot first)
archivebox crawl https://example.com
# Discover links from an existing snapshot
archivebox crawl 01234567-89ab-cdef-0123-456789abcdef
# Full recursive crawl pipeline
archivebox crawl https://example.com | archivebox snapshot | archivebox extract
# Use only specific parser plugin
archivebox crawl --plugin=parse_html_urls https://example.com
# Chain: create snapshot, then crawl its outlinks
archivebox snapshot https://example.com | archivebox crawl | archivebox snapshot | archivebox extract
"""
__package__ = 'archivebox.cli'
__command__ = 'archivebox crawl'
import sys
import json
from pathlib import Path
from typing import Optional
import rich_click as click
from archivebox.misc.util import docstring
def discover_outlinks(
args: tuple,
depth: int = 1,
plugin: str = '',
wait: bool = True,
) -> int:
"""
Discover outgoing links from URLs or existing Snapshots.
Accepts URLs or snapshot_ids. For URLs, creates Snapshots first.
Runs parser plugins, outputs discovered URLs as JSONL.
The output can be piped to `archivebox snapshot` to archive the discovered links.
Exit codes:
0: Success
1: Failure
"""
from rich import print as rprint
from django.utils import timezone
from archivebox.misc.jsonl import (
read_args_or_stdin, write_record,
TYPE_SNAPSHOT, get_or_create_snapshot
)
from archivebox.base_models.models import get_or_create_system_user_pk
from core.models import Snapshot, ArchiveResult
from crawls.models import Seed, Crawl
from archivebox.config import CONSTANTS
from workers.orchestrator import Orchestrator
created_by_id = get_or_create_system_user_pk()
is_tty = sys.stdout.isatty()
# Collect all input records
records = list(read_args_or_stdin(args))
if not records:
rprint('[yellow]No URLs or snapshot IDs provided. Pass as arguments or via stdin.[/yellow]', file=sys.stderr)
return 1
# Separate records into existing snapshots vs new URLs
existing_snapshot_ids = []
new_url_records = []
for record in records:
# Check if it's an existing snapshot (has id but no url, or looks like a UUID)
if record.get('id') and not record.get('url'):
existing_snapshot_ids.append(record['id'])
elif record.get('id'):
# Has both id and url - check if snapshot exists
try:
Snapshot.objects.get(id=record['id'])
existing_snapshot_ids.append(record['id'])
except Snapshot.DoesNotExist:
new_url_records.append(record)
elif record.get('url'):
new_url_records.append(record)
# For new URLs, create a Crawl and Snapshots
snapshot_ids = list(existing_snapshot_ids)
if new_url_records:
# Create a Crawl to manage this operation
sources_file = CONSTANTS.SOURCES_DIR / f'{timezone.now().strftime("%Y-%m-%d__%H-%M-%S")}__crawl.txt'
sources_file.parent.mkdir(parents=True, exist_ok=True)
sources_file.write_text('\n'.join(r.get('url', '') for r in new_url_records if r.get('url')))
seed = Seed.from_file(
sources_file,
label=f'crawl --depth={depth}',
created_by=created_by_id,
)
crawl = Crawl.from_seed(seed, max_depth=depth)
# Create snapshots for new URLs
for record in new_url_records:
try:
record['crawl_id'] = str(crawl.id)
record['depth'] = record.get('depth', 0)
snapshot = get_or_create_snapshot(record, created_by_id=created_by_id)
snapshot_ids.append(str(snapshot.id))
except Exception as e:
rprint(f'[red]Error creating snapshot: {e}[/red]', file=sys.stderr)
continue
if not snapshot_ids:
rprint('[red]No snapshots to process[/red]', file=sys.stderr)
return 1
if existing_snapshot_ids:
rprint(f'[blue]Using {len(existing_snapshot_ids)} existing snapshots[/blue]', file=sys.stderr)
if new_url_records:
rprint(f'[blue]Created {len(snapshot_ids) - len(existing_snapshot_ids)} new snapshots[/blue]', file=sys.stderr)
rprint(f'[blue]Running parser plugins on {len(snapshot_ids)} snapshots...[/blue]', file=sys.stderr)
# Create ArchiveResults for plugins
# If --plugin is specified, only run that one. Otherwise, run all available plugins.
# The orchestrator will handle dependency ordering (plugins declare deps in config.json)
for snapshot_id in snapshot_ids:
try:
snapshot = Snapshot.objects.get(id=snapshot_id)
if plugin:
# User specified a single plugin to run
ArchiveResult.objects.get_or_create(
snapshot=snapshot,
extractor=plugin,
defaults={
'status': ArchiveResult.StatusChoices.QUEUED,
'retry_at': timezone.now(),
'created_by_id': snapshot.created_by_id,
}
)
else:
# Create pending ArchiveResults for all enabled plugins
# This uses hook discovery to find available plugins dynamically
snapshot.create_pending_archiveresults()
# Mark snapshot as started
snapshot.status = Snapshot.StatusChoices.STARTED
snapshot.retry_at = timezone.now()
snapshot.save()
except Snapshot.DoesNotExist:
continue
# Run plugins
if wait:
rprint('[blue]Running outlink plugins...[/blue]', file=sys.stderr)
orchestrator = Orchestrator(exit_on_idle=True)
orchestrator.runloop()
# Collect discovered URLs from urls.jsonl files
# Uses dynamic discovery - any plugin that outputs urls.jsonl is considered a parser
from archivebox.hooks import collect_urls_from_extractors
discovered_urls = {}
for snapshot_id in snapshot_ids:
try:
snapshot = Snapshot.objects.get(id=snapshot_id)
snapshot_dir = Path(snapshot.output_dir)
# Dynamically collect urls.jsonl from ANY plugin subdirectory
for entry in collect_urls_from_extractors(snapshot_dir):
url = entry.get('url')
if url and url not in discovered_urls:
# Add metadata for crawl tracking
entry['type'] = TYPE_SNAPSHOT
entry['depth'] = snapshot.depth + 1
entry['via_snapshot'] = str(snapshot.id)
discovered_urls[url] = entry
except Snapshot.DoesNotExist:
continue
rprint(f'[green]Discovered {len(discovered_urls)} URLs[/green]', file=sys.stderr)
# Output discovered URLs as JSONL (when piped) or human-readable (when TTY)
for url, entry in discovered_urls.items():
if is_tty:
via = entry.get('via_extractor', 'unknown')
rprint(f' [dim]{via}[/dim] {url[:80]}', file=sys.stderr)
else:
write_record(entry)
return 0
def process_crawl_by_id(crawl_id: str) -> int:
"""
Process a single Crawl by ID (used by workers).
Triggers the Crawl's state machine tick() which will:
- Transition from queued -> started (creates root snapshot)
- Transition from started -> sealed (when all snapshots done)
"""
from rich import print as rprint
from crawls.models import Crawl
try:
crawl = Crawl.objects.get(id=crawl_id)
except Crawl.DoesNotExist:
rprint(f'[red]Crawl {crawl_id} not found[/red]', file=sys.stderr)
return 1
rprint(f'[blue]Processing Crawl {crawl.id} (status={crawl.status})[/blue]', file=sys.stderr)
try:
crawl.sm.tick()
crawl.refresh_from_db()
rprint(f'[green]Crawl complete (status={crawl.status})[/green]', file=sys.stderr)
return 0
except Exception as e:
rprint(f'[red]Crawl error: {type(e).__name__}: {e}[/red]', file=sys.stderr)
return 1
def is_crawl_id(value: str) -> bool:
"""Check if value looks like a Crawl UUID."""
import re
uuid_pattern = re.compile(r'^[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$', re.I)
if not uuid_pattern.match(value):
return False
# Verify it's actually a Crawl (not a Snapshot or other object)
from crawls.models import Crawl
return Crawl.objects.filter(id=value).exists()
@click.command()
@click.option('--depth', '-d', type=int, default=1, help='Max depth for recursive crawling (default: 1)')
@click.option('--plugin', '-p', default='', help='Use only this parser plugin (e.g., parse_html_urls, parse_dom_outlinks)')
@click.option('--wait/--no-wait', default=True, help='Wait for plugins to complete (default: wait)')
@click.argument('args', nargs=-1)
def main(depth: int, plugin: str, wait: bool, args: tuple):
"""Discover outgoing links from URLs or existing Snapshots, or process Crawl by ID"""
from archivebox.misc.jsonl import read_args_or_stdin
# Read all input
records = list(read_args_or_stdin(args))
if not records:
from rich import print as rprint
rprint('[yellow]No URLs, Snapshot IDs, or Crawl IDs provided. Pass as arguments or via stdin.[/yellow]', file=sys.stderr)
sys.exit(1)
# Check if input looks like existing Crawl IDs to process
# If ALL inputs are Crawl UUIDs, process them
all_are_crawl_ids = all(
is_crawl_id(r.get('id') or r.get('url', ''))
for r in records
)
if all_are_crawl_ids:
# Process existing Crawls by ID
exit_code = 0
for record in records:
crawl_id = record.get('id') or record.get('url')
result = process_crawl_by_id(crawl_id)
if result != 0:
exit_code = result
sys.exit(exit_code)
else:
# Default behavior: discover outlinks from input (URLs or Snapshot IDs)
sys.exit(discover_outlinks(args, depth=depth, plugin=plugin, wait=wait))
if __name__ == '__main__':
main()

View File

@@ -1,49 +1,262 @@
#!/usr/bin/env python3
"""
archivebox extract [snapshot_ids...] [--plugin=NAME]
Run plugins on Snapshots. Accepts snapshot IDs as arguments, from stdin, or via JSONL.
Input formats:
- Snapshot UUIDs (one per line)
- JSONL: {"type": "Snapshot", "id": "...", "url": "..."}
- JSONL: {"type": "ArchiveResult", "snapshot_id": "...", "plugin": "..."}
Output (JSONL):
{"type": "ArchiveResult", "id": "...", "snapshot_id": "...", "plugin": "...", "status": "..."}
Examples:
# Extract specific snapshot
archivebox extract 01234567-89ab-cdef-0123-456789abcdef
# Pipe from snapshot command
archivebox snapshot https://example.com | archivebox extract
# Run specific plugin only
archivebox extract --plugin=screenshot 01234567-89ab-cdef-0123-456789abcdef
# Chain commands
archivebox crawl https://example.com | archivebox snapshot | archivebox extract
"""
__package__ = 'archivebox.cli'
__command__ = 'archivebox extract'
import sys
from typing import TYPE_CHECKING, Generator
from typing import Optional, List
import rich_click as click
from django.db.models import Q
from archivebox.misc.util import enforce_types, docstring
def process_archiveresult_by_id(archiveresult_id: str) -> int:
"""
Run extraction for a single ArchiveResult by ID (used by workers).
if TYPE_CHECKING:
Triggers the ArchiveResult's state machine tick() to run the extractor.
"""
from rich import print as rprint
from core.models import ArchiveResult
try:
archiveresult = ArchiveResult.objects.get(id=archiveresult_id)
except ArchiveResult.DoesNotExist:
rprint(f'[red]ArchiveResult {archiveresult_id} not found[/red]', file=sys.stderr)
return 1
ORCHESTRATOR = None
rprint(f'[blue]Extracting {archiveresult.extractor} for {archiveresult.snapshot.url}[/blue]', file=sys.stderr)
@enforce_types
def extract(archiveresult_id: str) -> Generator['ArchiveResult', None, None]:
archiveresult = ArchiveResult.objects.get(id=archiveresult_id)
if not archiveresult:
raise Exception(f'ArchiveResult {archiveresult_id} not found')
return archiveresult.EXTRACTOR.extract()
try:
# Trigger state machine tick - this runs the actual extraction
archiveresult.sm.tick()
archiveresult.refresh_from_db()
if archiveresult.status == ArchiveResult.StatusChoices.SUCCEEDED:
print(f'[green]Extraction succeeded: {archiveresult.output}[/green]')
return 0
elif archiveresult.status == ArchiveResult.StatusChoices.FAILED:
print(f'[red]Extraction failed: {archiveresult.output}[/red]', file=sys.stderr)
return 1
else:
# Still in progress or backoff - not a failure
print(f'[yellow]Extraction status: {archiveresult.status}[/yellow]')
return 0
except Exception as e:
print(f'[red]Extraction error: {type(e).__name__}: {e}[/red]', file=sys.stderr)
return 1
def run_plugins(
args: tuple,
plugin: str = '',
wait: bool = True,
) -> int:
"""
Run plugins on Snapshots from input.
Reads Snapshot IDs or JSONL from args/stdin, runs plugins, outputs JSONL.
Exit codes:
0: Success
1: Failure
"""
from rich import print as rprint
from django.utils import timezone
from archivebox.misc.jsonl import (
read_args_or_stdin, write_record, archiveresult_to_jsonl,
TYPE_SNAPSHOT, TYPE_ARCHIVERESULT
)
from core.models import Snapshot, ArchiveResult
from workers.orchestrator import Orchestrator
is_tty = sys.stdout.isatty()
# Collect all input records
records = list(read_args_or_stdin(args))
if not records:
rprint('[yellow]No snapshots provided. Pass snapshot IDs as arguments or via stdin.[/yellow]', file=sys.stderr)
return 1
# Gather snapshot IDs to process
snapshot_ids = set()
for record in records:
record_type = record.get('type')
if record_type == TYPE_SNAPSHOT:
snapshot_id = record.get('id')
if snapshot_id:
snapshot_ids.add(snapshot_id)
elif record.get('url'):
# Look up by URL
try:
snap = Snapshot.objects.get(url=record['url'])
snapshot_ids.add(str(snap.id))
except Snapshot.DoesNotExist:
rprint(f'[yellow]Snapshot not found for URL: {record["url"]}[/yellow]', file=sys.stderr)
elif record_type == TYPE_ARCHIVERESULT:
snapshot_id = record.get('snapshot_id')
if snapshot_id:
snapshot_ids.add(snapshot_id)
elif 'id' in record:
# Assume it's a snapshot ID
snapshot_ids.add(record['id'])
if not snapshot_ids:
rprint('[red]No valid snapshot IDs found in input[/red]', file=sys.stderr)
return 1
# Get snapshots and ensure they have pending ArchiveResults
processed_count = 0
for snapshot_id in snapshot_ids:
try:
snapshot = Snapshot.objects.get(id=snapshot_id)
except Snapshot.DoesNotExist:
rprint(f'[yellow]Snapshot {snapshot_id} not found[/yellow]', file=sys.stderr)
continue
# Create pending ArchiveResults if needed
if plugin:
# Only create for specific plugin
result, created = ArchiveResult.objects.get_or_create(
snapshot=snapshot,
extractor=plugin,
defaults={
'status': ArchiveResult.StatusChoices.QUEUED,
'retry_at': timezone.now(),
'created_by_id': snapshot.created_by_id,
}
)
if not created and result.status in [ArchiveResult.StatusChoices.FAILED, ArchiveResult.StatusChoices.SKIPPED]:
# Reset for retry
result.status = ArchiveResult.StatusChoices.QUEUED
result.retry_at = timezone.now()
result.save()
else:
# Create all pending plugins
snapshot.create_pending_archiveresults()
# Reset snapshot status to allow processing
if snapshot.status == Snapshot.StatusChoices.SEALED:
snapshot.status = Snapshot.StatusChoices.STARTED
snapshot.retry_at = timezone.now()
snapshot.save()
processed_count += 1
if processed_count == 0:
rprint('[red]No snapshots to process[/red]', file=sys.stderr)
return 1
rprint(f'[blue]Queued {processed_count} snapshots for extraction[/blue]', file=sys.stderr)
# Run orchestrator if --wait (default)
if wait:
rprint('[blue]Running plugins...[/blue]', file=sys.stderr)
orchestrator = Orchestrator(exit_on_idle=True)
orchestrator.runloop()
# Output results as JSONL (when piped) or human-readable (when TTY)
for snapshot_id in snapshot_ids:
try:
snapshot = Snapshot.objects.get(id=snapshot_id)
results = snapshot.archiveresult_set.all()
if plugin:
results = results.filter(extractor=plugin)
for result in results:
if is_tty:
status_color = {
'succeeded': 'green',
'failed': 'red',
'skipped': 'yellow',
}.get(result.status, 'dim')
rprint(f' [{status_color}]{result.status}[/{status_color}] {result.extractor}{result.output or ""}', file=sys.stderr)
else:
write_record(archiveresult_to_jsonl(result))
except Snapshot.DoesNotExist:
continue
return 0
def is_archiveresult_id(value: str) -> bool:
"""Check if value looks like an ArchiveResult UUID."""
import re
uuid_pattern = re.compile(r'^[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$', re.I)
if not uuid_pattern.match(value):
return False
# Verify it's actually an ArchiveResult (not a Snapshot or other object)
from core.models import ArchiveResult
return ArchiveResult.objects.filter(id=value).exists()
# <user>@<machine_id>#<datetime>/absolute/path/to/binary
# 2014.24.01
@click.command()
@click.option('--plugin', '-p', default='', help='Run only this plugin (e.g., screenshot, singlefile)')
@click.option('--wait/--no-wait', default=True, help='Wait for plugins to complete (default: wait)')
@click.argument('args', nargs=-1)
def main(plugin: str, wait: bool, args: tuple):
"""Run plugins on Snapshots, or process existing ArchiveResults by ID"""
from archivebox.misc.jsonl import read_args_or_stdin
@click.argument('archiveresult_ids', nargs=-1, type=str)
@docstring(extract.__doc__)
def main(archiveresult_ids: list[str]):
"""Add a new URL or list of URLs to your archive"""
for archiveresult_id in (archiveresult_ids or sys.stdin):
print(f'Extracting {archiveresult_id}...')
archiveresult = extract(str(archiveresult_id))
print(archiveresult.as_json())
# Read all input
records = list(read_args_or_stdin(args))
if not records:
from rich import print as rprint
rprint('[yellow]No Snapshot IDs or ArchiveResult IDs provided. Pass as arguments or via stdin.[/yellow]', file=sys.stderr)
sys.exit(1)
# Check if input looks like existing ArchiveResult IDs to process
all_are_archiveresult_ids = all(
is_archiveresult_id(r.get('id') or r.get('url', ''))
for r in records
)
if all_are_archiveresult_ids:
# Process existing ArchiveResults by ID
exit_code = 0
for record in records:
archiveresult_id = record.get('id') or record.get('url')
result = process_archiveresult_by_id(archiveresult_id)
if result != 0:
exit_code = result
sys.exit(exit_code)
else:
# Default behavior: run plugins on Snapshots from input
sys.exit(run_plugins(args, plugin=plugin, wait=wait))
if __name__ == '__main__':
main()

View File

@@ -21,10 +21,9 @@ def init(force: bool=False, quick: bool=False, install: bool=False, setup: bool=
from archivebox.config import CONSTANTS, VERSION, DATA_DIR
from archivebox.config.common import SERVER_CONFIG
from archivebox.config.collection import write_config_file
from archivebox.index import load_main_index, write_main_index, fix_invalid_folder_locations, get_invalid_folders
from archivebox.index.schema import Link
from archivebox.index.json import parse_json_main_index, parse_json_links_details
from archivebox.index.sql import apply_migrations
from archivebox.misc.folders import fix_invalid_folder_locations, get_invalid_folders
from archivebox.misc.legacy import parse_json_main_index, parse_json_links_details, SnapshotDict
from archivebox.misc.db import apply_migrations
# if os.access(out_dir / CONSTANTS.JSON_INDEX_FILENAME, os.F_OK):
# print("[red]:warning: This folder contains a JSON index. It is deprecated, and will no longer be kept up to date automatically.[/red]", file=sys.stderr)
@@ -100,10 +99,10 @@ def init(force: bool=False, quick: bool=False, install: bool=False, setup: bool=
from core.models import Snapshot
all_links = Snapshot.objects.none()
pending_links: dict[str, Link] = {}
pending_links: dict[str, SnapshotDict] = {}
if existing_index:
all_links = load_main_index(DATA_DIR, warn=False)
all_links = Snapshot.objects.all()
print(f' √ Loaded {all_links.count()} links from existing main index.')
if quick:
@@ -119,9 +118,9 @@ def init(force: bool=False, quick: bool=False, install: bool=False, setup: bool=
# Links in JSON index but not in main index
orphaned_json_links = {
link.url: link
for link in parse_json_main_index(DATA_DIR)
if not all_links.filter(url=link.url).exists()
link_dict['url']: link_dict
for link_dict in parse_json_main_index(DATA_DIR)
if not all_links.filter(url=link_dict['url']).exists()
}
if orphaned_json_links:
pending_links.update(orphaned_json_links)
@@ -129,9 +128,9 @@ def init(force: bool=False, quick: bool=False, install: bool=False, setup: bool=
# Links in data dir indexes but not in main index
orphaned_data_dir_links = {
link.url: link
for link in parse_json_links_details(DATA_DIR)
if not all_links.filter(url=link.url).exists()
link_dict['url']: link_dict
for link_dict in parse_json_links_details(DATA_DIR)
if not all_links.filter(url=link_dict['url']).exists()
}
if orphaned_data_dir_links:
pending_links.update(orphaned_data_dir_links)
@@ -159,7 +158,8 @@ def init(force: bool=False, quick: bool=False, install: bool=False, setup: bool=
print(' archivebox init --quick', file=sys.stderr)
raise SystemExit(1)
write_main_index(list(pending_links.values()), DATA_DIR)
if pending_links:
Snapshot.objects.create_from_dicts(list(pending_links.values()))
print('\n[green]----------------------------------------------------------------------[/green]')

View File

@@ -4,7 +4,7 @@ __package__ = 'archivebox.cli'
import os
import sys
from typing import Optional, List
import shutil
import rich_click as click
from rich import print
@@ -13,149 +13,86 @@ from archivebox.misc.util import docstring, enforce_types
@enforce_types
def install(binproviders: Optional[List[str]]=None, binaries: Optional[List[str]]=None, dry_run: bool=False) -> None:
"""Automatically install all ArchiveBox dependencies and extras"""
# if running as root:
# - run init to create index + lib dir
# - chown -R 911 DATA_DIR
# - install all binaries as root
# - chown -R 911 LIB_DIR
# else:
# - run init to create index + lib dir as current user
# - install all binaries as current user
# - recommend user re-run with sudo if any deps need to be installed as root
def install(dry_run: bool=False) -> None:
"""Detect and install ArchiveBox dependencies by running a dependency-check crawl"""
import abx
import archivebox
from archivebox.config.permissions import IS_ROOT, ARCHIVEBOX_USER, ARCHIVEBOX_GROUP, SudoPermission
from archivebox.config.paths import DATA_DIR, ARCHIVE_DIR, get_or_create_working_lib_dir
from archivebox.config.permissions import IS_ROOT, ARCHIVEBOX_USER, ARCHIVEBOX_GROUP
from archivebox.config.paths import ARCHIVE_DIR
from archivebox.misc.logging import stderr
from archivebox.cli.archivebox_init import init
from archivebox.misc.system import run as run_shell
if not (os.access(ARCHIVE_DIR, os.R_OK) and ARCHIVE_DIR.is_dir()):
init() # must init full index because we need a db to store InstalledBinary entries in
print('\n[green][+] Installing ArchiveBox dependencies automatically...[/green]')
# we never want the data dir to be owned by root, detect owner of existing owner of DATA_DIR to try and guess desired non-root UID
print('\n[green][+] Detecting ArchiveBox dependencies...[/green]')
if IS_ROOT:
EUID = os.geteuid()
# if we have sudo/root permissions, take advantage of them just while installing dependencies
print()
print(f'[yellow]:warning: Running as UID=[blue]{EUID}[/blue] with [red]sudo[/red] only for dependencies that need it.[/yellow]')
print(f' DATA_DIR, LIB_DIR, and TMP_DIR will be owned by [blue]{ARCHIVEBOX_USER}:{ARCHIVEBOX_GROUP}[/blue].')
print(f'[yellow]:warning: Running as UID=[blue]{EUID}[/blue].[/yellow]')
print(f' DATA_DIR will be owned by [blue]{ARCHIVEBOX_USER}:{ARCHIVEBOX_GROUP}[/blue].')
print()
LIB_DIR = get_or_create_working_lib_dir()
package_manager_names = ', '.join(
f'[yellow]{binprovider.name}[/yellow]'
for binprovider in reversed(list(abx.as_dict(abx.pm.hook.get_BINPROVIDERS()).values()))
if not binproviders or (binproviders and binprovider.name in binproviders)
)
print(f'[+] Setting up package managers {package_manager_names}...')
for binprovider in reversed(list(abx.as_dict(abx.pm.hook.get_BINPROVIDERS()).values())):
if binproviders and binprovider.name not in binproviders:
continue
try:
binprovider.setup()
except Exception:
# it's ok, installing binaries below will automatically set up package managers as needed
# e.g. if user does not have npm available we cannot set it up here yet, but once npm Binary is installed
# the next package that depends on npm will automatically call binprovider.setup() during its own install
pass
print()
for binary in reversed(list(abx.as_dict(abx.pm.hook.get_BINARIES()).values())):
if binary.name in ('archivebox', 'django', 'sqlite', 'python'):
# obviously must already be installed if we are running
continue
if binaries and binary.name not in binaries:
continue
providers = ' [grey53]or[/grey53] '.join(
provider.name for provider in binary.binproviders_supported
if not binproviders or (binproviders and provider.name in binproviders)
)
if not providers:
continue
print(f'[+] Detecting / Installing [yellow]{binary.name.ljust(22)}[/yellow] using [red]{providers}[/red]...')
try:
with SudoPermission(uid=0, fallback=True):
# print(binary.load_or_install(fresh=True).model_dump(exclude={'overrides', 'bin_dir', 'hook_type'}))
if binproviders:
providers_supported_by_binary = [provider.name for provider in binary.binproviders_supported]
for binprovider_name in binproviders:
if binprovider_name not in providers_supported_by_binary:
continue
try:
if dry_run:
# always show install commands when doing a dry run
sys.stderr.write("\033[2;49;90m") # grey53
result = binary.install(binproviders=[binprovider_name], dry_run=dry_run).model_dump(exclude={'overrides', 'bin_dir', 'hook_type'})
sys.stderr.write("\033[00m\n") # reset
else:
loaded_binary = archivebox.pm.hook.binary_load_or_install(binary=binary, binproviders=[binprovider_name], fresh=True, dry_run=dry_run, quiet=False)
result = loaded_binary.model_dump(exclude={'overrides', 'bin_dir', 'hook_type'})
if result and result['loaded_version']:
break
except Exception as e:
print(f'[red]:cross_mark: Failed to install {binary.name} as using {binprovider_name} as user {ARCHIVEBOX_USER}: {e}[/red]')
else:
if dry_run:
sys.stderr.write("\033[2;49;90m") # grey53
binary.install(dry_run=dry_run).model_dump(exclude={'overrides', 'bin_dir', 'hook_type'})
sys.stderr.write("\033[00m\n") # reset
else:
loaded_binary = archivebox.pm.hook.binary_load_or_install(binary=binary, fresh=True, dry_run=dry_run)
result = loaded_binary.model_dump(exclude={'overrides', 'bin_dir', 'hook_type'})
if IS_ROOT and LIB_DIR:
with SudoPermission(uid=0):
if ARCHIVEBOX_USER == 0:
os.system(f'chmod -R 777 "{LIB_DIR.resolve()}"')
else:
os.system(f'chown -R {ARCHIVEBOX_USER} "{LIB_DIR.resolve()}"')
except Exception as e:
print(f'[red]:cross_mark: Failed to install {binary.name} as user {ARCHIVEBOX_USER}: {e}[/red]')
if binaries and len(binaries) == 1:
# if we are only installing a single binary, raise the exception so the user can see what went wrong
raise
if dry_run:
print('[dim]Dry run - would create a crawl to detect dependencies[/dim]')
return
# Set up Django
from archivebox.config.django import setup_django
setup_django()
from django.utils import timezone
from crawls.models import Seed, Crawl
from archivebox.base_models.models import get_or_create_system_user_pk
# Create a seed and crawl for dependency detection
# Using a minimal crawl that will trigger on_Crawl hooks
created_by_id = get_or_create_system_user_pk()
seed = Seed.objects.create(
uri='archivebox://install',
label='Dependency detection',
created_by_id=created_by_id,
)
crawl = Crawl.objects.create(
seed=seed,
max_depth=0,
created_by_id=created_by_id,
status='queued',
)
print(f'[+] Created dependency detection crawl: {crawl.id}')
print('[+] Running crawl to detect binaries via on_Crawl hooks...')
print()
# Run the crawl synchronously (this triggers on_Crawl hooks)
from workers.orchestrator import Orchestrator
orchestrator = Orchestrator(exit_on_idle=True)
orchestrator.runloop()
print()
# Check for superuser
from django.contrib.auth import get_user_model
User = get_user_model()
if not User.objects.filter(is_superuser=True).exclude(username='system').exists():
stderr('\n[+] Don\'t forget to create a new admin user for the Web UI...', color='green')
stderr(' archivebox manage createsuperuser')
# run_subcommand('manage', subcommand_args=['createsuperuser'], pwd=out_dir)
print('\n[green][√] Set up ArchiveBox and its dependencies successfully.[/green]\n', file=sys.stderr)
from abx_plugin_pip.binaries import ARCHIVEBOX_BINARY
extra_args = []
if binproviders:
extra_args.append(f'--binproviders={",".join(binproviders)}')
if binaries:
extra_args.append(f'--binaries={",".join(binaries)}')
proc = run_shell([ARCHIVEBOX_BINARY.load().abspath, 'version', *extra_args], capture_output=False, cwd=DATA_DIR)
raise SystemExit(proc.returncode)
print()
# Run version to show full status
archivebox_path = shutil.which('archivebox') or sys.executable
if 'python' in archivebox_path:
os.system(f'{sys.executable} -m archivebox version')
else:
os.system(f'{archivebox_path} version')
@click.command()
@click.option('--binproviders', '-p', type=str, help='Select binproviders to use DEFAULT=env,apt,brew,sys_pip,venv_pip,lib_pip,pipx,sys_npm,lib_npm,puppeteer,playwright (all)', default=None)
@click.option('--binaries', '-b', type=str, help='Select binaries to install DEFAULT=curl,wget,git,yt-dlp,chrome,single-file,readability-extractor,postlight-parser,... (all)', default=None)
@click.option('--dry-run', '-d', is_flag=True, help='Show what would be installed without actually installing anything', default=False)
@click.option('--dry-run', '-d', is_flag=True, help='Show what would happen without actually running', default=False)
@docstring(install.__doc__)
def main(**kwargs) -> None:
install(**kwargs)

View File

@@ -0,0 +1,67 @@
#!/usr/bin/env python3
"""
archivebox orchestrator [--daemon]
Start the orchestrator process that manages workers.
The orchestrator polls queues for each model type (Crawl, Snapshot, ArchiveResult)
and lazily spawns worker processes when there is work to be done.
"""
__package__ = 'archivebox.cli'
__command__ = 'archivebox orchestrator'
import sys
import rich_click as click
from archivebox.misc.util import docstring
def orchestrator(daemon: bool = False, watch: bool = False) -> int:
"""
Start the orchestrator process.
The orchestrator:
1. Polls each model queue (Crawl, Snapshot, ArchiveResult)
2. Spawns worker processes when there is work to do
3. Monitors worker health and restarts failed workers
4. Exits when all queues are empty (unless --daemon)
Args:
daemon: Run forever (don't exit when idle)
watch: Just watch the queues without spawning workers (for debugging)
Exit codes:
0: All work completed successfully
1: Error occurred
"""
from workers.orchestrator import Orchestrator
if Orchestrator.is_running():
print('[yellow]Orchestrator is already running[/yellow]')
return 0
try:
orchestrator_instance = Orchestrator(exit_on_idle=not daemon)
orchestrator_instance.runloop()
return 0
except KeyboardInterrupt:
return 0
except Exception as e:
print(f'[red]Orchestrator error: {type(e).__name__}: {e}[/red]', file=sys.stderr)
return 1
@click.command()
@click.option('--daemon', '-d', is_flag=True, help="Run forever (don't exit on idle)")
@click.option('--watch', '-w', is_flag=True, help="Watch queues without spawning workers")
@docstring(orchestrator.__doc__)
def main(daemon: bool, watch: bool):
"""Start the ArchiveBox orchestrator process"""
sys.exit(orchestrator(daemon=daemon, watch=watch))
if __name__ == '__main__':
main()

View File

@@ -12,10 +12,7 @@ import rich_click as click
from django.db.models import QuerySet
from archivebox.config import DATA_DIR
from archivebox.index.schema import Link
from archivebox.config.django import setup_django
from archivebox.index import load_main_index
from archivebox.index.sql import remove_from_sql_main_index
from archivebox.misc.util import enforce_types, docstring
from archivebox.misc.checks import check_data_folder
from archivebox.misc.logging_util import (
@@ -35,7 +32,7 @@ def remove(filter_patterns: Iterable[str]=(),
before: float | None=None,
yes: bool=False,
delete: bool=False,
out_dir: Path=DATA_DIR) -> Iterable[Link]:
out_dir: Path=DATA_DIR) -> QuerySet:
"""Remove the specified URLs from the archive"""
setup_django()
@@ -63,27 +60,27 @@ def remove(filter_patterns: Iterable[str]=(),
log_removal_finished(0, 0)
raise SystemExit(1)
log_links = [link.as_link() for link in snapshots]
log_list_finished(log_links)
log_removal_started(log_links, yes=yes, delete=delete)
log_list_finished(snapshots)
log_removal_started(snapshots, yes=yes, delete=delete)
timer = TimedProgress(360, prefix=' ')
try:
for snapshot in snapshots:
if delete:
shutil.rmtree(snapshot.as_link().link_dir, ignore_errors=True)
shutil.rmtree(snapshot.output_dir, ignore_errors=True)
finally:
timer.end()
to_remove = snapshots.count()
from archivebox.search import flush_search_index
from core.models import Snapshot
flush_search_index(snapshots=snapshots)
remove_from_sql_main_index(snapshots=snapshots, out_dir=out_dir)
all_snapshots = load_main_index(out_dir=out_dir)
snapshots.delete()
all_snapshots = Snapshot.objects.all()
log_removal_finished(all_snapshots.count(), to_remove)
return all_snapshots

View File

@@ -35,9 +35,12 @@ def schedule(add: bool=False,
depth = int(depth)
import shutil
from crontab import CronTab, CronSlices
from archivebox.misc.system import dedupe_cron_jobs
from abx_plugin_pip.binaries import ARCHIVEBOX_BINARY
# Find the archivebox binary path
ARCHIVEBOX_ABSPATH = shutil.which('archivebox') or sys.executable.replace('python', 'archivebox')
Path(CONSTANTS.LOGS_DIR).mkdir(exist_ok=True)
@@ -58,7 +61,7 @@ def schedule(add: bool=False,
'cd',
quoted(out_dir),
'&&',
quoted(ARCHIVEBOX_BINARY.load().abspath),
quoted(ARCHIVEBOX_ABSPATH),
*([
'add',
*(['--overwrite'] if overwrite else []),

View File

@@ -4,7 +4,7 @@ __package__ = 'archivebox.cli'
__command__ = 'archivebox search'
from pathlib import Path
from typing import Optional, List, Iterable
from typing import Optional, List, Any
import rich_click as click
from rich import print
@@ -12,11 +12,19 @@ from rich import print
from django.db.models import QuerySet
from archivebox.config import DATA_DIR
from archivebox.index import LINK_FILTERS
from archivebox.index.schema import Link
from archivebox.misc.logging import stderr
from archivebox.misc.util import enforce_types, docstring
# Filter types for URL matching
LINK_FILTERS = {
'exact': lambda pattern: {'url': pattern},
'substring': lambda pattern: {'url__icontains': pattern},
'regex': lambda pattern: {'url__iregex': pattern},
'domain': lambda pattern: {'url__istartswith': f'http://{pattern}'},
'tag': lambda pattern: {'tags__name': pattern},
'timestamp': lambda pattern: {'timestamp': pattern},
}
STATUS_CHOICES = [
'indexed', 'archived', 'unarchived', 'present', 'valid', 'invalid',
'duplicate', 'orphaned', 'corrupted', 'unrecognized'
@@ -24,38 +32,37 @@ STATUS_CHOICES = [
def list_links(snapshots: Optional[QuerySet]=None,
filter_patterns: Optional[List[str]]=None,
filter_type: str='substring',
after: Optional[float]=None,
before: Optional[float]=None,
out_dir: Path=DATA_DIR) -> Iterable[Link]:
from archivebox.index import load_main_index
from archivebox.index import snapshot_filter
def get_snapshots(snapshots: Optional[QuerySet]=None,
filter_patterns: Optional[List[str]]=None,
filter_type: str='substring',
after: Optional[float]=None,
before: Optional[float]=None,
out_dir: Path=DATA_DIR) -> QuerySet:
"""Filter and return Snapshots matching the given criteria."""
from core.models import Snapshot
if snapshots:
all_snapshots = snapshots
result = snapshots
else:
all_snapshots = load_main_index(out_dir=out_dir)
result = Snapshot.objects.all()
if after is not None:
all_snapshots = all_snapshots.filter(timestamp__gte=after)
result = result.filter(timestamp__gte=after)
if before is not None:
all_snapshots = all_snapshots.filter(timestamp__lt=before)
result = result.filter(timestamp__lt=before)
if filter_patterns:
all_snapshots = snapshot_filter(all_snapshots, filter_patterns, filter_type)
result = Snapshot.objects.filter_by_patterns(filter_patterns, filter_type)
if not all_snapshots:
if not result:
stderr('[!] No Snapshots matched your filters:', filter_patterns, f'({filter_type})', color='lightyellow')
return all_snapshots
return result
def list_folders(links: list[Link], status: str, out_dir: Path=DATA_DIR) -> dict[str, Link | None]:
def list_folders(snapshots: QuerySet, status: str, out_dir: Path=DATA_DIR) -> dict[str, Any]:
from archivebox.misc.checks import check_data_folder
from archivebox.index import (
from archivebox.misc.folders import (
get_indexed_folders,
get_archived_folders,
get_unarchived_folders,
@@ -67,7 +74,7 @@ def list_folders(links: list[Link], status: str, out_dir: Path=DATA_DIR) -> dict
get_corrupted_folders,
get_unrecognized_folders,
)
check_data_folder()
STATUS_FUNCTIONS = {
@@ -84,7 +91,7 @@ def list_folders(links: list[Link], status: str, out_dir: Path=DATA_DIR) -> dict
}
try:
return STATUS_FUNCTIONS[status](links, out_dir=out_dir)
return STATUS_FUNCTIONS[status](snapshots, out_dir=out_dir)
except KeyError:
raise ValueError('Status not recognized.')
@@ -109,7 +116,7 @@ def search(filter_patterns: list[str] | None=None,
stderr('[X] --with-headers requires --json, --html or --csv\n', color='red')
raise SystemExit(2)
snapshots = list_links(
snapshots = get_snapshots(
filter_patterns=list(filter_patterns) if filter_patterns else None,
filter_type=filter_type,
before=before,
@@ -120,20 +127,24 @@ def search(filter_patterns: list[str] | None=None,
snapshots = snapshots.order_by(sort)
folders = list_folders(
links=snapshots,
snapshots=snapshots,
status=status,
out_dir=DATA_DIR,
)
if json:
from archivebox.index.json import generate_json_index_from_links
output = generate_json_index_from_links(folders.values(), with_headers)
from core.models import Snapshot
# Filter for non-None snapshots
valid_snapshots = [s for s in folders.values() if s is not None]
output = Snapshot.objects.filter(pk__in=[s.pk for s in valid_snapshots]).to_json(with_headers=with_headers)
elif html:
from archivebox.index.html import generate_index_from_links
output = generate_index_from_links(folders.values(), with_headers)
from core.models import Snapshot
valid_snapshots = [s for s in folders.values() if s is not None]
output = Snapshot.objects.filter(pk__in=[s.pk for s in valid_snapshots]).to_html(with_headers=with_headers)
elif csv:
from archivebox.index.csv import links_to_csv
output = links_to_csv(folders.values(), csv.split(','), with_headers)
from core.models import Snapshot
valid_snapshots = [s for s in folders.values() if s is not None]
output = Snapshot.objects.filter(pk__in=[s.pk for s in valid_snapshots]).to_csv(cols=csv.split(','), header=with_headers)
else:
from archivebox.misc.logging_util import printable_folders
output = printable_folders(folders, with_headers)

View File

@@ -0,0 +1,218 @@
#!/usr/bin/env python3
"""
archivebox snapshot [urls...] [--depth=N] [--tag=TAG] [--plugins=...]
Create Snapshots from URLs. Accepts URLs as arguments, from stdin, or via JSONL.
Input formats:
- Plain URLs (one per line)
- JSONL: {"type": "Snapshot", "url": "...", "title": "...", "tags": "..."}
Output (JSONL):
{"type": "Snapshot", "id": "...", "url": "...", "status": "queued", ...}
Examples:
# Create snapshots from URLs
archivebox snapshot https://example.com https://foo.com
# Pipe from stdin
echo 'https://example.com' | archivebox snapshot
# Chain with extract
archivebox snapshot https://example.com | archivebox extract
# With crawl depth
archivebox snapshot --depth=1 https://example.com
"""
__package__ = 'archivebox.cli'
__command__ = 'archivebox snapshot'
import sys
from typing import Optional
import rich_click as click
from archivebox.misc.util import docstring
def process_snapshot_by_id(snapshot_id: str) -> int:
"""
Process a single Snapshot by ID (used by workers).
Triggers the Snapshot's state machine tick() which will:
- Transition from queued -> started (creates pending ArchiveResults)
- Transition from started -> sealed (when all ArchiveResults done)
"""
from rich import print as rprint
from core.models import Snapshot
try:
snapshot = Snapshot.objects.get(id=snapshot_id)
except Snapshot.DoesNotExist:
rprint(f'[red]Snapshot {snapshot_id} not found[/red]', file=sys.stderr)
return 1
rprint(f'[blue]Processing Snapshot {snapshot.id} {snapshot.url[:50]} (status={snapshot.status})[/blue]', file=sys.stderr)
try:
snapshot.sm.tick()
snapshot.refresh_from_db()
rprint(f'[green]Snapshot complete (status={snapshot.status})[/green]', file=sys.stderr)
return 0
except Exception as e:
rprint(f'[red]Snapshot error: {type(e).__name__}: {e}[/red]', file=sys.stderr)
return 1
def create_snapshots(
urls: tuple,
depth: int = 0,
tag: str = '',
plugins: str = '',
created_by_id: Optional[int] = None,
) -> int:
"""
Create Snapshots from URLs or JSONL records.
Reads from args or stdin, creates Snapshot objects, outputs JSONL.
If --plugins is passed, also runs specified plugins (blocking).
Exit codes:
0: Success
1: Failure
"""
from rich import print as rprint
from django.utils import timezone
from archivebox.misc.jsonl import (
read_args_or_stdin, write_record, snapshot_to_jsonl,
TYPE_SNAPSHOT, TYPE_TAG, get_or_create_snapshot
)
from archivebox.base_models.models import get_or_create_system_user_pk
from core.models import Snapshot
from crawls.models import Seed, Crawl
from archivebox.config import CONSTANTS
created_by_id = created_by_id or get_or_create_system_user_pk()
is_tty = sys.stdout.isatty()
# Collect all input records
records = list(read_args_or_stdin(urls))
if not records:
rprint('[yellow]No URLs provided. Pass URLs as arguments or via stdin.[/yellow]', file=sys.stderr)
return 1
# If depth > 0, we need a Crawl to manage recursive discovery
crawl = None
if depth > 0:
# Create a seed for this batch
sources_file = CONSTANTS.SOURCES_DIR / f'{timezone.now().strftime("%Y-%m-%d__%H-%M-%S")}__snapshot.txt'
sources_file.parent.mkdir(parents=True, exist_ok=True)
sources_file.write_text('\n'.join(r.get('url', '') for r in records if r.get('url')))
seed = Seed.from_file(
sources_file,
label=f'snapshot --depth={depth}',
created_by=created_by_id,
)
crawl = Crawl.from_seed(seed, max_depth=depth)
# Process each record
created_snapshots = []
for record in records:
if record.get('type') != TYPE_SNAPSHOT and 'url' not in record:
continue
try:
# Add crawl info if we have one
if crawl:
record['crawl_id'] = str(crawl.id)
record['depth'] = record.get('depth', 0)
# Add tags if provided via CLI
if tag and not record.get('tags'):
record['tags'] = tag
# Get or create the snapshot
snapshot = get_or_create_snapshot(record, created_by_id=created_by_id)
created_snapshots.append(snapshot)
# Output JSONL record (only when piped)
if not is_tty:
write_record(snapshot_to_jsonl(snapshot))
except Exception as e:
rprint(f'[red]Error creating snapshot: {e}[/red]', file=sys.stderr)
continue
if not created_snapshots:
rprint('[red]No snapshots created[/red]', file=sys.stderr)
return 1
rprint(f'[green]Created {len(created_snapshots)} snapshots[/green]', file=sys.stderr)
# If TTY, show human-readable output
if is_tty:
for snapshot in created_snapshots:
rprint(f' [dim]{snapshot.id}[/dim] {snapshot.url[:60]}', file=sys.stderr)
# If --plugins is passed, run the orchestrator for those plugins
if plugins:
from workers.orchestrator import Orchestrator
rprint(f'[blue]Running plugins: {plugins or "all"}...[/blue]', file=sys.stderr)
orchestrator = Orchestrator(exit_on_idle=True)
orchestrator.runloop()
return 0
def is_snapshot_id(value: str) -> bool:
"""Check if value looks like a Snapshot UUID."""
import re
uuid_pattern = re.compile(r'^[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$', re.I)
return bool(uuid_pattern.match(value))
@click.command()
@click.option('--depth', '-d', type=int, default=0, help='Recursively crawl linked pages up to N levels deep')
@click.option('--tag', '-t', default='', help='Comma-separated tags to add to each snapshot')
@click.option('--plugins', '-p', default='', help='Comma-separated list of plugins to run after creating snapshots (e.g. title,screenshot)')
@click.argument('args', nargs=-1)
def main(depth: int, tag: str, plugins: str, args: tuple):
"""Create Snapshots from URLs, or process existing Snapshots by ID"""
from archivebox.misc.jsonl import read_args_or_stdin
# Read all input
records = list(read_args_or_stdin(args))
if not records:
from rich import print as rprint
rprint('[yellow]No URLs or Snapshot IDs provided. Pass as arguments or via stdin.[/yellow]', file=sys.stderr)
sys.exit(1)
# Check if input looks like existing Snapshot IDs to process
# If ALL inputs are UUIDs with no URL, assume we're processing existing Snapshots
all_are_ids = all(
(r.get('id') and not r.get('url')) or is_snapshot_id(r.get('url', ''))
for r in records
)
if all_are_ids:
# Process existing Snapshots by ID
exit_code = 0
for record in records:
snapshot_id = record.get('id') or record.get('url')
result = process_snapshot_by_id(snapshot_id)
if result != 0:
exit_code = result
sys.exit(exit_code)
else:
# Create new Snapshots from URLs
sys.exit(create_snapshots(args, depth=depth, tag=tag, plugins=plugins))
if __name__ == '__main__':
main()

View File

@@ -10,9 +10,8 @@ from rich import print
from archivebox.misc.util import enforce_types, docstring
from archivebox.config import DATA_DIR, CONSTANTS, ARCHIVE_DIR
from archivebox.config.common import SHELL_CONFIG
from archivebox.index.json import parse_json_links_details
from archivebox.index import (
load_main_index,
from archivebox.misc.legacy import parse_json_links_details
from archivebox.misc.folders import (
get_indexed_folders,
get_archived_folders,
get_invalid_folders,
@@ -33,7 +32,7 @@ def status(out_dir: Path=DATA_DIR) -> None:
"""Print out some info and statistics about the archive collection"""
from django.contrib.auth import get_user_model
from archivebox.index.sql import get_admins
from archivebox.misc.db import get_admins
from core.models import Snapshot
User = get_user_model()
@@ -44,7 +43,7 @@ def status(out_dir: Path=DATA_DIR) -> None:
print(f' Index size: {size} across {num_files} files')
print()
links = load_main_index(out_dir=out_dir)
links = Snapshot.objects.all()
num_sql_links = links.count()
num_link_details = sum(1 for link in parse_json_links_details(out_dir=out_dir))
print(f' > SQL Main Index: {num_sql_links} links'.ljust(36), f'(found in {CONSTANTS.SQL_INDEX_FILENAME})')

View File

@@ -8,8 +8,7 @@ import rich_click as click
from typing import Iterable
from archivebox.misc.util import enforce_types, docstring
from archivebox.index import (
LINK_FILTERS,
from archivebox.misc.folders import (
get_indexed_folders,
get_archived_folders,
get_unarchived_folders,
@@ -22,6 +21,16 @@ from archivebox.index import (
get_unrecognized_folders,
)
# Filter types for URL matching
LINK_FILTERS = {
'exact': lambda pattern: {'url': pattern},
'substring': lambda pattern: {'url__icontains': pattern},
'regex': lambda pattern: {'url__iregex': pattern},
'domain': lambda pattern: {'url__istartswith': f'http://{pattern}'},
'tag': lambda pattern: {'tags__name': pattern},
'timestamp': lambda pattern: {'timestamp': pattern},
}
@enforce_types
def update(filter_patterns: Iterable[str]=(),
@@ -33,15 +42,66 @@ def update(filter_patterns: Iterable[str]=(),
after: float | None=None,
status: str='indexed',
filter_type: str='exact',
extract: str="") -> None:
plugins: str="",
max_workers: int=4) -> None:
"""Import any new links from subscriptions and retry any previously failed/skipped links"""
from rich import print
from archivebox.config.django import setup_django
setup_django()
from django.utils import timezone
from core.models import Snapshot
from workers.orchestrator import parallel_archive
from workers.orchestrator import Orchestrator
orchestrator = Orchestrator(exit_on_idle=False)
orchestrator.start()
# Get snapshots to update based on filters
snapshots = Snapshot.objects.all()
if filter_patterns:
snapshots = Snapshot.objects.filter_by_patterns(list(filter_patterns), filter_type)
if status == 'unarchived':
snapshots = snapshots.filter(downloaded_at__isnull=True)
elif status == 'archived':
snapshots = snapshots.filter(downloaded_at__isnull=False)
if before:
from datetime import datetime
snapshots = snapshots.filter(bookmarked_at__lt=datetime.fromtimestamp(before))
if after:
from datetime import datetime
snapshots = snapshots.filter(bookmarked_at__gt=datetime.fromtimestamp(after))
if resume:
snapshots = snapshots.filter(timestamp__gte=str(resume))
snapshot_ids = list(snapshots.values_list('pk', flat=True))
if not snapshot_ids:
print('[yellow]No snapshots found matching the given filters[/yellow]')
return
print(f'[green]\\[*] Found {len(snapshot_ids)} snapshots to update[/green]')
if index_only:
print('[yellow]Index-only mode - skipping archiving[/yellow]')
return
methods = plugins.split(',') if plugins else None
# Queue snapshots for archiving via the state machine system
# Workers will pick them up and run the plugins
if len(snapshot_ids) > 1 and max_workers > 1:
parallel_archive(snapshot_ids, max_workers=max_workers, overwrite=overwrite, methods=methods)
else:
# Queue snapshots by setting status to queued
for snapshot in snapshots:
Snapshot.objects.filter(id=snapshot.id).update(
status=Snapshot.StatusChoices.QUEUED,
retry_at=timezone.now(),
)
print(f'[green]Queued {len(snapshot_ids)} snapshots for archiving[/green]')
@click.command()
@@ -71,7 +131,8 @@ Update only links or data directories that have the given status:
unrecognized {get_unrecognized_folders.__doc__}
''')
@click.option('--filter-type', '-t', type=click.Choice([*LINK_FILTERS.keys(), 'search']), default='exact', help='Type of pattern matching to use when filtering URLs')
@click.option('--extract', '-e', default='', help='Comma-separated list of extractors to use e.g. title,favicon,screenshot,singlefile,...')
@click.option('--plugins', '-p', default='', help='Comma-separated list of plugins to use e.g. title,favicon,screenshot,singlefile,...')
@click.option('--max-workers', '-j', type=int, default=4, help='Number of parallel worker processes for archiving')
@click.argument('filter_patterns', nargs=-1)
@docstring(update.__doc__)
def main(**kwargs):

View File

@@ -3,7 +3,10 @@
__package__ = 'archivebox.cli'
import sys
from typing import Iterable
import os
import platform
from pathlib import Path
from typing import Iterable, Optional
import rich_click as click
@@ -12,7 +15,6 @@ from archivebox.misc.util import docstring, enforce_types
@enforce_types
def version(quiet: bool=False,
binproviders: Iterable[str]=(),
binaries: Iterable[str]=()) -> list[str]:
"""Print the ArchiveBox version, debug metadata, and installed dependency versions"""
@@ -22,37 +24,24 @@ def version(quiet: bool=False,
if quiet or '--version' in sys.argv:
return []
# Only do slower imports when getting full version info
import os
import platform
from pathlib import Path
from rich.panel import Panel
from rich.console import Console
from abx_pkg import Binary
import abx
import archivebox
from archivebox.config import CONSTANTS, DATA_DIR
from archivebox.config.version import get_COMMIT_HASH, get_BUILD_TIME
from archivebox.config.permissions import ARCHIVEBOX_USER, ARCHIVEBOX_GROUP, RUNNING_AS_UID, RUNNING_AS_GID, IN_DOCKER
from archivebox.config.paths import get_data_locations, get_code_locations
from archivebox.config.common import SHELL_CONFIG, STORAGE_CONFIG, SEARCH_BACKEND_CONFIG
from archivebox.misc.logging_util import printable_folder_status
from abx_plugin_default_binproviders import apt, brew, env
from archivebox.config.configset import get_config
console = Console()
prnt = console.print
LDAP_ENABLED = archivebox.pm.hook.get_SCOPE_CONFIG().LDAP_ENABLED
# Check if LDAP is enabled (simple config lookup)
config = get_config()
LDAP_ENABLED = config.get('LDAP_ENABLED', False)
# 0.7.1
# ArchiveBox v0.7.1+editable COMMIT_HASH=951bba5 BUILD_TIME=2023-12-17 16:46:05 1702860365
# IN_DOCKER=False IN_QEMU=False ARCH=arm64 OS=Darwin PLATFORM=macOS-14.2-arm64-arm-64bit PYTHON=Cpython
# FS_ATOMIC=True FS_REMOTE=False FS_USER=501:20 FS_PERMS=644
# DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND=ripgrep LDAP=False
p = platform.uname()
COMMIT_HASH = get_COMMIT_HASH()
prnt(
@@ -68,15 +57,26 @@ def version(quiet: bool=False,
f'PLATFORM={platform.platform()}',
f'PYTHON={sys.implementation.name.title()}' + (' (venv)' if CONSTANTS.IS_INSIDE_VENV else ''),
)
OUTPUT_IS_REMOTE_FS = get_data_locations().DATA_DIR.is_mount or get_data_locations().ARCHIVE_DIR.is_mount
DATA_DIR_STAT = CONSTANTS.DATA_DIR.stat()
prnt(
f'EUID={os.geteuid()}:{os.getegid()} UID={RUNNING_AS_UID}:{RUNNING_AS_GID} PUID={ARCHIVEBOX_USER}:{ARCHIVEBOX_GROUP}',
f'FS_UID={DATA_DIR_STAT.st_uid}:{DATA_DIR_STAT.st_gid}',
f'FS_PERMS={STORAGE_CONFIG.OUTPUT_PERMISSIONS}',
f'FS_ATOMIC={STORAGE_CONFIG.ENFORCE_ATOMIC_WRITES}',
f'FS_REMOTE={OUTPUT_IS_REMOTE_FS}',
)
try:
OUTPUT_IS_REMOTE_FS = get_data_locations().DATA_DIR.is_mount or get_data_locations().ARCHIVE_DIR.is_mount
except Exception:
OUTPUT_IS_REMOTE_FS = False
try:
DATA_DIR_STAT = CONSTANTS.DATA_DIR.stat()
prnt(
f'EUID={os.geteuid()}:{os.getegid()} UID={RUNNING_AS_UID}:{RUNNING_AS_GID} PUID={ARCHIVEBOX_USER}:{ARCHIVEBOX_GROUP}',
f'FS_UID={DATA_DIR_STAT.st_uid}:{DATA_DIR_STAT.st_gid}',
f'FS_PERMS={STORAGE_CONFIG.OUTPUT_PERMISSIONS}',
f'FS_ATOMIC={STORAGE_CONFIG.ENFORCE_ATOMIC_WRITES}',
f'FS_REMOTE={OUTPUT_IS_REMOTE_FS}',
)
except Exception:
prnt(
f'EUID={os.geteuid()}:{os.getegid()} UID={RUNNING_AS_UID}:{RUNNING_AS_GID} PUID={ARCHIVEBOX_USER}:{ARCHIVEBOX_GROUP}',
)
prnt(
f'DEBUG={SHELL_CONFIG.DEBUG}',
f'IS_TTY={SHELL_CONFIG.IS_TTY}',
@@ -84,14 +84,11 @@ def version(quiet: bool=False,
f'ID={CONSTANTS.MACHINE_ID}:{CONSTANTS.COLLECTION_ID}',
f'SEARCH_BACKEND={SEARCH_BACKEND_CONFIG.SEARCH_BACKEND_ENGINE}',
f'LDAP={LDAP_ENABLED}',
#f'DB=django.db.backends.sqlite3 (({CONFIG["SQLITE_JOURNAL_MODE"]})', # add this if we have more useful info to show eventually
)
prnt()
if not (os.access(CONSTANTS.ARCHIVE_DIR, os.R_OK) and os.access(CONSTANTS.CONFIG_FILE, os.R_OK)):
PANEL_TEXT = '\n'.join((
# '',
# f'[yellow]CURRENT DIR =[/yellow] [red]{os.getcwd()}[/red]',
'',
'[violet]Hint:[/violet] [green]cd[/green] into a collection [blue]DATA_DIR[/blue] and run [green]archivebox version[/green] again...',
' [grey53]OR[/grey53] run [green]archivebox init[/green] to create a new collection in the current dir.',
@@ -105,77 +102,94 @@ def version(quiet: bool=False,
prnt('[pale_green1][i] Binary Dependencies:[/pale_green1]')
failures = []
BINARIES = abx.as_dict(archivebox.pm.hook.get_BINARIES())
for name, binary in list(BINARIES.items()):
if binary.name == 'archivebox':
continue
# skip if the binary is not in the requested list of binaries
if binaries and binary.name not in binaries:
continue
# skip if the binary is not supported by any of the requested binproviders
if binproviders and binary.binproviders_supported and not any(provider.name in binproviders for provider in binary.binproviders_supported):
continue
err = None
try:
loaded_bin = binary.load()
except Exception as e:
err = e
loaded_bin = binary
provider_summary = f'[dark_sea_green3]{loaded_bin.binprovider.name.ljust(10)}[/dark_sea_green3]' if loaded_bin.binprovider else '[grey23]not found[/grey23] '
if loaded_bin.abspath:
abspath = str(loaded_bin.abspath).replace(str(DATA_DIR), '[light_slate_blue].[/light_slate_blue]').replace(str(Path('~').expanduser()), '~')
if ' ' in abspath:
abspath = abspath.replace(' ', r'\ ')
else:
abspath = f'[red]{err}[/red]'
prnt('', '[green]√[/green]' if loaded_bin.is_valid else '[red]X[/red]', '', loaded_bin.name.ljust(21), str(loaded_bin.version).ljust(12), provider_summary, abspath, overflow='ignore', crop=False)
if not loaded_bin.is_valid:
failures.append(loaded_bin.name)
prnt()
prnt('[gold3][i] Package Managers:[/gold3]')
BINPROVIDERS = abx.as_dict(archivebox.pm.hook.get_BINPROVIDERS())
for name, binprovider in list(BINPROVIDERS.items()):
err = None
if binproviders and binprovider.name not in binproviders:
continue
# TODO: implement a BinProvider.BINARY() method that gets the loaded binary for a binprovider's INSTALLER_BIN
loaded_bin = binprovider.INSTALLER_BINARY or Binary(name=binprovider.INSTALLER_BIN, binproviders=[env, apt, brew])
abspath = str(loaded_bin.abspath).replace(str(DATA_DIR), '[light_slate_blue].[/light_slate_blue]').replace(str(Path('~').expanduser()), '~')
abspath = None
if loaded_bin.abspath:
abspath = str(loaded_bin.abspath).replace(str(DATA_DIR), '.').replace(str(Path('~').expanduser()), '~')
if ' ' in abspath:
abspath = abspath.replace(' ', r'\ ')
PATH = str(binprovider.PATH).replace(str(DATA_DIR), '[light_slate_blue].[/light_slate_blue]').replace(str(Path('~').expanduser()), '~')
ownership_summary = f'UID=[blue]{str(binprovider.EUID).ljust(4)}[/blue]'
provider_summary = f'[dark_sea_green3]{str(abspath).ljust(52)}[/dark_sea_green3]' if abspath else f'[grey23]{"not available".ljust(52)}[/grey23]'
prnt('', '[green]√[/green]' if binprovider.is_valid else '[grey53]-[/grey53]', '', binprovider.name.ljust(11), provider_summary, ownership_summary, f'PATH={PATH}', overflow='ellipsis', soft_wrap=True)
if not (binaries or binproviders):
# dont show source code / data dir info if we just want to get version info for a binary or binprovider
# Setup Django before importing models
from archivebox.config.django import setup_django
setup_django()
from machine.models import Machine, InstalledBinary
machine = Machine.current()
# Get all *_BINARY config values
binary_config_keys = [key for key in config.keys() if key.endswith('_BINARY')]
if not binary_config_keys:
prnt('', '[grey53]No binary dependencies defined in config.[/grey53]')
else:
for key in sorted(set(binary_config_keys)):
# Get the actual binary name/path from config value
bin_value = config.get(key, '').strip()
if not bin_value:
continue
# Check if it's a path (has slashes) or just a name
is_path = '/' in bin_value
if is_path:
# It's a full path - match against abspath
bin_name = Path(bin_value).name
# Skip if user specified specific binaries and this isn't one
if binaries and bin_name not in binaries:
continue
# Find InstalledBinary where abspath ends with this path
installed = InstalledBinary.objects.filter(
machine=machine,
abspath__endswith=bin_value,
).exclude(abspath='').exclude(abspath__isnull=True).order_by('-modified_at').first()
else:
# It's just a binary name - match against name
bin_name = bin_value
# Skip if user specified specific binaries and this isn't one
if binaries and bin_name not in binaries:
continue
# Find InstalledBinary by name
installed = InstalledBinary.objects.filter(
machine=machine,
name__iexact=bin_name,
).exclude(abspath='').exclude(abspath__isnull=True).order_by('-modified_at').first()
if installed and installed.is_valid:
display_path = installed.abspath.replace(str(DATA_DIR), '.').replace(str(Path('~').expanduser()), '~')
version_str = (installed.version or 'unknown')[:15]
provider = (installed.binprovider or 'env')[:8]
prnt('', '[green]√[/green]', '', bin_name.ljust(18), version_str.ljust(16), provider.ljust(8), display_path, overflow='ignore', crop=False)
else:
prnt('', '[red]X[/red]', '', bin_name.ljust(18), '[grey53]not installed[/grey53]', overflow='ignore', crop=False)
failures.append(bin_name)
# Show hint if no binaries are installed yet
has_any_installed = InstalledBinary.objects.filter(machine=machine).exclude(abspath='').exists()
if not has_any_installed:
prnt()
prnt('', '[grey53]Run [green]archivebox install[/green] to detect and install dependencies.[/grey53]')
if not binaries:
# Show code and data locations
prnt()
prnt('[deep_sky_blue3][i] Code locations:[/deep_sky_blue3]')
for name, path in get_code_locations().items():
prnt(printable_folder_status(name, path), overflow='ignore', crop=False)
try:
for name, path in get_code_locations().items():
if isinstance(path, dict):
prnt(printable_folder_status(name, path), overflow='ignore', crop=False)
except Exception as e:
prnt(f' [red]Error getting code locations: {e}[/red]')
prnt()
if os.access(CONSTANTS.ARCHIVE_DIR, os.R_OK) or os.access(CONSTANTS.CONFIG_FILE, os.R_OK):
prnt('[bright_yellow][i] Data locations:[/bright_yellow]')
for name, path in get_data_locations().items():
prnt(printable_folder_status(name, path), overflow='ignore', crop=False)
from archivebox.misc.checks import check_data_dir_permissions
try:
for name, path in get_data_locations().items():
if isinstance(path, dict):
prnt(printable_folder_status(name, path), overflow='ignore', crop=False)
except Exception as e:
prnt(f' [red]Error getting data locations: {e}[/red]')
check_data_dir_permissions()
try:
from archivebox.misc.checks import check_data_dir_permissions
check_data_dir_permissions()
except Exception:
pass
else:
prnt()
prnt('[red][i] Data locations:[/red] (not in a data directory)')
@@ -194,7 +208,6 @@ def version(quiet: bool=False,
@click.command()
@click.option('--quiet', '-q', is_flag=True, help='Only print ArchiveBox version number and nothing else. (equivalent to archivebox --version)')
@click.option('--binproviders', '-p', help='Select binproviders to detect DEFAULT=env,apt,brew,sys_pip,venv_pip,lib_pip,pipx,sys_npm,lib_npm,puppeteer,playwright (all)')
@click.option('--binaries', '-b', help='Select binaries to detect DEFAULT=curl,wget,git,yt-dlp,chrome,single-file,readability-extractor,postlight-parser,... (all)')
@docstring(version.__doc__)
def main(**kwargs):

View File

@@ -4,29 +4,46 @@ __package__ = 'archivebox.cli'
__command__ = 'archivebox worker'
import sys
import json
import rich_click as click
from archivebox.misc.util import docstring
def worker(worker_type: str, daemon: bool = False, plugin: str | None = None):
"""
Start a worker process to process items from the queue.
Worker types:
- crawl: Process Crawl objects (parse seeds, create snapshots)
- snapshot: Process Snapshot objects (create archive results)
- archiveresult: Process ArchiveResult objects (run plugins)
Workers poll the database for queued items, claim them atomically,
and spawn subprocess tasks to handle each item.
"""
from workers.worker import get_worker_class
WorkerClass = get_worker_class(worker_type)
# Build kwargs
kwargs = {'daemon': daemon}
if plugin and worker_type == 'archiveresult':
kwargs['extractor'] = plugin # internal field still called extractor
# Create and run worker
worker_instance = WorkerClass(**kwargs)
worker_instance.runloop()
@click.command()
@click.argument('worker_type')
@click.option('--wait-for-first-event', is_flag=True)
@click.option('--exit-on-idle', is_flag=True)
def main(worker_type: str, wait_for_first_event: bool, exit_on_idle: bool):
"""Start an ArchiveBox worker process of the given type"""
from workers.worker import get_worker_type
# allow piping in events to process from stdin
# if not sys.stdin.isatty():
# for line in sys.stdin.readlines():
# Event.dispatch(event=json.loads(line), parent=None)
# run the actor
Worker = get_worker_type(worker_type)
for event in Worker.run(wait_for_first_event=wait_for_first_event, exit_on_idle=exit_on_idle):
print(event)
@click.argument('worker_type', type=click.Choice(['crawl', 'snapshot', 'archiveresult']))
@click.option('--daemon', '-d', is_flag=True, help="Run forever (don't exit on idle)")
@click.option('--plugin', '-p', default=None, help='Filter by plugin (archiveresult only)')
@docstring(worker.__doc__)
def main(worker_type: str, daemon: bool, plugin: str | None):
"""Start an ArchiveBox worker process"""
worker(worker_type, daemon=daemon, plugin=plugin)
if __name__ == '__main__':

View File

@@ -31,7 +31,6 @@ DATA_DIR = 'data.tests'
os.environ.update(TEST_CONFIG)
from ..main import init
from ..index import load_main_index
from archivebox.config.constants import (
SQL_INDEX_FILENAME,
JSON_INDEX_FILENAME,

View File

@@ -0,0 +1,966 @@
#!/usr/bin/env python3
"""
Tests for CLI piping workflow: crawl | snapshot | extract
This module tests the JSONL-based piping between CLI commands as described in:
https://github.com/ArchiveBox/ArchiveBox/issues/1363
Workflows tested:
archivebox snapshot URL | archivebox extract
archivebox crawl URL | archivebox snapshot | archivebox extract
archivebox crawl --plugin=PARSER URL | archivebox snapshot | archivebox extract
Each command should:
- Accept URLs, snapshot_ids, or JSONL as input (args or stdin)
- Output JSONL to stdout when piped (not TTY)
- Output human-readable to stderr when TTY
"""
__package__ = 'archivebox.cli'
import os
import sys
import json
import shutil
import tempfile
import unittest
from io import StringIO
from pathlib import Path
from unittest.mock import patch, MagicMock
# Test configuration - disable slow extractors
TEST_CONFIG = {
'USE_COLOR': 'False',
'SHOW_PROGRESS': 'False',
'SAVE_ARCHIVE_DOT_ORG': 'False',
'SAVE_TITLE': 'True', # Fast extractor
'SAVE_FAVICON': 'False',
'SAVE_WGET': 'False',
'SAVE_WARC': 'False',
'SAVE_PDF': 'False',
'SAVE_SCREENSHOT': 'False',
'SAVE_DOM': 'False',
'SAVE_SINGLEFILE': 'False',
'SAVE_READABILITY': 'False',
'SAVE_MERCURY': 'False',
'SAVE_GIT': 'False',
'SAVE_MEDIA': 'False',
'SAVE_HEADERS': 'False',
'USE_CURL': 'False',
'USE_WGET': 'False',
'USE_GIT': 'False',
'USE_CHROME': 'False',
'USE_YOUTUBEDL': 'False',
'USE_NODE': 'False',
}
os.environ.update(TEST_CONFIG)
# =============================================================================
# JSONL Utility Tests
# =============================================================================
class TestJSONLParsing(unittest.TestCase):
"""Test JSONL input parsing utilities."""
def test_parse_plain_url(self):
"""Plain URLs should be parsed as Snapshot records."""
from archivebox.misc.jsonl import parse_line, TYPE_SNAPSHOT
result = parse_line('https://example.com')
self.assertIsNotNone(result)
self.assertEqual(result['type'], TYPE_SNAPSHOT)
self.assertEqual(result['url'], 'https://example.com')
def test_parse_jsonl_snapshot(self):
"""JSONL Snapshot records should preserve all fields."""
from archivebox.misc.jsonl import parse_line, TYPE_SNAPSHOT
line = '{"type": "Snapshot", "url": "https://example.com", "tags": "test,demo"}'
result = parse_line(line)
self.assertIsNotNone(result)
self.assertEqual(result['type'], TYPE_SNAPSHOT)
self.assertEqual(result['url'], 'https://example.com')
self.assertEqual(result['tags'], 'test,demo')
def test_parse_jsonl_with_id(self):
"""JSONL with id field should be recognized."""
from archivebox.misc.jsonl import parse_line, TYPE_SNAPSHOT
line = '{"type": "Snapshot", "id": "abc123", "url": "https://example.com"}'
result = parse_line(line)
self.assertIsNotNone(result)
self.assertEqual(result['id'], 'abc123')
self.assertEqual(result['url'], 'https://example.com')
def test_parse_uuid_as_snapshot_id(self):
"""Bare UUIDs should be parsed as snapshot IDs."""
from archivebox.misc.jsonl import parse_line, TYPE_SNAPSHOT
uuid = '01234567-89ab-cdef-0123-456789abcdef'
result = parse_line(uuid)
self.assertIsNotNone(result)
self.assertEqual(result['type'], TYPE_SNAPSHOT)
self.assertEqual(result['id'], uuid)
def test_parse_empty_line(self):
"""Empty lines should return None."""
from archivebox.misc.jsonl import parse_line
self.assertIsNone(parse_line(''))
self.assertIsNone(parse_line(' '))
self.assertIsNone(parse_line('\n'))
def test_parse_comment_line(self):
"""Comment lines should return None."""
from archivebox.misc.jsonl import parse_line
self.assertIsNone(parse_line('# This is a comment'))
self.assertIsNone(parse_line(' # Indented comment'))
def test_parse_invalid_url(self):
"""Invalid URLs should return None."""
from archivebox.misc.jsonl import parse_line
self.assertIsNone(parse_line('not-a-url'))
self.assertIsNone(parse_line('ftp://example.com')) # Only http/https/file
def test_parse_file_url(self):
"""file:// URLs should be parsed."""
from archivebox.misc.jsonl import parse_line, TYPE_SNAPSHOT
result = parse_line('file:///path/to/file.txt')
self.assertIsNotNone(result)
self.assertEqual(result['type'], TYPE_SNAPSHOT)
self.assertEqual(result['url'], 'file:///path/to/file.txt')
class TestJSONLOutput(unittest.TestCase):
"""Test JSONL output formatting."""
def test_snapshot_to_jsonl(self):
"""Snapshot model should serialize to JSONL correctly."""
from archivebox.misc.jsonl import snapshot_to_jsonl, TYPE_SNAPSHOT
# Create a mock snapshot
mock_snapshot = MagicMock()
mock_snapshot.id = 'test-uuid-1234'
mock_snapshot.url = 'https://example.com'
mock_snapshot.title = 'Example Title'
mock_snapshot.tags_str.return_value = 'tag1,tag2'
mock_snapshot.bookmarked_at = None
mock_snapshot.created_at = None
mock_snapshot.timestamp = '1234567890'
mock_snapshot.depth = 0
mock_snapshot.status = 'queued'
result = snapshot_to_jsonl(mock_snapshot)
self.assertEqual(result['type'], TYPE_SNAPSHOT)
self.assertEqual(result['id'], 'test-uuid-1234')
self.assertEqual(result['url'], 'https://example.com')
self.assertEqual(result['title'], 'Example Title')
def test_archiveresult_to_jsonl(self):
"""ArchiveResult model should serialize to JSONL correctly."""
from archivebox.misc.jsonl import archiveresult_to_jsonl, TYPE_ARCHIVERESULT
mock_result = MagicMock()
mock_result.id = 'result-uuid-5678'
mock_result.snapshot_id = 'snapshot-uuid-1234'
mock_result.extractor = 'title'
mock_result.status = 'succeeded'
mock_result.output = 'Example Title'
mock_result.start_ts = None
mock_result.end_ts = None
result = archiveresult_to_jsonl(mock_result)
self.assertEqual(result['type'], TYPE_ARCHIVERESULT)
self.assertEqual(result['id'], 'result-uuid-5678')
self.assertEqual(result['snapshot_id'], 'snapshot-uuid-1234')
self.assertEqual(result['extractor'], 'title')
self.assertEqual(result['status'], 'succeeded')
class TestReadArgsOrStdin(unittest.TestCase):
"""Test reading from args or stdin."""
def test_read_from_args(self):
"""Should read URLs from command line args."""
from archivebox.misc.jsonl import read_args_or_stdin
args = ('https://example1.com', 'https://example2.com')
records = list(read_args_or_stdin(args))
self.assertEqual(len(records), 2)
self.assertEqual(records[0]['url'], 'https://example1.com')
self.assertEqual(records[1]['url'], 'https://example2.com')
def test_read_from_stdin(self):
"""Should read URLs from stdin when no args provided."""
from archivebox.misc.jsonl import read_args_or_stdin
stdin_content = 'https://example1.com\nhttps://example2.com\n'
stream = StringIO(stdin_content)
# Mock isatty to return False (simulating piped input)
stream.isatty = lambda: False
records = list(read_args_or_stdin((), stream=stream))
self.assertEqual(len(records), 2)
self.assertEqual(records[0]['url'], 'https://example1.com')
self.assertEqual(records[1]['url'], 'https://example2.com')
def test_read_jsonl_from_stdin(self):
"""Should read JSONL from stdin."""
from archivebox.misc.jsonl import read_args_or_stdin
stdin_content = '{"type": "Snapshot", "url": "https://example.com", "tags": "test"}\n'
stream = StringIO(stdin_content)
stream.isatty = lambda: False
records = list(read_args_or_stdin((), stream=stream))
self.assertEqual(len(records), 1)
self.assertEqual(records[0]['url'], 'https://example.com')
self.assertEqual(records[0]['tags'], 'test')
def test_skip_tty_stdin(self):
"""Should not read from TTY stdin (would block)."""
from archivebox.misc.jsonl import read_args_or_stdin
stream = StringIO('https://example.com')
stream.isatty = lambda: True # Simulate TTY
records = list(read_args_or_stdin((), stream=stream))
self.assertEqual(len(records), 0)
# =============================================================================
# Unit Tests for Individual Commands
# =============================================================================
class TestCrawlCommand(unittest.TestCase):
"""Unit tests for archivebox crawl command."""
def setUp(self):
"""Set up test environment."""
self.test_dir = tempfile.mkdtemp()
os.environ['DATA_DIR'] = self.test_dir
def tearDown(self):
"""Clean up test environment."""
shutil.rmtree(self.test_dir, ignore_errors=True)
def test_crawl_accepts_url(self):
"""crawl should accept URLs as input."""
from archivebox.misc.jsonl import read_args_or_stdin
args = ('https://example.com',)
records = list(read_args_or_stdin(args))
self.assertEqual(len(records), 1)
self.assertEqual(records[0]['url'], 'https://example.com')
def test_crawl_accepts_snapshot_id(self):
"""crawl should accept snapshot IDs as input."""
from archivebox.misc.jsonl import read_args_or_stdin
uuid = '01234567-89ab-cdef-0123-456789abcdef'
args = (uuid,)
records = list(read_args_or_stdin(args))
self.assertEqual(len(records), 1)
self.assertEqual(records[0]['id'], uuid)
def test_crawl_accepts_jsonl(self):
"""crawl should accept JSONL with snapshot info."""
from archivebox.misc.jsonl import read_args_or_stdin
stdin = StringIO('{"type": "Snapshot", "id": "abc123", "url": "https://example.com"}\n')
stdin.isatty = lambda: False
records = list(read_args_or_stdin((), stream=stdin))
self.assertEqual(len(records), 1)
self.assertEqual(records[0]['id'], 'abc123')
self.assertEqual(records[0]['url'], 'https://example.com')
def test_crawl_separates_existing_vs_new(self):
"""crawl should identify existing snapshots vs new URLs."""
# This tests the logic in discover_outlinks() that separates
# records with 'id' (existing) from records with just 'url' (new)
records = [
{'type': 'Snapshot', 'id': 'existing-id-1'}, # Existing (id only)
{'type': 'Snapshot', 'url': 'https://new-url.com'}, # New (url only)
{'type': 'Snapshot', 'id': 'existing-id-2', 'url': 'https://existing.com'}, # Existing (has id)
]
existing = []
new = []
for record in records:
if record.get('id') and not record.get('url'):
existing.append(record['id'])
elif record.get('id'):
existing.append(record['id']) # Has both id and url - treat as existing
elif record.get('url'):
new.append(record)
self.assertEqual(len(existing), 2)
self.assertEqual(len(new), 1)
self.assertEqual(new[0]['url'], 'https://new-url.com')
class TestSnapshotCommand(unittest.TestCase):
"""Unit tests for archivebox snapshot command."""
def setUp(self):
"""Set up test environment."""
self.test_dir = tempfile.mkdtemp()
os.environ['DATA_DIR'] = self.test_dir
def tearDown(self):
"""Clean up test environment."""
shutil.rmtree(self.test_dir, ignore_errors=True)
def test_snapshot_accepts_url(self):
"""snapshot should accept URLs as input."""
from archivebox.misc.jsonl import read_args_or_stdin
args = ('https://example.com',)
records = list(read_args_or_stdin(args))
self.assertEqual(len(records), 1)
self.assertEqual(records[0]['url'], 'https://example.com')
def test_snapshot_accepts_jsonl_with_metadata(self):
"""snapshot should accept JSONL with tags and other metadata."""
from archivebox.misc.jsonl import read_args_or_stdin
stdin = StringIO('{"type": "Snapshot", "url": "https://example.com", "tags": "tag1,tag2", "title": "Test"}\n')
stdin.isatty = lambda: False
records = list(read_args_or_stdin((), stream=stdin))
self.assertEqual(len(records), 1)
self.assertEqual(records[0]['url'], 'https://example.com')
self.assertEqual(records[0]['tags'], 'tag1,tag2')
self.assertEqual(records[0]['title'], 'Test')
def test_snapshot_output_format(self):
"""snapshot output should include id and url."""
from archivebox.misc.jsonl import snapshot_to_jsonl
mock_snapshot = MagicMock()
mock_snapshot.id = 'test-id'
mock_snapshot.url = 'https://example.com'
mock_snapshot.title = 'Test'
mock_snapshot.tags_str.return_value = ''
mock_snapshot.bookmarked_at = None
mock_snapshot.created_at = None
mock_snapshot.timestamp = '123'
mock_snapshot.depth = 0
mock_snapshot.status = 'queued'
output = snapshot_to_jsonl(mock_snapshot)
self.assertIn('id', output)
self.assertIn('url', output)
self.assertEqual(output['type'], 'Snapshot')
class TestExtractCommand(unittest.TestCase):
"""Unit tests for archivebox extract command."""
def setUp(self):
"""Set up test environment."""
self.test_dir = tempfile.mkdtemp()
os.environ['DATA_DIR'] = self.test_dir
def tearDown(self):
"""Clean up test environment."""
shutil.rmtree(self.test_dir, ignore_errors=True)
def test_extract_accepts_snapshot_id(self):
"""extract should accept snapshot IDs as input."""
from archivebox.misc.jsonl import read_args_or_stdin
uuid = '01234567-89ab-cdef-0123-456789abcdef'
args = (uuid,)
records = list(read_args_or_stdin(args))
self.assertEqual(len(records), 1)
self.assertEqual(records[0]['id'], uuid)
def test_extract_accepts_jsonl_snapshot(self):
"""extract should accept JSONL Snapshot records."""
from archivebox.misc.jsonl import read_args_or_stdin, TYPE_SNAPSHOT
stdin = StringIO('{"type": "Snapshot", "id": "abc123", "url": "https://example.com"}\n')
stdin.isatty = lambda: False
records = list(read_args_or_stdin((), stream=stdin))
self.assertEqual(len(records), 1)
self.assertEqual(records[0]['type'], TYPE_SNAPSHOT)
self.assertEqual(records[0]['id'], 'abc123')
def test_extract_gathers_snapshot_ids(self):
"""extract should gather snapshot IDs from various input formats."""
from archivebox.misc.jsonl import TYPE_SNAPSHOT, TYPE_ARCHIVERESULT
records = [
{'type': TYPE_SNAPSHOT, 'id': 'snap-1'},
{'type': TYPE_SNAPSHOT, 'id': 'snap-2', 'url': 'https://example.com'},
{'type': TYPE_ARCHIVERESULT, 'snapshot_id': 'snap-3'},
{'id': 'snap-4'}, # Bare id
]
snapshot_ids = set()
for record in records:
record_type = record.get('type')
if record_type == TYPE_SNAPSHOT:
snapshot_id = record.get('id')
if snapshot_id:
snapshot_ids.add(snapshot_id)
elif record_type == TYPE_ARCHIVERESULT:
snapshot_id = record.get('snapshot_id')
if snapshot_id:
snapshot_ids.add(snapshot_id)
elif 'id' in record:
snapshot_ids.add(record['id'])
self.assertEqual(len(snapshot_ids), 4)
self.assertIn('snap-1', snapshot_ids)
self.assertIn('snap-2', snapshot_ids)
self.assertIn('snap-3', snapshot_ids)
self.assertIn('snap-4', snapshot_ids)
# =============================================================================
# URL Collection Tests
# =============================================================================
class TestURLCollection(unittest.TestCase):
"""Test collecting urls.jsonl from extractor output."""
def setUp(self):
"""Create test directory structure."""
self.test_dir = Path(tempfile.mkdtemp())
# Create fake extractor output directories with urls.jsonl
(self.test_dir / 'wget').mkdir()
(self.test_dir / 'wget' / 'urls.jsonl').write_text(
'{"url": "https://wget-link-1.com"}\n'
'{"url": "https://wget-link-2.com"}\n'
)
(self.test_dir / 'parse_html_urls').mkdir()
(self.test_dir / 'parse_html_urls' / 'urls.jsonl').write_text(
'{"url": "https://html-link-1.com"}\n'
'{"url": "https://html-link-2.com", "title": "HTML Link 2"}\n'
)
(self.test_dir / 'screenshot').mkdir()
# No urls.jsonl in screenshot dir - not a parser
def tearDown(self):
"""Clean up test directory."""
shutil.rmtree(self.test_dir, ignore_errors=True)
def test_collect_urls_from_extractors(self):
"""Should collect urls.jsonl from all extractor subdirectories."""
from archivebox.hooks import collect_urls_from_extractors
urls = collect_urls_from_extractors(self.test_dir)
self.assertEqual(len(urls), 4)
# Check that via_extractor is set
extractors = {u['via_extractor'] for u in urls}
self.assertIn('wget', extractors)
self.assertIn('parse_html_urls', extractors)
self.assertNotIn('screenshot', extractors) # No urls.jsonl
def test_collect_urls_preserves_metadata(self):
"""Should preserve metadata from urls.jsonl entries."""
from archivebox.hooks import collect_urls_from_extractors
urls = collect_urls_from_extractors(self.test_dir)
# Find the entry with title
titled = [u for u in urls if u.get('title') == 'HTML Link 2']
self.assertEqual(len(titled), 1)
self.assertEqual(titled[0]['url'], 'https://html-link-2.com')
def test_collect_urls_empty_dir(self):
"""Should handle empty or non-existent directories."""
from archivebox.hooks import collect_urls_from_extractors
empty_dir = self.test_dir / 'nonexistent'
urls = collect_urls_from_extractors(empty_dir)
self.assertEqual(len(urls), 0)
# =============================================================================
# Integration Tests
# =============================================================================
class TestPipingWorkflowIntegration(unittest.TestCase):
"""
Integration tests for the complete piping workflow.
These tests require Django to be set up and use the actual database.
"""
@classmethod
def setUpClass(cls):
"""Set up Django and test database."""
cls.test_dir = tempfile.mkdtemp()
os.environ['DATA_DIR'] = cls.test_dir
# Initialize Django
from archivebox.config.django import setup_django
setup_django()
# Initialize the archive
from archivebox.cli.archivebox_init import init
init()
@classmethod
def tearDownClass(cls):
"""Clean up test database."""
shutil.rmtree(cls.test_dir, ignore_errors=True)
def test_snapshot_creates_and_outputs_jsonl(self):
"""
Test: archivebox snapshot URL
Should create a Snapshot and output JSONL when piped.
"""
from core.models import Snapshot
from archivebox.misc.jsonl import (
read_args_or_stdin, write_record, snapshot_to_jsonl,
TYPE_SNAPSHOT, get_or_create_snapshot
)
from archivebox.base_models.models import get_or_create_system_user_pk
created_by_id = get_or_create_system_user_pk()
# Simulate input
url = 'https://test-snapshot-1.example.com'
records = list(read_args_or_stdin((url,)))
self.assertEqual(len(records), 1)
self.assertEqual(records[0]['url'], url)
# Create snapshot
snapshot = get_or_create_snapshot(records[0], created_by_id=created_by_id)
self.assertIsNotNone(snapshot.id)
self.assertEqual(snapshot.url, url)
# Verify output format
output = snapshot_to_jsonl(snapshot)
self.assertEqual(output['type'], TYPE_SNAPSHOT)
self.assertIn('id', output)
self.assertEqual(output['url'], url)
def test_extract_accepts_snapshot_from_previous_command(self):
"""
Test: archivebox snapshot URL | archivebox extract
Extract should accept JSONL output from snapshot command.
"""
from core.models import Snapshot, ArchiveResult
from archivebox.misc.jsonl import (
snapshot_to_jsonl, read_args_or_stdin, get_or_create_snapshot,
TYPE_SNAPSHOT
)
from archivebox.base_models.models import get_or_create_system_user_pk
created_by_id = get_or_create_system_user_pk()
# Step 1: Create snapshot (simulating 'archivebox snapshot')
url = 'https://test-extract-1.example.com'
snapshot = get_or_create_snapshot({'url': url}, created_by_id=created_by_id)
snapshot_output = snapshot_to_jsonl(snapshot)
# Step 2: Parse snapshot output as extract input
stdin = StringIO(json.dumps(snapshot_output) + '\n')
stdin.isatty = lambda: False
records = list(read_args_or_stdin((), stream=stdin))
self.assertEqual(len(records), 1)
self.assertEqual(records[0]['type'], TYPE_SNAPSHOT)
self.assertEqual(records[0]['id'], str(snapshot.id))
# Step 3: Gather snapshot IDs (as extract does)
snapshot_ids = set()
for record in records:
if record.get('type') == TYPE_SNAPSHOT and record.get('id'):
snapshot_ids.add(record['id'])
self.assertIn(str(snapshot.id), snapshot_ids)
def test_crawl_outputs_discovered_urls(self):
"""
Test: archivebox crawl URL
Should create snapshot, run plugins, output discovered URLs.
"""
from archivebox.hooks import collect_urls_from_extractors
from archivebox.misc.jsonl import TYPE_SNAPSHOT
# Create a mock snapshot directory with urls.jsonl
test_snapshot_dir = Path(self.test_dir) / 'archive' / 'test-crawl-snapshot'
test_snapshot_dir.mkdir(parents=True, exist_ok=True)
# Create mock extractor output
(test_snapshot_dir / 'parse_html_urls').mkdir()
(test_snapshot_dir / 'parse_html_urls' / 'urls.jsonl').write_text(
'{"url": "https://discovered-1.com"}\n'
'{"url": "https://discovered-2.com", "title": "Discovered 2"}\n'
)
# Collect URLs (as crawl does)
discovered = collect_urls_from_extractors(test_snapshot_dir)
self.assertEqual(len(discovered), 2)
# Add crawl metadata (as crawl does)
for entry in discovered:
entry['type'] = TYPE_SNAPSHOT
entry['depth'] = 1
entry['via_snapshot'] = 'test-crawl-snapshot'
# Verify output format
self.assertEqual(discovered[0]['type'], TYPE_SNAPSHOT)
self.assertEqual(discovered[0]['depth'], 1)
self.assertEqual(discovered[0]['url'], 'https://discovered-1.com')
def test_full_pipeline_snapshot_extract(self):
"""
Test: archivebox snapshot URL | archivebox extract
This is equivalent to: archivebox add URL
"""
from core.models import Snapshot
from archivebox.misc.jsonl import (
get_or_create_snapshot, snapshot_to_jsonl, read_args_or_stdin,
TYPE_SNAPSHOT
)
from archivebox.base_models.models import get_or_create_system_user_pk
created_by_id = get_or_create_system_user_pk()
# === archivebox snapshot https://example.com ===
url = 'https://test-pipeline-1.example.com'
snapshot = get_or_create_snapshot({'url': url}, created_by_id=created_by_id)
snapshot_jsonl = json.dumps(snapshot_to_jsonl(snapshot))
# === | archivebox extract ===
stdin = StringIO(snapshot_jsonl + '\n')
stdin.isatty = lambda: False
records = list(read_args_or_stdin((), stream=stdin))
# Extract should receive the snapshot ID
self.assertEqual(len(records), 1)
self.assertEqual(records[0]['id'], str(snapshot.id))
# Verify snapshot exists in DB
db_snapshot = Snapshot.objects.get(id=snapshot.id)
self.assertEqual(db_snapshot.url, url)
def test_full_pipeline_crawl_snapshot_extract(self):
"""
Test: archivebox crawl URL | archivebox snapshot | archivebox extract
This is equivalent to: archivebox add --depth=1 URL
"""
from core.models import Snapshot
from archivebox.misc.jsonl import (
get_or_create_snapshot, snapshot_to_jsonl, read_args_or_stdin,
TYPE_SNAPSHOT
)
from archivebox.base_models.models import get_or_create_system_user_pk
from archivebox.hooks import collect_urls_from_extractors
created_by_id = get_or_create_system_user_pk()
# === archivebox crawl https://example.com ===
# Step 1: Create snapshot for starting URL
start_url = 'https://test-crawl-pipeline.example.com'
start_snapshot = get_or_create_snapshot({'url': start_url}, created_by_id=created_by_id)
# Step 2: Simulate extractor output with discovered URLs
snapshot_dir = Path(self.test_dir) / 'archive' / str(start_snapshot.timestamp)
snapshot_dir.mkdir(parents=True, exist_ok=True)
(snapshot_dir / 'parse_html_urls').mkdir(exist_ok=True)
(snapshot_dir / 'parse_html_urls' / 'urls.jsonl').write_text(
'{"url": "https://outlink-1.example.com"}\n'
'{"url": "https://outlink-2.example.com"}\n'
)
# Step 3: Collect discovered URLs (crawl output)
discovered = collect_urls_from_extractors(snapshot_dir)
crawl_output = []
for entry in discovered:
entry['type'] = TYPE_SNAPSHOT
entry['depth'] = 1
crawl_output.append(json.dumps(entry))
# === | archivebox snapshot ===
stdin = StringIO('\n'.join(crawl_output) + '\n')
stdin.isatty = lambda: False
records = list(read_args_or_stdin((), stream=stdin))
self.assertEqual(len(records), 2)
# Create snapshots for discovered URLs
created_snapshots = []
for record in records:
snap = get_or_create_snapshot(record, created_by_id=created_by_id)
created_snapshots.append(snap)
self.assertEqual(len(created_snapshots), 2)
# === | archivebox extract ===
snapshot_jsonl_lines = [json.dumps(snapshot_to_jsonl(s)) for s in created_snapshots]
stdin = StringIO('\n'.join(snapshot_jsonl_lines) + '\n')
stdin.isatty = lambda: False
records = list(read_args_or_stdin((), stream=stdin))
self.assertEqual(len(records), 2)
# Verify all snapshots exist in DB
for record in records:
db_snapshot = Snapshot.objects.get(id=record['id'])
self.assertIn(db_snapshot.url, [
'https://outlink-1.example.com',
'https://outlink-2.example.com'
])
class TestDepthWorkflows(unittest.TestCase):
"""Test various depth crawl workflows."""
@classmethod
def setUpClass(cls):
"""Set up Django and test database."""
cls.test_dir = tempfile.mkdtemp()
os.environ['DATA_DIR'] = cls.test_dir
from archivebox.config.django import setup_django
setup_django()
from archivebox.cli.archivebox_init import init
init()
@classmethod
def tearDownClass(cls):
"""Clean up test database."""
shutil.rmtree(cls.test_dir, ignore_errors=True)
def test_depth_0_workflow(self):
"""
Test: archivebox snapshot URL | archivebox extract
Depth 0: Only archive the specified URL, no crawling.
"""
from core.models import Snapshot
from archivebox.misc.jsonl import get_or_create_snapshot
from archivebox.base_models.models import get_or_create_system_user_pk
created_by_id = get_or_create_system_user_pk()
# Create snapshot
url = 'https://depth0-test.example.com'
snapshot = get_or_create_snapshot({'url': url}, created_by_id=created_by_id)
# Verify only one snapshot created
self.assertEqual(Snapshot.objects.filter(url=url).count(), 1)
self.assertEqual(snapshot.url, url)
def test_depth_1_workflow(self):
"""
Test: archivebox crawl URL | archivebox snapshot | archivebox extract
Depth 1: Archive URL + all outlinks from that URL.
"""
# This is tested in test_full_pipeline_crawl_snapshot_extract
pass
def test_depth_metadata_propagation(self):
"""Test that depth metadata propagates through the pipeline."""
from archivebox.misc.jsonl import TYPE_SNAPSHOT
# Simulate crawl output with depth metadata
crawl_output = [
{'type': TYPE_SNAPSHOT, 'url': 'https://hop1.com', 'depth': 1, 'via_snapshot': 'root'},
{'type': TYPE_SNAPSHOT, 'url': 'https://hop2.com', 'depth': 2, 'via_snapshot': 'hop1'},
]
# Verify depth is preserved
for entry in crawl_output:
self.assertIn('depth', entry)
self.assertIn('via_snapshot', entry)
class TestParserPluginWorkflows(unittest.TestCase):
"""Test workflows with specific parser plugins."""
@classmethod
def setUpClass(cls):
"""Set up Django and test database."""
cls.test_dir = tempfile.mkdtemp()
os.environ['DATA_DIR'] = cls.test_dir
from archivebox.config.django import setup_django
setup_django()
from archivebox.cli.archivebox_init import init
init()
@classmethod
def tearDownClass(cls):
"""Clean up test database."""
shutil.rmtree(cls.test_dir, ignore_errors=True)
def test_html_parser_workflow(self):
"""
Test: archivebox crawl --plugin=parse_html_urls URL | archivebox snapshot | archivebox extract
"""
from archivebox.hooks import collect_urls_from_extractors
from archivebox.misc.jsonl import TYPE_SNAPSHOT
# Create mock output directory
snapshot_dir = Path(self.test_dir) / 'archive' / 'html-parser-test'
snapshot_dir.mkdir(parents=True, exist_ok=True)
(snapshot_dir / 'parse_html_urls').mkdir(exist_ok=True)
(snapshot_dir / 'parse_html_urls' / 'urls.jsonl').write_text(
'{"url": "https://html-discovered.com", "title": "HTML Link"}\n'
)
# Collect URLs
discovered = collect_urls_from_extractors(snapshot_dir)
self.assertEqual(len(discovered), 1)
self.assertEqual(discovered[0]['url'], 'https://html-discovered.com')
self.assertEqual(discovered[0]['via_extractor'], 'parse_html_urls')
def test_rss_parser_workflow(self):
"""
Test: archivebox crawl --plugin=parse_rss_urls URL | archivebox snapshot | archivebox extract
"""
from archivebox.hooks import collect_urls_from_extractors
# Create mock output directory
snapshot_dir = Path(self.test_dir) / 'archive' / 'rss-parser-test'
snapshot_dir.mkdir(parents=True, exist_ok=True)
(snapshot_dir / 'parse_rss_urls').mkdir(exist_ok=True)
(snapshot_dir / 'parse_rss_urls' / 'urls.jsonl').write_text(
'{"url": "https://rss-item-1.com", "title": "RSS Item 1"}\n'
'{"url": "https://rss-item-2.com", "title": "RSS Item 2"}\n'
)
# Collect URLs
discovered = collect_urls_from_extractors(snapshot_dir)
self.assertEqual(len(discovered), 2)
self.assertTrue(all(d['via_extractor'] == 'parse_rss_urls' for d in discovered))
def test_multiple_parsers_dedupe(self):
"""
Multiple parsers may discover the same URL - should be deduplicated.
"""
from archivebox.hooks import collect_urls_from_extractors
# Create mock output with duplicate URLs from different parsers
snapshot_dir = Path(self.test_dir) / 'archive' / 'dedupe-test'
snapshot_dir.mkdir(parents=True, exist_ok=True)
(snapshot_dir / 'parse_html_urls').mkdir(exist_ok=True)
(snapshot_dir / 'parse_html_urls' / 'urls.jsonl').write_text(
'{"url": "https://same-url.com"}\n'
)
(snapshot_dir / 'wget').mkdir(exist_ok=True)
(snapshot_dir / 'wget' / 'urls.jsonl').write_text(
'{"url": "https://same-url.com"}\n' # Same URL, different extractor
)
# Collect URLs
all_discovered = collect_urls_from_extractors(snapshot_dir)
# Both entries are returned (deduplication happens at the crawl command level)
self.assertEqual(len(all_discovered), 2)
# Verify both extractors found the same URL
urls = {d['url'] for d in all_discovered}
self.assertEqual(urls, {'https://same-url.com'})
class TestEdgeCases(unittest.TestCase):
"""Test edge cases and error handling."""
def test_empty_input(self):
"""Commands should handle empty input gracefully."""
from archivebox.misc.jsonl import read_args_or_stdin
# Empty args, TTY stdin (should not block)
stdin = StringIO('')
stdin.isatty = lambda: True
records = list(read_args_or_stdin((), stream=stdin))
self.assertEqual(len(records), 0)
def test_malformed_jsonl(self):
"""Should skip malformed JSONL lines."""
from archivebox.misc.jsonl import read_args_or_stdin
stdin = StringIO(
'{"url": "https://good.com"}\n'
'not valid json\n'
'{"url": "https://also-good.com"}\n'
)
stdin.isatty = lambda: False
records = list(read_args_or_stdin((), stream=stdin))
self.assertEqual(len(records), 2)
urls = {r['url'] for r in records}
self.assertEqual(urls, {'https://good.com', 'https://also-good.com'})
def test_mixed_input_formats(self):
"""Should handle mixed URLs and JSONL."""
from archivebox.misc.jsonl import read_args_or_stdin
stdin = StringIO(
'https://plain-url.com\n'
'{"type": "Snapshot", "url": "https://jsonl-url.com", "tags": "test"}\n'
'01234567-89ab-cdef-0123-456789abcdef\n' # UUID
)
stdin.isatty = lambda: False
records = list(read_args_or_stdin((), stream=stdin))
self.assertEqual(len(records), 3)
# Plain URL
self.assertEqual(records[0]['url'], 'https://plain-url.com')
# JSONL with metadata
self.assertEqual(records[1]['url'], 'https://jsonl-url.com')
self.assertEqual(records[1]['tags'], 'test')
# UUID
self.assertEqual(records[2]['id'], '01234567-89ab-cdef-0123-456789abcdef')
if __name__ == '__main__':
unittest.main()

View File

@@ -1,6 +1,17 @@
"""
ArchiveBox config exports.
This module provides backwards-compatible config exports for extractors
and other modules that expect to import config values directly.
"""
__package__ = 'archivebox.config'
__order__ = 200
import shutil
from pathlib import Path
from typing import Dict, List, Optional
from .paths import (
PACKAGE_DIR, # noqa
DATA_DIR, # noqa
@@ -9,28 +20,219 @@ from .paths import (
from .constants import CONSTANTS, CONSTANTS_CONFIG, PACKAGE_DIR, DATA_DIR, ARCHIVE_DIR # noqa
from .version import VERSION # noqa
# import abx
# @abx.hookimpl
# def get_CONFIG():
# from .common import (
# SHELL_CONFIG,
# STORAGE_CONFIG,
# GENERAL_CONFIG,
# SERVER_CONFIG,
# ARCHIVING_CONFIG,
# SEARCH_BACKEND_CONFIG,
# )
# return {
# 'SHELL_CONFIG': SHELL_CONFIG,
# 'STORAGE_CONFIG': STORAGE_CONFIG,
# 'GENERAL_CONFIG': GENERAL_CONFIG,
# 'SERVER_CONFIG': SERVER_CONFIG,
# 'ARCHIVING_CONFIG': ARCHIVING_CONFIG,
# 'SEARCHBACKEND_CONFIG': SEARCH_BACKEND_CONFIG,
# }
###############################################################################
# Config value exports for extractors
# These provide backwards compatibility with extractors that import from ..config
###############################################################################
# @abx.hookimpl
# def ready():
# for config in get_CONFIG().values():
# config.validate()
def _get_config():
"""Lazy import to avoid circular imports."""
from .common import ARCHIVING_CONFIG, STORAGE_CONFIG
return ARCHIVING_CONFIG, STORAGE_CONFIG
# Direct exports (evaluated at import time for backwards compat)
# These are recalculated each time the module attribute is accessed
def __getattr__(name: str):
"""Module-level __getattr__ for lazy config loading."""
# Timeout settings
if name == 'TIMEOUT':
cfg, _ = _get_config()
return cfg.TIMEOUT
if name == 'MEDIA_TIMEOUT':
cfg, _ = _get_config()
return cfg.MEDIA_TIMEOUT
# SSL/Security settings
if name == 'CHECK_SSL_VALIDITY':
cfg, _ = _get_config()
return cfg.CHECK_SSL_VALIDITY
# Storage settings
if name == 'RESTRICT_FILE_NAMES':
_, storage = _get_config()
return storage.RESTRICT_FILE_NAMES
# User agent / cookies
if name == 'COOKIES_FILE':
cfg, _ = _get_config()
return cfg.COOKIES_FILE
if name == 'USER_AGENT':
cfg, _ = _get_config()
return cfg.USER_AGENT
if name == 'CURL_USER_AGENT':
cfg, _ = _get_config()
return cfg.USER_AGENT
if name == 'WGET_USER_AGENT':
cfg, _ = _get_config()
return cfg.USER_AGENT
if name == 'CHROME_USER_AGENT':
cfg, _ = _get_config()
return cfg.USER_AGENT
# Archive method toggles (SAVE_*)
if name == 'SAVE_TITLE':
return True
if name == 'SAVE_FAVICON':
return True
if name == 'SAVE_WGET':
return True
if name == 'SAVE_WARC':
return True
if name == 'SAVE_WGET_REQUISITES':
return True
if name == 'SAVE_SINGLEFILE':
return True
if name == 'SAVE_READABILITY':
return True
if name == 'SAVE_MERCURY':
return True
if name == 'SAVE_HTMLTOTEXT':
return True
if name == 'SAVE_PDF':
return True
if name == 'SAVE_SCREENSHOT':
return True
if name == 'SAVE_DOM':
return True
if name == 'SAVE_HEADERS':
return True
if name == 'SAVE_GIT':
return True
if name == 'SAVE_MEDIA':
return True
if name == 'SAVE_ARCHIVE_DOT_ORG':
return True
# Extractor-specific settings
if name == 'RESOLUTION':
cfg, _ = _get_config()
return cfg.RESOLUTION
if name == 'GIT_DOMAINS':
return 'github.com,bitbucket.org,gitlab.com,gist.github.com,codeberg.org,gitea.com,git.sr.ht'
if name == 'MEDIA_MAX_SIZE':
cfg, _ = _get_config()
return cfg.MEDIA_MAX_SIZE
if name == 'FAVICON_PROVIDER':
return 'https://www.google.com/s2/favicons?domain={}'
# Binary paths (use shutil.which for detection)
if name == 'CURL_BINARY':
return shutil.which('curl') or 'curl'
if name == 'WGET_BINARY':
return shutil.which('wget') or 'wget'
if name == 'GIT_BINARY':
return shutil.which('git') or 'git'
if name == 'YOUTUBEDL_BINARY':
return shutil.which('yt-dlp') or shutil.which('youtube-dl') or 'yt-dlp'
if name == 'CHROME_BINARY':
for chrome in ['chromium', 'chromium-browser', 'google-chrome', 'google-chrome-stable', 'chrome']:
path = shutil.which(chrome)
if path:
return path
return 'chromium'
if name == 'NODE_BINARY':
return shutil.which('node') or 'node'
if name == 'SINGLEFILE_BINARY':
return shutil.which('single-file') or shutil.which('singlefile') or 'single-file'
if name == 'READABILITY_BINARY':
return shutil.which('readability-extractor') or 'readability-extractor'
if name == 'MERCURY_BINARY':
return shutil.which('mercury-parser') or shutil.which('postlight-parser') or 'mercury-parser'
# Binary versions (return placeholder, actual version detection happens elsewhere)
if name == 'CURL_VERSION':
return 'curl'
if name == 'WGET_VERSION':
return 'wget'
if name == 'GIT_VERSION':
return 'git'
if name == 'YOUTUBEDL_VERSION':
return 'yt-dlp'
if name == 'CHROME_VERSION':
return 'chromium'
if name == 'SINGLEFILE_VERSION':
return 'singlefile'
if name == 'READABILITY_VERSION':
return 'readability'
if name == 'MERCURY_VERSION':
return 'mercury'
# Binary arguments
if name == 'CURL_ARGS':
return ['--silent', '--location', '--compressed']
if name == 'WGET_ARGS':
return [
'--no-verbose',
'--adjust-extension',
'--convert-links',
'--force-directories',
'--backup-converted',
'--span-hosts',
'--no-parent',
'-e', 'robots=off',
]
if name == 'GIT_ARGS':
return ['--recursive']
if name == 'YOUTUBEDL_ARGS':
cfg, _ = _get_config()
return [
'--write-description',
'--write-info-json',
'--write-annotations',
'--write-thumbnail',
'--no-call-home',
'--write-sub',
'--write-auto-subs',
'--convert-subs=srt',
'--yes-playlist',
'--continue',
'--no-abort-on-error',
'--ignore-errors',
'--geo-bypass',
'--add-metadata',
f'--format=(bv*+ba/b)[filesize<={cfg.MEDIA_MAX_SIZE}][filesize_approx<=?{cfg.MEDIA_MAX_SIZE}]/(bv*+ba/b)',
]
if name == 'SINGLEFILE_ARGS':
return None # Uses defaults
if name == 'CHROME_ARGS':
return []
# Other settings
if name == 'WGET_AUTO_COMPRESSION':
return True
if name == 'DEPENDENCIES':
return {} # Legacy, not used anymore
# Allowlist/Denylist patterns (compiled regexes)
if name == 'SAVE_ALLOWLIST_PTN':
cfg, _ = _get_config()
return cfg.SAVE_ALLOWLIST_PTNS
if name == 'SAVE_DENYLIST_PTN':
cfg, _ = _get_config()
return cfg.SAVE_DENYLIST_PTNS
raise AttributeError(f"module 'archivebox.config' has no attribute '{name}'")
# Re-export common config classes for direct imports
def get_CONFIG():
"""Get all config sections as a dict."""
from .common import (
SHELL_CONFIG,
STORAGE_CONFIG,
GENERAL_CONFIG,
SERVER_CONFIG,
ARCHIVING_CONFIG,
SEARCH_BACKEND_CONFIG,
)
return {
'SHELL_CONFIG': SHELL_CONFIG,
'STORAGE_CONFIG': STORAGE_CONFIG,
'GENERAL_CONFIG': GENERAL_CONFIG,
'SERVER_CONFIG': SERVER_CONFIG,
'ARCHIVING_CONFIG': ARCHIVING_CONFIG,
'SEARCHBACKEND_CONFIG': SEARCH_BACKEND_CONFIG,
}

View File

@@ -18,13 +18,8 @@ from archivebox.misc.logging import stderr
def get_real_name(key: str) -> str:
"""get the up-to-date canonical name for a given old alias or current key"""
CONFIGS = archivebox.pm.hook.get_CONFIGS()
for section in CONFIGS.values():
try:
return section.aliases[key]
except (KeyError, AttributeError):
pass
# Config aliases are no longer used with the simplified config system
# Just return the key as-is since we no longer have a complex alias mapping
return key
@@ -117,9 +112,20 @@ def load_config_file() -> Optional[benedict]:
def section_for_key(key: str) -> Any:
for config_section in archivebox.pm.hook.get_CONFIGS().values():
if hasattr(config_section, key):
return config_section
"""Find the config section containing a given key."""
from archivebox.config.common import (
SHELL_CONFIG,
STORAGE_CONFIG,
GENERAL_CONFIG,
SERVER_CONFIG,
ARCHIVING_CONFIG,
SEARCH_BACKEND_CONFIG,
)
for section in [SHELL_CONFIG, STORAGE_CONFIG, GENERAL_CONFIG,
SERVER_CONFIG, ARCHIVING_CONFIG, SEARCH_BACKEND_CONFIG]:
if hasattr(section, key):
return section
raise ValueError(f'No config section found for key: {key}')
@@ -178,7 +184,8 @@ def write_config_file(config: Dict[str, str]) -> benedict:
updated_config = {}
try:
# validate the updated_config by attempting to re-parse it
updated_config = {**load_all_config(), **archivebox.pm.hook.get_FLAT_CONFIG()}
from archivebox.config.configset import get_flat_config
updated_config = {**load_all_config(), **get_flat_config()}
except BaseException: # lgtm [py/catch-base-exception]
# something went horribly wrong, revert to the previous version
with open(f'{config_path}.bak', 'r', encoding='utf-8') as old:
@@ -236,12 +243,20 @@ def load_config(defaults: Dict[str, Any],
return benedict(extended_config)
def load_all_config():
import abx
"""Load all config sections and return as a flat dict."""
from archivebox.config.common import (
SHELL_CONFIG,
STORAGE_CONFIG,
GENERAL_CONFIG,
SERVER_CONFIG,
ARCHIVING_CONFIG,
SEARCH_BACKEND_CONFIG,
)
flat_config = benedict()
for config_section in abx.pm.hook.get_CONFIGS().values():
config_section.__init__()
for config_section in [SHELL_CONFIG, STORAGE_CONFIG, GENERAL_CONFIG,
SERVER_CONFIG, ARCHIVING_CONFIG, SEARCH_BACKEND_CONFIG]:
flat_config.update(dict(config_section))
return flat_config

View File

@@ -1,4 +1,4 @@
__package__ = 'archivebox.config'
__package__ = "archivebox.config"
import re
import sys
@@ -10,7 +10,7 @@ from rich import print
from pydantic import Field, field_validator
from django.utils.crypto import get_random_string
from abx_spec_config.base_configset import BaseConfigSet
from archivebox.config.configset import BaseConfigSet
from .constants import CONSTANTS
from .version import get_COMMIT_HASH, get_BUILD_TIME, VERSION
@@ -20,109 +20,127 @@ from .permissions import IN_DOCKER
class ShellConfig(BaseConfigSet):
DEBUG: bool = Field(default=lambda: '--debug' in sys.argv)
IS_TTY: bool = Field(default=sys.stdout.isatty())
USE_COLOR: bool = Field(default=lambda c: c.IS_TTY)
SHOW_PROGRESS: bool = Field(default=lambda c: c.IS_TTY)
IN_DOCKER: bool = Field(default=IN_DOCKER)
IN_QEMU: bool = Field(default=False)
toml_section_header: str = "SHELL_CONFIG"
ANSI: Dict[str, str] = Field(default=lambda c: CONSTANTS.DEFAULT_CLI_COLORS if c.USE_COLOR else CONSTANTS.DISABLED_CLI_COLORS)
DEBUG: bool = Field(default="--debug" in sys.argv)
IS_TTY: bool = Field(default=sys.stdout.isatty())
USE_COLOR: bool = Field(default=sys.stdout.isatty())
SHOW_PROGRESS: bool = Field(default=sys.stdout.isatty())
IN_DOCKER: bool = Field(default=IN_DOCKER)
IN_QEMU: bool = Field(default=False)
ANSI: Dict[str, str] = Field(
default_factory=lambda: CONSTANTS.DEFAULT_CLI_COLORS if sys.stdout.isatty() else CONSTANTS.DISABLED_CLI_COLORS
)
@property
def TERM_WIDTH(self) -> int:
if not self.IS_TTY:
return 200
return shutil.get_terminal_size((140, 10)).columns
@property
def COMMIT_HASH(self) -> Optional[str]:
return get_COMMIT_HASH()
@property
def BUILD_TIME(self) -> str:
return get_BUILD_TIME()
SHELL_CONFIG = ShellConfig()
class StorageConfig(BaseConfigSet):
toml_section_header: str = "STORAGE_CONFIG"
# TMP_DIR must be a local, fast, readable/writable dir by archivebox user,
# must be a short path due to unix path length restrictions for socket files (<100 chars)
# must be a local SSD/tmpfs for speed and because bind mounts/network mounts/FUSE dont support unix sockets
TMP_DIR: Path = Field(default=CONSTANTS.DEFAULT_TMP_DIR)
TMP_DIR: Path = Field(default=CONSTANTS.DEFAULT_TMP_DIR)
# LIB_DIR must be a local, fast, readable/writable dir by archivebox user,
# must be able to contain executable binaries (up to 5GB size)
# should not be a remote/network/FUSE mount for speed reasons, otherwise extractors will be slow
LIB_DIR: Path = Field(default=CONSTANTS.DEFAULT_LIB_DIR)
OUTPUT_PERMISSIONS: str = Field(default='644')
RESTRICT_FILE_NAMES: str = Field(default='windows')
ENFORCE_ATOMIC_WRITES: bool = Field(default=True)
LIB_DIR: Path = Field(default=CONSTANTS.DEFAULT_LIB_DIR)
OUTPUT_PERMISSIONS: str = Field(default="644")
RESTRICT_FILE_NAMES: str = Field(default="windows")
ENFORCE_ATOMIC_WRITES: bool = Field(default=True)
# not supposed to be user settable:
DIR_OUTPUT_PERMISSIONS: str = Field(default=lambda c: c['OUTPUT_PERMISSIONS'].replace('6', '7').replace('4', '5'))
DIR_OUTPUT_PERMISSIONS: str = Field(default="755") # computed from OUTPUT_PERMISSIONS
STORAGE_CONFIG = StorageConfig()
class GeneralConfig(BaseConfigSet):
TAG_SEPARATOR_PATTERN: str = Field(default=r'[,]')
toml_section_header: str = "GENERAL_CONFIG"
TAG_SEPARATOR_PATTERN: str = Field(default=r"[,]")
GENERAL_CONFIG = GeneralConfig()
class ServerConfig(BaseConfigSet):
SECRET_KEY: str = Field(default=lambda: get_random_string(50, 'abcdefghijklmnopqrstuvwxyz0123456789_'))
BIND_ADDR: str = Field(default=lambda: ['127.0.0.1:8000', '0.0.0.0:8000'][SHELL_CONFIG.IN_DOCKER])
ALLOWED_HOSTS: str = Field(default='*')
CSRF_TRUSTED_ORIGINS: str = Field(default=lambda c: 'http://localhost:8000,http://127.0.0.1:8000,http://0.0.0.0:8000,http://{}'.format(c.BIND_ADDR))
SNAPSHOTS_PER_PAGE: int = Field(default=40)
PREVIEW_ORIGINALS: bool = Field(default=True)
FOOTER_INFO: str = Field(default='Content is hosted for personal archiving purposes only. Contact server owner for any takedown requests.')
toml_section_header: str = "SERVER_CONFIG"
SECRET_KEY: str = Field(default_factory=lambda: get_random_string(50, "abcdefghijklmnopqrstuvwxyz0123456789_"))
BIND_ADDR: str = Field(default="127.0.0.1:8000")
ALLOWED_HOSTS: str = Field(default="*")
CSRF_TRUSTED_ORIGINS: str = Field(default="http://localhost:8000,http://127.0.0.1:8000,http://0.0.0.0:8000")
SNAPSHOTS_PER_PAGE: int = Field(default=40)
PREVIEW_ORIGINALS: bool = Field(default=True)
FOOTER_INFO: str = Field(
default="Content is hosted for personal archiving purposes only. Contact server owner for any takedown requests."
)
# CUSTOM_TEMPLATES_DIR: Path = Field(default=None) # this is now a constant
PUBLIC_INDEX: bool = Field(default=True)
PUBLIC_SNAPSHOTS: bool = Field(default=True)
PUBLIC_ADD_VIEW: bool = Field(default=False)
ADMIN_USERNAME: str = Field(default=None)
ADMIN_PASSWORD: str = Field(default=None)
REVERSE_PROXY_USER_HEADER: str = Field(default='Remote-User')
REVERSE_PROXY_WHITELIST: str = Field(default='')
LOGOUT_REDIRECT_URL: str = Field(default='/')
PUBLIC_INDEX: bool = Field(default=True)
PUBLIC_SNAPSHOTS: bool = Field(default=True)
PUBLIC_ADD_VIEW: bool = Field(default=False)
ADMIN_USERNAME: Optional[str] = Field(default=None)
ADMIN_PASSWORD: Optional[str] = Field(default=None)
REVERSE_PROXY_USER_HEADER: str = Field(default="Remote-User")
REVERSE_PROXY_WHITELIST: str = Field(default="")
LOGOUT_REDIRECT_URL: str = Field(default="/")
SERVER_CONFIG = ServerConfig()
class ArchivingConfig(BaseConfigSet):
ONLY_NEW: bool = Field(default=True)
OVERWRITE: bool = Field(default=False)
TIMEOUT: int = Field(default=60)
MEDIA_TIMEOUT: int = Field(default=3600)
toml_section_header: str = "ARCHIVING_CONFIG"
ONLY_NEW: bool = Field(default=True)
OVERWRITE: bool = Field(default=False)
TIMEOUT: int = Field(default=60)
MEDIA_TIMEOUT: int = Field(default=3600)
MEDIA_MAX_SIZE: str = Field(default="750m")
RESOLUTION: str = Field(default="1440,2000")
CHECK_SSL_VALIDITY: bool = Field(default=True)
USER_AGENT: str = Field(
default=f"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36 ArchiveBox/{VERSION} (+https://github.com/ArchiveBox/ArchiveBox/)"
)
COOKIES_FILE: Path | None = Field(default=None)
URL_DENYLIST: str = Field(default=r"\.(css|js|otf|ttf|woff|woff2|gstatic\.com|googleapis\.com/css)(\?.*)?$", alias="URL_BLACKLIST")
URL_ALLOWLIST: str | None = Field(default=None, alias="URL_WHITELIST")
SAVE_ALLOWLIST: Dict[str, List[str]] = Field(default={}) # mapping of regex patterns to list of archive methods
SAVE_DENYLIST: Dict[str, List[str]] = Field(default={})
DEFAULT_PERSONA: str = Field(default="Default")
MEDIA_MAX_SIZE: str = Field(default='750m')
RESOLUTION: str = Field(default='1440,2000')
CHECK_SSL_VALIDITY: bool = Field(default=True)
USER_AGENT: str = Field(default=f'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36 ArchiveBox/{VERSION} (+https://github.com/ArchiveBox/ArchiveBox/)')
COOKIES_FILE: Path | None = Field(default=None)
URL_DENYLIST: str = Field(default=r'\.(css|js|otf|ttf|woff|woff2|gstatic\.com|googleapis\.com/css)(\?.*)?$', alias='URL_BLACKLIST')
URL_ALLOWLIST: str | None = Field(default=None, alias='URL_WHITELIST')
SAVE_ALLOWLIST: Dict[str, List[str]] = Field(default={}) # mapping of regex patterns to list of archive methods
SAVE_DENYLIST: Dict[str, List[str]] = Field(default={})
DEFAULT_PERSONA: str = Field(default='Default')
# GIT_DOMAINS: str = Field(default='github.com,bitbucket.org,gitlab.com,gist.github.com,codeberg.org,gitea.com,git.sr.ht')
# WGET_USER_AGENT: str = Field(default=lambda c: c['USER_AGENT'] + ' wget/{WGET_VERSION}')
# CURL_USER_AGENT: str = Field(default=lambda c: c['USER_AGENT'] + ' curl/{CURL_VERSION}')
@@ -134,58 +152,70 @@ class ArchivingConfig(BaseConfigSet):
def validate(self):
if int(self.TIMEOUT) < 5:
print(f'[red][!] Warning: TIMEOUT is set too low! (currently set to TIMEOUT={self.TIMEOUT} seconds)[/red]', file=sys.stderr)
print(' You must allow *at least* 5 seconds for indexing and archive methods to run succesfully.', file=sys.stderr)
print(' (Setting it to somewhere between 30 and 3000 seconds is recommended)', file=sys.stderr)
print(f"[red][!] Warning: TIMEOUT is set too low! (currently set to TIMEOUT={self.TIMEOUT} seconds)[/red]", file=sys.stderr)
print(" You must allow *at least* 5 seconds for indexing and archive methods to run succesfully.", file=sys.stderr)
print(" (Setting it to somewhere between 30 and 3000 seconds is recommended)", file=sys.stderr)
print(file=sys.stderr)
print(' If you want to make ArchiveBox run faster, disable specific archive methods instead:', file=sys.stderr)
print(' https://github.com/ArchiveBox/ArchiveBox/wiki/Configuration#archive-method-toggles', file=sys.stderr)
print(" If you want to make ArchiveBox run faster, disable specific archive methods instead:", file=sys.stderr)
print(" https://github.com/ArchiveBox/ArchiveBox/wiki/Configuration#archive-method-toggles", file=sys.stderr)
print(file=sys.stderr)
@field_validator('CHECK_SSL_VALIDITY', mode='after')
@field_validator("CHECK_SSL_VALIDITY", mode="after")
def validate_check_ssl_validity(cls, v):
"""SIDE EFFECT: disable "you really shouldnt disable ssl" warnings emitted by requests"""
if not v:
import requests
import urllib3
requests.packages.urllib3.disable_warnings(requests.packages.urllib3.exceptions.InsecureRequestWarning)
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
return v
@property
def URL_ALLOWLIST_PTN(self) -> re.Pattern | None:
return re.compile(self.URL_ALLOWLIST, CONSTANTS.ALLOWDENYLIST_REGEX_FLAGS) if self.URL_ALLOWLIST else None
@property
def URL_DENYLIST_PTN(self) -> re.Pattern:
return re.compile(self.URL_DENYLIST, CONSTANTS.ALLOWDENYLIST_REGEX_FLAGS)
@property
def SAVE_ALLOWLIST_PTNS(self) -> Dict[re.Pattern, List[str]]:
return {
# regexp: methods list
re.compile(key, CONSTANTS.ALLOWDENYLIST_REGEX_FLAGS): val
for key, val in self.SAVE_ALLOWLIST.items()
} if self.SAVE_ALLOWLIST else {}
return (
{
# regexp: methods list
re.compile(key, CONSTANTS.ALLOWDENYLIST_REGEX_FLAGS): val
for key, val in self.SAVE_ALLOWLIST.items()
}
if self.SAVE_ALLOWLIST
else {}
)
@property
def SAVE_DENYLIST_PTNS(self) -> Dict[re.Pattern, List[str]]:
return {
# regexp: methods list
re.compile(key, CONSTANTS.ALLOWDENYLIST_REGEX_FLAGS): val
for key, val in self.SAVE_DENYLIST.items()
} if self.SAVE_DENYLIST else {}
return (
{
# regexp: methods list
re.compile(key, CONSTANTS.ALLOWDENYLIST_REGEX_FLAGS): val
for key, val in self.SAVE_DENYLIST.items()
}
if self.SAVE_DENYLIST
else {}
)
ARCHIVING_CONFIG = ArchivingConfig()
class SearchBackendConfig(BaseConfigSet):
USE_INDEXING_BACKEND: bool = Field(default=True)
USE_SEARCHING_BACKEND: bool = Field(default=True)
SEARCH_BACKEND_ENGINE: str = Field(default='ripgrep')
SEARCH_PROCESS_HTML: bool = Field(default=True)
SEARCH_BACKEND_TIMEOUT: int = Field(default=10)
toml_section_header: str = "SEARCH_BACKEND_CONFIG"
USE_INDEXING_BACKEND: bool = Field(default=True)
USE_SEARCHING_BACKEND: bool = Field(default=True)
SEARCH_BACKEND_ENGINE: str = Field(default="ripgrep")
SEARCH_PROCESS_HTML: bool = Field(default=True)
SEARCH_BACKEND_TIMEOUT: int = Field(default=10)
SEARCH_BACKEND_CONFIG = SearchBackendConfig()

View File

@@ -0,0 +1,266 @@
"""
Simplified config system for ArchiveBox.
This replaces the complex abx_spec_config/base_configset.py with a simpler
approach that still supports environment variables, config files, and
per-object overrides.
"""
__package__ = "archivebox.config"
import os
import json
from pathlib import Path
from typing import Any, Dict, Optional, List, Type, TYPE_CHECKING, cast
from configparser import ConfigParser
from pydantic import Field
from pydantic_settings import BaseSettings
class BaseConfigSet(BaseSettings):
"""
Base class for config sections.
Automatically loads values from:
1. Environment variables (highest priority)
2. ArchiveBox.conf file (if exists)
3. Default values (lowest priority)
Subclasses define fields with defaults and types:
class ShellConfig(BaseConfigSet):
DEBUG: bool = Field(default=False)
USE_COLOR: bool = Field(default=True)
"""
class Config:
# Use env vars with ARCHIVEBOX_ prefix or raw name
env_prefix = ""
extra = "ignore"
validate_default = True
@classmethod
def load_from_file(cls, config_path: Path) -> Dict[str, str]:
"""Load config values from INI file."""
if not config_path.exists():
return {}
parser = ConfigParser()
parser.optionxform = lambda x: x # type: ignore # preserve case
parser.read(config_path)
# Flatten all sections into single namespace
return {key.upper(): value for section in parser.sections() for key, value in parser.items(section)}
def update_in_place(self, warn: bool = True, persist: bool = False, **kwargs) -> None:
"""
Update config values in place.
This allows runtime updates to config without reloading.
"""
for key, value in kwargs.items():
if hasattr(self, key):
# Use object.__setattr__ to bypass pydantic's frozen model
object.__setattr__(self, key, value)
def get_config(
scope: str = "global",
defaults: Optional[Dict] = None,
user: Any = None,
crawl: Any = None,
snapshot: Any = None,
) -> Dict[str, Any]:
"""
Get merged config from all sources.
Priority (highest to lowest):
1. Per-snapshot config (snapshot.config JSON field)
2. Per-crawl config (crawl.config JSON field)
3. Per-user config (user.config JSON field)
4. Environment variables
5. Config file (ArchiveBox.conf)
6. Plugin schema defaults (config.json)
7. Core config defaults
Args:
scope: Config scope ('global', 'crawl', 'snapshot', etc.)
defaults: Default values to start with
user: User object with config JSON field
crawl: Crawl object with config JSON field
snapshot: Snapshot object with config JSON field
Returns:
Merged config dict
"""
from archivebox.config.constants import CONSTANTS
from archivebox.config.common import (
SHELL_CONFIG,
STORAGE_CONFIG,
GENERAL_CONFIG,
SERVER_CONFIG,
ARCHIVING_CONFIG,
SEARCH_BACKEND_CONFIG,
)
# Start with defaults
config = dict(defaults or {})
# Add plugin config defaults from JSONSchema config.json files
try:
from archivebox.hooks import get_config_defaults_from_plugins
plugin_defaults = get_config_defaults_from_plugins()
config.update(plugin_defaults)
except ImportError:
pass # hooks not available yet during early startup
# Add all core config sections
config.update(dict(SHELL_CONFIG))
config.update(dict(STORAGE_CONFIG))
config.update(dict(GENERAL_CONFIG))
config.update(dict(SERVER_CONFIG))
config.update(dict(ARCHIVING_CONFIG))
config.update(dict(SEARCH_BACKEND_CONFIG))
# Load from config file
config_file = CONSTANTS.CONFIG_FILE
if config_file.exists():
file_config = BaseConfigSet.load_from_file(config_file)
config.update(file_config)
# Override with environment variables
for key in config:
env_val = os.environ.get(key)
if env_val is not None:
config[key] = _parse_env_value(env_val, config.get(key))
# Also check plugin config aliases in environment
try:
from archivebox.hooks import discover_plugin_configs
plugin_configs = discover_plugin_configs()
for plugin_name, schema in plugin_configs.items():
for key, prop_schema in schema.get('properties', {}).items():
# Check x-aliases
for alias in prop_schema.get('x-aliases', []):
if alias in os.environ and key not in os.environ:
config[key] = _parse_env_value(os.environ[alias], config.get(key))
break
# Check x-fallback
fallback = prop_schema.get('x-fallback')
if fallback and fallback in config and key not in config:
config[key] = config[fallback]
except ImportError:
pass
# Apply user config overrides
if user and hasattr(user, "config") and user.config:
config.update(user.config)
# Apply crawl config overrides
if crawl and hasattr(crawl, "config") and crawl.config:
config.update(crawl.config)
# Apply snapshot config overrides (highest priority)
if snapshot and hasattr(snapshot, "config") and snapshot.config:
config.update(snapshot.config)
return config
def get_flat_config() -> Dict[str, Any]:
"""
Get a flat dictionary of all config values.
Replaces abx.pm.hook.get_FLAT_CONFIG()
"""
return get_config(scope="global")
def get_all_configs() -> Dict[str, BaseConfigSet]:
"""
Get all config section objects as a dictionary.
Replaces abx.pm.hook.get_CONFIGS()
"""
from archivebox.config.common import (
SHELL_CONFIG, SERVER_CONFIG, ARCHIVING_CONFIG, SEARCH_BACKEND_CONFIG
)
return {
'SHELL_CONFIG': SHELL_CONFIG,
'SERVER_CONFIG': SERVER_CONFIG,
'ARCHIVING_CONFIG': ARCHIVING_CONFIG,
'SEARCH_BACKEND_CONFIG': SEARCH_BACKEND_CONFIG,
}
def _parse_env_value(value: str, default: Any = None) -> Any:
"""Parse an environment variable value based on expected type."""
if default is None:
# Try to guess the type
if value.lower() in ("true", "false", "yes", "no", "1", "0"):
return value.lower() in ("true", "yes", "1")
try:
return int(value)
except ValueError:
pass
try:
return json.loads(value)
except (json.JSONDecodeError, ValueError):
pass
return value
# Parse based on default's type
if isinstance(default, bool):
return value.lower() in ("true", "yes", "1")
elif isinstance(default, int):
return int(value)
elif isinstance(default, float):
return float(value)
elif isinstance(default, (list, dict)):
return json.loads(value)
elif isinstance(default, Path):
return Path(value)
else:
return value
# Default worker concurrency settings
DEFAULT_WORKER_CONCURRENCY = {
"crawl": 2,
"snapshot": 3,
"wget": 2,
"ytdlp": 2,
"screenshot": 3,
"singlefile": 2,
"title": 5,
"favicon": 5,
"headers": 5,
"archive_org": 2,
"readability": 3,
"mercury": 3,
"git": 2,
"pdf": 2,
"dom": 3,
}
def get_worker_concurrency() -> Dict[str, int]:
"""
Get worker concurrency settings.
Can be configured via WORKER_CONCURRENCY env var as JSON dict.
"""
config = get_config()
# Start with defaults
concurrency = DEFAULT_WORKER_CONCURRENCY.copy()
# Override with config
if "WORKER_CONCURRENCY" in config:
custom = config["WORKER_CONCURRENCY"]
if isinstance(custom, str):
custom = json.loads(custom)
concurrency.update(custom)
return concurrency

View File

@@ -1,6 +1,7 @@
__package__ = 'abx.archivebox'
__package__ = 'archivebox.config'
import os
import shutil
import inspect
from pathlib import Path
from typing import Any, List, Dict, cast
@@ -13,14 +14,22 @@ from django.utils.html import format_html, mark_safe
from admin_data_views.typing import TableContext, ItemContext
from admin_data_views.utils import render_with_table_view, render_with_item_view, ItemLink
import abx
import archivebox
from archivebox.config import CONSTANTS
from archivebox.misc.util import parse_date
from machine.models import InstalledBinary
# Common binaries to check for
KNOWN_BINARIES = [
'wget', 'curl', 'chromium', 'chrome', 'google-chrome', 'google-chrome-stable',
'node', 'npm', 'npx', 'yt-dlp', 'ytdlp', 'youtube-dl',
'git', 'singlefile', 'readability-extractor', 'mercury-parser',
'python3', 'python', 'bash', 'zsh',
'ffmpeg', 'ripgrep', 'rg', 'sonic', 'archivebox',
]
def obj_to_yaml(obj: Any, indent: int=0) -> str:
indent_str = " " * indent
if indent == 0:
@@ -62,65 +71,92 @@ def obj_to_yaml(obj: Any, indent: int=0) -> str:
else:
return f" {str(obj)}"
def get_detected_binaries() -> Dict[str, Dict[str, Any]]:
"""Detect available binaries using shutil.which."""
binaries = {}
for name in KNOWN_BINARIES:
path = shutil.which(name)
if path:
binaries[name] = {
'name': name,
'abspath': path,
'version': None, # Could add version detection later
'is_available': True,
}
return binaries
def get_filesystem_plugins() -> Dict[str, Dict[str, Any]]:
"""Discover plugins from filesystem directories."""
from archivebox.hooks import BUILTIN_PLUGINS_DIR, USER_PLUGINS_DIR
plugins = {}
for base_dir, source in [(BUILTIN_PLUGINS_DIR, 'builtin'), (USER_PLUGINS_DIR, 'user')]:
if not base_dir.exists():
continue
for plugin_dir in base_dir.iterdir():
if plugin_dir.is_dir() and not plugin_dir.name.startswith('_'):
plugin_id = f'{source}.{plugin_dir.name}'
# Find hook scripts
hooks = []
for ext in ('sh', 'py', 'js'):
hooks.extend(plugin_dir.glob(f'on_*__*.{ext}'))
plugins[plugin_id] = {
'id': plugin_id,
'name': plugin_dir.name,
'path': str(plugin_dir),
'source': source,
'hooks': [str(h.name) for h in hooks],
}
return plugins
@render_with_table_view
def binaries_list_view(request: HttpRequest, **kwargs) -> TableContext:
FLAT_CONFIG = archivebox.pm.hook.get_FLAT_CONFIG()
assert request.user.is_superuser, 'Must be a superuser to view configuration settings.'
rows = {
"Binary Name": [],
"Found Version": [],
"From Plugin": [],
"Provided By": [],
"Found Abspath": [],
"Related Configuration": [],
# "Overrides": [],
# "Description": [],
}
relevant_configs = {
key: val
for key, val in FLAT_CONFIG.items()
if '_BINARY' in key or '_VERSION' in key
}
for plugin_id, plugin in abx.get_all_plugins().items():
plugin = benedict(plugin)
if not hasattr(plugin.plugin, 'get_BINARIES'):
continue
# Get binaries from database (previously detected/installed)
db_binaries = {b.name: b for b in InstalledBinary.objects.all()}
# Get currently detectable binaries
detected = get_detected_binaries()
# Merge and display
all_binary_names = sorted(set(list(db_binaries.keys()) + list(detected.keys())))
for name in all_binary_names:
db_binary = db_binaries.get(name)
detected_binary = detected.get(name)
for binary in plugin.plugin.get_BINARIES().values():
try:
installed_binary = InstalledBinary.objects.get_from_db_or_cache(binary)
binary = installed_binary.load_from_db()
except Exception as e:
print(e)
rows['Binary Name'].append(ItemLink(binary.name, key=binary.name))
rows['Found Version'].append(f'{binary.loaded_version}' if binary.loaded_version else '❌ missing')
rows['From Plugin'].append(plugin.package)
rows['Provided By'].append(
', '.join(
f'[{binprovider.name}]' if binprovider.name == getattr(binary.loaded_binprovider, 'name', None) else binprovider.name
for binprovider in binary.binproviders_supported
if binprovider
)
# binary.loaded_binprovider.name
# if binary.loaded_binprovider else
# ', '.join(getattr(provider, 'name', str(provider)) for provider in binary.binproviders_supported)
)
rows['Found Abspath'].append(str(binary.loaded_abspath or '❌ missing'))
rows['Related Configuration'].append(mark_safe(', '.join(
f'<a href="/admin/environment/config/{config_key}/">{config_key}</a>'
for config_key, config_value in relevant_configs.items()
if str(binary.name).lower().replace('-', '').replace('_', '').replace('ytdlp', 'youtubedl') in config_key.lower()
or config_value.lower().endswith(binary.name.lower())
# or binary.name.lower().replace('-', '').replace('_', '') in str(config_value).lower()
)))
# if not binary.overrides:
# import ipdb; ipdb.set_trace()
# rows['Overrides'].append(str(obj_to_yaml(binary.overrides) or str(binary.overrides))[:200])
# rows['Description'].append(binary.description)
rows['Binary Name'].append(ItemLink(name, key=name))
if db_binary:
rows['Found Version'].append(f'{db_binary.version}' if db_binary.version else '✅ found')
rows['Provided By'].append(db_binary.binprovider or 'PATH')
rows['Found Abspath'].append(str(db_binary.abspath or ''))
elif detected_binary:
rows['Found Version'].append('✅ found')
rows['Provided By'].append('PATH')
rows['Found Abspath'].append(detected_binary['abspath'])
else:
rows['Found Version'].append('❌ missing')
rows['Provided By'].append('-')
rows['Found Abspath'].append('-')
return TableContext(
title="Binaries",
@@ -132,43 +168,65 @@ def binary_detail_view(request: HttpRequest, key: str, **kwargs) -> ItemContext:
assert request.user and request.user.is_superuser, 'Must be a superuser to view configuration settings.'
binary = None
plugin = None
for plugin_id, plugin in abx.get_all_plugins().items():
try:
for loaded_binary in plugin['hooks'].get_BINARIES().values():
if loaded_binary.name == key:
binary = loaded_binary
plugin = plugin
# break # last write wins
except Exception as e:
print(e)
assert plugin and binary, f'Could not find a binary matching the specified name: {key}'
# Try database first
try:
binary = binary.load()
except Exception as e:
print(e)
binary = InstalledBinary.objects.get(name=key)
return ItemContext(
slug=key,
title=key,
data=[
{
"name": binary.name,
"description": str(binary.abspath or ''),
"fields": {
'name': binary.name,
'binprovider': binary.binprovider,
'abspath': str(binary.abspath),
'version': binary.version,
'sha256': binary.sha256,
},
"help_texts": {},
},
],
)
except InstalledBinary.DoesNotExist:
pass
# Try to detect from PATH
path = shutil.which(key)
if path:
return ItemContext(
slug=key,
title=key,
data=[
{
"name": key,
"description": path,
"fields": {
'name': key,
'binprovider': 'PATH',
'abspath': path,
'version': 'unknown',
},
"help_texts": {},
},
],
)
return ItemContext(
slug=key,
title=key,
data=[
{
"name": binary.name,
"description": binary.abspath,
"name": key,
"description": "Binary not found",
"fields": {
'plugin': plugin['package'],
'binprovider': binary.loaded_binprovider,
'abspath': binary.loaded_abspath,
'version': binary.loaded_version,
'overrides': obj_to_yaml(binary.overrides),
'providers': obj_to_yaml(binary.binproviders_supported),
},
"help_texts": {
# TODO
'name': key,
'binprovider': 'not installed',
'abspath': 'not found',
'version': 'N/A',
},
"help_texts": {},
},
],
)
@@ -180,66 +238,26 @@ def plugins_list_view(request: HttpRequest, **kwargs) -> TableContext:
assert request.user.is_superuser, 'Must be a superuser to view configuration settings.'
rows = {
"Label": [],
"Version": [],
"Author": [],
"Package": [],
"Source Code": [],
"Config": [],
"Binaries": [],
"Package Managers": [],
# "Search Backends": [],
"Name": [],
"Source": [],
"Path": [],
"Hooks": [],
}
config_colors = {
'_BINARY': '#339',
'USE_': 'green',
'SAVE_': 'green',
'_ARGS': '#33e',
'KEY': 'red',
'COOKIES': 'red',
'AUTH': 'red',
'SECRET': 'red',
'TOKEN': 'red',
'PASSWORD': 'red',
'TIMEOUT': '#533',
'RETRIES': '#533',
'MAX': '#533',
'MIN': '#533',
}
def get_color(key):
for pattern, color in config_colors.items():
if pattern in key:
return color
return 'black'
plugins = get_filesystem_plugins()
for plugin_id, plugin in abx.get_all_plugins().items():
plugin.hooks.get_BINPROVIDERS = getattr(plugin.plugin, 'get_BINPROVIDERS', lambda: {})
plugin.hooks.get_BINARIES = getattr(plugin.plugin, 'get_BINARIES', lambda: {})
plugin.hooks.get_CONFIG = getattr(plugin.plugin, 'get_CONFIG', lambda: {})
rows['Label'].append(ItemLink(plugin.label, key=plugin.package))
rows['Version'].append(str(plugin.version))
rows['Author'].append(mark_safe(f'<a href="{plugin.homepage}" target="_blank">{plugin.author}</a>'))
rows['Package'].append(ItemLink(plugin.package, key=plugin.package))
rows['Source Code'].append(format_html('<code>{}</code>', str(plugin.source_code).replace(str(Path('~').expanduser()), '~')))
rows['Config'].append(mark_safe(''.join(
f'<a href="/admin/environment/config/{key}/"><b><code style="color: {get_color(key)};">{key}</code></b>=<code>{value}</code></a><br/>'
for configdict in plugin.hooks.get_CONFIG().values()
for key, value in benedict(configdict).items()
)))
rows['Binaries'].append(mark_safe(', '.join(
f'<a href="/admin/environment/binaries/{binary.name}/"><code>{binary.name}</code></a>'
for binary in plugin.hooks.get_BINARIES().values()
)))
rows['Package Managers'].append(mark_safe(', '.join(
f'<a href="/admin/environment/binproviders/{binprovider.name}/"><code>{binprovider.name}</code></a>'
for binprovider in plugin.hooks.get_BINPROVIDERS().values()
)))
# rows['Search Backends'].append(mark_safe(', '.join(
# f'<a href="/admin/environment/searchbackends/{searchbackend.name}/"><code>{searchbackend.name}</code></a>'
# for searchbackend in plugin.SEARCHBACKENDS.values()
# )))
for plugin_id, plugin in plugins.items():
rows['Name'].append(ItemLink(plugin['name'], key=plugin_id))
rows['Source'].append(plugin['source'])
rows['Path'].append(format_html('<code>{}</code>', plugin['path']))
rows['Hooks'].append(', '.join(plugin['hooks']) or '(none)')
if not plugins:
# Show a helpful message when no plugins found
rows['Name'].append('(no plugins found)')
rows['Source'].append('-')
rows['Path'].append(format_html('<code>archivebox/plugins/</code> or <code>data/plugins/</code>'))
rows['Hooks'].append('-')
return TableContext(
title="Installed plugins",
@@ -251,39 +269,31 @@ def plugin_detail_view(request: HttpRequest, key: str, **kwargs) -> ItemContext:
assert request.user.is_superuser, 'Must be a superuser to view configuration settings.'
plugins = abx.get_all_plugins()
plugin_id = None
for check_plugin_id, loaded_plugin in plugins.items():
if check_plugin_id.split('.')[-1] == key.split('.')[-1]:
plugin_id = check_plugin_id
break
assert plugin_id, f'Could not find a plugin matching the specified name: {key}'
plugin = abx.get_plugin(plugin_id)
plugins = get_filesystem_plugins()
plugin = plugins.get(key)
if not plugin:
return ItemContext(
slug=key,
title=f'Plugin not found: {key}',
data=[],
)
return ItemContext(
slug=key,
title=key,
title=plugin['name'],
data=[
{
"name": plugin.package,
"description": plugin.label,
"name": plugin['name'],
"description": plugin['path'],
"fields": {
"id": plugin.id,
"package": plugin.package,
"label": plugin.label,
"version": plugin.version,
"author": plugin.author,
"homepage": plugin.homepage,
"dependencies": getattr(plugin, 'DEPENDENCIES', []),
"source_code": plugin.source_code,
"hooks": plugin.hooks,
},
"help_texts": {
# TODO
"id": plugin['id'],
"name": plugin['name'],
"source": plugin['source'],
"path": plugin['path'],
"hooks": plugin['hooks'],
},
"help_texts": {},
},
],
)
@@ -333,22 +343,6 @@ def worker_list_view(request: HttpRequest, **kwargs) -> TableContext:
# Add a row for each worker process managed by supervisord
for proc in cast(List[Dict[str, Any]], supervisor.getAllProcessInfo()):
proc = benedict(proc)
# {
# "name": "daphne",
# "group": "daphne",
# "start": 1725933056,
# "stop": 0,
# "now": 1725933438,
# "state": 20,
# "statename": "RUNNING",
# "spawnerr": "",
# "exitstatus": 0,
# "logfile": "logs/server.log",
# "stdout_logfile": "logs/server.log",
# "stderr_logfile": "",
# "pid": 33283,
# "description": "pid 33283, uptime 0:06:22",
# }
rows["Name"].append(ItemLink(proc.name, key=proc.name))
rows["State"].append(proc.statename)
rows['PID'].append(proc.description.replace('pid ', ''))

View File

@@ -1,16 +1,13 @@
__package__ = 'archivebox.core'
__order__ = 100
import abx
@abx.hookimpl
def register_admin(admin_site):
"""Register the core.models views (Snapshot, ArchiveResult, Tag, etc.) with the admin site"""
from core.admin import register_admin
register_admin(admin_site)
from core.admin import register_admin as do_register
do_register(admin_site)
@abx.hookimpl
def get_CONFIG():
from archivebox.config.common import (
SHELL_CONFIG,
@@ -28,4 +25,3 @@ def get_CONFIG():
'ARCHIVING_CONFIG': ARCHIVING_CONFIG,
'SEARCHBACKEND_CONFIG': SEARCH_BACKEND_CONFIG,
}

View File

@@ -9,10 +9,7 @@ from core.admin_snapshots import SnapshotAdmin
from core.admin_archiveresults import ArchiveResultAdmin
from core.admin_users import UserAdmin
import abx
@abx.hookimpl
def register_admin(admin_site):
admin_site.register(get_user_model(), UserAdmin)
admin_site.register(ArchiveResult, ArchiveResultAdmin)

View File

@@ -11,8 +11,6 @@ from django.utils import timezone
from huey_monitor.admin import TaskModel
import abx
from archivebox.config import DATA_DIR
from archivebox.config.common import SERVER_CONFIG
from archivebox.misc.paginators import AccelleratedPaginator
@@ -43,7 +41,6 @@ class ArchiveResultInline(admin.TabularInline):
ordering = ('end_ts',)
show_change_link = True
# # classes = ['collapse']
# # list_display_links = ['abid']
def get_parent_object_from_request(self, request):
resolved = resolve(request.path_info)
@@ -80,7 +77,7 @@ class ArchiveResultInline(admin.TabularInline):
formset.form.base_fields['start_ts'].initial = timezone.now()
formset.form.base_fields['end_ts'].initial = timezone.now()
formset.form.base_fields['cmd_version'].initial = '-'
formset.form.base_fields['pwd'].initial = str(snapshot.link_dir)
formset.form.base_fields['pwd'].initial = str(snapshot.output_dir)
formset.form.base_fields['created_by'].initial = request.user
formset.form.base_fields['cmd'].initial = '["-"]'
formset.form.base_fields['output'].initial = 'Manually recorded cmd output...'
@@ -193,6 +190,5 @@ class ArchiveResultAdmin(BaseModelAdmin):
@abx.hookimpl
def register_admin(admin_site):
admin_site.register(ArchiveResult, ArchiveResultAdmin)

View File

@@ -36,7 +36,7 @@ def register_admin_site():
admin.site = archivebox_admin
sites.site = archivebox_admin
# register all plugins admin classes
archivebox.pm.hook.register_admin(admin_site=archivebox_admin)
# Plugin admin registration is now handled by individual app admins
# No longer using archivebox.pm.hook.register_admin()
return archivebox_admin

View File

@@ -19,11 +19,9 @@ from archivebox.misc.util import htmldecode, urldecode
from archivebox.misc.paginators import AccelleratedPaginator
from archivebox.misc.logging_util import printable_filesize
from archivebox.search.admin import SearchResultsAdminMixin
from archivebox.index.html import snapshot_icons
from archivebox.extractors import archive_links
from archivebox.base_models.admin import BaseModelAdmin
from archivebox.workers.tasks import bg_archive_links, bg_add
from archivebox.base_models.admin import BaseModelAdmin, ConfigEditorMixin
from archivebox.workers.tasks import bg_archive_snapshots, bg_add
from core.models import Tag
from core.admin_tags import TagInline
@@ -53,13 +51,13 @@ class SnapshotActionForm(ActionForm):
# )
class SnapshotAdmin(SearchResultsAdminMixin, BaseModelAdmin):
class SnapshotAdmin(SearchResultsAdminMixin, ConfigEditorMixin, BaseModelAdmin):
list_display = ('created_at', 'title_str', 'status', 'files', 'size', 'url_str')
sort_fields = ('title_str', 'url_str', 'created_at', 'status', 'crawl')
readonly_fields = ('admin_actions', 'status_info', 'tags_str', 'imported_timestamp', 'created_at', 'modified_at', 'downloaded_at', 'link_dir')
readonly_fields = ('admin_actions', 'status_info', 'tags_str', 'imported_timestamp', 'created_at', 'modified_at', 'downloaded_at', 'link_dir', 'available_config_options')
search_fields = ('id', 'url', 'timestamp', 'title', 'tags__name')
list_filter = ('created_at', 'downloaded_at', 'archiveresult__status', 'created_by', 'tags__name')
fields = ('url', 'title', 'created_by', 'bookmarked_at', 'status', 'retry_at', 'crawl', *readonly_fields)
fields = ('url', 'title', 'created_by', 'bookmarked_at', 'status', 'retry_at', 'crawl', 'config', 'available_config_options', *readonly_fields[:-1])
ordering = ['-created_at']
actions = ['add_tags', 'remove_tags', 'update_titles', 'update_snapshots', 'resnapshot_snapshot', 'overwrite_snapshots', 'delete_snapshots']
inlines = [TagInline, ArchiveResultInline]
@@ -196,14 +194,14 @@ class SnapshotAdmin(SearchResultsAdminMixin, BaseModelAdmin):
)
def files(self, obj):
# return '-'
return snapshot_icons(obj)
return obj.icons()
@admin.display(
# ordering='archiveresult_count'
)
def size(self, obj):
archive_size = os.access(Path(obj.link_dir) / 'index.html', os.F_OK) and obj.archive_size
archive_size = os.access(Path(obj.output_dir) / 'index.html', os.F_OK) and obj.archive_size
if archive_size:
size_txt = printable_filesize(archive_size)
if archive_size > 52428800:
@@ -261,30 +259,27 @@ class SnapshotAdmin(SearchResultsAdminMixin, BaseModelAdmin):
description=" Get Title"
)
def update_titles(self, request, queryset):
links = [snapshot.as_link() for snapshot in queryset]
if len(links) < 3:
# run syncronously if there are only 1 or 2 links
archive_links(links, overwrite=True, methods=('title','favicon'), out_dir=DATA_DIR)
messages.success(request, f"Title and favicon have been fetched and saved for {len(links)} URLs.")
else:
# otherwise run in a background worker
result = bg_archive_links((links,), kwargs={"overwrite": True, "methods": ["title", "favicon"], "out_dir": DATA_DIR})
messages.success(
request,
mark_safe(f"Title and favicon are updating in the background for {len(links)} URLs. {result_url(result)}"),
)
from core.models import Snapshot
count = queryset.count()
# Queue snapshots for archiving via the state machine system
result = bg_archive_snapshots(queryset, kwargs={"overwrite": True, "methods": ["title", "favicon"], "out_dir": DATA_DIR})
messages.success(
request,
mark_safe(f"Title and favicon are updating in the background for {count} URLs. {result_url(result)}"),
)
@admin.action(
description="⬇️ Get Missing"
)
def update_snapshots(self, request, queryset):
links = [snapshot.as_link() for snapshot in queryset]
count = queryset.count()
result = bg_archive_links((links,), kwargs={"overwrite": False, "out_dir": DATA_DIR})
result = bg_archive_snapshots(queryset, kwargs={"overwrite": False, "out_dir": DATA_DIR})
messages.success(
request,
mark_safe(f"Re-trying any previously failed methods for {len(links)} URLs in the background. {result_url(result)}"),
mark_safe(f"Re-trying any previously failed methods for {count} URLs in the background. {result_url(result)}"),
)
@@ -307,13 +302,13 @@ class SnapshotAdmin(SearchResultsAdminMixin, BaseModelAdmin):
description="🔄 Redo"
)
def overwrite_snapshots(self, request, queryset):
links = [snapshot.as_link() for snapshot in queryset]
count = queryset.count()
result = bg_archive_links((links,), kwargs={"overwrite": True, "out_dir": DATA_DIR})
result = bg_archive_snapshots(queryset, kwargs={"overwrite": True, "out_dir": DATA_DIR})
messages.success(
request,
mark_safe(f"Clearing all previous results and re-downloading {len(links)} URLs in the background. {result_url(result)}"),
mark_safe(f"Clearing all previous results and re-downloading {count} URLs in the background. {result_url(result)}"),
)
@admin.action(

View File

@@ -3,8 +3,6 @@ __package__ = 'archivebox.core'
from django.contrib import admin
from django.utils.html import format_html, mark_safe
import abx
from archivebox.misc.paginators import AccelleratedPaginator
from archivebox.base_models.admin import BaseModelAdmin
@@ -150,7 +148,7 @@ class TagAdmin(BaseModelAdmin):
# @admin.register(SnapshotTag, site=archivebox_admin)
# class SnapshotTagAdmin(ABIDModelAdmin):
# class SnapshotTagAdmin(BaseModelAdmin):
# list_display = ('id', 'snapshot', 'tag')
# sort_fields = ('id', 'snapshot', 'tag')
# search_fields = ('id', 'snapshot_id', 'tag_id')
@@ -159,7 +157,6 @@ class TagAdmin(BaseModelAdmin):
# ordering = ['-id']
@abx.hookimpl
def register_admin(admin_site):
admin_site.register(Tag, TagAdmin)

View File

@@ -5,8 +5,6 @@ from django.contrib.auth.admin import UserAdmin
from django.utils.html import format_html, mark_safe
from django.contrib.auth import get_user_model
import abx
class CustomUserAdmin(UserAdmin):
sort_fields = ['id', 'email', 'username', 'is_superuser', 'last_login', 'date_joined']
@@ -86,6 +84,5 @@ class CustomUserAdmin(UserAdmin):
@abx.hookimpl
def register_admin(admin_site):
admin_site.register(get_user_model(), CustomUserAdmin)

View File

@@ -2,17 +2,12 @@ __package__ = 'archivebox.core'
from django.apps import AppConfig
import archivebox
class CoreConfig(AppConfig):
name = 'core'
def ready(self):
"""Register the archivebox.core.admin_site as the main django admin site"""
from django.conf import settings
archivebox.pm.hook.ready(settings=settings)
from core.admin_site import register_admin_site
register_admin_site()

View File

@@ -3,37 +3,34 @@ __package__ = 'archivebox.core'
from django import forms
from archivebox.misc.util import URL_REGEX
from ..parsers import PARSERS
from taggit.utils import edit_string_for_tags, parse_tags
PARSER_CHOICES = [
(parser_key, parser[0])
for parser_key, parser in PARSERS.items()
]
DEPTH_CHOICES = (
('0', 'depth = 0 (archive just these URLs)'),
('1', 'depth = 1 (archive these URLs and all URLs one hop away)'),
)
from ..extractors import get_default_archive_methods
from archivebox.hooks import get_extractors
ARCHIVE_METHODS = [
(name, name)
for name, _, _ in get_default_archive_methods()
]
def get_archive_methods():
"""Get available archive methods from discovered hooks."""
return [(name, name) for name in get_extractors()]
class AddLinkForm(forms.Form):
url = forms.RegexField(label="URLs (one per line)", regex=URL_REGEX, min_length='6', strip=True, widget=forms.Textarea, required=True)
parser = forms.ChoiceField(label="URLs format", choices=[('auto', 'Auto-detect parser'), *PARSER_CHOICES], initial='auto')
tag = forms.CharField(label="Tags (comma separated tag1,tag2,tag3)", strip=True, required=False)
depth = forms.ChoiceField(label="Archive depth", choices=DEPTH_CHOICES, initial='0', widget=forms.RadioSelect(attrs={"class": "depth-selection"}))
archive_methods = forms.MultipleChoiceField(
label="Archive methods (select at least 1, otherwise all will be used by default)",
required=False,
widget=forms.SelectMultiple,
choices=ARCHIVE_METHODS,
choices=[], # populated dynamically in __init__
)
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.fields['archive_methods'].choices = get_archive_methods()
# TODO: hook these up to the view and put them
# in a collapsible UI section labeled "Advanced"
#

View File

@@ -1,18 +1,14 @@
# Generated by Django 3.0.8 on 2020-11-04 12:25
import os
import json
from pathlib import Path
from django.db import migrations, models
import django.db.models.deletion
from config import CONFIG
from index.json import to_json
DATA_DIR = Path(os.getcwd()).resolve() # archivebox user data dir
ARCHIVE_DIR = DATA_DIR / 'archive' # archivebox snapshot data dir
try:
JSONField = models.JSONField
except AttributeError:
@@ -21,12 +17,14 @@ except AttributeError:
def forwards_func(apps, schema_editor):
from core.models import EXTRACTORS
Snapshot = apps.get_model("core", "Snapshot")
ArchiveResult = apps.get_model("core", "ArchiveResult")
snapshots = Snapshot.objects.all()
for snapshot in snapshots:
out_dir = ARCHIVE_DIR / snapshot.timestamp
out_dir = Path(CONFIG['ARCHIVE_DIR']) / snapshot.timestamp
try:
with open(out_dir / "index.json", "r") as f:
@@ -61,7 +59,7 @@ def forwards_func(apps, schema_editor):
def verify_json_index_integrity(snapshot):
results = snapshot.archiveresult_set.all()
out_dir = ARCHIVE_DIR / snapshot.timestamp
out_dir = Path(CONFIG['ARCHIVE_DIR']) / snapshot.timestamp
with open(out_dir / "index.json", "r") as f:
index = json.load(f)

View File

@@ -1,58 +0,0 @@
# Generated by Django 5.0.6 on 2024-05-13 10:56
import charidfield.fields
from django.db import migrations, models
class Migration(migrations.Migration):
dependencies = [
('core', '0022_auto_20231023_2008'),
]
operations = [
migrations.AlterModelOptions(
name='archiveresult',
options={'verbose_name': 'Result'},
),
migrations.AddField(
model_name='archiveresult',
name='abid',
field=charidfield.fields.CharIDField(blank=True, db_index=True, default=None, help_text='ABID-format identifier for this entity (e.g. snp_01BJQMF54D093DXEAWZ6JYRPAQ)', max_length=30, null=True, prefix='res_', unique=True),
),
migrations.AddField(
model_name='snapshot',
name='abid',
field=charidfield.fields.CharIDField(blank=True, db_index=True, default=None, help_text='ABID-format identifier for this entity (e.g. snp_01BJQMF54D093DXEAWZ6JYRPAQ)', max_length=30, null=True, prefix='snp_', unique=True),
),
migrations.AddField(
model_name='snapshot',
name='uuid',
field=models.UUIDField(blank=True, null=True, unique=True),
),
migrations.AddField(
model_name='tag',
name='abid',
field=charidfield.fields.CharIDField(blank=True, db_index=True, default=None, help_text='ABID-format identifier for this entity (e.g. snp_01BJQMF54D093DXEAWZ6JYRPAQ)', max_length=30, null=True, prefix='tag_', unique=True),
),
migrations.AlterField(
model_name='archiveresult',
name='extractor',
field=models.CharField(choices=(
('htmltotext', 'htmltotext'),
('git', 'git'),
('singlefile', 'singlefile'),
('media', 'media'),
('archive_org', 'archive_org'),
('readability', 'readability'),
('mercury', 'mercury'),
('favicon', 'favicon'),
('pdf', 'pdf'),
('headers', 'headers'),
('screenshot', 'screenshot'),
('dom', 'dom'),
('title', 'title'),
('wget', 'wget'),
), max_length=32),
),
]

View File

@@ -0,0 +1,466 @@
# Generated by Django 5.0.6 on 2024-12-25
# Transforms schema from 0022 to new simplified schema (ABID system removed)
from uuid import uuid4
from django.conf import settings
from django.db import migrations, models
import django.db.models.deletion
import django.utils.timezone
def get_or_create_system_user_pk(apps, schema_editor):
"""Get or create system user for migrations."""
User = apps.get_model('auth', 'User')
user, _ = User.objects.get_or_create(
username='system',
defaults={'is_active': False, 'password': '!'}
)
return user.pk
def populate_created_by_snapshot(apps, schema_editor):
"""Populate created_by for existing snapshots."""
User = apps.get_model('auth', 'User')
Snapshot = apps.get_model('core', 'Snapshot')
system_user, _ = User.objects.get_or_create(
username='system',
defaults={'is_active': False, 'password': '!'}
)
Snapshot.objects.filter(created_by__isnull=True).update(created_by=system_user)
def populate_created_by_archiveresult(apps, schema_editor):
"""Populate created_by for existing archive results."""
User = apps.get_model('auth', 'User')
ArchiveResult = apps.get_model('core', 'ArchiveResult')
system_user, _ = User.objects.get_or_create(
username='system',
defaults={'is_active': False, 'password': '!'}
)
ArchiveResult.objects.filter(created_by__isnull=True).update(created_by=system_user)
def populate_created_by_tag(apps, schema_editor):
"""Populate created_by for existing tags."""
User = apps.get_model('auth', 'User')
Tag = apps.get_model('core', 'Tag')
system_user, _ = User.objects.get_or_create(
username='system',
defaults={'is_active': False, 'password': '!'}
)
Tag.objects.filter(created_by__isnull=True).update(created_by=system_user)
def generate_uuid_for_archiveresults(apps, schema_editor):
"""Generate UUIDs for archive results that don't have them."""
ArchiveResult = apps.get_model('core', 'ArchiveResult')
for ar in ArchiveResult.objects.filter(uuid__isnull=True).iterator(chunk_size=500):
ar.uuid = uuid4()
ar.save(update_fields=['uuid'])
def generate_uuid_for_tags(apps, schema_editor):
"""Generate UUIDs for tags that don't have them."""
Tag = apps.get_model('core', 'Tag')
for tag in Tag.objects.filter(uuid__isnull=True).iterator(chunk_size=500):
tag.uuid = uuid4()
tag.save(update_fields=['uuid'])
def copy_bookmarked_at_from_added(apps, schema_editor):
"""Copy added timestamp to bookmarked_at."""
Snapshot = apps.get_model('core', 'Snapshot')
Snapshot.objects.filter(bookmarked_at__isnull=True).update(
bookmarked_at=models.F('added')
)
def copy_created_at_from_added(apps, schema_editor):
"""Copy added timestamp to created_at for snapshots."""
Snapshot = apps.get_model('core', 'Snapshot')
Snapshot.objects.filter(created_at__isnull=True).update(
created_at=models.F('added')
)
def copy_created_at_from_start_ts(apps, schema_editor):
"""Copy start_ts to created_at for archive results."""
ArchiveResult = apps.get_model('core', 'ArchiveResult')
ArchiveResult.objects.filter(created_at__isnull=True).update(
created_at=models.F('start_ts')
)
class Migration(migrations.Migration):
"""
This migration transforms the schema from the main branch (0022) to the new
simplified schema without the ABID system.
For dev branch users who had ABID migrations (0023-0074), this replaces them
with a clean transformation.
"""
replaces = [
('core', '0023_alter_archiveresult_options_archiveresult_abid_and_more'),
('core', '0024_auto_20240513_1143'),
('core', '0025_alter_archiveresult_uuid'),
('core', '0026_archiveresult_created_archiveresult_created_by_and_more'),
('core', '0027_update_snapshot_ids'),
('core', '0028_alter_archiveresult_uuid'),
('core', '0029_alter_archiveresult_id'),
('core', '0030_alter_archiveresult_uuid'),
('core', '0031_alter_archiveresult_id_alter_archiveresult_uuid_and_more'),
('core', '0032_alter_archiveresult_id'),
('core', '0033_rename_id_archiveresult_old_id'),
('core', '0034_alter_archiveresult_old_id_alter_archiveresult_uuid'),
('core', '0035_remove_archiveresult_uuid_archiveresult_id'),
('core', '0036_alter_archiveresult_id_alter_archiveresult_old_id'),
('core', '0037_rename_id_snapshot_old_id'),
('core', '0038_rename_uuid_snapshot_id'),
('core', '0039_rename_snapshot_archiveresult_snapshot_old'),
('core', '0040_archiveresult_snapshot'),
('core', '0041_alter_archiveresult_snapshot_and_more'),
('core', '0042_remove_archiveresult_snapshot_old'),
('core', '0043_alter_archiveresult_snapshot_alter_snapshot_id_and_more'),
('core', '0044_alter_archiveresult_snapshot_alter_tag_uuid_and_more'),
('core', '0045_alter_snapshot_old_id'),
('core', '0046_alter_archiveresult_snapshot_alter_snapshot_id_and_more'),
('core', '0047_alter_snapshottag_unique_together_and_more'),
('core', '0048_alter_archiveresult_snapshot_and_more'),
('core', '0049_rename_snapshot_snapshottag_snapshot_old_and_more'),
('core', '0050_alter_snapshottag_snapshot_old'),
('core', '0051_snapshottag_snapshot_alter_snapshottag_snapshot_old'),
('core', '0052_alter_snapshottag_unique_together_and_more'),
('core', '0053_remove_snapshottag_snapshot_old'),
('core', '0054_alter_snapshot_timestamp'),
('core', '0055_alter_tag_slug'),
('core', '0056_remove_tag_uuid'),
('core', '0057_rename_id_tag_old_id'),
('core', '0058_alter_tag_old_id'),
('core', '0059_tag_id'),
('core', '0060_alter_tag_id'),
('core', '0061_rename_tag_snapshottag_old_tag_and_more'),
('core', '0062_alter_snapshottag_old_tag'),
('core', '0063_snapshottag_tag_alter_snapshottag_old_tag'),
('core', '0064_alter_snapshottag_unique_together_and_more'),
('core', '0065_remove_snapshottag_old_tag'),
('core', '0066_alter_snapshottag_tag_alter_tag_id_alter_tag_old_id'),
('core', '0067_alter_snapshottag_tag'),
('core', '0068_alter_archiveresult_options'),
('core', '0069_alter_archiveresult_created_alter_snapshot_added_and_more'),
('core', '0070_alter_archiveresult_created_by_alter_snapshot_added_and_more'),
('core', '0071_remove_archiveresult_old_id_remove_snapshot_old_id_and_more'),
('core', '0072_rename_added_snapshot_bookmarked_at_and_more'),
('core', '0073_rename_created_archiveresult_created_at_and_more'),
('core', '0074_alter_snapshot_downloaded_at'),
]
dependencies = [
('core', '0022_auto_20231023_2008'),
migrations.swappable_dependency(settings.AUTH_USER_MODEL),
]
operations = [
# === SNAPSHOT CHANGES ===
# Add new fields to Snapshot
migrations.AddField(
model_name='snapshot',
name='created_by',
field=models.ForeignKey(
default=None, null=True, blank=True,
on_delete=django.db.models.deletion.CASCADE,
related_name='snapshot_set',
to=settings.AUTH_USER_MODEL,
),
),
migrations.AddField(
model_name='snapshot',
name='created_at',
field=models.DateTimeField(default=django.utils.timezone.now, db_index=True, null=True),
),
migrations.AddField(
model_name='snapshot',
name='modified_at',
field=models.DateTimeField(auto_now=True),
),
migrations.AddField(
model_name='snapshot',
name='bookmarked_at',
field=models.DateTimeField(default=django.utils.timezone.now, db_index=True, null=True),
),
migrations.AddField(
model_name='snapshot',
name='downloaded_at',
field=models.DateTimeField(default=None, null=True, blank=True, db_index=True),
),
migrations.AddField(
model_name='snapshot',
name='depth',
field=models.PositiveSmallIntegerField(default=0, db_index=True),
),
migrations.AddField(
model_name='snapshot',
name='status',
field=models.CharField(choices=[('queued', 'Queued'), ('started', 'Started'), ('sealed', 'Sealed')], default='queued', max_length=15, db_index=True),
),
migrations.AddField(
model_name='snapshot',
name='retry_at',
field=models.DateTimeField(default=django.utils.timezone.now, null=True, blank=True, db_index=True),
),
migrations.AddField(
model_name='snapshot',
name='config',
field=models.JSONField(default=dict, blank=False),
),
migrations.AddField(
model_name='snapshot',
name='notes',
field=models.TextField(blank=True, default=''),
),
migrations.AddField(
model_name='snapshot',
name='output_dir',
field=models.CharField(max_length=256, default=None, null=True, blank=True),
),
# Copy data from old fields to new
migrations.RunPython(copy_bookmarked_at_from_added, migrations.RunPython.noop),
migrations.RunPython(copy_created_at_from_added, migrations.RunPython.noop),
migrations.RunPython(populate_created_by_snapshot, migrations.RunPython.noop),
# Make created_by non-nullable after population
migrations.AlterField(
model_name='snapshot',
name='created_by',
field=models.ForeignKey(
on_delete=django.db.models.deletion.CASCADE,
related_name='snapshot_set',
to=settings.AUTH_USER_MODEL,
db_index=True,
),
),
# Update timestamp field constraints
migrations.AlterField(
model_name='snapshot',
name='timestamp',
field=models.CharField(max_length=32, unique=True, db_index=True, editable=False),
),
# Update title field size
migrations.AlterField(
model_name='snapshot',
name='title',
field=models.CharField(max_length=512, null=True, blank=True, db_index=True),
),
# Remove old 'added' and 'updated' fields
migrations.RemoveField(model_name='snapshot', name='added'),
migrations.RemoveField(model_name='snapshot', name='updated'),
# Remove old 'tags' CharField (now M2M via Tag model)
migrations.RemoveField(model_name='snapshot', name='tags'),
# === TAG CHANGES ===
# Add uuid field to Tag temporarily for ID migration
migrations.AddField(
model_name='tag',
name='uuid',
field=models.UUIDField(default=uuid4, null=True, blank=True),
),
migrations.AddField(
model_name='tag',
name='created_by',
field=models.ForeignKey(
default=None, null=True, blank=True,
on_delete=django.db.models.deletion.CASCADE,
related_name='tag_set',
to=settings.AUTH_USER_MODEL,
),
),
migrations.AddField(
model_name='tag',
name='created_at',
field=models.DateTimeField(default=django.utils.timezone.now, db_index=True, null=True),
),
migrations.AddField(
model_name='tag',
name='modified_at',
field=models.DateTimeField(auto_now=True),
),
# Populate UUIDs for tags
migrations.RunPython(generate_uuid_for_tags, migrations.RunPython.noop),
migrations.RunPython(populate_created_by_tag, migrations.RunPython.noop),
# Make created_by non-nullable
migrations.AlterField(
model_name='tag',
name='created_by',
field=models.ForeignKey(
on_delete=django.db.models.deletion.CASCADE,
related_name='tag_set',
to=settings.AUTH_USER_MODEL,
),
),
# Update slug field
migrations.AlterField(
model_name='tag',
name='slug',
field=models.SlugField(unique=True, max_length=100, editable=False),
),
# === ARCHIVERESULT CHANGES ===
# Add uuid field for new ID
migrations.AddField(
model_name='archiveresult',
name='uuid',
field=models.UUIDField(default=uuid4, null=True, blank=True),
),
migrations.AddField(
model_name='archiveresult',
name='created_by',
field=models.ForeignKey(
default=None, null=True, blank=True,
on_delete=django.db.models.deletion.CASCADE,
related_name='archiveresult_set',
to=settings.AUTH_USER_MODEL,
),
),
migrations.AddField(
model_name='archiveresult',
name='created_at',
field=models.DateTimeField(default=django.utils.timezone.now, db_index=True, null=True),
),
migrations.AddField(
model_name='archiveresult',
name='modified_at',
field=models.DateTimeField(auto_now=True),
),
migrations.AddField(
model_name='archiveresult',
name='retry_at',
field=models.DateTimeField(default=django.utils.timezone.now, null=True, blank=True, db_index=True),
),
migrations.AddField(
model_name='archiveresult',
name='notes',
field=models.TextField(blank=True, default=''),
),
migrations.AddField(
model_name='archiveresult',
name='output_dir',
field=models.CharField(max_length=256, default=None, null=True, blank=True),
),
# Populate UUIDs and data for archive results
migrations.RunPython(generate_uuid_for_archiveresults, migrations.RunPython.noop),
migrations.RunPython(copy_created_at_from_start_ts, migrations.RunPython.noop),
migrations.RunPython(populate_created_by_archiveresult, migrations.RunPython.noop),
# Make created_by non-nullable
migrations.AlterField(
model_name='archiveresult',
name='created_by',
field=models.ForeignKey(
on_delete=django.db.models.deletion.CASCADE,
related_name='archiveresult_set',
to=settings.AUTH_USER_MODEL,
db_index=True,
),
),
# Update extractor choices
migrations.AlterField(
model_name='archiveresult',
name='extractor',
field=models.CharField(
choices=[
('htmltotext', 'htmltotext'), ('git', 'git'), ('singlefile', 'singlefile'),
('media', 'media'), ('archive_org', 'archive_org'), ('readability', 'readability'),
('mercury', 'mercury'), ('favicon', 'favicon'), ('pdf', 'pdf'),
('headers', 'headers'), ('screenshot', 'screenshot'), ('dom', 'dom'),
('title', 'title'), ('wget', 'wget'),
],
max_length=32, db_index=True,
),
),
# Update status field
migrations.AlterField(
model_name='archiveresult',
name='status',
field=models.CharField(
choices=[
('queued', 'Queued'), ('started', 'Started'), ('backoff', 'Waiting to retry'),
('succeeded', 'Succeeded'), ('failed', 'Failed'), ('skipped', 'Skipped'),
],
max_length=16, default='queued', db_index=True,
),
),
# Update output field size
migrations.AlterField(
model_name='archiveresult',
name='output',
field=models.CharField(max_length=1024, default=None, null=True, blank=True),
),
# Update cmd_version field size
migrations.AlterField(
model_name='archiveresult',
name='cmd_version',
field=models.CharField(max_length=128, default=None, null=True, blank=True),
),
# Make start_ts and end_ts nullable
migrations.AlterField(
model_name='archiveresult',
name='start_ts',
field=models.DateTimeField(default=None, null=True, blank=True),
),
migrations.AlterField(
model_name='archiveresult',
name='end_ts',
field=models.DateTimeField(default=None, null=True, blank=True),
),
# Make pwd nullable
migrations.AlterField(
model_name='archiveresult',
name='pwd',
field=models.CharField(max_length=256, default=None, null=True, blank=True),
),
# Make cmd nullable
migrations.AlterField(
model_name='archiveresult',
name='cmd',
field=models.JSONField(default=None, null=True, blank=True),
),
# Update model options
migrations.AlterModelOptions(
name='archiveresult',
options={'verbose_name': 'Archive Result', 'verbose_name_plural': 'Archive Results Log'},
),
migrations.AlterModelOptions(
name='snapshot',
options={'verbose_name': 'Snapshot', 'verbose_name_plural': 'Snapshots'},
),
migrations.AlterModelOptions(
name='tag',
options={'verbose_name': 'Tag', 'verbose_name_plural': 'Tags'},
),
]

View File

@@ -1,101 +0,0 @@
# Generated by Django 5.0.6 on 2024-05-13 11:43
from django.db import migrations
from datetime import datetime
from archivebox.base_models.abid import abid_from_values, DEFAULT_ABID_URI_SALT
def calculate_abid(self):
"""
Return a freshly derived ABID (assembled from attrs defined in ABIDModel.abid_*_src).
"""
prefix = self.abid_prefix
ts = eval(self.abid_ts_src)
uri = eval(self.abid_uri_src)
subtype = eval(self.abid_subtype_src)
rand = eval(self.abid_rand_src)
if (not prefix) or prefix == 'obj_':
suggested_abid = self.__class__.__name__[:3].lower()
raise Exception(f'{self.__class__.__name__}.abid_prefix must be defined to calculate ABIDs (suggested: {suggested_abid})')
if not ts:
ts = datetime.utcfromtimestamp(0)
print(f'[!] WARNING: Generating ABID with ts=0000000000 placeholder because {self.__class__.__name__}.abid_ts_src={self.abid_ts_src} is unset!', ts.isoformat())
if not uri:
uri = str(self)
print(f'[!] WARNING: Generating ABID with uri=str(self) placeholder because {self.__class__.__name__}.abid_uri_src={self.abid_uri_src} is unset!', uri)
if not subtype:
subtype = self.__class__.__name__
print(f'[!] WARNING: Generating ABID with subtype={subtype} placeholder because {self.__class__.__name__}.abid_subtype_src={self.abid_subtype_src} is unset!', subtype)
if not rand:
rand = getattr(self, 'uuid', None) or getattr(self, 'id', None) or getattr(self, 'pk')
print(f'[!] WARNING: Generating ABID with rand=self.id placeholder because {self.__class__.__name__}.abid_rand_src={self.abid_rand_src} is unset!', rand)
abid = abid_from_values(
prefix=prefix,
ts=ts,
uri=uri,
subtype=subtype,
rand=rand,
salt=DEFAULT_ABID_URI_SALT,
)
assert abid.ulid and abid.uuid and abid.typeid, f'Failed to calculate {prefix}_ABID for {self.__class__.__name__}'
return abid
def copy_snapshot_uuids(apps, schema_editor):
print(' Copying snapshot.id -> snapshot.uuid...')
Snapshot = apps.get_model("core", "Snapshot")
for snapshot in Snapshot.objects.all():
snapshot.uuid = snapshot.id
snapshot.save(update_fields=["uuid"])
def generate_snapshot_abids(apps, schema_editor):
print(' Generating snapshot.abid values...')
Snapshot = apps.get_model("core", "Snapshot")
for snapshot in Snapshot.objects.all():
snapshot.abid_prefix = 'snp_'
snapshot.abid_ts_src = 'self.added'
snapshot.abid_uri_src = 'self.url'
snapshot.abid_subtype_src = '"01"'
snapshot.abid_rand_src = 'self.uuid'
snapshot.abid = calculate_abid(snapshot)
snapshot.uuid = snapshot.abid.uuid
snapshot.save(update_fields=["abid", "uuid"])
def generate_archiveresult_abids(apps, schema_editor):
print(' Generating ArchiveResult.abid values... (may take an hour or longer for large collections...)')
ArchiveResult = apps.get_model("core", "ArchiveResult")
Snapshot = apps.get_model("core", "Snapshot")
for result in ArchiveResult.objects.all():
result.abid_prefix = 'res_'
result.snapshot = Snapshot.objects.get(pk=result.snapshot_id)
result.snapshot_added = result.snapshot.added
result.snapshot_url = result.snapshot.url
result.abid_ts_src = 'self.snapshot_added'
result.abid_uri_src = 'self.snapshot_url'
result.abid_subtype_src = 'self.extractor'
result.abid_rand_src = 'self.id'
result.abid = calculate_abid(result)
result.uuid = result.abid.uuid
result.save(update_fields=["abid", "uuid"])
class Migration(migrations.Migration):
dependencies = [
('core', '0023_alter_archiveresult_options_archiveresult_abid_and_more'),
]
operations = [
migrations.RunPython(copy_snapshot_uuids, reverse_code=migrations.RunPython.noop),
migrations.RunPython(generate_snapshot_abids, reverse_code=migrations.RunPython.noop),
migrations.RunPython(generate_archiveresult_abids, reverse_code=migrations.RunPython.noop),
]

View File

@@ -1,19 +0,0 @@
# Generated by Django 5.0.6 on 2024-05-13 12:08
import uuid
from django.db import migrations, models
class Migration(migrations.Migration):
dependencies = [
('core', '0024_auto_20240513_1143'),
]
operations = [
migrations.AlterField(
model_name='archiveresult',
name='uuid',
field=models.UUIDField(default=uuid.uuid4, editable=False, unique=True),
),
]

View File

@@ -1,117 +0,0 @@
# Generated by Django 5.0.6 on 2024-05-13 13:01
import django.db.models.deletion
import django.utils.timezone
from django.conf import settings
from django.db import migrations, models
import archivebox.base_models.models
def updated_created_by_ids(apps, schema_editor):
"""Get or create a system user with is_superuser=True to be the default owner for new DB rows"""
User = apps.get_model("auth", "User")
ArchiveResult = apps.get_model("core", "ArchiveResult")
Snapshot = apps.get_model("core", "Snapshot")
Tag = apps.get_model("core", "Tag")
# if only one user exists total, return that user
if User.objects.filter(is_superuser=True).count() == 1:
user_id = User.objects.filter(is_superuser=True).values_list('pk', flat=True)[0]
# otherwise, create a dedicated "system" user
user_id = User.objects.get_or_create(username='system', is_staff=True, is_superuser=True, defaults={'email': '', 'password': ''})[0].pk
ArchiveResult.objects.all().update(created_by_id=user_id)
Snapshot.objects.all().update(created_by_id=user_id)
Tag.objects.all().update(created_by_id=user_id)
class Migration(migrations.Migration):
dependencies = [
('core', '0025_alter_archiveresult_uuid'),
migrations.swappable_dependency(settings.AUTH_USER_MODEL),
]
operations = [
migrations.AddField(
model_name='archiveresult',
name='created',
field=models.DateTimeField(auto_now_add=True, default=django.utils.timezone.now),
preserve_default=False,
),
migrations.AddField(
model_name='archiveresult',
name='created_by',
field=models.ForeignKey(null=True, default=archivebox.base_models.models.get_or_create_system_user_pk, on_delete=django.db.models.deletion.CASCADE, to=settings.AUTH_USER_MODEL),
),
migrations.AddField(
model_name='archiveresult',
name='modified',
field=models.DateTimeField(auto_now=True),
),
migrations.AddField(
model_name='snapshot',
name='created',
field=models.DateTimeField(auto_now_add=True, default=django.utils.timezone.now),
preserve_default=False,
),
migrations.AddField(
model_name='snapshot',
name='created_by',
field=models.ForeignKey(null=True, default=archivebox.base_models.models.get_or_create_system_user_pk, on_delete=django.db.models.deletion.CASCADE, to=settings.AUTH_USER_MODEL),
),
migrations.AddField(
model_name='snapshot',
name='modified',
field=models.DateTimeField(auto_now=True),
),
migrations.AddField(
model_name='tag',
name='created',
field=models.DateTimeField(auto_now_add=True, default=django.utils.timezone.now),
preserve_default=False,
),
migrations.AddField(
model_name='tag',
name='created_by',
field=models.ForeignKey(null=True, default=archivebox.base_models.models.get_or_create_system_user_pk, on_delete=django.db.models.deletion.CASCADE, to=settings.AUTH_USER_MODEL),
),
migrations.AddField(
model_name='tag',
name='modified',
field=models.DateTimeField(auto_now=True),
),
migrations.AddField(
model_name='tag',
name='uuid',
field=models.UUIDField(blank=True, null=True, unique=True),
),
migrations.AlterField(
model_name='archiveresult',
name='uuid',
field=models.UUIDField(blank=True, null=True, unique=True),
),
migrations.RunPython(updated_created_by_ids, reverse_code=migrations.RunPython.noop),
migrations.AddField(
model_name='snapshot',
name='created_by',
field=models.ForeignKey(default=archivebox.base_models.models.get_or_create_system_user_pk, on_delete=django.db.models.deletion.CASCADE, to=settings.AUTH_USER_MODEL),
),
migrations.AlterField(
model_name='archiveresult',
name='created_by',
field=models.ForeignKey(default=archivebox.base_models.models.get_or_create_system_user_pk, on_delete=django.db.models.deletion.CASCADE, to=settings.AUTH_USER_MODEL),
),
migrations.AddField(
model_name='tag',
name='created_by',
field=models.ForeignKey(default=archivebox.base_models.models.get_or_create_system_user_pk, on_delete=django.db.models.deletion.CASCADE, to=settings.AUTH_USER_MODEL),
),
]

View File

@@ -1,105 +0,0 @@
# Generated by Django 5.0.6 on 2024-08-18 02:48
from django.db import migrations
from datetime import datetime
from archivebox.base_models.abid import ABID, abid_from_values, DEFAULT_ABID_URI_SALT
def calculate_abid(self):
"""
Return a freshly derived ABID (assembled from attrs defined in ABIDModel.abid_*_src).
"""
prefix = self.abid_prefix
ts = eval(self.abid_ts_src)
uri = eval(self.abid_uri_src)
subtype = eval(self.abid_subtype_src)
rand = eval(self.abid_rand_src)
if (not prefix) or prefix == 'obj_':
suggested_abid = self.__class__.__name__[:3].lower()
raise Exception(f'{self.__class__.__name__}.abid_prefix must be defined to calculate ABIDs (suggested: {suggested_abid})')
if not ts:
ts = datetime.utcfromtimestamp(0)
print(f'[!] WARNING: Generating ABID with ts=0000000000 placeholder because {self.__class__.__name__}.abid_ts_src={self.abid_ts_src} is unset!', ts.isoformat())
if not uri:
uri = str(self)
print(f'[!] WARNING: Generating ABID with uri=str(self) placeholder because {self.__class__.__name__}.abid_uri_src={self.abid_uri_src} is unset!', uri)
if not subtype:
subtype = self.__class__.__name__
print(f'[!] WARNING: Generating ABID with subtype={subtype} placeholder because {self.__class__.__name__}.abid_subtype_src={self.abid_subtype_src} is unset!', subtype)
if not rand:
rand = getattr(self, 'uuid', None) or getattr(self, 'id', None) or getattr(self, 'pk')
print(f'[!] WARNING: Generating ABID with rand=self.id placeholder because {self.__class__.__name__}.abid_rand_src={self.abid_rand_src} is unset!', rand)
abid = abid_from_values(
prefix=prefix,
ts=ts,
uri=uri,
subtype=subtype,
rand=rand,
salt=DEFAULT_ABID_URI_SALT,
)
assert abid.ulid and abid.uuid and abid.typeid, f'Failed to calculate {prefix}_ABID for {self.__class__.__name__}'
return abid
def update_snapshot_ids(apps, schema_editor):
Snapshot = apps.get_model("core", "Snapshot")
num_total = Snapshot.objects.all().count()
print(f' Updating {num_total} Snapshot.id, Snapshot.uuid values in place...')
for idx, snapshot in enumerate(Snapshot.objects.all().only('abid').iterator(chunk_size=500)):
assert snapshot.abid
snapshot.abid_prefix = 'snp_'
snapshot.abid_ts_src = 'self.added'
snapshot.abid_uri_src = 'self.url'
snapshot.abid_subtype_src = '"01"'
snapshot.abid_rand_src = 'self.uuid'
snapshot.abid = calculate_abid(snapshot)
snapshot.uuid = snapshot.abid.uuid
snapshot.save(update_fields=["abid", "uuid"])
assert str(ABID.parse(snapshot.abid).uuid) == str(snapshot.uuid)
if idx % 1000 == 0:
print(f'Migrated {idx}/{num_total} Snapshot objects...')
def update_archiveresult_ids(apps, schema_editor):
Snapshot = apps.get_model("core", "Snapshot")
ArchiveResult = apps.get_model("core", "ArchiveResult")
num_total = ArchiveResult.objects.all().count()
print(f' Updating {num_total} ArchiveResult.id, ArchiveResult.uuid values in place... (may take an hour or longer for large collections...)')
for idx, result in enumerate(ArchiveResult.objects.all().only('abid', 'snapshot_id').iterator(chunk_size=500)):
assert result.abid
result.abid_prefix = 'res_'
result.snapshot = Snapshot.objects.get(pk=result.snapshot_id)
result.snapshot_added = result.snapshot.added
result.snapshot_url = result.snapshot.url
result.abid_ts_src = 'self.snapshot_added'
result.abid_uri_src = 'self.snapshot_url'
result.abid_subtype_src = 'self.extractor'
result.abid_rand_src = 'self.id'
result.abid = calculate_abid(result)
result.uuid = result.abid.uuid
result.uuid = ABID.parse(result.abid).uuid
result.save(update_fields=["abid", "uuid"])
assert str(ABID.parse(result.abid).uuid) == str(result.uuid)
if idx % 5000 == 0:
print(f'Migrated {idx}/{num_total} ArchiveResult objects...')
class Migration(migrations.Migration):
dependencies = [
('core', '0026_archiveresult_created_archiveresult_created_by_and_more'),
]
operations = [
migrations.RunPython(update_snapshot_ids, reverse_code=migrations.RunPython.noop),
migrations.RunPython(update_archiveresult_ids, reverse_code=migrations.RunPython.noop),
]

View File

@@ -1,19 +0,0 @@
# Generated by Django 5.0.6 on 2024-08-18 04:28
import uuid
from django.db import migrations, models
class Migration(migrations.Migration):
dependencies = [
('core', '0027_update_snapshot_ids'),
]
operations = [
migrations.AlterField(
model_name='archiveresult',
name='uuid',
field=models.UUIDField(default=uuid.uuid4),
),
]

View File

@@ -1,18 +0,0 @@
# Generated by Django 5.0.6 on 2024-08-18 04:28
from django.db import migrations, models
class Migration(migrations.Migration):
dependencies = [
('core', '0028_alter_archiveresult_uuid'),
]
operations = [
migrations.AlterField(
model_name='archiveresult',
name='id',
field=models.BigIntegerField(primary_key=True, serialize=False, verbose_name='ID'),
),
]

View File

@@ -1,18 +0,0 @@
# Generated by Django 5.0.6 on 2024-08-18 05:00
from django.db import migrations, models
class Migration(migrations.Migration):
dependencies = [
('core', '0029_alter_archiveresult_id'),
]
operations = [
migrations.AlterField(
model_name='archiveresult',
name='uuid',
field=models.UUIDField(unique=True),
),
]

View File

@@ -1,34 +0,0 @@
# Generated by Django 5.0.6 on 2024-08-18 05:09
import uuid
from django.db import migrations, models
class Migration(migrations.Migration):
dependencies = [
('core', '0030_alter_archiveresult_uuid'),
]
operations = [
migrations.AlterField(
model_name='archiveresult',
name='id',
field=models.IntegerField(default=uuid.uuid4, primary_key=True, serialize=False, verbose_name='ID'),
),
migrations.AlterField(
model_name='archiveresult',
name='uuid',
field=models.UUIDField(default=uuid.uuid4, unique=True),
),
migrations.AlterField(
model_name='snapshot',
name='uuid',
field=models.UUIDField(default=uuid.uuid4, unique=True),
),
migrations.AlterField(
model_name='tag',
name='uuid',
field=models.UUIDField(default=uuid.uuid4, null=True, unique=True),
),
]

View File

@@ -1,23 +0,0 @@
# Generated by Django 5.0.6 on 2024-08-18 05:20
import core.models
import random
from django.db import migrations, models
def rand_int_id():
return random.getrandbits(32)
class Migration(migrations.Migration):
dependencies = [
('core', '0031_alter_archiveresult_id_alter_archiveresult_uuid_and_more'),
]
operations = [
migrations.AlterField(
model_name='archiveresult',
name='id',
field=models.BigIntegerField(default=rand_int_id, primary_key=True, serialize=False, verbose_name='ID'),
),
]

View File

@@ -1,18 +0,0 @@
# Generated by Django 5.0.6 on 2024-08-18 05:34
from django.db import migrations
class Migration(migrations.Migration):
dependencies = [
('core', '0032_alter_archiveresult_id'),
]
operations = [
migrations.RenameField(
model_name='archiveresult',
old_name='id',
new_name='old_id',
),
]

View File

@@ -1,45 +0,0 @@
# Generated by Django 5.0.6 on 2024-08-18 05:37
import uuid
import random
from django.db import migrations, models
from archivebox.base_models.abid import ABID
def rand_int_id():
return random.getrandbits(32)
def update_archiveresult_ids(apps, schema_editor):
ArchiveResult = apps.get_model("core", "ArchiveResult")
num_total = ArchiveResult.objects.all().count()
print(f' Updating {num_total} ArchiveResult.id, ArchiveResult.uuid values in place... (may take an hour or longer for large collections...)')
for idx, result in enumerate(ArchiveResult.objects.all().only('abid').iterator(chunk_size=500)):
assert result.abid
result.uuid = ABID.parse(result.abid).uuid
result.save(update_fields=["uuid"])
assert str(ABID.parse(result.abid).uuid) == str(result.uuid)
if idx % 2500 == 0:
print(f'Migrated {idx}/{num_total} ArchiveResult objects...')
class Migration(migrations.Migration):
dependencies = [
('core', '0033_rename_id_archiveresult_old_id'),
]
operations = [
migrations.AlterField(
model_name='archiveresult',
name='old_id',
field=models.BigIntegerField(default=rand_int_id, serialize=False, verbose_name='ID'),
),
migrations.RunPython(update_archiveresult_ids, reverse_code=migrations.RunPython.noop),
migrations.AlterField(
model_name='archiveresult',
name='uuid',
field=models.UUIDField(default=uuid.uuid4, primary_key=True, serialize=False, unique=True),
),
]

View File

@@ -1,19 +0,0 @@
# Generated by Django 5.0.6 on 2024-08-18 05:49
import uuid
from django.db import migrations, models
class Migration(migrations.Migration):
dependencies = [
('core', '0034_alter_archiveresult_old_id_alter_archiveresult_uuid'),
]
operations = [
migrations.RenameField(
model_name='archiveresult',
old_name='uuid',
new_name='id',
),
]

View File

@@ -1,29 +0,0 @@
# Generated by Django 5.0.6 on 2024-08-18 05:59
import core.models
import uuid
import random
from django.db import migrations, models
def rand_int_id():
return random.getrandbits(32)
class Migration(migrations.Migration):
dependencies = [
('core', '0035_remove_archiveresult_uuid_archiveresult_id'),
]
operations = [
migrations.AlterField(
model_name='archiveresult',
name='id',
field=models.UUIDField(default=uuid.uuid4, primary_key=True, serialize=False, unique=True, verbose_name='ID'),
),
migrations.AlterField(
model_name='archiveresult',
name='old_id',
field=models.BigIntegerField(default=rand_int_id, serialize=False, verbose_name='Old ID'),
),
]

View File

@@ -1,18 +0,0 @@
# Generated by Django 5.0.6 on 2024-08-18 06:08
from django.db import migrations
class Migration(migrations.Migration):
dependencies = [
('core', '0036_alter_archiveresult_id_alter_archiveresult_old_id'),
]
operations = [
migrations.RenameField(
model_name='snapshot',
old_name='id',
new_name='old_id',
),
]

View File

@@ -1,18 +0,0 @@
# Generated by Django 5.0.6 on 2024-08-18 06:09
from django.db import migrations
class Migration(migrations.Migration):
dependencies = [
('core', '0037_rename_id_snapshot_old_id'),
]
operations = [
migrations.RenameField(
model_name='snapshot',
old_name='uuid',
new_name='id',
),
]

View File

@@ -1,18 +0,0 @@
# Generated by Django 5.0.6 on 2024-08-18 06:25
from django.db import migrations
class Migration(migrations.Migration):
dependencies = [
('core', '0038_rename_uuid_snapshot_id'),
]
operations = [
migrations.RenameField(
model_name='archiveresult',
old_name='snapshot',
new_name='snapshot_old',
),
]

View File

@@ -1,34 +0,0 @@
# Generated by Django 5.0.6 on 2024-08-18 06:46
import django.db.models.deletion
from django.db import migrations, models
def update_archiveresult_snapshot_ids(apps, schema_editor):
ArchiveResult = apps.get_model("core", "ArchiveResult")
Snapshot = apps.get_model("core", "Snapshot")
num_total = ArchiveResult.objects.all().count()
print(f' Updating {num_total} ArchiveResult.snapshot_id values in place... (may take an hour or longer for large collections...)')
for idx, result in enumerate(ArchiveResult.objects.all().only('snapshot_old_id').iterator(chunk_size=5000)):
assert result.snapshot_old_id
snapshot = Snapshot.objects.only('id').get(old_id=result.snapshot_old_id)
result.snapshot_id = snapshot.id
result.save(update_fields=["snapshot_id"])
assert str(result.snapshot_id) == str(snapshot.id)
if idx % 5000 == 0:
print(f'Migrated {idx}/{num_total} ArchiveResult objects...')
class Migration(migrations.Migration):
dependencies = [
('core', '0039_rename_snapshot_archiveresult_snapshot_old'),
]
operations = [
migrations.AddField(
model_name='archiveresult',
name='snapshot',
field=models.ForeignKey(null=True, on_delete=django.db.models.deletion.CASCADE, related_name='archiveresults', to='core.snapshot', to_field='id'),
),
migrations.RunPython(update_archiveresult_snapshot_ids, reverse_code=migrations.RunPython.noop),
]

View File

@@ -1,24 +0,0 @@
# Generated by Django 5.0.6 on 2024-08-18 06:50
import django.db.models.deletion
from django.db import migrations, models
class Migration(migrations.Migration):
dependencies = [
('core', '0040_archiveresult_snapshot'),
]
operations = [
migrations.AlterField(
model_name='archiveresult',
name='snapshot',
field=models.ForeignKey(on_delete=django.db.models.deletion.CASCADE, to='core.snapshot', to_field='id'),
),
migrations.AlterField(
model_name='archiveresult',
name='snapshot_old',
field=models.ForeignKey(on_delete=django.db.models.deletion.CASCADE, related_name='archiveresults_old', to='core.snapshot'),
),
]

View File

@@ -1,17 +0,0 @@
# Generated by Django 5.0.6 on 2024-08-18 06:51
from django.db import migrations
class Migration(migrations.Migration):
dependencies = [
('core', '0041_alter_archiveresult_snapshot_and_more'),
]
operations = [
migrations.RemoveField(
model_name='archiveresult',
name='snapshot_old',
),
]

View File

@@ -1,20 +0,0 @@
# Generated by Django 5.0.6 on 2024-08-18 06:52
import django.db.models.deletion
import uuid
from django.db import migrations, models
class Migration(migrations.Migration):
dependencies = [
('core', '0042_remove_archiveresult_snapshot_old'),
]
operations = [
migrations.AlterField(
model_name='archiveresult',
name='snapshot',
field=models.ForeignKey(db_column='snapshot_id', on_delete=django.db.models.deletion.CASCADE, to='core.snapshot', to_field='id'),
),
]

View File

@@ -1,40 +0,0 @@
# Generated by Django 5.0.6 on 2024-08-19 23:01
import django.db.models.deletion
import uuid
from django.db import migrations, models
class Migration(migrations.Migration):
dependencies = [
('core', '0043_alter_archiveresult_snapshot_alter_snapshot_id_and_more'),
]
operations = [
migrations.SeparateDatabaseAndState(
database_operations=[
# No-op, SnapshotTag model already exists in DB
],
state_operations=[
migrations.CreateModel(
name='SnapshotTag',
fields=[
('id', models.AutoField(primary_key=True, serialize=False)),
('snapshot', models.ForeignKey(on_delete=django.db.models.deletion.CASCADE, to='core.snapshot')),
('tag', models.ForeignKey(on_delete=django.db.models.deletion.CASCADE, to='core.tag')),
],
options={
'db_table': 'core_snapshot_tags',
'unique_together': {('snapshot', 'tag')},
},
),
migrations.AlterField(
model_name='snapshot',
name='tags',
field=models.ManyToManyField(blank=True, related_name='snapshot_set', through='core.SnapshotTag', to='core.tag'),
),
],
),
]

View File

@@ -1,19 +0,0 @@
# Generated by Django 5.0.6 on 2024-08-20 01:54
import uuid
from django.db import migrations, models
class Migration(migrations.Migration):
dependencies = [
('core', '0044_alter_archiveresult_snapshot_alter_tag_uuid_and_more'),
]
operations = [
migrations.AlterField(
model_name='snapshot',
name='old_id',
field=models.UUIDField(default=uuid.uuid4, editable=False, primary_key=True, serialize=False, unique=True),
),
]

View File

@@ -1,30 +0,0 @@
# Generated by Django 5.0.6 on 2024-08-20 01:55
import django.db.models.deletion
import uuid
from django.db import migrations, models
class Migration(migrations.Migration):
dependencies = [
('core', '0045_alter_snapshot_old_id'),
]
operations = [
migrations.AlterField(
model_name='archiveresult',
name='snapshot',
field=models.ForeignKey(db_column='snapshot_id', on_delete=django.db.models.deletion.CASCADE, to='core.snapshot', to_field='id'),
),
migrations.AlterField(
model_name='snapshot',
name='id',
field=models.UUIDField(default=uuid.uuid4, primary_key=True, serialize=False, unique=True),
),
migrations.AlterField(
model_name='snapshot',
name='old_id',
field=models.UUIDField(default=uuid.uuid4, editable=False, unique=True),
),
]

View File

@@ -1,24 +0,0 @@
# Generated by Django 5.0.6 on 2024-08-20 02:16
import django.db.models.deletion
from django.db import migrations, models
class Migration(migrations.Migration):
dependencies = [
('core', '0046_alter_archiveresult_snapshot_alter_snapshot_id_and_more'),
]
operations = [
migrations.AlterField(
model_name='archiveresult',
name='snapshot',
field=models.ForeignKey(db_column='snapshot_id', on_delete=django.db.models.deletion.CASCADE, to='core.snapshot', to_field='id'),
),
migrations.AlterField(
model_name='snapshottag',
name='tag',
field=models.ForeignKey(db_column='tag_id', on_delete=django.db.models.deletion.CASCADE, to='core.tag'),
),
]

View File

@@ -1,24 +0,0 @@
# Generated by Django 5.0.6 on 2024-08-20 02:17
import django.db.models.deletion
from django.db import migrations, models
class Migration(migrations.Migration):
dependencies = [
('core', '0047_alter_snapshottag_unique_together_and_more'),
]
operations = [
migrations.AlterField(
model_name='archiveresult',
name='snapshot',
field=models.ForeignKey(db_column='snapshot_id', on_delete=django.db.models.deletion.CASCADE, to='core.snapshot'),
),
migrations.AlterField(
model_name='snapshottag',
name='snapshot',
field=models.ForeignKey(db_column='snapshot_id', on_delete=django.db.models.deletion.CASCADE, to='core.snapshot', to_field='old_id'),
),
]

View File

@@ -1,22 +0,0 @@
# Generated by Django 5.0.6 on 2024-08-20 02:26
from django.db import migrations
class Migration(migrations.Migration):
dependencies = [
('core', '0048_alter_archiveresult_snapshot_and_more'),
]
operations = [
migrations.RenameField(
model_name='snapshottag',
old_name='snapshot',
new_name='snapshot_old',
),
migrations.AlterUniqueTogether(
name='snapshottag',
unique_together={('snapshot_old', 'tag')},
),
]

View File

@@ -1,19 +0,0 @@
# Generated by Django 5.0.6 on 2024-08-20 02:30
import django.db.models.deletion
from django.db import migrations, models
class Migration(migrations.Migration):
dependencies = [
('core', '0049_rename_snapshot_snapshottag_snapshot_old_and_more'),
]
operations = [
migrations.AlterField(
model_name='snapshottag',
name='snapshot_old',
field=models.ForeignKey(db_column='snapshot_old_id', on_delete=django.db.models.deletion.CASCADE, to='core.snapshot', to_field='old_id'),
),
]

View File

@@ -1,40 +0,0 @@
# Generated by Django 5.0.6 on 2024-08-20 02:31
import django.db.models.deletion
from django.db import migrations, models
def update_snapshottag_ids(apps, schema_editor):
Snapshot = apps.get_model("core", "Snapshot")
SnapshotTag = apps.get_model("core", "SnapshotTag")
num_total = SnapshotTag.objects.all().count()
print(f' Updating {num_total} SnapshotTag.snapshot_id values in place... (may take an hour or longer for large collections...)')
for idx, snapshottag in enumerate(SnapshotTag.objects.all().only('snapshot_old_id').iterator(chunk_size=500)):
assert snapshottag.snapshot_old_id
snapshot = Snapshot.objects.get(old_id=snapshottag.snapshot_old_id)
snapshottag.snapshot_id = snapshot.id
snapshottag.save(update_fields=["snapshot_id"])
assert str(snapshottag.snapshot_id) == str(snapshot.id)
if idx % 100 == 0:
print(f'Migrated {idx}/{num_total} SnapshotTag objects...')
class Migration(migrations.Migration):
dependencies = [
('core', '0050_alter_snapshottag_snapshot_old'),
]
operations = [
migrations.AddField(
model_name='snapshottag',
name='snapshot',
field=models.ForeignKey(blank=True, db_column='snapshot_id', null=True, on_delete=django.db.models.deletion.CASCADE, to='core.snapshot'),
),
migrations.AlterField(
model_name='snapshottag',
name='snapshot_old',
field=models.ForeignKey(db_column='snapshot_old_id', on_delete=django.db.models.deletion.CASCADE, related_name='snapshottag_old_set', to='core.snapshot', to_field='old_id'),
),
migrations.RunPython(update_snapshottag_ids, reverse_code=migrations.RunPython.noop),
]

View File

@@ -1,27 +0,0 @@
# Generated by Django 5.0.6 on 2024-08-20 02:37
import django.db.models.deletion
from django.db import migrations, models
class Migration(migrations.Migration):
dependencies = [
('core', '0051_snapshottag_snapshot_alter_snapshottag_snapshot_old'),
]
operations = [
migrations.AlterUniqueTogether(
name='snapshottag',
unique_together=set(),
),
migrations.AlterField(
model_name='snapshottag',
name='snapshot',
field=models.ForeignKey(db_column='snapshot_id', on_delete=django.db.models.deletion.CASCADE, to='core.snapshot'),
),
migrations.AlterUniqueTogether(
name='snapshottag',
unique_together={('snapshot', 'tag')},
),
]

View File

@@ -1,17 +0,0 @@
# Generated by Django 5.0.6 on 2024-08-20 02:38
from django.db import migrations
class Migration(migrations.Migration):
dependencies = [
('core', '0052_alter_snapshottag_unique_together_and_more'),
]
operations = [
migrations.RemoveField(
model_name='snapshottag',
name='snapshot_old',
),
]

View File

@@ -1,18 +0,0 @@
# Generated by Django 5.0.6 on 2024-08-20 02:40
from django.db import migrations, models
class Migration(migrations.Migration):
dependencies = [
('core', '0053_remove_snapshottag_snapshot_old'),
]
operations = [
migrations.AlterField(
model_name='snapshot',
name='timestamp',
field=models.CharField(db_index=True, editable=False, max_length=32, unique=True),
),
]

View File

@@ -1,18 +0,0 @@
# Generated by Django 5.0.6 on 2024-08-20 03:24
from django.db import migrations, models
class Migration(migrations.Migration):
dependencies = [
('core', '0054_alter_snapshot_timestamp'),
]
operations = [
migrations.AlterField(
model_name='tag',
name='slug',
field=models.SlugField(editable=False, max_length=100, unique=True),
),
]

View File

@@ -1,17 +0,0 @@
# Generated by Django 5.0.6 on 2024-08-20 03:25
from django.db import migrations
class Migration(migrations.Migration):
dependencies = [
('core', '0055_alter_tag_slug'),
]
operations = [
migrations.RemoveField(
model_name='tag',
name='uuid',
),
]

View File

@@ -1,18 +0,0 @@
# Generated by Django 5.0.6 on 2024-08-20 03:29
from django.db import migrations
class Migration(migrations.Migration):
dependencies = [
('core', '0056_remove_tag_uuid'),
]
operations = [
migrations.RenameField(
model_name='tag',
old_name='id',
new_name='old_id',
),
]

View File

@@ -1,22 +0,0 @@
# Generated by Django 5.0.6 on 2024-08-20 03:30
import random
from django.db import migrations, models
def rand_int_id():
return random.getrandbits(32)
class Migration(migrations.Migration):
dependencies = [
('core', '0057_rename_id_tag_old_id'),
]
operations = [
migrations.AlterField(
model_name='tag',
name='old_id',
field=models.BigIntegerField(default=rand_int_id, primary_key=True, serialize=False, verbose_name='Old ID'),
),
]

View File

@@ -1,90 +0,0 @@
# Generated by Django 5.0.6 on 2024-08-20 03:33
from datetime import datetime
from django.db import migrations, models
from archivebox.base_models.abid import abid_from_values
from archivebox.base_models.models import ABID
def calculate_abid(self):
"""
Return a freshly derived ABID (assembled from attrs defined in ABIDModel.abid_*_src).
"""
prefix = self.abid_prefix
ts = eval(self.abid_ts_src)
uri = eval(self.abid_uri_src)
subtype = eval(self.abid_subtype_src)
rand = eval(self.abid_rand_src)
if (not prefix) or prefix == 'obj_':
suggested_abid = self.__class__.__name__[:3].lower()
raise Exception(f'{self.__class__.__name__}.abid_prefix must be defined to calculate ABIDs (suggested: {suggested_abid})')
if not ts:
ts = datetime.utcfromtimestamp(0)
print(f'[!] WARNING: Generating ABID with ts=0000000000 placeholder because {self.__class__.__name__}.abid_ts_src={self.abid_ts_src} is unset!', ts.isoformat())
if not uri:
uri = str(self)
print(f'[!] WARNING: Generating ABID with uri=str(self) placeholder because {self.__class__.__name__}.abid_uri_src={self.abid_uri_src} is unset!', uri)
if not subtype:
subtype = self.__class__.__name__
print(f'[!] WARNING: Generating ABID with subtype={subtype} placeholder because {self.__class__.__name__}.abid_subtype_src={self.abid_subtype_src} is unset!', subtype)
if not rand:
rand = getattr(self, 'uuid', None) or getattr(self, 'id', None) or getattr(self, 'pk')
print(f'[!] WARNING: Generating ABID with rand=self.id placeholder because {self.__class__.__name__}.abid_rand_src={self.abid_rand_src} is unset!', rand)
abid = abid_from_values(
prefix=prefix,
ts=ts,
uri=uri,
subtype=subtype,
rand=rand,
)
assert abid.ulid and abid.uuid and abid.typeid, f'Failed to calculate {prefix}_ABID for {self.__class__.__name__}'
return abid
def update_archiveresult_ids(apps, schema_editor):
Tag = apps.get_model("core", "Tag")
num_total = Tag.objects.all().count()
print(f' Updating {num_total} Tag.id, ArchiveResult.uuid values in place...')
for idx, tag in enumerate(Tag.objects.all().iterator(chunk_size=500)):
if not tag.slug:
tag.slug = tag.name.lower().replace(' ', '_')
if not tag.name:
tag.name = tag.slug
if not (tag.name or tag.slug):
tag.delete()
continue
assert tag.slug or tag.name, f'Tag.slug must be defined! You have a Tag(id={tag.pk}) missing a slug!'
tag.abid_prefix = 'tag_'
tag.abid_ts_src = 'self.created'
tag.abid_uri_src = 'self.slug'
tag.abid_subtype_src = '"03"'
tag.abid_rand_src = 'self.old_id'
tag.abid = calculate_abid(tag)
tag.id = tag.abid.uuid
tag.save(update_fields=["abid", "id", "name", "slug"])
assert str(ABID.parse(tag.abid).uuid) == str(tag.id)
if idx % 10 == 0:
print(f'Migrated {idx}/{num_total} Tag objects...')
class Migration(migrations.Migration):
dependencies = [
('core', '0058_alter_tag_old_id'),
]
operations = [
migrations.AddField(
model_name='tag',
name='id',
field=models.UUIDField(blank=True, null=True),
),
migrations.RunPython(update_archiveresult_ids, reverse_code=migrations.RunPython.noop),
]

View File

@@ -1,19 +0,0 @@
# Generated by Django 5.0.6 on 2024-08-20 03:42
import uuid
from django.db import migrations, models
class Migration(migrations.Migration):
dependencies = [
('core', '0059_tag_id'),
]
operations = [
migrations.AlterField(
model_name='tag',
name='id',
field=models.UUIDField(default=uuid.uuid4, editable=False, unique=True),
),
]

View File

@@ -1,22 +0,0 @@
# Generated by Django 5.0.6 on 2024-08-20 03:43
from django.db import migrations
class Migration(migrations.Migration):
dependencies = [
('core', '0060_alter_tag_id'),
]
operations = [
migrations.RenameField(
model_name='snapshottag',
old_name='tag',
new_name='old_tag',
),
migrations.AlterUniqueTogether(
name='snapshottag',
unique_together={('snapshot', 'old_tag')},
),
]

View File

@@ -1,19 +0,0 @@
# Generated by Django 5.0.6 on 2024-08-20 03:44
import django.db.models.deletion
from django.db import migrations, models
class Migration(migrations.Migration):
dependencies = [
('core', '0061_rename_tag_snapshottag_old_tag_and_more'),
]
operations = [
migrations.AlterField(
model_name='snapshottag',
name='old_tag',
field=models.ForeignKey(db_column='old_tag_id', on_delete=django.db.models.deletion.CASCADE, to='core.tag'),
),
]

View File

@@ -1,40 +0,0 @@
# Generated by Django 5.0.6 on 2024-08-20 03:45
import django.db.models.deletion
from django.db import migrations, models
def update_snapshottag_ids(apps, schema_editor):
Tag = apps.get_model("core", "Tag")
SnapshotTag = apps.get_model("core", "SnapshotTag")
num_total = SnapshotTag.objects.all().count()
print(f' Updating {num_total} SnapshotTag.tag_id values in place... (may take an hour or longer for large collections...)')
for idx, snapshottag in enumerate(SnapshotTag.objects.all().only('old_tag_id').iterator(chunk_size=500)):
assert snapshottag.old_tag_id
tag = Tag.objects.get(old_id=snapshottag.old_tag_id)
snapshottag.tag_id = tag.id
snapshottag.save(update_fields=["tag_id"])
assert str(snapshottag.tag_id) == str(tag.id)
if idx % 100 == 0:
print(f'Migrated {idx}/{num_total} SnapshotTag objects...')
class Migration(migrations.Migration):
dependencies = [
('core', '0062_alter_snapshottag_old_tag'),
]
operations = [
migrations.AddField(
model_name='snapshottag',
name='tag',
field=models.ForeignKey(blank=True, db_column='tag_id', null=True, on_delete=django.db.models.deletion.CASCADE, to='core.tag', to_field='id'),
),
migrations.AlterField(
model_name='snapshottag',
name='old_tag',
field=models.ForeignKey(db_column='old_tag_id', on_delete=django.db.models.deletion.CASCADE, related_name='snapshottags_old', to='core.tag'),
),
migrations.RunPython(update_snapshottag_ids, reverse_code=migrations.RunPython.noop),
]

View File

@@ -1,27 +0,0 @@
# Generated by Django 5.0.6 on 2024-08-20 03:50
import django.db.models.deletion
from django.db import migrations, models
class Migration(migrations.Migration):
dependencies = [
('core', '0063_snapshottag_tag_alter_snapshottag_old_tag'),
]
operations = [
migrations.AlterUniqueTogether(
name='snapshottag',
unique_together=set(),
),
migrations.AlterField(
model_name='snapshottag',
name='tag',
field=models.ForeignKey(db_column='tag_id', on_delete=django.db.models.deletion.CASCADE, to='core.tag', to_field='id'),
),
migrations.AlterUniqueTogether(
name='snapshottag',
unique_together={('snapshot', 'tag')},
),
]

View File

@@ -1,17 +0,0 @@
# Generated by Django 5.0.6 on 2024-08-20 03:51
from django.db import migrations
class Migration(migrations.Migration):
dependencies = [
('core', '0064_alter_snapshottag_unique_together_and_more'),
]
operations = [
migrations.RemoveField(
model_name='snapshottag',
name='old_tag',
),
]

View File

@@ -1,34 +0,0 @@
# Generated by Django 5.0.6 on 2024-08-20 03:52
import core.models
import django.db.models.deletion
import uuid
import random
from django.db import migrations, models
def rand_int_id():
return random.getrandbits(32)
class Migration(migrations.Migration):
dependencies = [
('core', '0065_remove_snapshottag_old_tag'),
]
operations = [
migrations.AlterField(
model_name='snapshottag',
name='tag',
field=models.ForeignKey(db_column='tag_id', on_delete=django.db.models.deletion.CASCADE, to='core.tag', to_field='id'),
),
migrations.AlterField(
model_name='tag',
name='id',
field=models.UUIDField(default=uuid.uuid4, editable=False, primary_key=True, serialize=False, unique=True),
),
migrations.AlterField(
model_name='tag',
name='old_id',
field=models.BigIntegerField(default=rand_int_id, serialize=False, unique=True, verbose_name='Old ID'),
),
]

Some files were not shown because too many files have changed in this diff Show More