28 KiB
ArchiveBox 2025 Simplification Plan
Status: FINAL - Ready for implementation Last Updated: 2024-12-24
Final Decisions Summary
| Decision | Choice |
|---|---|
| Task Queue | Keep retry_at polling pattern (no Django Tasks) |
| State Machine | Preserve current semantics; only replace mixins/statemachines if identical retry/lock guarantees are kept |
| Event Model | Remove completely |
| ABX Plugin System | Remove entirely (archivebox/pkgs/) |
| abx-pkg | Keep as external pip dependency (separate repo: github.com/ArchiveBox/abx-pkg) |
| Binary Providers | File-based plugins using abx-pkg internally |
| Search Backends | Hybrid: hooks for indexing, Python classes for querying |
| Auth Methods | Keep simple (LDAP + normal), no pluginization needed |
| ABID | Already removed (ignore old references) |
| ArchiveResult | Keep pre-creation with status=queued + retry_at for consistency |
| Plugin Directory | archivebox/plugins/* for built-ins, data/plugins/* for user hooks (flat on_*__*.* files) |
| Locking | Use retry_at consistently across Crawl, Snapshot, ArchiveResult |
| Worker Model | Separate processes per model type + per extractor, visible in htop |
| Concurrency | Per-extractor configurable (e.g., ytdlp_max_parallel=5) |
| InstalledBinary | Keep model + add Dependency model for audit trail |
Architecture Overview
Consistent Queue/Lock Pattern
All models (Crawl, Snapshot, ArchiveResult) use the same pattern:
class StatusMixin(models.Model):
status = models.CharField(max_length=15, db_index=True)
retry_at = models.DateTimeField(default=timezone.now, null=True, db_index=True)
class Meta:
abstract = True
def tick(self) -> bool:
"""Override in subclass. Returns True if state changed."""
raise NotImplementedError
# Worker query (same for all models):
Model.objects.filter(
status__in=['queued', 'started'],
retry_at__lte=timezone.now()
).order_by('retry_at').first()
# Claim (atomic via optimistic locking):
updated = Model.objects.filter(
id=obj.id,
retry_at=obj.retry_at
).update(
retry_at=timezone.now() + timedelta(seconds=60)
)
if updated == 1: # Successfully claimed
obj.refresh_from_db()
obj.tick()
Failure/cleanup guarantees
- Objects stuck in
startedwith a pastretry_atmust be reclaimed automatically using the existing retry/backoff rules. tick()implementations must continue to bumpretry_at/ transition tobackoffthe same way current statemachines do so that failures get retried without manual intervention.
Process Tree (Separate Processes, Visible in htop)
archivebox server
├── orchestrator (pid=1000)
│ ├── crawl_worker_0 (pid=1001)
│ ├── crawl_worker_1 (pid=1002)
│ ├── snapshot_worker_0 (pid=1003)
│ ├── snapshot_worker_1 (pid=1004)
│ ├── snapshot_worker_2 (pid=1005)
│ ├── wget_worker_0 (pid=1006)
│ ├── wget_worker_1 (pid=1007)
│ ├── ytdlp_worker_0 (pid=1008) # Limited concurrency
│ ├── ytdlp_worker_1 (pid=1009)
│ ├── screenshot_worker_0 (pid=1010)
│ ├── screenshot_worker_1 (pid=1011)
│ ├── screenshot_worker_2 (pid=1012)
│ └── ...
Configurable per-extractor concurrency:
# archivebox.conf or environment
WORKER_CONCURRENCY = {
'crawl': 2,
'snapshot': 3,
'wget': 2,
'ytdlp': 2, # Bandwidth-limited
'screenshot': 3,
'singlefile': 2,
'title': 5, # Fast, can run many
'favicon': 5,
}
Hook System
Discovery (Glob at Startup)
# archivebox/hooks.py
from pathlib import Path
import subprocess
import os
import json
from django.conf import settings
BUILTIN_PLUGIN_DIR = Path(__file__).parent.parent / 'plugins'
USER_PLUGIN_DIR = settings.DATA_DIR / 'plugins'
def discover_hooks(event_name: str) -> list[Path]:
"""Find all scripts matching on_{EventName}__*.{sh,py,js} under archivebox/plugins/* and data/plugins/*"""
hooks = []
for base in (BUILTIN_PLUGIN_DIR, USER_PLUGIN_DIR):
if not base.exists():
continue
for ext in ('sh', 'py', 'js'):
hooks.extend(base.glob(f'*/on_{event_name}__*.{ext}'))
return sorted(hooks)
def run_hook(script: Path, output_dir: Path, **kwargs) -> dict:
"""Execute hook with --key=value args, cwd=output_dir."""
args = [str(script)]
for key, value in kwargs.items():
args.append(f'--{key.replace("_", "-")}={json.dumps(value, default=str)}')
env = os.environ.copy()
env['ARCHIVEBOX_DATA_DIR'] = str(settings.DATA_DIR)
result = subprocess.run(
args,
cwd=output_dir,
capture_output=True,
text=True,
timeout=300,
env=env,
)
return {
'returncode': result.returncode,
'stdout': result.stdout,
'stderr': result.stderr,
}
Hook Interface
- Input: CLI args
--url=... --snapshot-id=... - Location: Built-in hooks in
archivebox/plugins/<plugin>/on_*__*.*, user hooks indata/plugins/<plugin>/on_*__*.* - Internal API: Should treat ArchiveBox as an external CLI—call
archivebox config --get ...,archivebox find ..., importabx-pkgonly when running in their own venvs. - Output: Files written to
$PWD(the output_dir), can callarchivebox create ... - Logging: stdout/stderr captured to ArchiveResult
- Exit code: 0 = success, non-zero = failure
Unified Config Access
- Implement
archivebox.config.get_config(scope='global'|'crawl'|'snapshot'|...)that merges defaults, config files, environment variables, DB overrides, and per-object config (seed/crawl/snapshot). - Provide helpers (
get_config(),get_flat_config()) for Python callers soabx.pm.hook.get_CONFIG*can be removed. - Ensure the CLI command
archivebox config --get KEY(and a machine-readable--format=json) uses the same API so hook scripts can query config via subprocess calls. - Document that plugin hooks should prefer the CLI to fetch config rather than importing Django internals, guaranteeing they work from shell/bash/js without ArchiveBox’s runtime.
Example Extractor Hooks
Bash:
#!/usr/bin/env bash
# plugins/on_Snapshot__wget.sh
set -e
# Parse args
for arg in "$@"; do
case $arg in
--url=*) URL="${arg#*=}" ;;
--snapshot-id=*) SNAPSHOT_ID="${arg#*=}" ;;
esac
done
# Find wget binary
WGET=$(archivebox find InstalledBinary --name=wget --format=abspath)
[ -z "$WGET" ] && echo "wget not found" >&2 && exit 1
# Run extraction (writes to $PWD)
$WGET --mirror --page-requisites --adjust-extension "$URL" 2>&1
echo "Completed wget mirror of $URL"
Python:
#!/usr/bin/env python3
# plugins/on_Snapshot__singlefile.py
import argparse
import subprocess
import sys
def main():
parser = argparse.ArgumentParser()
parser.add_argument('--url', required=True)
parser.add_argument('--snapshot-id', required=True)
args = parser.parse_args()
# Find binary via CLI
result = subprocess.run(
['archivebox', 'find', 'InstalledBinary', '--name=single-file', '--format=abspath'],
capture_output=True, text=True
)
bin_path = result.stdout.strip()
if not bin_path:
print("single-file not installed", file=sys.stderr)
sys.exit(1)
# Run extraction (writes to $PWD)
subprocess.run([bin_path, args.url, '--output', 'singlefile.html'], check=True)
print(f"Saved {args.url} to singlefile.html")
if __name__ == '__main__':
main()
Binary Providers & Dependencies
- Move dependency tracking into a dedicated
dependenciesmodule (or extendarchivebox/machine/) with two Django models:
Dependency:
id: uuidv7
bin_name: extractor binary executable name (ytdlp|wget|screenshot|...)
bin_provider: apt | brew | pip | npm | gem | nix | '*' for any
custom_cmds: JSON of provider->install command overrides (optional)
config: JSON of env vars/settings to apply during install
created_at: utc datetime
InstalledBinary:
id: uuidv7
dependency: FK to Dependency
bin_name: executable name again
bin_abspath: filesystem path
bin_version: semver string
bin_hash: sha256 of the binary
bin_provider: apt | brew | pip | npm | gem | nix | custom | ...
created_at: utc datetime (last seen/installed)
is_valid: property returning True when both abspath+version are set
- Provide CLI commands for hook scripts:
archivebox find InstalledBinary --name=wget --format=abspath,archivebox dependency create ..., etc. - Hooks remain language agnostic and should not import ArchiveBox Django modules; they rely on CLI commands plus their own runtime (python/bash/js).
Provider Hooks
- Built-in provider plugins live under
archivebox/plugins/<provider>/on_Dependency__*.py(e.g., apt, brew, pip, custom). - Each provider hook:
- Checks if the Dependency allows that provider via
bin_provideror wildcard'*'. - Builds the install command (
custom_cmds[provider]override or sane default likeapt install -y <bin_name>). - Executes the command (bash/python) and, on success, records/updates an
InstalledBinary.
- Checks if the Dependency allows that provider via
Example outline (bash or python, but still interacting via CLI):
# archivebox/plugins/apt/on_Dependency__install_using_apt_provider.sh
set -euo pipefail
DEP_JSON=$(archivebox dependency show --id="$DEPENDENCY_ID" --format=json)
BIN_NAME=$(echo "$DEP_JSON" | jq -r '.bin_name')
PROVIDER_ALLOWED=$(echo "$DEP_JSON" | jq -r '.bin_provider')
if [[ "$PROVIDER_ALLOWED" == "*" || "$PROVIDER_ALLOWED" == *"apt"* ]]; then
INSTALL_CMD=$(echo "$DEP_JSON" | jq -r '.custom_cmds.apt // empty')
INSTALL_CMD=${INSTALL_CMD:-"apt install -y --no-install-recommends $BIN_NAME"}
bash -lc "$INSTALL_CMD"
archivebox dependency register-installed \
--dependency-id="$DEPENDENCY_ID" \
--bin-provider=apt \
--bin-abspath="$(command -v "$BIN_NAME")" \
--bin-version="$("$(command -v "$BIN_NAME")" --version | head -n1)" \
--bin-hash="$(sha256sum "$(command -v "$BIN_NAME")" | cut -d' ' -f1)"
fi
- Extractor-level hooks (e.g.,
archivebox/plugins/wget/on_Crawl__install_wget_extractor_if_needed.*) ensure dependencies exist before starting work by creating/updatingDependencyrecords (via CLI) and then invoking provider hooks. - Remove all reliance on
abx.pm.hook.binary_load/ ABX plugin packages;abx-pkgcan remain as a normal pip dependency that hooks import if useful.
Search Backends (Hybrid)
Indexing: Hook Scripts
Triggered when ArchiveResult completes successfully (from the Django side we simply fire the event; indexing logic lives in standalone hook scripts):
#!/usr/bin/env python3
# plugins/on_ArchiveResult__index_sqlitefts.py
import argparse
import sqlite3
import os
from pathlib import Path
def main():
parser = argparse.ArgumentParser()
parser.add_argument('--snapshot-id', required=True)
parser.add_argument('--extractor', required=True)
args = parser.parse_args()
# Read text content from output files
content = ""
for f in Path.cwd().rglob('*.txt'):
content += f.read_text(errors='ignore') + "\n"
for f in Path.cwd().rglob('*.html'):
content += strip_html(f.read_text(errors='ignore')) + "\n"
if not content.strip():
return
# Add to FTS index
db = sqlite3.connect(os.environ['ARCHIVEBOX_DATA_DIR'] + '/search.sqlite3')
db.execute('CREATE VIRTUAL TABLE IF NOT EXISTS fts USING fts5(snapshot_id, content)')
db.execute('INSERT OR REPLACE INTO fts VALUES (?, ?)', (args.snapshot_id, content))
db.commit()
if __name__ == '__main__':
main()
Querying: CLI-backed Python Classes
# archivebox/search/backends/sqlitefts.py
import subprocess
import json
class SQLiteFTSBackend:
name = 'sqlitefts'
def search(self, query: str, limit: int = 50) -> list[str]:
"""Call plugins/on_Search__query_sqlitefts.* and parse stdout."""
result = subprocess.run(
['archivebox', 'search-backend', '--backend', self.name, '--query', query, '--limit', str(limit)],
capture_output=True,
check=True,
text=True,
)
return json.loads(result.stdout or '[]')
# archivebox/search/__init__.py
from django.conf import settings
def get_backend():
name = getattr(settings, 'SEARCH_BACKEND', 'sqlitefts')
if name == 'sqlitefts':
from .backends.sqlitefts import SQLiteFTSBackend
return SQLiteFTSBackend()
elif name == 'sonic':
from .backends.sonic import SonicBackend
return SonicBackend()
raise ValueError(f'Unknown search backend: {name}')
def search(query: str) -> list[str]:
return get_backend().search(query)
- Each backend script lives under
archivebox/plugins/search/on_Search__query_<backend>.py(with user overrides indata/plugins/...) and outputs JSON list of snapshot IDs. Python wrappers simply invoke the CLI to keep Django isolated from backend implementations.
Simplified Models
Goal: reduce line count without sacrificing the correctness guarantees we currently get from
ModelWithStateMachine+ python-statemachine. We keep the mixins/statemachines unless we can prove a smaller implementation enforces the same transitions/retry locking.
Snapshot
class Snapshot(models.Model):
id = models.UUIDField(primary_key=True, default=uuid7)
url = models.URLField(unique=True, db_index=True)
timestamp = models.CharField(max_length=32, unique=True, db_index=True)
title = models.CharField(max_length=512, null=True, blank=True)
created_by = models.ForeignKey(settings.AUTH_USER_MODEL, on_delete=models.CASCADE)
created_at = models.DateTimeField(default=timezone.now)
modified_at = models.DateTimeField(auto_now=True)
crawl = models.ForeignKey('crawls.Crawl', on_delete=models.CASCADE, null=True)
tags = models.ManyToManyField('Tag', through='SnapshotTag')
# Status (consistent with Crawl, ArchiveResult)
status = models.CharField(max_length=15, default='queued', db_index=True)
retry_at = models.DateTimeField(default=timezone.now, null=True, db_index=True)
# Inline fields (no mixins)
config = models.JSONField(default=dict)
notes = models.TextField(blank=True, default='')
FINAL_STATES = ['sealed']
@property
def output_dir(self) -> Path:
return settings.ARCHIVE_DIR / self.timestamp
def tick(self) -> bool:
if self.status == 'queued' and self.can_start():
self.start()
return True
elif self.status == 'started' and self.is_finished():
self.seal()
return True
return False
def can_start(self) -> bool:
return bool(self.url)
def is_finished(self) -> bool:
results = self.archiveresult_set.all()
if not results.exists():
return False
return not results.filter(status__in=['queued', 'started', 'backoff']).exists()
def start(self):
self.status = 'started'
self.retry_at = timezone.now() + timedelta(seconds=10)
self.output_dir.mkdir(parents=True, exist_ok=True)
self.save()
self.create_pending_archiveresults()
def seal(self):
self.status = 'sealed'
self.retry_at = None
self.save()
def create_pending_archiveresults(self):
for extractor in get_config(defaults=settings, crawl=self.crawl, snapshot=self).ENABLED_EXTRACTORS:
ArchiveResult.objects.get_or_create(
snapshot=self,
extractor=extractor,
defaults={
'status': 'queued',
'retry_at': timezone.now(),
'created_by': self.created_by,
}
)
ArchiveResult
class ArchiveResult(models.Model):
id = models.UUIDField(primary_key=True, default=uuid7)
snapshot = models.ForeignKey(Snapshot, on_delete=models.CASCADE)
extractor = models.CharField(max_length=32, db_index=True)
created_by = models.ForeignKey(settings.AUTH_USER_MODEL, on_delete=models.CASCADE)
created_at = models.DateTimeField(default=timezone.now)
modified_at = models.DateTimeField(auto_now=True)
# Status
status = models.CharField(max_length=15, default='queued', db_index=True)
retry_at = models.DateTimeField(default=timezone.now, null=True, db_index=True)
# Execution
start_ts = models.DateTimeField(null=True)
end_ts = models.DateTimeField(null=True)
output = models.CharField(max_length=1024, null=True)
cmd = models.JSONField(null=True)
pwd = models.CharField(max_length=256, null=True)
# Audit trail
machine = models.ForeignKey('machine.Machine', on_delete=models.SET_NULL, null=True)
iface = models.ForeignKey('machine.NetworkInterface', on_delete=models.SET_NULL, null=True)
installed_binary = models.ForeignKey('machine.InstalledBinary', on_delete=models.SET_NULL, null=True)
FINAL_STATES = ['succeeded', 'failed']
class Meta:
unique_together = ('snapshot', 'extractor')
@property
def output_dir(self) -> Path:
return self.snapshot.output_dir / self.extractor
def tick(self) -> bool:
if self.status == 'queued' and self.can_start():
self.start()
return True
elif self.status == 'backoff' and self.can_retry():
self.status = 'queued'
self.retry_at = timezone.now()
self.save()
return True
return False
def can_start(self) -> bool:
return bool(self.snapshot.url)
def can_retry(self) -> bool:
return self.retry_at and self.retry_at <= timezone.now()
def start(self):
self.status = 'started'
self.start_ts = timezone.now()
self.retry_at = timezone.now() + timedelta(seconds=120)
self.output_dir.mkdir(parents=True, exist_ok=True)
self.save()
# Run hook and complete
self.run_extractor_hook()
def run_extractor_hook(self):
from archivebox.hooks import discover_hooks, run_hook
hooks = discover_hooks(f'Snapshot__{self.extractor}')
if not hooks:
self.status = 'failed'
self.output = f'No hook for: {self.extractor}'
self.end_ts = timezone.now()
self.retry_at = None
self.save()
return
result = run_hook(
hooks[0],
output_dir=self.output_dir,
url=self.snapshot.url,
snapshot_id=str(self.snapshot.id),
)
self.status = 'succeeded' if result['returncode'] == 0 else 'failed'
self.output = result['stdout'][:1024] or result['stderr'][:1024]
self.end_ts = timezone.now()
self.retry_at = None
self.save()
# Trigger search indexing if succeeded
if self.status == 'succeeded':
self.trigger_search_indexing()
def trigger_search_indexing(self):
from archivebox.hooks import discover_hooks, run_hook
for hook in discover_hooks('ArchiveResult__index'):
run_hook(hook, output_dir=self.output_dir,
snapshot_id=str(self.snapshot.id),
extractor=self.extractor)
ArchiveResultmust continue storing execution metadata (cmd,pwd,machine,iface,installed_binary, timestamps) exactly as before, even though the extractor now runs via hook scripts.run_extractor_hook()is responsible for capturing those values (e.g., wrapping subprocess calls).- Any refactor of
Snapshot,ArchiveResult, orCrawlhas to keep the sameFINAL_STATES,retry_atsemantics, and tag/output directory handling thatModelWithStateMachinecurrently provides.
Simplified Worker System
# archivebox/workers/orchestrator.py
import os
import time
import multiprocessing
from datetime import timedelta
from django.utils import timezone
from django.conf import settings
class Worker:
"""Base worker for processing queued objects."""
Model = None
name = 'worker'
def get_queue(self):
return self.Model.objects.filter(
retry_at__lte=timezone.now()
).exclude(
status__in=self.Model.FINAL_STATES
).order_by('retry_at')
def claim(self, obj) -> bool:
"""Atomic claim via optimistic lock."""
updated = self.Model.objects.filter(
id=obj.id,
retry_at=obj.retry_at
).update(retry_at=timezone.now() + timedelta(seconds=60))
return updated == 1
def run(self):
print(f'[{self.name}] Started pid={os.getpid()}')
while True:
obj = self.get_queue().first()
if obj and self.claim(obj):
try:
obj.refresh_from_db()
obj.tick()
except Exception as e:
print(f'[{self.name}] Error: {e}')
obj.retry_at = timezone.now() + timedelta(seconds=60)
obj.save(update_fields=['retry_at'])
else:
time.sleep(0.5)
class CrawlWorker(Worker):
from crawls.models import Crawl
Model = Crawl
name = 'crawl'
class SnapshotWorker(Worker):
from core.models import Snapshot
Model = Snapshot
name = 'snapshot'
class ExtractorWorker(Worker):
"""Worker for a specific extractor."""
from core.models import ArchiveResult
Model = ArchiveResult
def __init__(self, extractor: str):
self.extractor = extractor
self.name = extractor
def get_queue(self):
return super().get_queue().filter(extractor=self.extractor)
class Orchestrator:
def __init__(self):
self.processes = []
def spawn(self):
config = settings.WORKER_CONCURRENCY
for i in range(config.get('crawl', 2)):
self._spawn(CrawlWorker, f'crawl_{i}')
for i in range(config.get('snapshot', 3)):
self._spawn(SnapshotWorker, f'snapshot_{i}')
for extractor, count in config.items():
if extractor in ('crawl', 'snapshot'):
continue
for i in range(count):
self._spawn(ExtractorWorker, f'{extractor}_{i}', extractor)
def _spawn(self, cls, name, *args):
worker = cls(*args) if args else cls()
worker.name = name
p = multiprocessing.Process(target=worker.run, name=name)
p.start()
self.processes.append(p)
def run(self):
print(f'Orchestrator pid={os.getpid()}')
self.spawn()
try:
while True:
for p in self.processes:
if not p.is_alive():
print(f'{p.name} died, restarting...')
# Respawn logic
time.sleep(5)
except KeyboardInterrupt:
for p in self.processes:
p.terminate()
Directory Structure
archivebox-nue/
├── archivebox/
│ ├── __init__.py
│ ├── config.py # Simple env-based config
│ ├── hooks.py # Hook discovery + execution
│ │
│ ├── core/
│ │ ├── models.py # Snapshot, ArchiveResult, Tag
│ │ ├── admin.py
│ │ └── views.py
│ │
│ ├── crawls/
│ │ ├── models.py # Crawl, Seed, CrawlSchedule, Outlink
│ │ └── admin.py
│ │
│ ├── machine/
│ │ ├── models.py # Machine, NetworkInterface, Dependency, InstalledBinary
│ │ └── admin.py
│ │
│ ├── workers/
│ │ └── orchestrator.py # ~150 lines
│ │
│ ├── api/
│ │ └── ...
│ │
│ ├── cli/
│ │ └── ...
│ │
│ ├── search/
│ │ ├── __init__.py
│ │ └── backends/
│ │ ├── sqlitefts.py
│ │ └── sonic.py
│ │
│ ├── index/
│ ├── parsers/
│ ├── misc/
│ └── templates/
│
-├── plugins/ # Built-in hooks (ArchiveBox never imports these directly)
│ ├── wget/
│ │ └── on_Snapshot__wget.sh
│ ├── dependencies/
│ │ ├── on_Dependency__install_using_apt_provider.sh
│ │ └── on_Dependency__install_using_custom_bash.py
│ ├── search/
│ │ ├── on_ArchiveResult__index_sqlitefts.py
│ │ └── on_Search__query_sqlitefts.py
│ └── ...
├── data/
│ └── plugins/ # User-provided hooks mirror builtin layout
└── pyproject.toml
Implementation Phases
Phase 1: Build Unified Config + Hook Scaffold
- Implement
archivebox.config.get_config()+ CLI plumbing (archivebox config --get ... --format=json) without touching abx yet. - Add
archivebox/hooks.pywith dual plugin directories (archivebox/plugins,data/plugins), discovery, and execution helpers. - Keep the existing ABX/worker system running while new APIs land; surface warnings where
abx.pm.*is still in use.
Phase 2: Gradual ABX Removal
- Rename
archivebox/pkgs/toarchivebox/pkgs.unused/and start deleting packages once equivalent hook scripts exist. - Remove
pluggy,python-statemachine, and allabx-*dependencies/workspace entries frompyproject.tomlonly after consumers are migrated. - Replace every
abx.pm.hook.get_*usage in CLI/config/search/extractors with the new config + hook APIs.
Phase 3: Worker + State Machine Simplification
- Introduce the process-per-model orchestrator while preserving
ModelWithStateMachinesemantics (Snapshot/Crawl/ArchiveResult). - Only drop mixins/statemachine dependency after verifying the new
tick()implementations keep retries/backoff/final states identical. - Ensure Huey/task entry points either delegate to the new orchestrator or are retired cleanly so background work isn’t double-run.
Phase 4: Hook-Based Extractors & Dependencies
- Create builtin extractor hooks in
archivebox/plugins/*/on_Snapshot__*.{sh,py,js}; haveArchiveResult.run_extractor_hook()capture cmd/pwd/machine/install metadata. - Implement the new
Dependency/InstalledBinarymodels + CLI commands, and port provider/install logic into hook scripts that only talk via CLI. - Add CLI helpers
archivebox find InstalledBinary,archivebox dependency ...used by all hooks and document how user plugins extend them.
Phase 5: Search Backends & Indexing Hooks
- Migrate indexing triggers to hook scripts (
on_ArchiveResult__index_*) that run standalone and write into$ARCHIVEBOX_DATA_DIR/search.*. - Implement CLI-driven query hooks (
on_Search__query_*) plus lightweight Python wrappers inarchivebox/search/backends/. - Remove any remaining ABX search integration.
What Gets Deleted
archivebox/pkgs/ # ~5,000 lines
archivebox/workers/actor.py # If exists
Dependencies Removed
"pluggy>=1.5.0"
"python-statemachine>=2.3.6"
# + all 30 abx-* packages
Dependencies Kept
"django>=6.0"
"django-ninja>=1.3.0"
"abx-pkg>=0.6.0" # External, for binary management
"click>=8.1.7"
"rich>=13.8.0"
Estimated Savings
| Component | Lines Removed |
|---|---|
| pkgs/ (ABX) | ~5,000 |
| statemachines | ~300 |
| workers/ | ~500 |
| base_models mixins | ~100 |
| Total | ~6,000 lines |
Plus 30+ dependencies removed, massive reduction in conceptual complexity.
Status: READY FOR IMPLEMENTATION
Begin with Phase 1: Rename archivebox/pkgs/ to add .unused suffix (delete after porting) and fix imports.