34 KiB
Background Hooks Implementation Plan
Overview
This plan implements support for long-running background hooks that run concurrently with other extractors, while maintaining proper result collection, cleanup, and state management.
Key Changes:
- Background hooks use
.bg.js/.bg.py/.bg.shsuffix - Runner hashes files and creates ArchiveFile records for tracking
- Filesystem-level deduplication (fdupes, ZFS, Btrfs) handles space savings
- Hooks emit single JSON output with optional structured data
- Binary FK is optional and only set when hook reports cmd
- Split
outputfield intooutput_str(human-readable) andoutput_data(structured) - Use ArchiveFile model (FK to ArchiveResult) instead of JSON fields for file tracking
- Output stats (size, mimetypes) derived via properties from ArchiveFile queries
Phase 1: Database Migration
Add new fields to ArchiveResult
# archivebox/core/migrations/00XX_archiveresult_background_hooks.py
from django.db import migrations, models
class Migration(migrations.Migration):
dependencies = [
('core', 'XXXX_previous_migration'),
('machine', 'XXXX_latest_machine_migration'),
]
operations = [
# Rename output → output_str for clarity
migrations.RenameField(
model_name='archiveresult',
old_name='output',
new_name='output_str',
),
# Add structured metadata field
migrations.AddField(
model_name='archiveresult',
name='output_data',
field=models.JSONField(
null=True,
blank=True,
help_text='Structured metadata from hook (headers, redirects, etc.)'
),
),
# Add binary FK (optional)
migrations.AddField(
model_name='archiveresult',
name='binary',
field=models.ForeignKey(
'machine.InstalledBinary',
on_delete=models.SET_NULL,
null=True,
blank=True,
help_text='Primary binary used by this hook (optional)'
),
),
]
ArchiveFile Model
Instead of storing file lists and stats as JSON fields on ArchiveResult, we use a normalized model that tracks files with hashes. Deduplication is handled at the filesystem level (fdupes, ZFS, Btrfs, etc.):
# archivebox/core/models.py
class ArchiveFile(models.Model):
"""
Track files produced by an ArchiveResult with hash for integrity checking.
Files remain in their natural filesystem hierarchy. Deduplication is handled
by the filesystem layer (hardlinks via fdupes, ZFS dedup, Btrfs dedup, etc.).
"""
archiveresult = models.ForeignKey(
'ArchiveResult',
on_delete=models.CASCADE,
related_name='files'
)
# Path relative to ArchiveResult output directory
relative_path = models.CharField(
max_length=512,
help_text='Path relative to extractor output dir (e.g., "index.html", "responses/all/file.js")'
)
# Hash for integrity checking and duplicate detection
hash_algorithm = models.CharField(max_length=16, default='sha256')
hash = models.CharField(
max_length=128,
db_index=True,
help_text='SHA-256 hash for integrity and finding duplicates'
)
# Cached filesystem stats
size = models.BigIntegerField(help_text='File size in bytes')
mime_type = models.CharField(max_length=128, blank=True)
created_at = models.DateTimeField(auto_now_add=True)
class Meta:
indexes = [
models.Index(fields=['archiveresult']),
models.Index(fields=['hash']), # Find duplicates across archive
]
unique_together = [['archiveresult', 'relative_path']]
def __str__(self):
return f"{self.archiveresult.extractor}/{self.relative_path}"
@property
def absolute_path(self) -> Path:
"""Get absolute filesystem path."""
return Path(self.archiveresult.pwd) / self.relative_path
Benefits:
- Simple: Single model, no CAS abstraction needed
- Natural hierarchy: Files stay in
snapshot_dir/extractor/file.html - Flexible deduplication: User chooses filesystem-level strategy
- Easy browsing: Directory structure matches logical organization
- Integrity checking: Hashes verify file integrity over time
- Duplicate detection: Query by hash to find duplicates for manual review
Phase 2: Hook Output Format
Hooks emit single JSON object to stdout
Contract:
- Hook emits ONE JSON object with
type: 'ArchiveResult' - Hook only provides:
status,output(human-readable), optionaloutput_data, optionalcmd - Runner calculates:
output_size,output_mimetypes,start_ts,end_ts,binaryFK
Example outputs:
// Simple string output
console.log(JSON.stringify({
type: 'ArchiveResult',
status: 'succeeded',
output: 'Downloaded index.html (4.2 KB)'
}));
// With structured metadata
console.log(JSON.stringify({
type: 'ArchiveResult',
status: 'succeeded',
output: 'Archived https://example.com',
output_data: {
files: ['index.html', 'style.css', 'script.js'],
headers: {'content-type': 'text/html', 'server': 'nginx'},
redirects: [{from: 'http://example.com', to: 'https://example.com'}]
}
}));
// With explicit cmd (for binary FK)
console.log(JSON.stringify({
type: 'ArchiveResult',
status: 'succeeded',
output: 'Archived with wget',
cmd: ['wget', '-p', '-k', 'https://example.com']
}));
// Just structured data (no human-readable string)
console.log(JSON.stringify({
type: 'ArchiveResult',
status: 'succeeded',
output_data: {
title: 'My Page Title',
charset: 'UTF-8'
}
}));
Phase 3: Update HookResult TypedDict
# archivebox/hooks.py
class HookResult(TypedDict):
"""Result from executing a hook script."""
returncode: int # Process exit code
stdout: str # Full stdout from hook
stderr: str # Full stderr from hook
output_json: Optional[dict] # Parsed JSON output from hook
start_ts: str # ISO timestamp (calculated by runner)
end_ts: str # ISO timestamp (calculated by runner)
cmd: List[str] # Command that ran (from hook or fallback)
binary_id: Optional[str] # FK to InstalledBinary (optional)
hook: str # Path to hook script
Note: output_files, output_size, and output_mimetypes are no longer in HookResult. Instead, the runner hashes files and creates ArchiveFile records. Stats are derived via properties on ArchiveResult.
Phase 4: Update run_hook() Implementation
Location: archivebox/hooks.py
def find_binary_for_cmd(cmd: List[str], machine_id: str) -> Optional[str]:
"""
Find InstalledBinary for a command, trying abspath first then name.
Only matches binaries on the current machine.
Args:
cmd: Command list (e.g., ['/usr/bin/wget', '-p', 'url'])
machine_id: Current machine ID
Returns:
Binary ID if found, None otherwise
"""
if not cmd:
return None
from machine.models import InstalledBinary
bin_path_or_name = cmd[0]
# Try matching by absolute path first
binary = InstalledBinary.objects.filter(
abspath=bin_path_or_name,
machine_id=machine_id
).first()
if binary:
return str(binary.id)
# Fallback: match by binary name
bin_name = Path(bin_path_or_name).name
binary = InstalledBinary.objects.filter(
name=bin_name,
machine_id=machine_id
).first()
return str(binary.id) if binary else None
def parse_hook_output_json(stdout: str) -> Optional[dict]:
"""
Parse single JSON output from hook stdout.
Looks for first line with {type: 'ArchiveResult', ...}
"""
for line in stdout.splitlines():
line = line.strip()
if not line:
continue
try:
data = json.loads(line)
if data.get('type') == 'ArchiveResult':
return data # Return first match
except json.JSONDecodeError:
continue
return None
def run_hook(
script: Path,
output_dir: Path,
timeout: int = 300,
config_objects: Optional[List[Any]] = None,
**kwargs: Any
) -> Optional[HookResult]:
"""
Execute a hook script and capture results.
Runner responsibilities:
- Detect background hooks (.bg. in filename)
- Capture stdout/stderr to log files
- Return result (caller will hash files and create ArchiveFile records)
- Determine binary FK from cmd (optional)
- Clean up log files and PID files
Hook responsibilities:
- Emit {type: 'ArchiveResult', status, output_str, output_data (optional), cmd (optional)}
- Write actual output files
Args:
script: Path to hook script
output_dir: Working directory (where output files go)
timeout: Max execution time in seconds
config_objects: Config override objects (Machine, Crawl, Snapshot)
**kwargs: CLI arguments passed to script
Returns:
HookResult for foreground hooks
None for background hooks (still running)
"""
import time
from datetime import datetime, timezone
from machine.models import Machine
start_time = time.time()
# 1. SETUP
is_background = '.bg.' in script.name # Detect .bg.js/.bg.py/.bg.sh
effective_timeout = timeout * 10 if is_background else timeout
# Infrastructure files (ALL hooks)
stdout_file = output_dir / 'stdout.log'
stderr_file = output_dir / 'stderr.log'
pid_file = output_dir / 'hook.pid'
# Capture files before execution
files_before = set(output_dir.rglob('*')) if output_dir.exists() else set()
start_ts = datetime.now(timezone.utc)
# 2. BUILD COMMAND
ext = script.suffix.lower()
if ext == '.sh':
interpreter_cmd = ['bash', str(script)]
elif ext == '.py':
interpreter_cmd = ['python3', str(script)]
elif ext == '.js':
interpreter_cmd = ['node', str(script)]
else:
interpreter_cmd = [str(script)]
# Build CLI arguments from kwargs
cli_args = []
for key, value in kwargs.items():
if key.startswith('_'):
continue
arg_key = f'--{key.replace("_", "-")}'
if isinstance(value, bool):
if value:
cli_args.append(arg_key)
elif value is not None and value != '':
if isinstance(value, (dict, list)):
cli_args.append(f'{arg_key}={json.dumps(value)}')
else:
str_value = str(value).strip()
if str_value:
cli_args.append(f'{arg_key}={str_value}')
full_cmd = interpreter_cmd + cli_args
# 3. SET UP ENVIRONMENT
env = os.environ.copy()
# ... (existing env setup from current run_hook implementation)
# 4. CREATE OUTPUT DIRECTORY
output_dir.mkdir(parents=True, exist_ok=True)
# 5. EXECUTE PROCESS
try:
with open(stdout_file, 'w') as out, open(stderr_file, 'w') as err:
process = subprocess.Popen(
full_cmd,
cwd=str(output_dir),
stdout=out,
stderr=err,
env=env,
)
# Write PID for all hooks
pid_file.write_text(str(process.pid))
if is_background:
# Background hook - return immediately, don't wait
return None
# Foreground hook - wait for completion
try:
returncode = process.wait(timeout=effective_timeout)
except subprocess.TimeoutExpired:
process.kill()
process.wait()
returncode = -1
with open(stderr_file, 'a') as err:
err.write(f'\nHook timed out after {effective_timeout}s')
# 6. COLLECT RESULTS (foreground only)
end_ts = datetime.now(timezone.utc)
stdout = stdout_file.read_text() if stdout_file.exists() else ''
stderr = stderr_file.read_text() if stderr_file.exists() else ''
# Parse single JSON output
output_json = parse_hook_output_json(stdout)
# Get cmd - prefer hook's reported cmd, fallback to interpreter cmd
if output_json and output_json.get('cmd'):
result_cmd = output_json['cmd']
else:
result_cmd = full_cmd
# 7. DETERMINE BINARY FK (OPTIONAL)
# Only set if hook reports cmd AND we can find the binary
machine = Machine.current()
binary_id = None
if output_json and output_json.get('cmd'):
binary_id = find_binary_for_cmd(output_json['cmd'], machine.id)
# If not found or not reported, leave binary_id=None
# 8. INGEST OUTPUT FILES VIA BLOBMANAGER
# BlobManager handles hashing, deduplication, and creating SnapshotFile records
# Note: This assumes snapshot and extractor name are available in kwargs
# In practice, ArchiveResult.run() will handle this after run_hook() returns
# For now, we just return the result and let the caller handle ingestion
# 9. CLEANUP
# Delete empty logs (keep non-empty for debugging)
if stdout_file.exists() and stdout_file.stat().st_size == 0:
stdout_file.unlink()
if stderr_file.exists() and stderr_file.stat().st_size == 0:
stderr_file.unlink()
# Delete ALL .pid files on success
if returncode == 0:
for pf in output_dir.glob('*.pid'):
pf.unlink(missing_ok=True)
# 10. RETURN RESULT
return HookResult(
returncode=returncode,
stdout=stdout,
stderr=stderr,
output_json=output_json,
start_ts=start_ts.isoformat(),
end_ts=end_ts.isoformat(),
cmd=result_cmd,
binary_id=binary_id,
hook=str(script),
)
except Exception as e:
duration_ms = int((time.time() - start_time) * 1000)
return HookResult(
returncode=-1,
stdout='',
stderr=f'Failed to run hook: {type(e).__name__}: {e}',
output_json=None,
start_ts=start_ts.isoformat(),
end_ts=datetime.now(timezone.utc).isoformat(),
cmd=full_cmd,
binary_id=None,
hook=str(script),
)
Phase 5: Update ArchiveResult.run()
Location: archivebox/core/models.py
def run(self):
"""
Execute this ArchiveResult's extractor and update status.
For foreground hooks: Waits for completion and updates immediately
For background hooks: Returns immediately, leaves status='started'
"""
from django.utils import timezone
from archivebox.hooks import BUILTIN_PLUGINS_DIR, USER_PLUGINS_DIR, run_hook
import dateutil.parser
config_objects = [self.snapshot.crawl, self.snapshot] if self.snapshot.crawl else [self.snapshot]
# Find hook for this extractor
hook = None
for base_dir in (BUILTIN_PLUGINS_DIR, USER_PLUGINS_DIR):
if not base_dir.exists():
continue
matches = list(base_dir.glob(f'*/on_Snapshot__{self.extractor}.*'))
if matches:
hook = matches[0]
break
if not hook:
self.status = self.StatusChoices.FAILED
self.output_str = f'No hook found for: {self.extractor}'
self.retry_at = None
self.save()
return
# Use plugin directory name instead of extractor name
plugin_name = hook.parent.name
extractor_dir = Path(self.snapshot.output_dir) / plugin_name
# Run the hook
result = run_hook(
hook,
output_dir=extractor_dir,
config_objects=config_objects,
url=self.snapshot.url,
snapshot_id=str(self.snapshot.id),
)
# BACKGROUND HOOK - still running
if result is None:
self.status = self.StatusChoices.STARTED
self.start_ts = timezone.now()
self.pwd = str(extractor_dir)
self.save()
return
# FOREGROUND HOOK - process result
if result['output_json']:
# Hook emitted JSON output
output_json = result['output_json']
# Determine status
status = output_json.get('status', 'failed')
status_map = {
'succeeded': self.StatusChoices.SUCCEEDED,
'failed': self.StatusChoices.FAILED,
'skipped': self.StatusChoices.SKIPPED,
}
self.status = status_map.get(status, self.StatusChoices.FAILED)
# Set output fields
self.output_str = output_json.get('output', '')
if 'output_data' in output_json:
self.output_data = output_json['output_data']
else:
# No JSON output - determine status from exit code
self.status = (self.StatusChoices.SUCCEEDED if result['returncode'] == 0
else self.StatusChoices.FAILED)
self.output_str = result['stdout'][:1024] or result['stderr'][:1024]
# Set timestamps (from runner)
self.start_ts = dateutil.parser.parse(result['start_ts'])
self.end_ts = dateutil.parser.parse(result['end_ts'])
# Set command and binary (from runner)
self.cmd = json.dumps(result['cmd'])
if result['binary_id']:
self.binary_id = result['binary_id']
# Metadata
self.pwd = str(extractor_dir)
self.retry_at = None
self.save()
# INGEST OUTPUT FILES VIA BLOBMANAGER
# This creates SnapshotFile records with deduplication
if extractor_dir.exists():
from archivebox.storage import BlobManager
snapshot_files = BlobManager.ingest_directory(
dir_path=extractor_dir,
snapshot=self.snapshot,
extractor=plugin_name,
# Exclude infrastructure files
exclude_patterns=['stdout.log', 'stderr.log', '*.pid']
)
# Clean up empty output directory (no real files after excluding logs/pids)
if extractor_dir.exists():
try:
# Check if only infrastructure files remain
remaining_files = [
f for f in extractor_dir.rglob('*')
if f.is_file() and f.name not in ('stdout.log', 'stderr.log', 'hook.pid', 'listener.pid')
]
if not remaining_files:
# Remove infrastructure files
for pf in extractor_dir.glob('*.log'):
pf.unlink(missing_ok=True)
for pf in extractor_dir.glob('*.pid'):
pf.unlink(missing_ok=True)
# Try to remove directory if empty
if not any(extractor_dir.iterdir()):
extractor_dir.rmdir()
except (OSError, RuntimeError):
pass
# Queue discovered URLs, trigger indexing, etc.
self._queue_urls_for_crawl(extractor_dir)
if self.status == self.StatusChoices.SUCCEEDED:
# Update snapshot title if this is title extractor
extractor_name = get_extractor_name(self.extractor)
if extractor_name == 'title':
self._update_snapshot_title(extractor_dir)
# Trigger search indexing
self.trigger_search_indexing()
Phase 6: Background Hook Finalization
Helper Functions
Location: archivebox/core/models.py or new archivebox/core/background_hooks.py
def find_background_hooks(snapshot) -> List['ArchiveResult']:
"""
Find all ArchiveResults that are background hooks still running.
Args:
snapshot: Snapshot instance
Returns:
List of ArchiveResults with status='started'
"""
return list(snapshot.archiveresult_set.filter(
status=ArchiveResult.StatusChoices.STARTED
))
def check_background_hook_completed(archiveresult: 'ArchiveResult') -> bool:
"""
Check if background hook process has exited.
Args:
archiveresult: ArchiveResult instance
Returns:
True if completed (process exited), False if still running
"""
extractor_dir = Path(archiveresult.pwd)
pid_file = extractor_dir / 'hook.pid'
if not pid_file.exists():
return True # No PID file = completed or failed to start
try:
pid = int(pid_file.read_text().strip())
os.kill(pid, 0) # Signal 0 = check if process exists
return False # Still running
except (OSError, ValueError):
return True # Process exited or invalid PID
def finalize_background_hook(archiveresult: 'ArchiveResult') -> None:
"""
Collect final results from completed background hook.
Runner calculates all stats - hook just emits status/output/output_data.
Args:
archiveresult: ArchiveResult instance to finalize
"""
from django.utils import timezone
from machine.models import Machine
import dateutil.parser
extractor_dir = Path(archiveresult.pwd)
stdout_file = extractor_dir / 'stdout.log'
stderr_file = extractor_dir / 'stderr.log'
# Read logs
stdout = stdout_file.read_text() if stdout_file.exists() else ''
stderr = stderr_file.read_text() if stderr_file.exists() else ''
# Parse JSON output
output_json = parse_hook_output_json(stdout)
# Determine status
if output_json:
status_str = output_json.get('status', 'failed')
status_map = {
'succeeded': ArchiveResult.StatusChoices.SUCCEEDED,
'failed': ArchiveResult.StatusChoices.FAILED,
'skipped': ArchiveResult.StatusChoices.SKIPPED,
}
status = status_map.get(status_str, ArchiveResult.StatusChoices.FAILED)
output_str = output_json.get('output', '')
output_data = output_json.get('output_data')
# Get cmd from hook (for binary FK)
cmd = output_json.get('cmd')
else:
# No JSON output = failed
status = ArchiveResult.StatusChoices.FAILED
output_str = stderr[:1024] if stderr else 'No output'
output_data = None
cmd = None
# Get binary FK from hook's reported cmd (if any)
binary_id = None
if cmd:
machine = Machine.current()
binary_id = find_binary_for_cmd(cmd, machine.id)
# Update ArchiveResult
archiveresult.status = status
archiveresult.end_ts = timezone.now()
archiveresult.output_str = output_str
if output_data:
archiveresult.output_data = output_data
archiveresult.retry_at = None
if binary_id:
archiveresult.binary_id = binary_id
archiveresult.save()
# INGEST OUTPUT FILES VIA BLOBMANAGER
# This creates SnapshotFile records with deduplication
if extractor_dir.exists():
from archivebox.storage import BlobManager
# Determine extractor name from path (plugin directory name)
plugin_name = extractor_dir.name
snapshot_files = BlobManager.ingest_directory(
dir_path=extractor_dir,
snapshot=archiveresult.snapshot,
extractor=plugin_name,
exclude_patterns=['stdout.log', 'stderr.log', '*.pid']
)
# Cleanup
for pf in extractor_dir.glob('*.pid'):
pf.unlink(missing_ok=True)
if stdout_file.exists() and stdout_file.stat().st_size == 0:
stdout_file.unlink()
if stderr_file.exists() and stderr_file.stat().st_size == 0:
stderr_file.unlink()
Update SnapshotMachine
Location: archivebox/core/statemachines.py
class SnapshotMachine(StateMachine, strict_states=True):
# ... existing states ...
def is_finished(self) -> bool:
"""
Check if snapshot archiving is complete.
A snapshot is finished when:
1. No pending archiveresults remain (queued/started foreground hooks)
2. All background hooks have completed
"""
# Check if any pending archiveresults exist
if self.snapshot.pending_archiveresults().exists():
return False
# Check and finalize background hooks
background_hooks = find_background_hooks(self.snapshot)
for bg_hook in background_hooks:
if not check_background_hook_completed(bg_hook):
return False # Still running
# Completed - finalize it
finalize_background_hook(bg_hook)
# All done
return True
Phase 6b: ArchiveResult Properties for Output Stats
Since output stats are no longer stored as fields, we expose them via properties that query SnapshotFile records:
# archivebox/core/models.py
class ArchiveResult(models.Model):
# ... existing fields ...
@property
def output_files(self):
"""
Get all SnapshotFile records created by this extractor.
Returns:
QuerySet of SnapshotFile objects
"""
plugin_name = self._get_plugin_name()
return self.snapshot.files.filter(extractor=plugin_name)
@property
def output_file_count(self) -> int:
"""Count of output files."""
return self.output_files.count()
@property
def total_output_size(self) -> int:
"""
Total size in bytes of all output files.
Returns:
Sum of blob sizes for this extractor's files
"""
from django.db.models import Sum
result = self.output_files.aggregate(total=Sum('blob__size'))
return result['total'] or 0
@property
def output_mimetypes(self) -> str:
"""
CSV of mimetypes ordered by size descending.
Returns:
String like "text/html,image/png,application/json"
"""
from django.db.models import Sum
from collections import OrderedDict
# Group by mimetype and sum sizes
files = self.output_files.values('blob__mime_type').annotate(
total_size=Sum('blob__size')
).order_by('-total_size')
# Build CSV
mimes = [f['blob__mime_type'] for f in files]
return ','.join(mimes)
@property
def output_summary(self) -> dict:
"""
Summary statistics for output files.
Returns:
Dict with file count, total size, and mimetype breakdown
"""
from django.db.models import Sum, Count
files = self.output_files.values('blob__mime_type').annotate(
count=Count('id'),
total_size=Sum('blob__size')
).order_by('-total_size')
return {
'file_count': self.output_file_count,
'total_size': self.total_output_size,
'by_mimetype': list(files),
}
def _get_plugin_name(self) -> str:
"""
Get plugin directory name from extractor.
Returns:
Plugin name (e.g., 'wget', 'singlefile')
"""
# This assumes pwd is set to extractor_dir during run()
if self.pwd:
return Path(self.pwd).name
# Fallback: use extractor number to find plugin
# (implementation depends on how extractor names map to plugins)
return self.extractor
Query Examples:
# Get all files for this extractor
files = archiveresult.output_files.all()
# Get total size
size = archiveresult.total_output_size
# Get mimetype breakdown
summary = archiveresult.output_summary
# {
# 'file_count': 42,
# 'total_size': 1048576,
# 'by_mimetype': [
# {'blob__mime_type': 'text/html', 'count': 5, 'total_size': 524288},
# {'blob__mime_type': 'image/png', 'count': 30, 'total_size': 409600},
# ...
# ]
# }
# Admin display
print(f"{archiveresult.output_mimetypes}") # "text/html,image/png,text/css"
Performance Considerations:
- Properties execute queries on access - cache results if needed
- Indexes on
(snapshot, extractor)make queries fast - For admin list views, use
select_related()andprefetch_related() - Consider adding
cached_propertyfor expensive calculations
Phase 7: Rename Background Hooks
Files to rename:
# Use .bg. suffix (not __background)
mv archivebox/plugins/consolelog/on_Snapshot__21_consolelog.js \
archivebox/plugins/consolelog/on_Snapshot__21_consolelog.bg.js
mv archivebox/plugins/ssl/on_Snapshot__23_ssl.js \
archivebox/plugins/ssl/on_Snapshot__23_ssl.bg.js
mv archivebox/plugins/responses/on_Snapshot__24_responses.js \
archivebox/plugins/responses/on_Snapshot__24_responses.bg.js
Update hook content to emit proper JSON:
Each hook should emit:
console.log(JSON.stringify({
type: 'ArchiveResult',
status: 'succeeded', // or 'failed' or 'skipped'
output: 'Captured 15 console messages', // human-readable summary
output_data: { // optional structured metadata
// ... specific to each hook
}
}));
Phase 8: Update Existing Hooks
Update all hooks to emit proper JSON format
Example: favicon hook
# Before
print(f'Favicon saved ({size} bytes)')
print(f'OUTPUT={OUTPUT_FILE}')
print(f'STATUS=succeeded')
# After
result = {
'type': 'ArchiveResult',
'status': 'succeeded',
'output': f'Favicon saved ({size} bytes)',
'output_data': {
'size': size,
'format': 'ico'
}
}
print(json.dumps(result))
Example: wget hook with explicit cmd
# After wget completes
cat <<EOF
{"type": "ArchiveResult", "status": "succeeded", "output": "Downloaded index.html", "cmd": ["wget", "-p", "-k", "$URL"]}
EOF
Testing Strategy
1. Unit Tests
# tests/test_background_hooks.py
def test_background_hook_detection():
"""Test .bg. suffix detection"""
assert is_background_hook(Path('on_Snapshot__21_test.bg.js'))
assert not is_background_hook(Path('on_Snapshot__21_test.js'))
def test_find_binary_by_abspath():
"""Test binary matching by absolute path"""
machine = Machine.current()
binary = InstalledBinary.objects.create(
name='wget',
abspath='/usr/bin/wget',
machine=machine
)
cmd = ['/usr/bin/wget', '-p', 'url']
assert find_binary_for_cmd(cmd, machine.id) == str(binary.id)
def test_find_binary_by_name():
"""Test binary matching by name fallback"""
machine = Machine.current()
binary = InstalledBinary.objects.create(
name='wget',
abspath='/usr/local/bin/wget',
machine=machine
)
cmd = ['wget', '-p', 'url']
assert find_binary_for_cmd(cmd, machine.id) == str(binary.id)
def test_parse_hook_json():
"""Test JSON parsing from stdout"""
stdout = '''
Some log output
{"type": "ArchiveResult", "status": "succeeded", "output": "test"}
More output
'''
result = parse_hook_output_json(stdout)
assert result['status'] == 'succeeded'
assert result['output'] == 'test'
2. Integration Tests
def test_foreground_hook_execution(snapshot):
"""Test foreground hook runs and returns results"""
ar = ArchiveResult.objects.create(
snapshot=snapshot,
extractor='11_favicon',
status=ArchiveResult.StatusChoices.QUEUED
)
ar.run()
ar.refresh_from_db()
assert ar.status in [
ArchiveResult.StatusChoices.SUCCEEDED,
ArchiveResult.StatusChoices.FAILED
]
assert ar.start_ts is not None
assert ar.end_ts is not None
assert ar.output_size >= 0
def test_background_hook_execution(snapshot):
"""Test background hook starts but doesn't block"""
ar = ArchiveResult.objects.create(
snapshot=snapshot,
extractor='21_consolelog',
status=ArchiveResult.StatusChoices.QUEUED
)
start = time.time()
ar.run()
duration = time.time() - start
ar.refresh_from_db()
# Should return quickly (< 5 seconds)
assert duration < 5
# Should be in 'started' state
assert ar.status == ArchiveResult.StatusChoices.STARTED
# PID file should exist
assert (Path(ar.pwd) / 'hook.pid').exists()
def test_background_hook_finalization(snapshot):
"""Test background hook finalization after completion"""
# Start background hook
ar = ArchiveResult.objects.create(
snapshot=snapshot,
extractor='21_consolelog',
status=ArchiveResult.StatusChoices.STARTED,
pwd='/path/to/output'
)
# Simulate completion (hook writes output and exits)
# ...
# Finalize
finalize_background_hook(ar)
ar.refresh_from_db()
assert ar.status == ArchiveResult.StatusChoices.SUCCEEDED
assert ar.end_ts is not None
assert ar.output_size > 0
Migration Path
Step 1: Create migration
cd archivebox
python manage.py makemigrations core --name archiveresult_background_hooks
Step 2: Update run_hook()
- Add background hook detection
- Add log file capture
- Add output stat calculation
- Add binary FK lookup
Step 3: Update ArchiveResult.run()
- Handle None result for background hooks
- Update field names (output → output_str, add output_data)
- Set binary FK
Step 4: Add finalization helpers
find_background_hooks()check_background_hook_completed()finalize_background_hook()
Step 5: Update SnapshotMachine.is_finished()
- Check for background hooks
- Finalize completed ones
Step 6: Rename hooks
- Rename 3 background hooks with .bg. suffix
Step 7: Update hook outputs
- Update all hooks to emit JSON format
- Remove manual timestamp/status calculation
Step 8: Test
- Unit tests
- Integration tests
- Manual testing with real snapshots
Success Criteria
- ✅ Background hooks start immediately without blocking other extractors
- ✅ Background hooks are finalized after completion with full results
- ✅ All output stats calculated by runner, not hooks
- ✅ Binary FK optional and only set when determinable
- ✅ Clean separation between output_str (human) and output_data (machine)
- ✅ Log files cleaned up on success, kept on failure
- ✅ PID files cleaned up after completion
- ✅ No plugin-specific code in core (generic polling mechanism)
Future Enhancements
1. Timeout for orphaned background hooks
If a background hook runs longer than MAX_LIFETIME after all foreground hooks complete, force kill it.
2. Progress reporting
Background hooks could write progress to a file that gets polled:
fs.writeFileSync('progress.txt', '50%');
3. Multiple results per hook
If needed in future, extend to support multiple JSON outputs by collecting all {type: 'ArchiveResult'} lines.
4. Dependency tracking
Store all binaries used by a hook (not just primary), useful for hooks that chain multiple tools.