alex/ArchiveBox

Fork 0

mirror of https://github.com/ArchiveBox/ArchiveBox.git synced 2026-04-03 06:17:53 +10:00

Files

Nick Sweeting 4fd7fcdbcf new gallerydl plugin and more

2025-12-26 11:55:03 -08:00

34 KiB

Raw Blame History

Background Hooks Implementation Plan

Overview

This plan implements support for long-running background hooks that run concurrently with other extractors, while maintaining proper result collection, cleanup, and state management.

Key Changes:

Background hooks use .bg.js/.bg.py/.bg.sh suffix
Runner hashes files and creates ArchiveFile records for tracking
Filesystem-level deduplication (fdupes, ZFS, Btrfs) handles space savings
Hooks emit single JSON output with optional structured data
Binary FK is optional and only set when hook reports cmd
Split output field into output_str (human-readable) and output_data (structured)
Use ArchiveFile model (FK to ArchiveResult) instead of JSON fields for file tracking
Output stats (size, mimetypes) derived via properties from ArchiveFile queries

Phase 1: Database Migration

Add new fields to ArchiveResult

# archivebox/core/migrations/00XX_archiveresult_background_hooks.py

from django.db import migrations, models

class Migration(migrations.Migration):
    dependencies = [
        ('core', 'XXXX_previous_migration'),
        ('machine', 'XXXX_latest_machine_migration'),
    ]

    operations = [
        # Rename output → output_str for clarity
        migrations.RenameField(
            model_name='archiveresult',
            old_name='output',
            new_name='output_str',
        ),

        # Add structured metadata field
        migrations.AddField(
            model_name='archiveresult',
            name='output_data',
            field=models.JSONField(
                null=True,
                blank=True,
                help_text='Structured metadata from hook (headers, redirects, etc.)'
            ),
        ),

        # Add binary FK (optional)
        migrations.AddField(
            model_name='archiveresult',
            name='binary',
            field=models.ForeignKey(
                'machine.InstalledBinary',
                on_delete=models.SET_NULL,
                null=True,
                blank=True,
                help_text='Primary binary used by this hook (optional)'
            ),
        ),
    ]

ArchiveFile Model

Instead of storing file lists and stats as JSON fields on ArchiveResult, we use a normalized model that tracks files with hashes. Deduplication is handled at the filesystem level (fdupes, ZFS, Btrfs, etc.):

# archivebox/core/models.py

class ArchiveFile(models.Model):
    """
    Track files produced by an ArchiveResult with hash for integrity checking.

    Files remain in their natural filesystem hierarchy. Deduplication is handled
    by the filesystem layer (hardlinks via fdupes, ZFS dedup, Btrfs dedup, etc.).
    """
    archiveresult = models.ForeignKey(
        'ArchiveResult',
        on_delete=models.CASCADE,
        related_name='files'
    )

    # Path relative to ArchiveResult output directory
    relative_path = models.CharField(
        max_length=512,
        help_text='Path relative to extractor output dir (e.g., "index.html", "responses/all/file.js")'
    )

    # Hash for integrity checking and duplicate detection
    hash_algorithm = models.CharField(max_length=16, default='sha256')
    hash = models.CharField(
        max_length=128,
        db_index=True,
        help_text='SHA-256 hash for integrity and finding duplicates'
    )

    # Cached filesystem stats
    size = models.BigIntegerField(help_text='File size in bytes')
    mime_type = models.CharField(max_length=128, blank=True)

    created_at = models.DateTimeField(auto_now_add=True)

    class Meta:
        indexes = [
            models.Index(fields=['archiveresult']),
            models.Index(fields=['hash']),  # Find duplicates across archive
        ]
        unique_together = [['archiveresult', 'relative_path']]

    def __str__(self):
        return f"{self.archiveresult.extractor}/{self.relative_path}"

    @property
    def absolute_path(self) -> Path:
        """Get absolute filesystem path."""
        return Path(self.archiveresult.pwd) / self.relative_path

Benefits:

Simple: Single model, no CAS abstraction needed
Natural hierarchy: Files stay in snapshot_dir/extractor/file.html
Flexible deduplication: User chooses filesystem-level strategy
Easy browsing: Directory structure matches logical organization
Integrity checking: Hashes verify file integrity over time
Duplicate detection: Query by hash to find duplicates for manual review

Phase 2: Hook Output Format

Hooks emit single JSON object to stdout

Contract:

Hook emits ONE JSON object with type: 'ArchiveResult'
Hook only provides: status, output (human-readable), optional output_data, optional cmd
Runner calculates: output_size, output_mimetypes, start_ts, end_ts, binary FK

Example outputs:

// Simple string output
console.log(JSON.stringify({
    type: 'ArchiveResult',
    status: 'succeeded',
    output: 'Downloaded index.html (4.2 KB)'
}));

// With structured metadata
console.log(JSON.stringify({
    type: 'ArchiveResult',
    status: 'succeeded',
    output: 'Archived https://example.com',
    output_data: {
        files: ['index.html', 'style.css', 'script.js'],
        headers: {'content-type': 'text/html', 'server': 'nginx'},
        redirects: [{from: 'http://example.com', to: 'https://example.com'}]
    }
}));

// With explicit cmd (for binary FK)
console.log(JSON.stringify({
    type: 'ArchiveResult',
    status: 'succeeded',
    output: 'Archived with wget',
    cmd: ['wget', '-p', '-k', 'https://example.com']
}));

// Just structured data (no human-readable string)
console.log(JSON.stringify({
    type: 'ArchiveResult',
    status: 'succeeded',
    output_data: {
        title: 'My Page Title',
        charset: 'UTF-8'
    }
}));

Phase 3: Update HookResult TypedDict

# archivebox/hooks.py

class HookResult(TypedDict):
    """Result from executing a hook script."""
    returncode: int                   # Process exit code
    stdout: str                       # Full stdout from hook
    stderr: str                       # Full stderr from hook
    output_json: Optional[dict]       # Parsed JSON output from hook
    start_ts: str                     # ISO timestamp (calculated by runner)
    end_ts: str                       # ISO timestamp (calculated by runner)
    cmd: List[str]                    # Command that ran (from hook or fallback)
    binary_id: Optional[str]          # FK to InstalledBinary (optional)
    hook: str                         # Path to hook script

Note: output_files, output_size, and output_mimetypes are no longer in HookResult. Instead, the runner hashes files and creates ArchiveFile records. Stats are derived via properties on ArchiveResult.

Phase 4: Update run_hook() Implementation

Location: `archivebox/hooks.py`

def find_binary_for_cmd(cmd: List[str], machine_id: str) -> Optional[str]:
    """
    Find InstalledBinary for a command, trying abspath first then name.
    Only matches binaries on the current machine.

    Args:
        cmd: Command list (e.g., ['/usr/bin/wget', '-p', 'url'])
        machine_id: Current machine ID

    Returns:
        Binary ID if found, None otherwise
    """
    if not cmd:
        return None

    from machine.models import InstalledBinary

    bin_path_or_name = cmd[0]

    # Try matching by absolute path first
    binary = InstalledBinary.objects.filter(
        abspath=bin_path_or_name,
        machine_id=machine_id
    ).first()

    if binary:
        return str(binary.id)

    # Fallback: match by binary name
    bin_name = Path(bin_path_or_name).name
    binary = InstalledBinary.objects.filter(
        name=bin_name,
        machine_id=machine_id
    ).first()

    return str(binary.id) if binary else None


def parse_hook_output_json(stdout: str) -> Optional[dict]:
    """
    Parse single JSON output from hook stdout.

    Looks for first line with {type: 'ArchiveResult', ...}
    """
    for line in stdout.splitlines():
        line = line.strip()
        if not line:
            continue
        try:
            data = json.loads(line)
            if data.get('type') == 'ArchiveResult':
                return data  # Return first match
        except json.JSONDecodeError:
            continue
    return None


def run_hook(
    script: Path,
    output_dir: Path,
    timeout: int = 300,
    config_objects: Optional[List[Any]] = None,
    **kwargs: Any
) -> Optional[HookResult]:
    """
    Execute a hook script and capture results.

    Runner responsibilities:
    - Detect background hooks (.bg. in filename)
    - Capture stdout/stderr to log files
    - Return result (caller will hash files and create ArchiveFile records)
    - Determine binary FK from cmd (optional)
    - Clean up log files and PID files

    Hook responsibilities:
    - Emit {type: 'ArchiveResult', status, output_str, output_data (optional), cmd (optional)}
    - Write actual output files

    Args:
        script: Path to hook script
        output_dir: Working directory (where output files go)
        timeout: Max execution time in seconds
        config_objects: Config override objects (Machine, Crawl, Snapshot)
        **kwargs: CLI arguments passed to script

    Returns:
        HookResult for foreground hooks
        None for background hooks (still running)
    """
    import time
    from datetime import datetime, timezone
    from machine.models import Machine

    start_time = time.time()

    # 1. SETUP
    is_background = '.bg.' in script.name  # Detect .bg.js/.bg.py/.bg.sh
    effective_timeout = timeout * 10 if is_background else timeout

    # Infrastructure files (ALL hooks)
    stdout_file = output_dir / 'stdout.log'
    stderr_file = output_dir / 'stderr.log'
    pid_file = output_dir / 'hook.pid'

    # Capture files before execution
    files_before = set(output_dir.rglob('*')) if output_dir.exists() else set()
    start_ts = datetime.now(timezone.utc)

    # 2. BUILD COMMAND
    ext = script.suffix.lower()
    if ext == '.sh':
        interpreter_cmd = ['bash', str(script)]
    elif ext == '.py':
        interpreter_cmd = ['python3', str(script)]
    elif ext == '.js':
        interpreter_cmd = ['node', str(script)]
    else:
        interpreter_cmd = [str(script)]

    # Build CLI arguments from kwargs
    cli_args = []
    for key, value in kwargs.items():
        if key.startswith('_'):
            continue

        arg_key = f'--{key.replace("_", "-")}'
        if isinstance(value, bool):
            if value:
                cli_args.append(arg_key)
        elif value is not None and value != '':
            if isinstance(value, (dict, list)):
                cli_args.append(f'{arg_key}={json.dumps(value)}')
            else:
                str_value = str(value).strip()
                if str_value:
                    cli_args.append(f'{arg_key}={str_value}')

    full_cmd = interpreter_cmd + cli_args

    # 3. SET UP ENVIRONMENT
    env = os.environ.copy()
    # ... (existing env setup from current run_hook implementation)

    # 4. CREATE OUTPUT DIRECTORY
    output_dir.mkdir(parents=True, exist_ok=True)

    # 5. EXECUTE PROCESS
    try:
        with open(stdout_file, 'w') as out, open(stderr_file, 'w') as err:
            process = subprocess.Popen(
                full_cmd,
                cwd=str(output_dir),
                stdout=out,
                stderr=err,
                env=env,
            )

            # Write PID for all hooks
            pid_file.write_text(str(process.pid))

            if is_background:
                # Background hook - return immediately, don't wait
                return None

            # Foreground hook - wait for completion
            try:
                returncode = process.wait(timeout=effective_timeout)
            except subprocess.TimeoutExpired:
                process.kill()
                process.wait()
                returncode = -1
                with open(stderr_file, 'a') as err:
                    err.write(f'\nHook timed out after {effective_timeout}s')

        # 6. COLLECT RESULTS (foreground only)
        end_ts = datetime.now(timezone.utc)

        stdout = stdout_file.read_text() if stdout_file.exists() else ''
        stderr = stderr_file.read_text() if stderr_file.exists() else ''

        # Parse single JSON output
        output_json = parse_hook_output_json(stdout)

        # Get cmd - prefer hook's reported cmd, fallback to interpreter cmd
        if output_json and output_json.get('cmd'):
            result_cmd = output_json['cmd']
        else:
            result_cmd = full_cmd

        # 7. DETERMINE BINARY FK (OPTIONAL)
        # Only set if hook reports cmd AND we can find the binary
        machine = Machine.current()
        binary_id = None
        if output_json and output_json.get('cmd'):
            binary_id = find_binary_for_cmd(output_json['cmd'], machine.id)
        # If not found or not reported, leave binary_id=None

        # 8. INGEST OUTPUT FILES VIA BLOBMANAGER
        # BlobManager handles hashing, deduplication, and creating SnapshotFile records
        # Note: This assumes snapshot and extractor name are available in kwargs
        # In practice, ArchiveResult.run() will handle this after run_hook() returns
        # For now, we just return the result and let the caller handle ingestion

        # 9. CLEANUP
        # Delete empty logs (keep non-empty for debugging)
        if stdout_file.exists() and stdout_file.stat().st_size == 0:
            stdout_file.unlink()
        if stderr_file.exists() and stderr_file.stat().st_size == 0:
            stderr_file.unlink()

        # Delete ALL .pid files on success
        if returncode == 0:
            for pf in output_dir.glob('*.pid'):
                pf.unlink(missing_ok=True)

        # 10. RETURN RESULT
        return HookResult(
            returncode=returncode,
            stdout=stdout,
            stderr=stderr,
            output_json=output_json,
            start_ts=start_ts.isoformat(),
            end_ts=end_ts.isoformat(),
            cmd=result_cmd,
            binary_id=binary_id,
            hook=str(script),
        )

    except Exception as e:
        duration_ms = int((time.time() - start_time) * 1000)
        return HookResult(
            returncode=-1,
            stdout='',
            stderr=f'Failed to run hook: {type(e).__name__}: {e}',
            output_json=None,
            start_ts=start_ts.isoformat(),
            end_ts=datetime.now(timezone.utc).isoformat(),
            cmd=full_cmd,
            binary_id=None,
            hook=str(script),
        )

Phase 5: Update ArchiveResult.run()

Location: `archivebox/core/models.py`

def run(self):
    """
    Execute this ArchiveResult's extractor and update status.

    For foreground hooks: Waits for completion and updates immediately
    For background hooks: Returns immediately, leaves status='started'
    """
    from django.utils import timezone
    from archivebox.hooks import BUILTIN_PLUGINS_DIR, USER_PLUGINS_DIR, run_hook
    import dateutil.parser

    config_objects = [self.snapshot.crawl, self.snapshot] if self.snapshot.crawl else [self.snapshot]

    # Find hook for this extractor
    hook = None
    for base_dir in (BUILTIN_PLUGINS_DIR, USER_PLUGINS_DIR):
        if not base_dir.exists():
            continue
        matches = list(base_dir.glob(f'*/on_Snapshot__{self.extractor}.*'))
        if matches:
            hook = matches[0]
            break

    if not hook:
        self.status = self.StatusChoices.FAILED
        self.output_str = f'No hook found for: {self.extractor}'
        self.retry_at = None
        self.save()
        return

    # Use plugin directory name instead of extractor name
    plugin_name = hook.parent.name
    extractor_dir = Path(self.snapshot.output_dir) / plugin_name

    # Run the hook
    result = run_hook(
        hook,
        output_dir=extractor_dir,
        config_objects=config_objects,
        url=self.snapshot.url,
        snapshot_id=str(self.snapshot.id),
    )

    # BACKGROUND HOOK - still running
    if result is None:
        self.status = self.StatusChoices.STARTED
        self.start_ts = timezone.now()
        self.pwd = str(extractor_dir)
        self.save()
        return

    # FOREGROUND HOOK - process result
    if result['output_json']:
        # Hook emitted JSON output
        output_json = result['output_json']

        # Determine status
        status = output_json.get('status', 'failed')
        status_map = {
            'succeeded': self.StatusChoices.SUCCEEDED,
            'failed': self.StatusChoices.FAILED,
            'skipped': self.StatusChoices.SKIPPED,
        }
        self.status = status_map.get(status, self.StatusChoices.FAILED)

        # Set output fields
        self.output_str = output_json.get('output', '')
        if 'output_data' in output_json:
            self.output_data = output_json['output_data']
    else:
        # No JSON output - determine status from exit code
        self.status = (self.StatusChoices.SUCCEEDED if result['returncode'] == 0
                      else self.StatusChoices.FAILED)
        self.output_str = result['stdout'][:1024] or result['stderr'][:1024]

    # Set timestamps (from runner)
    self.start_ts = dateutil.parser.parse(result['start_ts'])
    self.end_ts = dateutil.parser.parse(result['end_ts'])

    # Set command and binary (from runner)
    self.cmd = json.dumps(result['cmd'])
    if result['binary_id']:
        self.binary_id = result['binary_id']

    # Metadata
    self.pwd = str(extractor_dir)
    self.retry_at = None

    self.save()

    # INGEST OUTPUT FILES VIA BLOBMANAGER
    # This creates SnapshotFile records with deduplication
    if extractor_dir.exists():
        from archivebox.storage import BlobManager

        snapshot_files = BlobManager.ingest_directory(
            dir_path=extractor_dir,
            snapshot=self.snapshot,
            extractor=plugin_name,
            # Exclude infrastructure files
            exclude_patterns=['stdout.log', 'stderr.log', '*.pid']
        )

    # Clean up empty output directory (no real files after excluding logs/pids)
    if extractor_dir.exists():
        try:
            # Check if only infrastructure files remain
            remaining_files = [
                f for f in extractor_dir.rglob('*')
                if f.is_file() and f.name not in ('stdout.log', 'stderr.log', 'hook.pid', 'listener.pid')
            ]
            if not remaining_files:
                # Remove infrastructure files
                for pf in extractor_dir.glob('*.log'):
                    pf.unlink(missing_ok=True)
                for pf in extractor_dir.glob('*.pid'):
                    pf.unlink(missing_ok=True)
                # Try to remove directory if empty
                if not any(extractor_dir.iterdir()):
                    extractor_dir.rmdir()
        except (OSError, RuntimeError):
            pass

    # Queue discovered URLs, trigger indexing, etc.
    self._queue_urls_for_crawl(extractor_dir)

    if self.status == self.StatusChoices.SUCCEEDED:
        # Update snapshot title if this is title extractor
        extractor_name = get_extractor_name(self.extractor)
        if extractor_name == 'title':
            self._update_snapshot_title(extractor_dir)

        # Trigger search indexing
        self.trigger_search_indexing()

Phase 6: Background Hook Finalization

Helper Functions

Location: archivebox/core/models.py or new archivebox/core/background_hooks.py

def find_background_hooks(snapshot) -> List['ArchiveResult']:
    """
    Find all ArchiveResults that are background hooks still running.

    Args:
        snapshot: Snapshot instance

    Returns:
        List of ArchiveResults with status='started'
    """
    return list(snapshot.archiveresult_set.filter(
        status=ArchiveResult.StatusChoices.STARTED
    ))


def check_background_hook_completed(archiveresult: 'ArchiveResult') -> bool:
    """
    Check if background hook process has exited.

    Args:
        archiveresult: ArchiveResult instance

    Returns:
        True if completed (process exited), False if still running
    """
    extractor_dir = Path(archiveresult.pwd)
    pid_file = extractor_dir / 'hook.pid'

    if not pid_file.exists():
        return True  # No PID file = completed or failed to start

    try:
        pid = int(pid_file.read_text().strip())
        os.kill(pid, 0)  # Signal 0 = check if process exists
        return False  # Still running
    except (OSError, ValueError):
        return True  # Process exited or invalid PID


def finalize_background_hook(archiveresult: 'ArchiveResult') -> None:
    """
    Collect final results from completed background hook.

    Runner calculates all stats - hook just emits status/output/output_data.

    Args:
        archiveresult: ArchiveResult instance to finalize
    """
    from django.utils import timezone
    from machine.models import Machine
    import dateutil.parser

    extractor_dir = Path(archiveresult.pwd)
    stdout_file = extractor_dir / 'stdout.log'
    stderr_file = extractor_dir / 'stderr.log'

    # Read logs
    stdout = stdout_file.read_text() if stdout_file.exists() else ''
    stderr = stderr_file.read_text() if stderr_file.exists() else ''

    # Parse JSON output
    output_json = parse_hook_output_json(stdout)

    # Determine status
    if output_json:
        status_str = output_json.get('status', 'failed')
        status_map = {
            'succeeded': ArchiveResult.StatusChoices.SUCCEEDED,
            'failed': ArchiveResult.StatusChoices.FAILED,
            'skipped': ArchiveResult.StatusChoices.SKIPPED,
        }
        status = status_map.get(status_str, ArchiveResult.StatusChoices.FAILED)
        output_str = output_json.get('output', '')
        output_data = output_json.get('output_data')

        # Get cmd from hook (for binary FK)
        cmd = output_json.get('cmd')
    else:
        # No JSON output = failed
        status = ArchiveResult.StatusChoices.FAILED
        output_str = stderr[:1024] if stderr else 'No output'
        output_data = None
        cmd = None

    # Get binary FK from hook's reported cmd (if any)
    binary_id = None
    if cmd:
        machine = Machine.current()
        binary_id = find_binary_for_cmd(cmd, machine.id)

    # Update ArchiveResult
    archiveresult.status = status
    archiveresult.end_ts = timezone.now()
    archiveresult.output_str = output_str
    if output_data:
        archiveresult.output_data = output_data
    archiveresult.retry_at = None

    if binary_id:
        archiveresult.binary_id = binary_id

    archiveresult.save()

    # INGEST OUTPUT FILES VIA BLOBMANAGER
    # This creates SnapshotFile records with deduplication
    if extractor_dir.exists():
        from archivebox.storage import BlobManager

        # Determine extractor name from path (plugin directory name)
        plugin_name = extractor_dir.name

        snapshot_files = BlobManager.ingest_directory(
            dir_path=extractor_dir,
            snapshot=archiveresult.snapshot,
            extractor=plugin_name,
            exclude_patterns=['stdout.log', 'stderr.log', '*.pid']
        )

    # Cleanup
    for pf in extractor_dir.glob('*.pid'):
        pf.unlink(missing_ok=True)
    if stdout_file.exists() and stdout_file.stat().st_size == 0:
        stdout_file.unlink()
    if stderr_file.exists() and stderr_file.stat().st_size == 0:
        stderr_file.unlink()

Update SnapshotMachine

Location: archivebox/core/statemachines.py

class SnapshotMachine(StateMachine, strict_states=True):
    # ... existing states ...

    def is_finished(self) -> bool:
        """
        Check if snapshot archiving is complete.

        A snapshot is finished when:
        1. No pending archiveresults remain (queued/started foreground hooks)
        2. All background hooks have completed
        """
        # Check if any pending archiveresults exist
        if self.snapshot.pending_archiveresults().exists():
            return False

        # Check and finalize background hooks
        background_hooks = find_background_hooks(self.snapshot)
        for bg_hook in background_hooks:
            if not check_background_hook_completed(bg_hook):
                return False  # Still running

            # Completed - finalize it
            finalize_background_hook(bg_hook)

        # All done
        return True

Phase 6b: ArchiveResult Properties for Output Stats

Since output stats are no longer stored as fields, we expose them via properties that query SnapshotFile records:

# archivebox/core/models.py

class ArchiveResult(models.Model):
    # ... existing fields ...

    @property
    def output_files(self):
        """
        Get all SnapshotFile records created by this extractor.

        Returns:
            QuerySet of SnapshotFile objects
        """
        plugin_name = self._get_plugin_name()
        return self.snapshot.files.filter(extractor=plugin_name)

    @property
    def output_file_count(self) -> int:
        """Count of output files."""
        return self.output_files.count()

    @property
    def total_output_size(self) -> int:
        """
        Total size in bytes of all output files.

        Returns:
            Sum of blob sizes for this extractor's files
        """
        from django.db.models import Sum

        result = self.output_files.aggregate(total=Sum('blob__size'))
        return result['total'] or 0

    @property
    def output_mimetypes(self) -> str:
        """
        CSV of mimetypes ordered by size descending.

        Returns:
            String like "text/html,image/png,application/json"
        """
        from django.db.models import Sum
        from collections import OrderedDict

        # Group by mimetype and sum sizes
        files = self.output_files.values('blob__mime_type').annotate(
            total_size=Sum('blob__size')
        ).order_by('-total_size')

        # Build CSV
        mimes = [f['blob__mime_type'] for f in files]
        return ','.join(mimes)

    @property
    def output_summary(self) -> dict:
        """
        Summary statistics for output files.

        Returns:
            Dict with file count, total size, and mimetype breakdown
        """
        from django.db.models import Sum, Count

        files = self.output_files.values('blob__mime_type').annotate(
            count=Count('id'),
            total_size=Sum('blob__size')
        ).order_by('-total_size')

        return {
            'file_count': self.output_file_count,
            'total_size': self.total_output_size,
            'by_mimetype': list(files),
        }

    def _get_plugin_name(self) -> str:
        """
        Get plugin directory name from extractor.

        Returns:
            Plugin name (e.g., 'wget', 'singlefile')
        """
        # This assumes pwd is set to extractor_dir during run()
        if self.pwd:
            return Path(self.pwd).name
        # Fallback: use extractor number to find plugin
        # (implementation depends on how extractor names map to plugins)
        return self.extractor

Query Examples:

# Get all files for this extractor
files = archiveresult.output_files.all()

# Get total size
size = archiveresult.total_output_size

# Get mimetype breakdown
summary = archiveresult.output_summary
# {
#   'file_count': 42,
#   'total_size': 1048576,
#   'by_mimetype': [
#     {'blob__mime_type': 'text/html', 'count': 5, 'total_size': 524288},
#     {'blob__mime_type': 'image/png', 'count': 30, 'total_size': 409600},
#     ...
#   ]
# }

# Admin display
print(f"{archiveresult.output_mimetypes}")  # "text/html,image/png,text/css"

Performance Considerations:

Properties execute queries on access - cache results if needed
Indexes on (snapshot, extractor) make queries fast
For admin list views, use select_related() and prefetch_related()
Consider adding cached_property for expensive calculations

Phase 7: Rename Background Hooks

Files to rename:

# Use .bg. suffix (not __background)
mv archivebox/plugins/consolelog/on_Snapshot__21_consolelog.js \
   archivebox/plugins/consolelog/on_Snapshot__21_consolelog.bg.js

mv archivebox/plugins/ssl/on_Snapshot__23_ssl.js \
   archivebox/plugins/ssl/on_Snapshot__23_ssl.bg.js

mv archivebox/plugins/responses/on_Snapshot__24_responses.js \
   archivebox/plugins/responses/on_Snapshot__24_responses.bg.js

Update hook content to emit proper JSON:

Each hook should emit:

console.log(JSON.stringify({
    type: 'ArchiveResult',
    status: 'succeeded',  // or 'failed' or 'skipped'
    output: 'Captured 15 console messages',  // human-readable summary
    output_data: {  // optional structured metadata
        // ... specific to each hook
    }
}));

Phase 8: Update Existing Hooks

Update all hooks to emit proper JSON format

Example: favicon hook

# Before
print(f'Favicon saved ({size} bytes)')
print(f'OUTPUT={OUTPUT_FILE}')
print(f'STATUS=succeeded')

# After
result = {
    'type': 'ArchiveResult',
    'status': 'succeeded',
    'output': f'Favicon saved ({size} bytes)',
    'output_data': {
        'size': size,
        'format': 'ico'
    }
}
print(json.dumps(result))

Example: wget hook with explicit cmd

# After wget completes
cat <<EOF
{"type": "ArchiveResult", "status": "succeeded", "output": "Downloaded index.html", "cmd": ["wget", "-p", "-k", "$URL"]}
EOF

Testing Strategy

1. Unit Tests

# tests/test_background_hooks.py

def test_background_hook_detection():
    """Test .bg. suffix detection"""
    assert is_background_hook(Path('on_Snapshot__21_test.bg.js'))
    assert not is_background_hook(Path('on_Snapshot__21_test.js'))

def test_find_binary_by_abspath():
    """Test binary matching by absolute path"""
    machine = Machine.current()
    binary = InstalledBinary.objects.create(
        name='wget',
        abspath='/usr/bin/wget',
        machine=machine
    )

    cmd = ['/usr/bin/wget', '-p', 'url']
    assert find_binary_for_cmd(cmd, machine.id) == str(binary.id)

def test_find_binary_by_name():
    """Test binary matching by name fallback"""
    machine = Machine.current()
    binary = InstalledBinary.objects.create(
        name='wget',
        abspath='/usr/local/bin/wget',
        machine=machine
    )

    cmd = ['wget', '-p', 'url']
    assert find_binary_for_cmd(cmd, machine.id) == str(binary.id)

def test_parse_hook_json():
    """Test JSON parsing from stdout"""
    stdout = '''
    Some log output
    {"type": "ArchiveResult", "status": "succeeded", "output": "test"}
    More output
    '''
    result = parse_hook_output_json(stdout)
    assert result['status'] == 'succeeded'
    assert result['output'] == 'test'

2. Integration Tests

def test_foreground_hook_execution(snapshot):
    """Test foreground hook runs and returns results"""
    ar = ArchiveResult.objects.create(
        snapshot=snapshot,
        extractor='11_favicon',
        status=ArchiveResult.StatusChoices.QUEUED
    )

    ar.run()
    ar.refresh_from_db()

    assert ar.status in [
        ArchiveResult.StatusChoices.SUCCEEDED,
        ArchiveResult.StatusChoices.FAILED
    ]
    assert ar.start_ts is not None
    assert ar.end_ts is not None
    assert ar.output_size >= 0

def test_background_hook_execution(snapshot):
    """Test background hook starts but doesn't block"""
    ar = ArchiveResult.objects.create(
        snapshot=snapshot,
        extractor='21_consolelog',
        status=ArchiveResult.StatusChoices.QUEUED
    )

    start = time.time()
    ar.run()
    duration = time.time() - start

    ar.refresh_from_db()

    # Should return quickly (< 5 seconds)
    assert duration < 5
    # Should be in 'started' state
    assert ar.status == ArchiveResult.StatusChoices.STARTED
    # PID file should exist
    assert (Path(ar.pwd) / 'hook.pid').exists()

def test_background_hook_finalization(snapshot):
    """Test background hook finalization after completion"""
    # Start background hook
    ar = ArchiveResult.objects.create(
        snapshot=snapshot,
        extractor='21_consolelog',
        status=ArchiveResult.StatusChoices.STARTED,
        pwd='/path/to/output'
    )

    # Simulate completion (hook writes output and exits)
    # ...

    # Finalize
    finalize_background_hook(ar)
    ar.refresh_from_db()

    assert ar.status == ArchiveResult.StatusChoices.SUCCEEDED
    assert ar.end_ts is not None
    assert ar.output_size > 0

Migration Path

Step 1: Create migration

cd archivebox
python manage.py makemigrations core --name archiveresult_background_hooks

Step 2: Update run_hook()

Add background hook detection
Add log file capture
Add output stat calculation
Add binary FK lookup

Step 3: Update ArchiveResult.run()

Handle None result for background hooks
Update field names (output → output_str, add output_data)
Set binary FK

Step 4: Add finalization helpers

find_background_hooks()
check_background_hook_completed()
finalize_background_hook()

Step 5: Update SnapshotMachine.is_finished()

Check for background hooks
Finalize completed ones

Step 6: Rename hooks

Rename 3 background hooks with .bg. suffix

Step 7: Update hook outputs

Update all hooks to emit JSON format
Remove manual timestamp/status calculation

Step 8: Test

Unit tests
Integration tests
Manual testing with real snapshots

Success Criteria

✅ Background hooks start immediately without blocking other extractors
✅ Background hooks are finalized after completion with full results
✅ All output stats calculated by runner, not hooks
✅ Binary FK optional and only set when determinable
✅ Clean separation between output_str (human) and output_data (machine)
✅ Log files cleaned up on success, kept on failure
✅ PID files cleaned up after completion
✅ No plugin-specific code in core (generic polling mechanism)

Future Enhancements

1. Timeout for orphaned background hooks

If a background hook runs longer than MAX_LIFETIME after all foreground hooks complete, force kill it.

2. Progress reporting

Background hooks could write progress to a file that gets polled:

fs.writeFileSync('progress.txt', '50%');

3. Multiple results per hook

If needed in future, extend to support multiple JSON outputs by collecting all {type: 'ArchiveResult'} lines.

4. Dependency tracking

Store all binaries used by a hook (not just primary), useful for hooks that chain multiple tools.

34 KiB Raw Blame History