alex/ArchiveBox

Fork 0

mirror of https://github.com/ArchiveBox/ArchiveBox.git synced 2026-01-02 17:05:38 +10:00

Files

Nick Sweeting 876feac522 actually working migration path from 0.7.2 and 0.8.6 + renames and test coverage

2026-01-01 15:50:00 -08:00

15 KiB

Raw Permalink Blame History

Claude Code Development Guide for ArchiveBox

Quick Start

# Set up dev environment (always use uv, never pip directly)
uv sync --dev --all-extras

# Run tests as non-root user (required - ArchiveBox always refuses to run as root)
sudo -u testuser bash -c 'source .venv/bin/activate && python -m pytest archivebox/tests/ -v'

Development Environment Setup

Prerequisites

Python 3.11+ (3.13 recommended)
uv package manager
A non-root user for running tests (e.g., testuser)

Install Dependencies

uv sync --dev --all-extras  # Always use uv, never pip directly

Activate Virtual Environment

source .venv/bin/activate

Common Gotchas

File Permissions

New files created by root need permissions fixed for testuser:

chmod 644 archivebox/tests/test_*.py

DATA_DIR Environment Variable

ArchiveBox commands must run inside a data directory. Tests use temp directories - the run_archivebox() helper sets DATA_DIR automatically.

Code Style Guidelines

Naming Conventions for Grep-ability

Use consistent naming for everything to enable easy grep-ability and logical grouping:

Principle: Fewest unique names. If you must create a new unique name, make it grep and group well.

Examples:

# Filesystem migration methods - all start with fs_
def fs_migration_needed() -> bool: ...
def fs_migrate() -> None: ...
def _fs_migrate_from_0_7_0_to_0_8_0() -> None: ...
def _fs_migrate_from_0_8_0_to_0_9_0() -> None: ...
def _fs_next_version(current: str) -> str: ...

# Logging methods - ALL must start with log_ or _log
def log_migration_start(snapshot_id: str) -> None: ...
def _log_error(message: str) -> None: ...
def log_validation_result(ok: bool, msg: str) -> None: ...

Rules:

Group related functions with common prefixes
Use _ prefix for internal/private helpers within the same family
ALL logging-related methods MUST start with log_ or _log
Search for all migration functions: grep -r "def.*fs_.*(" archivebox/
Search for all logging: grep -r "def.*log_.*(" archivebox/

Minimize Unique Names and Data Structures

Do not invent new data structures, variable names, or keys if possible. Try to use existing field names and data structures exactly to keep the total unique data structures and names in the codebase to an absolute minimum.

Example - GOOD:

# Binary has overrides field
binary = Binary(overrides={'TIMEOUT': '60s'})

# Binary reuses the same field name and structure
class Binary(models.Model):
    overrides = models.JSONField(default=dict)  # Same name, same structure

Example - BAD:

# Don't invent new names like custom_bin_cmds, binary_overrides, etc.
class Binary(models.Model):
    custom_bin_cmds = models.JSONField(default=dict)  # ❌ New unique name

Principle: If you're storing the same conceptual data (e.g., overrides), use the same field name across all models and keep the internal structure identical. This makes the codebase predictable and reduces cognitive load.

Testing

CRITICAL: Never Run as Root

ArchiveBox has a root check that prevents running as root user. All ArchiveBox commands (including tests) must run as non-root user inside a data directory:

# Run all migration tests
sudo -u testuser bash -c 'source /path/to/.venv/bin/activate && python -m pytest archivebox/tests/test_migrations_*.py -v'

# Run specific test file
sudo -u testuser bash -c 'source .venv/bin/activate && python -m pytest archivebox/tests/test_migrations_08_to_09.py -v'

# Run single test
sudo -u testuser bash -c 'source .venv/bin/activate && python -m pytest archivebox/tests/test_migrations_fresh.py::TestFreshInstall::test_init_creates_database -xvs'

Test File Structure

archivebox/tests/
├── test_migrations_helpers.py    # Schemas, seeding functions, verification helpers
├── test_migrations_fresh.py      # Fresh install tests
├── test_migrations_04_to_09.py   # 0.4.x → 0.9.x migration tests
├── test_migrations_07_to_09.py   # 0.7.x → 0.9.x migration tests
└── test_migrations_08_to_09.py   # 0.8.x → 0.9.x migration tests

Test Writing Standards

NO MOCKS - Real Tests Only

Tests must exercise real code paths:

Create real SQLite databases with version-specific schemas
Seed with realistic test data
Run actual python -m archivebox commands via subprocess
Query SQLite directly to verify results

If something is hard to test: Modify the implementation to make it easier to test, or fix the underlying issue. Never mock, skip, simulate, or exit early from a test because you can't get something working inside the test.

NO SKIPS

Never use @skip, skipTest, or pytest.mark.skip. Every test must run. If a test is difficult, fix the code or test environment - don't disable the test.

Strict Assertions

init command must return exit code 0 (not [0, 1])
Verify ALL data is preserved, not just "at least one"
Use exact counts (==) not loose bounds (>=)

Example Test Pattern

def test_migration_preserves_snapshots(self):
    """Migration should preserve all snapshots."""
    result = run_archivebox(self.work_dir, ['init'], timeout=45)
    self.assertEqual(result.returncode, 0, f"Init failed: {result.stderr}")

    ok, msg = verify_snapshot_count(self.db_path, expected_count)
    self.assertTrue(ok, msg)

Testing Gotchas

Extractors Disabled for Speed

Tests disable all extractors via environment variables for faster execution:

env['SAVE_TITLE'] = 'False'
env['SAVE_FAVICON'] = 'False'
# ... etc

Timeout Settings

Use appropriate timeouts for migration tests (45s for init, 60s default).

Database Migrations

Generate and Apply Migrations

# Generate migrations (run from archivebox subdirectory)
cd archivebox
./manage.py makemigrations

# Apply migrations to test database
cd data/
archivebox init

Schema Versions

0.4.x: First Django version. Tags as comma-separated string, no ArchiveResult model
0.7.x: Tag model with M2M, ArchiveResult model, AutoField PKs
0.8.x: Crawl/Seed models, UUID PKs, status fields, depth/retry_at
0.9.x: Seed model removed, seed_id FK removed from Crawl

Testing a Migration Path

Create SQLite DB with source version schema (from test_migrations_helpers.py)
Seed with realistic test data using seed_0_X_data()
Run archivebox init to trigger migrations
Verify data preservation with verify_* functions
Test CLI commands work post-migration (status, list, add, etc.)

Squashed Migrations

When testing 0.8.x (dev branch), you must record ALL replaced migrations:

# The squashed migration replaces these - all must be recorded
('core', '0023_alter_archiveresult_options_archiveresult_abid_and_more'),
('core', '0024_auto_20240513_1143'),
# ... all 52 migrations from 0023-0074 ...
('core', '0023_new_schema'),  # Also record the squashed migration itself

Migration Strategy

Squashed migrations for clean installs
Individual migrations recorded for upgrades from dev branch
replaces attribute in squashed migrations lists what they replace

Migration Gotchas

Circular FK References in Schemas

SQLite handles circular references with IF NOT EXISTS. Order matters less than in other DBs.

Plugin System Architecture

Plugin Dependency Rules

Like other plugins, chrome plugins ARE NOT ALLOWED TO DEPEND ON ARCHIVEBOX OR DJANGO. However, they are allowed to depend on two shared files ONLY:

archivebox/plugins/chrome/chrome_utils.js ← source of truth API for all basic chrome ops
archivebox/plugins/chrome/tests/chrome_test_utils.py ← use for your tests, do not implement launching/killing/pid files/cdp/etc. in python, just extend this file as needed.

Chrome-Dependent Plugins

Many plugins depend on Chrome/Chromium via CDP (Chrome DevTools Protocol). When checking for script name references or debugging Chrome-related issues, check these plugins:

Main puppeteer-based chrome installer + launcher plugin:

chrome - Core Chrome integration (CDP, launch, navigation)

Metadata extraction using chrome/chrome_utils.js / CDP:

dns - DNS resolution info
ssl - SSL certificate info
headers - HTTP response headers
redirects - Capture redirect chains
staticfile - Direct file downloads (e.g. if the url itself is a .png, .exe, .zip, etc.)
responses - Capture network responses
consolelog - Capture console.log output
title - Extract page title
accessibility - Extract accessibility tree
seo - Extract SEO metadata

Extensions installed using chrome/chrome_utils.js / controlled using CDP:

ublock - uBlock Origin ad blocking
istilldontcareaboutcookies - Cookie banner dismissal
twocaptcha - 2captcha CAPTCHA solver integration

Page-alteration plugins to prepare the content for archiving:

modalcloser - Modal dialog dismissal
infiniscroll - Infinite scroll handler

Main Extractor Outputs:

dom - DOM snapshot extraction
pdf - Generate PDF snapshots
screenshot - Generate screenshots
singlefile - SingleFile archival, can be single-file-cli that launches chrome, or singlefile extension running inside chrome

Crawl URL parsers (post-process dom.html, singlefile.html, staticfile, responses, headers, etc. for URLs to re-emit as new queued Snapshots during recursive crawling):

parse_dom_outlinks - Extract outlinks from DOM (special, uses CDP to directly query browser)
parse_html_urls - Parse URLs from HTML (doesn't use chrome directly, just reads dom.html)
parse_jsonl_urls - Parse URLs from JSONL (doesn't use chrome directly, just reads dom.html)
parse_netscape_urls - Parse Netscape bookmark format (doesn't use chrome directly, just reads dom.html)

Finding Chrome-Dependent Plugins

# Find all files containing "chrom" (case-insensitive)
grep -ri "chrom" archivebox/plugins/*/on_*.* --include="*.*" 2>/dev/null | cut -d: -f1 | sort -u

# Or get just the plugin names
grep -ri "chrom" archivebox/plugins/*/on_*.* --include="*.*" 2>/dev/null | cut -d/ -f3 | sort -u

Note: This list may not be complete. Always run the grep command above when checking for Chrome-related script references or debugging Chrome integration issues.

Architecture Notes

Crawl Model (0.9.x)

Crawl groups multiple Snapshots from a single add command
Each add creates one Crawl with one or more Snapshots
Seed model was removed - crawls now store URLs directly

Code Coverage

Overview

Coverage tracking is enabled for passive collection across all contexts:

Unit tests (pytest)
Integration tests
Dev server (manual testing)
CLI usage

Coverage data accumulates in .coverage file and can be viewed/analyzed to find dead code.

Install Coverage Tools

uv sync --dev  # Installs pytest-cov and coverage

Running with Coverage

Unit Tests

# Run tests with coverage
pytest --cov=archivebox --cov-report=term archivebox/tests/

# Or run specific test file
pytest --cov=archivebox --cov-report=term archivebox/tests/test_migrations_08_to_09.py

Dev Server with Coverage

# Start dev server with coverage tracking
coverage run --parallel-mode -m archivebox server

# Or CLI commands
coverage run --parallel-mode -m archivebox init
coverage run --parallel-mode -m archivebox add https://example.com

Manual Testing (Always-On)

To enable coverage during ALL Python executions (passive tracking):

# Option 1: Use coverage run wrapper
coverage run --parallel-mode -m archivebox [command]

# Option 2: Set environment variable (tracks everything)
export COVERAGE_PROCESS_START=pyproject.toml
# Now all Python processes will track coverage
archivebox server
archivebox add https://example.com

Viewing Coverage

Text Report (Quick View)

# Combine all parallel coverage data
coverage combine

# View summary
coverage report

# View detailed report with missing lines
coverage report --show-missing

# View specific file
coverage report --include="archivebox/core/models.py" --show-missing

JSON Report (LLM-Friendly)

# Generate JSON report
coverage json

# View the JSON
cat coverage.json | jq '.files | keys'  # List all files

# Find files with low coverage
cat coverage.json | jq -r '.files | to_entries[] | select(.value.summary.percent_covered < 50) | "\(.key): \(.value.summary.percent_covered)%"'

# Find completely uncovered files (dead code candidates)
cat coverage.json | jq -r '.files | to_entries[] | select(.value.summary.percent_covered == 0) | .key'

# Get missing lines for a specific file
cat coverage.json | jq '.files["archivebox/core/models.py"].missing_lines'

HTML Report (Visual)

# Generate interactive HTML report
coverage html

# Open in browser
open htmlcov/index.html

Isolated Runs

To measure coverage for specific scenarios:

# 1. Reset coverage data
coverage erase

# 2. Run your isolated test/scenario
pytest --cov=archivebox archivebox/tests/test_migrations_fresh.py
# OR
coverage run --parallel-mode -m archivebox add https://example.com

# 3. View results
coverage combine
coverage report --show-missing

# 4. Optionally export for analysis
coverage json

Finding Dead Code

# 1. Run comprehensive tests + manual testing to build coverage
pytest --cov=archivebox archivebox/tests/
coverage run --parallel-mode -m archivebox server  # Use the app manually
coverage combine

# 2. Find files with 0% coverage (strong dead code candidates)
coverage json
cat coverage.json | jq -r '.files | to_entries[] | select(.value.summary.percent_covered == 0) | .key'

# 3. Find files with <10% coverage (likely dead code)
cat coverage.json | jq -r '.files | to_entries[] | select(.value.summary.percent_covered < 10) | "\(.key): \(.value.summary.percent_covered)%"' | sort -t: -k2 -n

# 4. Generate detailed report for analysis
coverage report --show-missing > coverage_report.txt

Tips

Parallel mode (--parallel-mode): Allows multiple processes to track coverage simultaneously without conflicts
Combine: Always run coverage combine before viewing reports to merge parallel data
Reset: Use coverage erase to start fresh for isolated measurements
Branch coverage: Enabled by default - tracks if both branches of if/else are executed
Exclude patterns: Config in pyproject.toml excludes tests, migrations, type stubs

Debugging Tips

Check Migration State

sqlite3 /path/to/index.sqlite3 "SELECT app, name FROM django_migrations WHERE app='core' ORDER BY id;"

Check Table Schema

sqlite3 /path/to/index.sqlite3 "PRAGMA table_info(core_snapshot);"

Verbose Test Output

sudo -u testuser bash -c 'source .venv/bin/activate && python -m pytest archivebox/tests/test_migrations_08_to_09.py -xvs 2>&1 | head -200'

Kill Zombie Chrome Processes

./bin/kill_chrome.sh

15 KiB Raw Permalink Blame History