15 KiB
Claude Code Development Guide for ArchiveBox
Quick Start
# Set up dev environment (always use uv, never pip directly)
uv sync --dev --all-extras
# Run tests as non-root user (required - ArchiveBox always refuses to run as root)
sudo -u testuser bash -c 'source .venv/bin/activate && python -m pytest archivebox/tests/ -v'
Development Environment Setup
Prerequisites
- Python 3.11+ (3.13 recommended)
- uv package manager
- A non-root user for running tests (e.g.,
testuser)
Install Dependencies
uv sync --dev --all-extras # Always use uv, never pip directly
Activate Virtual Environment
source .venv/bin/activate
Common Gotchas
File Permissions
New files created by root need permissions fixed for testuser:
chmod 644 archivebox/tests/test_*.py
DATA_DIR Environment Variable
ArchiveBox commands must run inside a data directory. Tests use temp directories - the run_archivebox() helper sets DATA_DIR automatically.
Code Style Guidelines
Naming Conventions for Grep-ability
Use consistent naming for everything to enable easy grep-ability and logical grouping:
Principle: Fewest unique names. If you must create a new unique name, make it grep and group well.
Examples:
# Filesystem migration methods - all start with fs_
def fs_migration_needed() -> bool: ...
def fs_migrate() -> None: ...
def _fs_migrate_from_0_7_0_to_0_8_0() -> None: ...
def _fs_migrate_from_0_8_0_to_0_9_0() -> None: ...
def _fs_next_version(current: str) -> str: ...
# Logging methods - ALL must start with log_ or _log
def log_migration_start(snapshot_id: str) -> None: ...
def _log_error(message: str) -> None: ...
def log_validation_result(ok: bool, msg: str) -> None: ...
Rules:
- Group related functions with common prefixes
- Use
_prefix for internal/private helpers within the same family - ALL logging-related methods MUST start with
log_or_log - Search for all migration functions:
grep -r "def.*fs_.*(" archivebox/ - Search for all logging:
grep -r "def.*log_.*(" archivebox/
Minimize Unique Names and Data Structures
Do not invent new data structures, variable names, or keys if possible. Try to use existing field names and data structures exactly to keep the total unique data structures and names in the codebase to an absolute minimum.
Example - GOOD:
# Binary has overrides field
binary = Binary(overrides={'TIMEOUT': '60s'})
# Binary reuses the same field name and structure
class Binary(models.Model):
overrides = models.JSONField(default=dict) # Same name, same structure
Example - BAD:
# Don't invent new names like custom_bin_cmds, binary_overrides, etc.
class Binary(models.Model):
custom_bin_cmds = models.JSONField(default=dict) # ❌ New unique name
Principle: If you're storing the same conceptual data (e.g., overrides), use the same field name across all models and keep the internal structure identical. This makes the codebase predictable and reduces cognitive load.
Testing
CRITICAL: Never Run as Root
ArchiveBox has a root check that prevents running as root user. All ArchiveBox commands (including tests) must run as non-root user inside a data directory:
# Run all migration tests
sudo -u testuser bash -c 'source /path/to/.venv/bin/activate && python -m pytest archivebox/tests/test_migrations_*.py -v'
# Run specific test file
sudo -u testuser bash -c 'source .venv/bin/activate && python -m pytest archivebox/tests/test_migrations_08_to_09.py -v'
# Run single test
sudo -u testuser bash -c 'source .venv/bin/activate && python -m pytest archivebox/tests/test_migrations_fresh.py::TestFreshInstall::test_init_creates_database -xvs'
Test File Structure
archivebox/tests/
├── test_migrations_helpers.py # Schemas, seeding functions, verification helpers
├── test_migrations_fresh.py # Fresh install tests
├── test_migrations_04_to_09.py # 0.4.x → 0.9.x migration tests
├── test_migrations_07_to_09.py # 0.7.x → 0.9.x migration tests
└── test_migrations_08_to_09.py # 0.8.x → 0.9.x migration tests
Test Writing Standards
NO MOCKS - Real Tests Only
Tests must exercise real code paths:
- Create real SQLite databases with version-specific schemas
- Seed with realistic test data
- Run actual
python -m archiveboxcommands via subprocess - Query SQLite directly to verify results
If something is hard to test: Modify the implementation to make it easier to test, or fix the underlying issue. Never mock, skip, simulate, or exit early from a test because you can't get something working inside the test.
NO SKIPS
Never use @skip, skipTest, or pytest.mark.skip. Every test must run. If a test is difficult, fix the code or test environment - don't disable the test.
Strict Assertions
initcommand must return exit code 0 (not[0, 1])- Verify ALL data is preserved, not just "at least one"
- Use exact counts (
==) not loose bounds (>=)
Example Test Pattern
def test_migration_preserves_snapshots(self):
"""Migration should preserve all snapshots."""
result = run_archivebox(self.work_dir, ['init'], timeout=45)
self.assertEqual(result.returncode, 0, f"Init failed: {result.stderr}")
ok, msg = verify_snapshot_count(self.db_path, expected_count)
self.assertTrue(ok, msg)
Testing Gotchas
Extractors Disabled for Speed
Tests disable all extractors via environment variables for faster execution:
env['SAVE_TITLE'] = 'False'
env['SAVE_FAVICON'] = 'False'
# ... etc
Timeout Settings
Use appropriate timeouts for migration tests (45s for init, 60s default).
Database Migrations
Generate and Apply Migrations
# Generate migrations (run from archivebox subdirectory)
cd archivebox
./manage.py makemigrations
# Apply migrations to test database
cd data/
archivebox init
Schema Versions
- 0.4.x: First Django version. Tags as comma-separated string, no ArchiveResult model
- 0.7.x: Tag model with M2M, ArchiveResult model, AutoField PKs
- 0.8.x: Crawl/Seed models, UUID PKs, status fields, depth/retry_at
- 0.9.x: Seed model removed, seed_id FK removed from Crawl
Testing a Migration Path
- Create SQLite DB with source version schema (from
test_migrations_helpers.py) - Seed with realistic test data using
seed_0_X_data() - Run
archivebox initto trigger migrations - Verify data preservation with
verify_*functions - Test CLI commands work post-migration (
status,list,add, etc.)
Squashed Migrations
When testing 0.8.x (dev branch), you must record ALL replaced migrations:
# The squashed migration replaces these - all must be recorded
('core', '0023_alter_archiveresult_options_archiveresult_abid_and_more'),
('core', '0024_auto_20240513_1143'),
# ... all 52 migrations from 0023-0074 ...
('core', '0023_new_schema'), # Also record the squashed migration itself
Migration Strategy
- Squashed migrations for clean installs
- Individual migrations recorded for upgrades from dev branch
replacesattribute in squashed migrations lists what they replace
Migration Gotchas
Circular FK References in Schemas
SQLite handles circular references with IF NOT EXISTS. Order matters less than in other DBs.
Plugin System Architecture
Plugin Dependency Rules
Like other plugins, chrome plugins ARE NOT ALLOWED TO DEPEND ON ARCHIVEBOX OR DJANGO. However, they are allowed to depend on two shared files ONLY:
archivebox/plugins/chrome/chrome_utils.js← source of truth API for all basic chrome opsarchivebox/plugins/chrome/tests/chrome_test_utils.py← use for your tests, do not implement launching/killing/pid files/cdp/etc. in python, just extend this file as needed.
Chrome-Dependent Plugins
Many plugins depend on Chrome/Chromium via CDP (Chrome DevTools Protocol). When checking for script name references or debugging Chrome-related issues, check these plugins:
Main puppeteer-based chrome installer + launcher plugin:
chrome- Core Chrome integration (CDP, launch, navigation)
Metadata extraction using chrome/chrome_utils.js / CDP:
dns- DNS resolution infossl- SSL certificate infoheaders- HTTP response headersredirects- Capture redirect chainsstaticfile- Direct file downloads (e.g. if the url itself is a .png, .exe, .zip, etc.)responses- Capture network responsesconsolelog- Capture console.log outputtitle- Extract page titleaccessibility- Extract accessibility treeseo- Extract SEO metadata
Extensions installed using chrome/chrome_utils.js / controlled using CDP:
ublock- uBlock Origin ad blockingistilldontcareaboutcookies- Cookie banner dismissaltwocaptcha- 2captcha CAPTCHA solver integration
Page-alteration plugins to prepare the content for archiving:
modalcloser- Modal dialog dismissalinfiniscroll- Infinite scroll handler
Main Extractor Outputs:
dom- DOM snapshot extractionpdf- Generate PDF snapshotsscreenshot- Generate screenshotssinglefile- SingleFile archival, can be single-file-cli that launches chrome, or singlefile extension running inside chrome
Crawl URL parsers (post-process dom.html, singlefile.html, staticfile, responses, headers, etc. for URLs to re-emit as new queued Snapshots during recursive crawling):
parse_dom_outlinks- Extract outlinks from DOM (special, uses CDP to directly query browser)parse_html_urls- Parse URLs from HTML (doesn't use chrome directly, just reads dom.html)parse_jsonl_urls- Parse URLs from JSONL (doesn't use chrome directly, just reads dom.html)parse_netscape_urls- Parse Netscape bookmark format (doesn't use chrome directly, just reads dom.html)
Finding Chrome-Dependent Plugins
# Find all files containing "chrom" (case-insensitive)
grep -ri "chrom" archivebox/plugins/*/on_*.* --include="*.*" 2>/dev/null | cut -d: -f1 | sort -u
# Or get just the plugin names
grep -ri "chrom" archivebox/plugins/*/on_*.* --include="*.*" 2>/dev/null | cut -d/ -f3 | sort -u
Note: This list may not be complete. Always run the grep command above when checking for Chrome-related script references or debugging Chrome integration issues.
Architecture Notes
Crawl Model (0.9.x)
- Crawl groups multiple Snapshots from a single
addcommand - Each
addcreates one Crawl with one or more Snapshots - Seed model was removed - crawls now store URLs directly
Code Coverage
Overview
Coverage tracking is enabled for passive collection across all contexts:
- Unit tests (pytest)
- Integration tests
- Dev server (manual testing)
- CLI usage
Coverage data accumulates in .coverage file and can be viewed/analyzed to find dead code.
Install Coverage Tools
uv sync --dev # Installs pytest-cov and coverage
Running with Coverage
Unit Tests
# Run tests with coverage
pytest --cov=archivebox --cov-report=term archivebox/tests/
# Or run specific test file
pytest --cov=archivebox --cov-report=term archivebox/tests/test_migrations_08_to_09.py
Dev Server with Coverage
# Start dev server with coverage tracking
coverage run --parallel-mode -m archivebox server
# Or CLI commands
coverage run --parallel-mode -m archivebox init
coverage run --parallel-mode -m archivebox add https://example.com
Manual Testing (Always-On)
To enable coverage during ALL Python executions (passive tracking):
# Option 1: Use coverage run wrapper
coverage run --parallel-mode -m archivebox [command]
# Option 2: Set environment variable (tracks everything)
export COVERAGE_PROCESS_START=pyproject.toml
# Now all Python processes will track coverage
archivebox server
archivebox add https://example.com
Viewing Coverage
Text Report (Quick View)
# Combine all parallel coverage data
coverage combine
# View summary
coverage report
# View detailed report with missing lines
coverage report --show-missing
# View specific file
coverage report --include="archivebox/core/models.py" --show-missing
JSON Report (LLM-Friendly)
# Generate JSON report
coverage json
# View the JSON
cat coverage.json | jq '.files | keys' # List all files
# Find files with low coverage
cat coverage.json | jq -r '.files | to_entries[] | select(.value.summary.percent_covered < 50) | "\(.key): \(.value.summary.percent_covered)%"'
# Find completely uncovered files (dead code candidates)
cat coverage.json | jq -r '.files | to_entries[] | select(.value.summary.percent_covered == 0) | .key'
# Get missing lines for a specific file
cat coverage.json | jq '.files["archivebox/core/models.py"].missing_lines'
HTML Report (Visual)
# Generate interactive HTML report
coverage html
# Open in browser
open htmlcov/index.html
Isolated Runs
To measure coverage for specific scenarios:
# 1. Reset coverage data
coverage erase
# 2. Run your isolated test/scenario
pytest --cov=archivebox archivebox/tests/test_migrations_fresh.py
# OR
coverage run --parallel-mode -m archivebox add https://example.com
# 3. View results
coverage combine
coverage report --show-missing
# 4. Optionally export for analysis
coverage json
Finding Dead Code
# 1. Run comprehensive tests + manual testing to build coverage
pytest --cov=archivebox archivebox/tests/
coverage run --parallel-mode -m archivebox server # Use the app manually
coverage combine
# 2. Find files with 0% coverage (strong dead code candidates)
coverage json
cat coverage.json | jq -r '.files | to_entries[] | select(.value.summary.percent_covered == 0) | .key'
# 3. Find files with <10% coverage (likely dead code)
cat coverage.json | jq -r '.files | to_entries[] | select(.value.summary.percent_covered < 10) | "\(.key): \(.value.summary.percent_covered)%"' | sort -t: -k2 -n
# 4. Generate detailed report for analysis
coverage report --show-missing > coverage_report.txt
Tips
- Parallel mode (
--parallel-mode): Allows multiple processes to track coverage simultaneously without conflicts - Combine: Always run
coverage combinebefore viewing reports to merge parallel data - Reset: Use
coverage eraseto start fresh for isolated measurements - Branch coverage: Enabled by default - tracks if both branches of if/else are executed
- Exclude patterns: Config in
pyproject.tomlexcludes tests, migrations, type stubs
Debugging Tips
Check Migration State
sqlite3 /path/to/index.sqlite3 "SELECT app, name FROM django_migrations WHERE app='core' ORDER BY id;"
Check Table Schema
sqlite3 /path/to/index.sqlite3 "PRAGMA table_info(core_snapshot);"
Verbose Test Output
sudo -u testuser bash -c 'source .venv/bin/activate && python -m pytest archivebox/tests/test_migrations_08_to_09.py -xvs 2>&1 | head -200'
Kill Zombie Chrome Processes
./bin/kill_chrome.sh