# Claude Code Development Guide for ArchiveBox ## Quick Start ```bash # Set up dev environment (always use uv, never pip directly) uv sync --dev --all-extras # Run tests as non-root user (required - ArchiveBox always refuses to run as root) sudo -u testuser bash -c 'source .venv/bin/activate && python -m pytest archivebox/tests/ -v' ``` ## Development Environment Setup ### Prerequisites - Python 3.11+ (3.13 recommended) - uv package manager - A non-root user for running tests (e.g., `testuser`) ### Install Dependencies ```bash uv sync --dev --all-extras # Always use uv, never pip directly ``` ### Activate Virtual Environment ```bash source .venv/bin/activate ``` ### Common Gotchas #### File Permissions New files created by root need permissions fixed for testuser: ```bash chmod 644 archivebox/tests/test_*.py ``` #### DATA_DIR Environment Variable ArchiveBox commands must run inside a data directory. Tests use temp directories - the `run_archivebox()` helper sets `DATA_DIR` automatically. ## Code Style Guidelines ### Naming Conventions for Grep-ability Use consistent naming for everything to enable easy grep-ability and logical grouping: **Principle**: Fewest unique names. If you must create a new unique name, make it grep and group well. **Examples**: ```python # Filesystem migration methods - all start with fs_ def fs_migration_needed() -> bool: ... def fs_migrate() -> None: ... def _fs_migrate_from_0_7_0_to_0_8_0() -> None: ... def _fs_migrate_from_0_8_0_to_0_9_0() -> None: ... def _fs_next_version(current: str) -> str: ... # Logging methods - ALL must start with log_ or _log def log_migration_start(snapshot_id: str) -> None: ... def _log_error(message: str) -> None: ... def log_validation_result(ok: bool, msg: str) -> None: ... ``` **Rules**: - Group related functions with common prefixes - Use `_` prefix for internal/private helpers within the same family - ALL logging-related methods MUST start with `log_` or `_log` - Search for all migration functions: `grep -r "def.*fs_.*(" archivebox/` - Search for all logging: `grep -r "def.*log_.*(" archivebox/` ### Minimize Unique Names and Data Structures **Do not invent new data structures, variable names, or keys if possible.** Try to use existing field names and data structures exactly to keep the total unique data structures and names in the codebase to an absolute minimum. **Example - GOOD**: ```python # Binary has overrides field binary = Binary(overrides={'TIMEOUT': '60s'}) # Binary reuses the same field name and structure class Binary(models.Model): overrides = models.JSONField(default=dict) # Same name, same structure ``` **Example - BAD**: ```python # Don't invent new names like custom_bin_cmds, binary_overrides, etc. class Binary(models.Model): custom_bin_cmds = models.JSONField(default=dict) # ❌ New unique name ``` **Principle**: If you're storing the same conceptual data (e.g., `overrides`), use the same field name across all models and keep the internal structure identical. This makes the codebase predictable and reduces cognitive load. ## Testing ### CRITICAL: Never Run as Root ArchiveBox has a root check that prevents running as root user. All ArchiveBox commands (including tests) must run as non-root user inside a data directory: ```bash # Run all migration tests sudo -u testuser bash -c 'source /path/to/.venv/bin/activate && python -m pytest archivebox/tests/test_migrations_*.py -v' # Run specific test file sudo -u testuser bash -c 'source .venv/bin/activate && python -m pytest archivebox/tests/test_migrations_08_to_09.py -v' # Run single test sudo -u testuser bash -c 'source .venv/bin/activate && python -m pytest archivebox/tests/test_migrations_fresh.py::TestFreshInstall::test_init_creates_database -xvs' ``` ### Test File Structure ``` archivebox/tests/ ├── test_migrations_helpers.py # Schemas, seeding functions, verification helpers ├── test_migrations_fresh.py # Fresh install tests ├── test_migrations_04_to_09.py # 0.4.x → 0.9.x migration tests ├── test_migrations_07_to_09.py # 0.7.x → 0.9.x migration tests └── test_migrations_08_to_09.py # 0.8.x → 0.9.x migration tests ``` ### Test Writing Standards #### NO MOCKS - Real Tests Only Tests must exercise real code paths: - Create real SQLite databases with version-specific schemas - Seed with realistic test data - Run actual `python -m archivebox` commands via subprocess - Query SQLite directly to verify results **If something is hard to test**: Modify the implementation to make it easier to test, or fix the underlying issue. Never mock, skip, simulate, or exit early from a test because you can't get something working inside the test. #### NO SKIPS Never use `@skip`, `skipTest`, or `pytest.mark.skip`. Every test must run. If a test is difficult, fix the code or test environment - don't disable the test. #### Strict Assertions - `init` command must return exit code 0 (not `[0, 1]`) - Verify ALL data is preserved, not just "at least one" - Use exact counts (`==`) not loose bounds (`>=`) ### Example Test Pattern ```python def test_migration_preserves_snapshots(self): """Migration should preserve all snapshots.""" result = run_archivebox(self.work_dir, ['init'], timeout=45) self.assertEqual(result.returncode, 0, f"Init failed: {result.stderr}") ok, msg = verify_snapshot_count(self.db_path, expected_count) self.assertTrue(ok, msg) ``` ### Testing Gotchas #### Extractors Disabled for Speed Tests disable all extractors via environment variables for faster execution: ```python env['SAVE_TITLE'] = 'False' env['SAVE_FAVICON'] = 'False' # ... etc ``` #### Timeout Settings Use appropriate timeouts for migration tests (45s for init, 60s default). ## Database Migrations ### Generate and Apply Migrations ```bash # Generate migrations (run from archivebox subdirectory) cd archivebox ./manage.py makemigrations # Apply migrations to test database cd data/ archivebox init ``` ### Schema Versions - **0.4.x**: First Django version. Tags as comma-separated string, no ArchiveResult model - **0.7.x**: Tag model with M2M, ArchiveResult model, AutoField PKs - **0.8.x**: Crawl/Seed models, UUID PKs, status fields, depth/retry_at - **0.9.x**: Seed model removed, seed_id FK removed from Crawl ### Testing a Migration Path 1. Create SQLite DB with source version schema (from `test_migrations_helpers.py`) 2. Seed with realistic test data using `seed_0_X_data()` 3. Run `archivebox init` to trigger migrations 4. Verify data preservation with `verify_*` functions 5. Test CLI commands work post-migration (`status`, `list`, `add`, etc.) ### Squashed Migrations When testing 0.8.x (dev branch), you must record ALL replaced migrations: ```python # The squashed migration replaces these - all must be recorded ('core', '0023_alter_archiveresult_options_archiveresult_abid_and_more'), ('core', '0024_auto_20240513_1143'), # ... all 52 migrations from 0023-0074 ... ('core', '0023_new_schema'), # Also record the squashed migration itself ``` ### Migration Strategy - Squashed migrations for clean installs - Individual migrations recorded for upgrades from dev branch - `replaces` attribute in squashed migrations lists what they replace ### Migration Gotchas #### Circular FK References in Schemas SQLite handles circular references with `IF NOT EXISTS`. Order matters less than in other DBs. ## Plugin System Architecture ### Plugin Dependency Rules Like other plugins, chrome plugins **ARE NOT ALLOWED TO DEPEND ON ARCHIVEBOX OR DJANGO**. However, they are allowed to depend on two shared files ONLY: - `archivebox/plugins/chrome/chrome_utils.js` ← source of truth API for all basic chrome ops - `archivebox/plugins/chrome/tests/chrome_test_utils.py` ← use for your tests, do not implement launching/killing/pid files/cdp/etc. in python, just extend this file as needed. ### Chrome-Dependent Plugins Many plugins depend on Chrome/Chromium via CDP (Chrome DevTools Protocol). When checking for script name references or debugging Chrome-related issues, check these plugins: **Main puppeteer-based chrome installer + launcher plugin**: - `chrome` - Core Chrome integration (CDP, launch, navigation) **Metadata extraction using chrome/chrome_utils.js / CDP**: - `dns` - DNS resolution info - `ssl` - SSL certificate info - `headers` - HTTP response headers - `redirects` - Capture redirect chains - `staticfile` - Direct file downloads (e.g. if the url itself is a .png, .exe, .zip, etc.) - `responses` - Capture network responses - `consolelog` - Capture console.log output - `title` - Extract page title - `accessibility` - Extract accessibility tree - `seo` - Extract SEO metadata **Extensions installed using chrome/chrome_utils.js / controlled using CDP**: - `ublock` - uBlock Origin ad blocking - `istilldontcareaboutcookies` - Cookie banner dismissal - `twocaptcha` - 2captcha CAPTCHA solver integration **Page-alteration plugins to prepare the content for archiving**: - `modalcloser` - Modal dialog dismissal - `infiniscroll` - Infinite scroll handler **Main Extractor Outputs**: - `dom` - DOM snapshot extraction - `pdf` - Generate PDF snapshots - `screenshot` - Generate screenshots - `singlefile` - SingleFile archival, can be single-file-cli that launches chrome, or singlefile extension running inside chrome **Crawl URL parsers** (post-process dom.html, singlefile.html, staticfile, responses, headers, etc. for URLs to re-emit as new queued Snapshots during recursive crawling): - `parse_dom_outlinks` - Extract outlinks from DOM (special, uses CDP to directly query browser) - `parse_html_urls` - Parse URLs from HTML (doesn't use chrome directly, just reads dom.html) - `parse_jsonl_urls` - Parse URLs from JSONL (doesn't use chrome directly, just reads dom.html) - `parse_netscape_urls` - Parse Netscape bookmark format (doesn't use chrome directly, just reads dom.html) ### Finding Chrome-Dependent Plugins ```bash # Find all files containing "chrom" (case-insensitive) grep -ri "chrom" archivebox/plugins/*/on_*.* --include="*.*" 2>/dev/null | cut -d: -f1 | sort -u # Or get just the plugin names grep -ri "chrom" archivebox/plugins/*/on_*.* --include="*.*" 2>/dev/null | cut -d/ -f3 | sort -u ``` **Note**: This list may not be complete. Always run the grep command above when checking for Chrome-related script references or debugging Chrome integration issues. ## Architecture Notes ### Crawl Model (0.9.x) - Crawl groups multiple Snapshots from a single `add` command - Each `add` creates one Crawl with one or more Snapshots - Seed model was removed - crawls now store URLs directly ## Code Coverage ### Overview Coverage tracking is enabled for passive collection across all contexts: - Unit tests (pytest) - Integration tests - Dev server (manual testing) - CLI usage Coverage data accumulates in `.coverage` file and can be viewed/analyzed to find dead code. ### Install Coverage Tools ```bash uv sync --dev # Installs pytest-cov and coverage ``` ### Running with Coverage #### Unit Tests ```bash # Run tests with coverage pytest --cov=archivebox --cov-report=term archivebox/tests/ # Or run specific test file pytest --cov=archivebox --cov-report=term archivebox/tests/test_migrations_08_to_09.py ``` #### Dev Server with Coverage ```bash # Start dev server with coverage tracking coverage run --parallel-mode -m archivebox server # Or CLI commands coverage run --parallel-mode -m archivebox init coverage run --parallel-mode -m archivebox add https://example.com ``` #### Manual Testing (Always-On) To enable coverage during ALL Python executions (passive tracking): ```bash # Option 1: Use coverage run wrapper coverage run --parallel-mode -m archivebox [command] # Option 2: Set environment variable (tracks everything) export COVERAGE_PROCESS_START=pyproject.toml # Now all Python processes will track coverage archivebox server archivebox add https://example.com ``` ### Viewing Coverage #### Text Report (Quick View) ```bash # Combine all parallel coverage data coverage combine # View summary coverage report # View detailed report with missing lines coverage report --show-missing # View specific file coverage report --include="archivebox/core/models.py" --show-missing ``` #### JSON Report (LLM-Friendly) ```bash # Generate JSON report coverage json # View the JSON cat coverage.json | jq '.files | keys' # List all files # Find files with low coverage cat coverage.json | jq -r '.files | to_entries[] | select(.value.summary.percent_covered < 50) | "\(.key): \(.value.summary.percent_covered)%"' # Find completely uncovered files (dead code candidates) cat coverage.json | jq -r '.files | to_entries[] | select(.value.summary.percent_covered == 0) | .key' # Get missing lines for a specific file cat coverage.json | jq '.files["archivebox/core/models.py"].missing_lines' ``` #### HTML Report (Visual) ```bash # Generate interactive HTML report coverage html # Open in browser open htmlcov/index.html ``` ### Isolated Runs To measure coverage for specific scenarios: ```bash # 1. Reset coverage data coverage erase # 2. Run your isolated test/scenario pytest --cov=archivebox archivebox/tests/test_migrations_fresh.py # OR coverage run --parallel-mode -m archivebox add https://example.com # 3. View results coverage combine coverage report --show-missing # 4. Optionally export for analysis coverage json ``` ### Finding Dead Code ```bash # 1. Run comprehensive tests + manual testing to build coverage pytest --cov=archivebox archivebox/tests/ coverage run --parallel-mode -m archivebox server # Use the app manually coverage combine # 2. Find files with 0% coverage (strong dead code candidates) coverage json cat coverage.json | jq -r '.files | to_entries[] | select(.value.summary.percent_covered == 0) | .key' # 3. Find files with <10% coverage (likely dead code) cat coverage.json | jq -r '.files | to_entries[] | select(.value.summary.percent_covered < 10) | "\(.key): \(.value.summary.percent_covered)%"' | sort -t: -k2 -n # 4. Generate detailed report for analysis coverage report --show-missing > coverage_report.txt ``` ### Tips - **Parallel mode** (`--parallel-mode`): Allows multiple processes to track coverage simultaneously without conflicts - **Combine**: Always run `coverage combine` before viewing reports to merge parallel data - **Reset**: Use `coverage erase` to start fresh for isolated measurements - **Branch coverage**: Enabled by default - tracks if both branches of if/else are executed - **Exclude patterns**: Config in `pyproject.toml` excludes tests, migrations, type stubs ## Debugging Tips ### Check Migration State ```bash sqlite3 /path/to/index.sqlite3 "SELECT app, name FROM django_migrations WHERE app='core' ORDER BY id;" ``` ### Check Table Schema ```bash sqlite3 /path/to/index.sqlite3 "PRAGMA table_info(core_snapshot);" ``` ### Verbose Test Output ```bash sudo -u testuser bash -c 'source .venv/bin/activate && python -m pytest archivebox/tests/test_migrations_08_to_09.py -xvs 2>&1 | head -200' ``` ### Kill Zombie Chrome Processes ```bash ./bin/kill_chrome.sh ```