actually working migration path from 0.7.2 and 0.8.6 + renames and test coverage

2026-01-03 01:15:57 +10:00 · 2026-01-01 15:49:56 -08:00
parent 6fadcf5168
commit 876feac522
33 changed files with 825 additions and 333 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -27,135 +27,17 @@ uv sync --dev --all-extras  # Always use uv, never pip directly
 source .venv/bin/activate
 ```

-### Generate and Apply Migrations
-```bash
-# Generate migrations (run from archivebox subdirectory)
-cd archivebox
-./manage.py makemigrations
+### Common Gotchas

-# Apply migrations to test database
-cd data/
-archivebox init
-```
-
-## Running Tests
-
-### CRITICAL: Never Run as Root
-ArchiveBox has a root check that prevents running as root user. All ArchiveBox commands (including tests) must run as non-root user inside a data directory:
-
-```bash
-# Run all migration tests
-sudo -u testuser bash -c 'source /path/to/.venv/bin/activate && python -m pytest archivebox/tests/test_migrations_*.py -v'
-
-# Run specific test file
-sudo -u testuser bash -c 'source .venv/bin/activate && python -m pytest archivebox/tests/test_migrations_08_to_09.py -v'
-
-# Run single test
-sudo -u testuser bash -c 'source .venv/bin/activate && python -m pytest archivebox/tests/test_migrations_fresh.py::TestFreshInstall::test_init_creates_database -xvs'
-```
-
-### Test File Structure
-```
-archivebox/tests/
-├── test_migrations_helpers.py    # Schemas, seeding functions, verification helpers
-├── test_migrations_fresh.py      # Fresh install tests
-├── test_migrations_04_to_09.py   # 0.4.x → 0.9.x migration tests
-├── test_migrations_07_to_09.py   # 0.7.x → 0.9.x migration tests
-└── test_migrations_08_to_09.py   # 0.8.x → 0.9.x migration tests
-```
-
-## Test Writing Standards
-
-### NO MOCKS - Real Tests Only
-Tests must exercise real code paths:
- Create real SQLite databases with version-specific schemas
- Seed with realistic test data
- Run actual `python -m archivebox` commands via subprocess
- Query SQLite directly to verify results
-
-**If something is hard to test**: Modify the implementation to make it easier to test, or fix the underlying issue. Never mock, skip, simulate, or exit early from a test because you can't get something working inside the test.
-
-### NO SKIPS
-Never use `@skip`, `skipTest`, or `pytest.mark.skip`. Every test must run. If a test is difficult, fix the code or test environment - don't disable the test.
-
-### Strict Assertions
- `init` command must return exit code 0 (not `[0, 1]`)
- Verify ALL data is preserved, not just "at least one"
- Use exact counts (`==`) not loose bounds (`>=`)
-
-### Example Test Pattern
-```python
-def test_migration_preserves_snapshots(self):
-    """Migration should preserve all snapshots."""
-    result = run_archivebox(self.work_dir, ['init'], timeout=45)
-    self.assertEqual(result.returncode, 0, f"Init failed: {result.stderr}")
-
-    ok, msg = verify_snapshot_count(self.db_path, expected_count)
-    self.assertTrue(ok, msg)
-```
-
-## Migration Testing
-
-### Schema Versions
- **0.4.x**: First Django version. Tags as comma-separated string, no ArchiveResult model
- **0.7.x**: Tag model with M2M, ArchiveResult model, AutoField PKs
- **0.8.x**: Crawl/Seed models, UUID PKs, status fields, depth/retry_at
- **0.9.x**: Seed model removed, seed_id FK removed from Crawl
-
-### Testing a Migration Path
-1. Create SQLite DB with source version schema (from `test_migrations_helpers.py`)
-2. Seed with realistic test data using `seed_0_X_data()`
-3. Run `archivebox init` to trigger migrations
-4. Verify data preservation with `verify_*` functions
-5. Test CLI commands work post-migration (`status`, `list`, `add`, etc.)
-
-### Squashed Migrations
-When testing 0.8.x (dev branch), you must record ALL replaced migrations:
-```python
-# The squashed migration replaces these - all must be recorded
-('core', '0023_alter_archiveresult_options_archiveresult_abid_and_more'),
-('core', '0024_auto_20240513_1143'),
-# ... all 52 migrations from 0023-0074 ...
-('core', '0023_new_schema'),  # Also record the squashed migration itself
-```
-
-## Common Gotchas
-
-### 1. File Permissions
+#### File Permissions
 New files created by root need permissions fixed for testuser:
 ```bash
 chmod 644 archivebox/tests/test_*.py
 ```

-### 2. DATA_DIR Environment Variable
+#### DATA_DIR Environment Variable
 ArchiveBox commands must run inside a data directory. Tests use temp directories - the `run_archivebox()` helper sets `DATA_DIR` automatically.

-### 3. Extractors Disabled for Speed
-Tests disable all extractors via environment variables for faster execution:
-```python
-env['SAVE_TITLE'] = 'False'
-env['SAVE_FAVICON'] = 'False'
-# ... etc
-```
-
-### 4. Timeout Settings
-Use appropriate timeouts for migration tests (45s for init, 60s default).
-
-### 5. Circular FK References in Schemas
-SQLite handles circular references with `IF NOT EXISTS`. Order matters less than in other DBs.
-
-## Architecture Notes
-
-### Crawl Model (0.9.x)
- Crawl groups multiple Snapshots from a single `add` command
- Each `add` creates one Crawl with one or more Snapshots
- Seed model was removed - crawls now store URLs directly
-
-### Migration Strategy
- Squashed migrations for clean installs
- Individual migrations recorded for upgrades from dev branch
- `replaces` attribute in squashed migrations lists what they replace
-
 ## Code Style Guidelines

 ### Naming Conventions for Grep-ability
@@ -207,6 +89,334 @@ class Binary(models.Model):

 **Principle**: If you're storing the same conceptual data (e.g., `overrides`), use the same field name across all models and keep the internal structure identical. This makes the codebase predictable and reduces cognitive load.

+## Testing
+
+### CRITICAL: Never Run as Root
+ArchiveBox has a root check that prevents running as root user. All ArchiveBox commands (including tests) must run as non-root user inside a data directory:
+
+```bash
+# Run all migration tests
+sudo -u testuser bash -c 'source /path/to/.venv/bin/activate && python -m pytest archivebox/tests/test_migrations_*.py -v'
+
+# Run specific test file
+sudo -u testuser bash -c 'source .venv/bin/activate && python -m pytest archivebox/tests/test_migrations_08_to_09.py -v'
+
+# Run single test
+sudo -u testuser bash -c 'source .venv/bin/activate && python -m pytest archivebox/tests/test_migrations_fresh.py::TestFreshInstall::test_init_creates_database -xvs'
+```
+
+### Test File Structure
+```
+archivebox/tests/
+├── test_migrations_helpers.py    # Schemas, seeding functions, verification helpers
+├── test_migrations_fresh.py      # Fresh install tests
+├── test_migrations_04_to_09.py   # 0.4.x → 0.9.x migration tests
+├── test_migrations_07_to_09.py   # 0.7.x → 0.9.x migration tests
+└── test_migrations_08_to_09.py   # 0.8.x → 0.9.x migration tests
+```
+
+### Test Writing Standards
+
+#### NO MOCKS - Real Tests Only
+Tests must exercise real code paths:
+- Create real SQLite databases with version-specific schemas
+- Seed with realistic test data
+- Run actual `python -m archivebox` commands via subprocess
+- Query SQLite directly to verify results
+
+**If something is hard to test**: Modify the implementation to make it easier to test, or fix the underlying issue. Never mock, skip, simulate, or exit early from a test because you can't get something working inside the test.
+
+#### NO SKIPS
+Never use `@skip`, `skipTest`, or `pytest.mark.skip`. Every test must run. If a test is difficult, fix the code or test environment - don't disable the test.
+
+#### Strict Assertions
+- `init` command must return exit code 0 (not `[0, 1]`)
+- Verify ALL data is preserved, not just "at least one"
+- Use exact counts (`==`) not loose bounds (`>=`)
+
+### Example Test Pattern
+```python
+def test_migration_preserves_snapshots(self):
+    """Migration should preserve all snapshots."""
+    result = run_archivebox(self.work_dir, ['init'], timeout=45)
+    self.assertEqual(result.returncode, 0, f"Init failed: {result.stderr}")
+
+    ok, msg = verify_snapshot_count(self.db_path, expected_count)
+    self.assertTrue(ok, msg)
+```
+
+### Testing Gotchas
+
+#### Extractors Disabled for Speed
+Tests disable all extractors via environment variables for faster execution:
+```python
+env['SAVE_TITLE'] = 'False'
+env['SAVE_FAVICON'] = 'False'
+# ... etc
+```
+
+#### Timeout Settings
+Use appropriate timeouts for migration tests (45s for init, 60s default).
+
+## Database Migrations
+
+### Generate and Apply Migrations
+```bash
+# Generate migrations (run from archivebox subdirectory)
+cd archivebox
+./manage.py makemigrations
+
+# Apply migrations to test database
+cd data/
+archivebox init
+```
+
+### Schema Versions
+- **0.4.x**: First Django version. Tags as comma-separated string, no ArchiveResult model
+- **0.7.x**: Tag model with M2M, ArchiveResult model, AutoField PKs
+- **0.8.x**: Crawl/Seed models, UUID PKs, status fields, depth/retry_at
+- **0.9.x**: Seed model removed, seed_id FK removed from Crawl
+
+### Testing a Migration Path
+1. Create SQLite DB with source version schema (from `test_migrations_helpers.py`)
+2. Seed with realistic test data using `seed_0_X_data()`
+3. Run `archivebox init` to trigger migrations
+4. Verify data preservation with `verify_*` functions
+5. Test CLI commands work post-migration (`status`, `list`, `add`, etc.)
+
+### Squashed Migrations
+When testing 0.8.x (dev branch), you must record ALL replaced migrations:
+```python
+# The squashed migration replaces these - all must be recorded
+('core', '0023_alter_archiveresult_options_archiveresult_abid_and_more'),
+('core', '0024_auto_20240513_1143'),
+# ... all 52 migrations from 0023-0074 ...
+('core', '0023_new_schema'),  # Also record the squashed migration itself
+```
+
+### Migration Strategy
+- Squashed migrations for clean installs
+- Individual migrations recorded for upgrades from dev branch
+- `replaces` attribute in squashed migrations lists what they replace
+
+### Migration Gotchas
+
+#### Circular FK References in Schemas
+SQLite handles circular references with `IF NOT EXISTS`. Order matters less than in other DBs.
+
+## Plugin System Architecture
+
+### Plugin Dependency Rules
+
+Like other plugins, chrome plugins **ARE NOT ALLOWED TO DEPEND ON ARCHIVEBOX OR DJANGO**.
+However, they are allowed to depend on two shared files ONLY:
+- `archivebox/plugins/chrome/chrome_utils.js` ← source of truth API for all basic chrome ops
+- `archivebox/plugins/chrome/tests/chrome_test_utils.py` ← use for your tests, do not implement launching/killing/pid files/cdp/etc. in python, just extend this file as needed.
+
+### Chrome-Dependent Plugins
+
+Many plugins depend on Chrome/Chromium via CDP (Chrome DevTools Protocol). When checking for script name references or debugging Chrome-related issues, check these plugins:
+
+**Main puppeteer-based chrome installer + launcher plugin**:
+- `chrome` - Core Chrome integration (CDP, launch, navigation)
+
+**Metadata extraction using chrome/chrome_utils.js / CDP**:
+- `dns` - DNS resolution info
+- `ssl` - SSL certificate info
+- `headers` - HTTP response headers
+- `redirects` - Capture redirect chains
+- `staticfile` - Direct file downloads (e.g. if the url itself is a .png, .exe, .zip, etc.)
+- `responses` - Capture network responses
+- `consolelog` - Capture console.log output
+- `title` - Extract page title
+- `accessibility` - Extract accessibility tree
+- `seo` - Extract SEO metadata
+
+**Extensions installed using chrome/chrome_utils.js / controlled using CDP**:
+- `ublock` - uBlock Origin ad blocking
+- `istilldontcareaboutcookies` - Cookie banner dismissal
+- `twocaptcha` - 2captcha CAPTCHA solver integration
+
+**Page-alteration plugins to prepare the content for archiving**:
+- `modalcloser` - Modal dialog dismissal
+- `infiniscroll` - Infinite scroll handler
+
+**Main Extractor Outputs**:
+- `dom` - DOM snapshot extraction
+- `pdf` - Generate PDF snapshots
+- `screenshot` - Generate screenshots
+- `singlefile` - SingleFile archival, can be single-file-cli that launches chrome, or singlefile extension running inside chrome
+
+**Crawl URL parsers** (post-process dom.html, singlefile.html, staticfile, responses, headers, etc. for URLs to re-emit as new queued Snapshots during recursive crawling):
+- `parse_dom_outlinks` - Extract outlinks from DOM (special, uses CDP to directly query browser)
+- `parse_html_urls` - Parse URLs from HTML (doesn't use chrome directly, just reads dom.html)
+- `parse_jsonl_urls` - Parse URLs from JSONL (doesn't use chrome directly, just reads dom.html)
+- `parse_netscape_urls` - Parse Netscape bookmark format (doesn't use chrome directly, just reads dom.html)
+
+### Finding Chrome-Dependent Plugins
+
+```bash
+# Find all files containing "chrom" (case-insensitive)
+grep -ri "chrom" archivebox/plugins/*/on_*.* --include="*.*" 2>/dev/null | cut -d: -f1 | sort -u
+
+# Or get just the plugin names
+grep -ri "chrom" archivebox/plugins/*/on_*.* --include="*.*" 2>/dev/null | cut -d/ -f3 | sort -u
+```
+
+**Note**: This list may not be complete. Always run the grep command above when checking for Chrome-related script references or debugging Chrome integration issues.
+
+## Architecture Notes
+
+### Crawl Model (0.9.x)
+- Crawl groups multiple Snapshots from a single `add` command
+- Each `add` creates one Crawl with one or more Snapshots
+- Seed model was removed - crawls now store URLs directly
+
+## Code Coverage
+
+### Overview
+
+Coverage tracking is enabled for passive collection across all contexts:
+- Unit tests (pytest)
+- Integration tests
+- Dev server (manual testing)
+- CLI usage
+
+Coverage data accumulates in `.coverage` file and can be viewed/analyzed to find dead code.
+
+### Install Coverage Tools
+
+```bash
+uv sync --dev  # Installs pytest-cov and coverage
+```
+
+### Running with Coverage
+
+#### Unit Tests
+```bash
+# Run tests with coverage
+pytest --cov=archivebox --cov-report=term archivebox/tests/
+
+# Or run specific test file
+pytest --cov=archivebox --cov-report=term archivebox/tests/test_migrations_08_to_09.py
+```
+
+#### Dev Server with Coverage
+```bash
+# Start dev server with coverage tracking
+coverage run --parallel-mode -m archivebox server
+
+# Or CLI commands
+coverage run --parallel-mode -m archivebox init
+coverage run --parallel-mode -m archivebox add https://example.com
+```
+
+#### Manual Testing (Always-On)
+To enable coverage during ALL Python executions (passive tracking):
+
+```bash
+# Option 1: Use coverage run wrapper
+coverage run --parallel-mode -m archivebox [command]
+
+# Option 2: Set environment variable (tracks everything)
+export COVERAGE_PROCESS_START=pyproject.toml
+# Now all Python processes will track coverage
+archivebox server
+archivebox add https://example.com
+```
+
+### Viewing Coverage
+
+#### Text Report (Quick View)
+```bash
+# Combine all parallel coverage data
+coverage combine
+
+# View summary
+coverage report
+
+# View detailed report with missing lines
+coverage report --show-missing
+
+# View specific file
+coverage report --include="archivebox/core/models.py" --show-missing
+```
+
+#### JSON Report (LLM-Friendly)
+```bash
+# Generate JSON report
+coverage json
+
+# View the JSON
+cat coverage.json | jq '.files | keys'  # List all files
+
+# Find files with low coverage
+cat coverage.json | jq -r '.files | to_entries[] | select(.value.summary.percent_covered < 50) | "\(.key): \(.value.summary.percent_covered)%"'
+
+# Find completely uncovered files (dead code candidates)
+cat coverage.json | jq -r '.files | to_entries[] | select(.value.summary.percent_covered == 0) | .key'
+
+# Get missing lines for a specific file
+cat coverage.json | jq '.files["archivebox/core/models.py"].missing_lines'
+```
+
+#### HTML Report (Visual)
+```bash
+# Generate interactive HTML report
+coverage html
+
+# Open in browser
+open htmlcov/index.html
+```
+
+### Isolated Runs
+
+To measure coverage for specific scenarios:
+
+```bash
+# 1. Reset coverage data
+coverage erase
+
+# 2. Run your isolated test/scenario
+pytest --cov=archivebox archivebox/tests/test_migrations_fresh.py
+# OR
+coverage run --parallel-mode -m archivebox add https://example.com
+
+# 3. View results
+coverage combine
+coverage report --show-missing
+
+# 4. Optionally export for analysis
+coverage json
+```
+
+### Finding Dead Code
+
+```bash
+# 1. Run comprehensive tests + manual testing to build coverage
+pytest --cov=archivebox archivebox/tests/
+coverage run --parallel-mode -m archivebox server  # Use the app manually
+coverage combine
+
+# 2. Find files with 0% coverage (strong dead code candidates)
+coverage json
+cat coverage.json | jq -r '.files | to_entries[] | select(.value.summary.percent_covered == 0) | .key'
+
+# 3. Find files with <10% coverage (likely dead code)
+cat coverage.json | jq -r '.files | to_entries[] | select(.value.summary.percent_covered < 10) | "\(.key): \(.value.summary.percent_covered)%"' | sort -t: -k2 -n
+
+# 4. Generate detailed report for analysis
+coverage report --show-missing > coverage_report.txt
+```
+
+### Tips
+
+- **Parallel mode** (`--parallel-mode`): Allows multiple processes to track coverage simultaneously without conflicts
+- **Combine**: Always run `coverage combine` before viewing reports to merge parallel data
+- **Reset**: Use `coverage erase` to start fresh for isolated measurements
+- **Branch coverage**: Enabled by default - tracks if both branches of if/else are executed
+- **Exclude patterns**: Config in `pyproject.toml` excludes tests, migrations, type stubs
+
 ## Debugging Tips

 ### Check Migration State