mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-01-03 01:15:57 +10:00
actually working migration path from 0.7.2 and 0.8.6 + renames and test coverage
This commit is contained in:
452
CLAUDE.md
452
CLAUDE.md
@@ -27,135 +27,17 @@ uv sync --dev --all-extras # Always use uv, never pip directly
|
||||
source .venv/bin/activate
|
||||
```
|
||||
|
||||
### Generate and Apply Migrations
|
||||
```bash
|
||||
# Generate migrations (run from archivebox subdirectory)
|
||||
cd archivebox
|
||||
./manage.py makemigrations
|
||||
### Common Gotchas
|
||||
|
||||
# Apply migrations to test database
|
||||
cd data/
|
||||
archivebox init
|
||||
```
|
||||
|
||||
## Running Tests
|
||||
|
||||
### CRITICAL: Never Run as Root
|
||||
ArchiveBox has a root check that prevents running as root user. All ArchiveBox commands (including tests) must run as non-root user inside a data directory:
|
||||
|
||||
```bash
|
||||
# Run all migration tests
|
||||
sudo -u testuser bash -c 'source /path/to/.venv/bin/activate && python -m pytest archivebox/tests/test_migrations_*.py -v'
|
||||
|
||||
# Run specific test file
|
||||
sudo -u testuser bash -c 'source .venv/bin/activate && python -m pytest archivebox/tests/test_migrations_08_to_09.py -v'
|
||||
|
||||
# Run single test
|
||||
sudo -u testuser bash -c 'source .venv/bin/activate && python -m pytest archivebox/tests/test_migrations_fresh.py::TestFreshInstall::test_init_creates_database -xvs'
|
||||
```
|
||||
|
||||
### Test File Structure
|
||||
```
|
||||
archivebox/tests/
|
||||
├── test_migrations_helpers.py # Schemas, seeding functions, verification helpers
|
||||
├── test_migrations_fresh.py # Fresh install tests
|
||||
├── test_migrations_04_to_09.py # 0.4.x → 0.9.x migration tests
|
||||
├── test_migrations_07_to_09.py # 0.7.x → 0.9.x migration tests
|
||||
└── test_migrations_08_to_09.py # 0.8.x → 0.9.x migration tests
|
||||
```
|
||||
|
||||
## Test Writing Standards
|
||||
|
||||
### NO MOCKS - Real Tests Only
|
||||
Tests must exercise real code paths:
|
||||
- Create real SQLite databases with version-specific schemas
|
||||
- Seed with realistic test data
|
||||
- Run actual `python -m archivebox` commands via subprocess
|
||||
- Query SQLite directly to verify results
|
||||
|
||||
**If something is hard to test**: Modify the implementation to make it easier to test, or fix the underlying issue. Never mock, skip, simulate, or exit early from a test because you can't get something working inside the test.
|
||||
|
||||
### NO SKIPS
|
||||
Never use `@skip`, `skipTest`, or `pytest.mark.skip`. Every test must run. If a test is difficult, fix the code or test environment - don't disable the test.
|
||||
|
||||
### Strict Assertions
|
||||
- `init` command must return exit code 0 (not `[0, 1]`)
|
||||
- Verify ALL data is preserved, not just "at least one"
|
||||
- Use exact counts (`==`) not loose bounds (`>=`)
|
||||
|
||||
### Example Test Pattern
|
||||
```python
|
||||
def test_migration_preserves_snapshots(self):
|
||||
"""Migration should preserve all snapshots."""
|
||||
result = run_archivebox(self.work_dir, ['init'], timeout=45)
|
||||
self.assertEqual(result.returncode, 0, f"Init failed: {result.stderr}")
|
||||
|
||||
ok, msg = verify_snapshot_count(self.db_path, expected_count)
|
||||
self.assertTrue(ok, msg)
|
||||
```
|
||||
|
||||
## Migration Testing
|
||||
|
||||
### Schema Versions
|
||||
- **0.4.x**: First Django version. Tags as comma-separated string, no ArchiveResult model
|
||||
- **0.7.x**: Tag model with M2M, ArchiveResult model, AutoField PKs
|
||||
- **0.8.x**: Crawl/Seed models, UUID PKs, status fields, depth/retry_at
|
||||
- **0.9.x**: Seed model removed, seed_id FK removed from Crawl
|
||||
|
||||
### Testing a Migration Path
|
||||
1. Create SQLite DB with source version schema (from `test_migrations_helpers.py`)
|
||||
2. Seed with realistic test data using `seed_0_X_data()`
|
||||
3. Run `archivebox init` to trigger migrations
|
||||
4. Verify data preservation with `verify_*` functions
|
||||
5. Test CLI commands work post-migration (`status`, `list`, `add`, etc.)
|
||||
|
||||
### Squashed Migrations
|
||||
When testing 0.8.x (dev branch), you must record ALL replaced migrations:
|
||||
```python
|
||||
# The squashed migration replaces these - all must be recorded
|
||||
('core', '0023_alter_archiveresult_options_archiveresult_abid_and_more'),
|
||||
('core', '0024_auto_20240513_1143'),
|
||||
# ... all 52 migrations from 0023-0074 ...
|
||||
('core', '0023_new_schema'), # Also record the squashed migration itself
|
||||
```
|
||||
|
||||
## Common Gotchas
|
||||
|
||||
### 1. File Permissions
|
||||
#### File Permissions
|
||||
New files created by root need permissions fixed for testuser:
|
||||
```bash
|
||||
chmod 644 archivebox/tests/test_*.py
|
||||
```
|
||||
|
||||
### 2. DATA_DIR Environment Variable
|
||||
#### DATA_DIR Environment Variable
|
||||
ArchiveBox commands must run inside a data directory. Tests use temp directories - the `run_archivebox()` helper sets `DATA_DIR` automatically.
|
||||
|
||||
### 3. Extractors Disabled for Speed
|
||||
Tests disable all extractors via environment variables for faster execution:
|
||||
```python
|
||||
env['SAVE_TITLE'] = 'False'
|
||||
env['SAVE_FAVICON'] = 'False'
|
||||
# ... etc
|
||||
```
|
||||
|
||||
### 4. Timeout Settings
|
||||
Use appropriate timeouts for migration tests (45s for init, 60s default).
|
||||
|
||||
### 5. Circular FK References in Schemas
|
||||
SQLite handles circular references with `IF NOT EXISTS`. Order matters less than in other DBs.
|
||||
|
||||
## Architecture Notes
|
||||
|
||||
### Crawl Model (0.9.x)
|
||||
- Crawl groups multiple Snapshots from a single `add` command
|
||||
- Each `add` creates one Crawl with one or more Snapshots
|
||||
- Seed model was removed - crawls now store URLs directly
|
||||
|
||||
### Migration Strategy
|
||||
- Squashed migrations for clean installs
|
||||
- Individual migrations recorded for upgrades from dev branch
|
||||
- `replaces` attribute in squashed migrations lists what they replace
|
||||
|
||||
## Code Style Guidelines
|
||||
|
||||
### Naming Conventions for Grep-ability
|
||||
@@ -207,6 +89,334 @@ class Binary(models.Model):
|
||||
|
||||
**Principle**: If you're storing the same conceptual data (e.g., `overrides`), use the same field name across all models and keep the internal structure identical. This makes the codebase predictable and reduces cognitive load.
|
||||
|
||||
## Testing
|
||||
|
||||
### CRITICAL: Never Run as Root
|
||||
ArchiveBox has a root check that prevents running as root user. All ArchiveBox commands (including tests) must run as non-root user inside a data directory:
|
||||
|
||||
```bash
|
||||
# Run all migration tests
|
||||
sudo -u testuser bash -c 'source /path/to/.venv/bin/activate && python -m pytest archivebox/tests/test_migrations_*.py -v'
|
||||
|
||||
# Run specific test file
|
||||
sudo -u testuser bash -c 'source .venv/bin/activate && python -m pytest archivebox/tests/test_migrations_08_to_09.py -v'
|
||||
|
||||
# Run single test
|
||||
sudo -u testuser bash -c 'source .venv/bin/activate && python -m pytest archivebox/tests/test_migrations_fresh.py::TestFreshInstall::test_init_creates_database -xvs'
|
||||
```
|
||||
|
||||
### Test File Structure
|
||||
```
|
||||
archivebox/tests/
|
||||
├── test_migrations_helpers.py # Schemas, seeding functions, verification helpers
|
||||
├── test_migrations_fresh.py # Fresh install tests
|
||||
├── test_migrations_04_to_09.py # 0.4.x → 0.9.x migration tests
|
||||
├── test_migrations_07_to_09.py # 0.7.x → 0.9.x migration tests
|
||||
└── test_migrations_08_to_09.py # 0.8.x → 0.9.x migration tests
|
||||
```
|
||||
|
||||
### Test Writing Standards
|
||||
|
||||
#### NO MOCKS - Real Tests Only
|
||||
Tests must exercise real code paths:
|
||||
- Create real SQLite databases with version-specific schemas
|
||||
- Seed with realistic test data
|
||||
- Run actual `python -m archivebox` commands via subprocess
|
||||
- Query SQLite directly to verify results
|
||||
|
||||
**If something is hard to test**: Modify the implementation to make it easier to test, or fix the underlying issue. Never mock, skip, simulate, or exit early from a test because you can't get something working inside the test.
|
||||
|
||||
#### NO SKIPS
|
||||
Never use `@skip`, `skipTest`, or `pytest.mark.skip`. Every test must run. If a test is difficult, fix the code or test environment - don't disable the test.
|
||||
|
||||
#### Strict Assertions
|
||||
- `init` command must return exit code 0 (not `[0, 1]`)
|
||||
- Verify ALL data is preserved, not just "at least one"
|
||||
- Use exact counts (`==`) not loose bounds (`>=`)
|
||||
|
||||
### Example Test Pattern
|
||||
```python
|
||||
def test_migration_preserves_snapshots(self):
|
||||
"""Migration should preserve all snapshots."""
|
||||
result = run_archivebox(self.work_dir, ['init'], timeout=45)
|
||||
self.assertEqual(result.returncode, 0, f"Init failed: {result.stderr}")
|
||||
|
||||
ok, msg = verify_snapshot_count(self.db_path, expected_count)
|
||||
self.assertTrue(ok, msg)
|
||||
```
|
||||
|
||||
### Testing Gotchas
|
||||
|
||||
#### Extractors Disabled for Speed
|
||||
Tests disable all extractors via environment variables for faster execution:
|
||||
```python
|
||||
env['SAVE_TITLE'] = 'False'
|
||||
env['SAVE_FAVICON'] = 'False'
|
||||
# ... etc
|
||||
```
|
||||
|
||||
#### Timeout Settings
|
||||
Use appropriate timeouts for migration tests (45s for init, 60s default).
|
||||
|
||||
## Database Migrations
|
||||
|
||||
### Generate and Apply Migrations
|
||||
```bash
|
||||
# Generate migrations (run from archivebox subdirectory)
|
||||
cd archivebox
|
||||
./manage.py makemigrations
|
||||
|
||||
# Apply migrations to test database
|
||||
cd data/
|
||||
archivebox init
|
||||
```
|
||||
|
||||
### Schema Versions
|
||||
- **0.4.x**: First Django version. Tags as comma-separated string, no ArchiveResult model
|
||||
- **0.7.x**: Tag model with M2M, ArchiveResult model, AutoField PKs
|
||||
- **0.8.x**: Crawl/Seed models, UUID PKs, status fields, depth/retry_at
|
||||
- **0.9.x**: Seed model removed, seed_id FK removed from Crawl
|
||||
|
||||
### Testing a Migration Path
|
||||
1. Create SQLite DB with source version schema (from `test_migrations_helpers.py`)
|
||||
2. Seed with realistic test data using `seed_0_X_data()`
|
||||
3. Run `archivebox init` to trigger migrations
|
||||
4. Verify data preservation with `verify_*` functions
|
||||
5. Test CLI commands work post-migration (`status`, `list`, `add`, etc.)
|
||||
|
||||
### Squashed Migrations
|
||||
When testing 0.8.x (dev branch), you must record ALL replaced migrations:
|
||||
```python
|
||||
# The squashed migration replaces these - all must be recorded
|
||||
('core', '0023_alter_archiveresult_options_archiveresult_abid_and_more'),
|
||||
('core', '0024_auto_20240513_1143'),
|
||||
# ... all 52 migrations from 0023-0074 ...
|
||||
('core', '0023_new_schema'), # Also record the squashed migration itself
|
||||
```
|
||||
|
||||
### Migration Strategy
|
||||
- Squashed migrations for clean installs
|
||||
- Individual migrations recorded for upgrades from dev branch
|
||||
- `replaces` attribute in squashed migrations lists what they replace
|
||||
|
||||
### Migration Gotchas
|
||||
|
||||
#### Circular FK References in Schemas
|
||||
SQLite handles circular references with `IF NOT EXISTS`. Order matters less than in other DBs.
|
||||
|
||||
## Plugin System Architecture
|
||||
|
||||
### Plugin Dependency Rules
|
||||
|
||||
Like other plugins, chrome plugins **ARE NOT ALLOWED TO DEPEND ON ARCHIVEBOX OR DJANGO**.
|
||||
However, they are allowed to depend on two shared files ONLY:
|
||||
- `archivebox/plugins/chrome/chrome_utils.js` ← source of truth API for all basic chrome ops
|
||||
- `archivebox/plugins/chrome/tests/chrome_test_utils.py` ← use for your tests, do not implement launching/killing/pid files/cdp/etc. in python, just extend this file as needed.
|
||||
|
||||
### Chrome-Dependent Plugins
|
||||
|
||||
Many plugins depend on Chrome/Chromium via CDP (Chrome DevTools Protocol). When checking for script name references or debugging Chrome-related issues, check these plugins:
|
||||
|
||||
**Main puppeteer-based chrome installer + launcher plugin**:
|
||||
- `chrome` - Core Chrome integration (CDP, launch, navigation)
|
||||
|
||||
**Metadata extraction using chrome/chrome_utils.js / CDP**:
|
||||
- `dns` - DNS resolution info
|
||||
- `ssl` - SSL certificate info
|
||||
- `headers` - HTTP response headers
|
||||
- `redirects` - Capture redirect chains
|
||||
- `staticfile` - Direct file downloads (e.g. if the url itself is a .png, .exe, .zip, etc.)
|
||||
- `responses` - Capture network responses
|
||||
- `consolelog` - Capture console.log output
|
||||
- `title` - Extract page title
|
||||
- `accessibility` - Extract accessibility tree
|
||||
- `seo` - Extract SEO metadata
|
||||
|
||||
**Extensions installed using chrome/chrome_utils.js / controlled using CDP**:
|
||||
- `ublock` - uBlock Origin ad blocking
|
||||
- `istilldontcareaboutcookies` - Cookie banner dismissal
|
||||
- `twocaptcha` - 2captcha CAPTCHA solver integration
|
||||
|
||||
**Page-alteration plugins to prepare the content for archiving**:
|
||||
- `modalcloser` - Modal dialog dismissal
|
||||
- `infiniscroll` - Infinite scroll handler
|
||||
|
||||
**Main Extractor Outputs**:
|
||||
- `dom` - DOM snapshot extraction
|
||||
- `pdf` - Generate PDF snapshots
|
||||
- `screenshot` - Generate screenshots
|
||||
- `singlefile` - SingleFile archival, can be single-file-cli that launches chrome, or singlefile extension running inside chrome
|
||||
|
||||
**Crawl URL parsers** (post-process dom.html, singlefile.html, staticfile, responses, headers, etc. for URLs to re-emit as new queued Snapshots during recursive crawling):
|
||||
- `parse_dom_outlinks` - Extract outlinks from DOM (special, uses CDP to directly query browser)
|
||||
- `parse_html_urls` - Parse URLs from HTML (doesn't use chrome directly, just reads dom.html)
|
||||
- `parse_jsonl_urls` - Parse URLs from JSONL (doesn't use chrome directly, just reads dom.html)
|
||||
- `parse_netscape_urls` - Parse Netscape bookmark format (doesn't use chrome directly, just reads dom.html)
|
||||
|
||||
### Finding Chrome-Dependent Plugins
|
||||
|
||||
```bash
|
||||
# Find all files containing "chrom" (case-insensitive)
|
||||
grep -ri "chrom" archivebox/plugins/*/on_*.* --include="*.*" 2>/dev/null | cut -d: -f1 | sort -u
|
||||
|
||||
# Or get just the plugin names
|
||||
grep -ri "chrom" archivebox/plugins/*/on_*.* --include="*.*" 2>/dev/null | cut -d/ -f3 | sort -u
|
||||
```
|
||||
|
||||
**Note**: This list may not be complete. Always run the grep command above when checking for Chrome-related script references or debugging Chrome integration issues.
|
||||
|
||||
## Architecture Notes
|
||||
|
||||
### Crawl Model (0.9.x)
|
||||
- Crawl groups multiple Snapshots from a single `add` command
|
||||
- Each `add` creates one Crawl with one or more Snapshots
|
||||
- Seed model was removed - crawls now store URLs directly
|
||||
|
||||
## Code Coverage
|
||||
|
||||
### Overview
|
||||
|
||||
Coverage tracking is enabled for passive collection across all contexts:
|
||||
- Unit tests (pytest)
|
||||
- Integration tests
|
||||
- Dev server (manual testing)
|
||||
- CLI usage
|
||||
|
||||
Coverage data accumulates in `.coverage` file and can be viewed/analyzed to find dead code.
|
||||
|
||||
### Install Coverage Tools
|
||||
|
||||
```bash
|
||||
uv sync --dev # Installs pytest-cov and coverage
|
||||
```
|
||||
|
||||
### Running with Coverage
|
||||
|
||||
#### Unit Tests
|
||||
```bash
|
||||
# Run tests with coverage
|
||||
pytest --cov=archivebox --cov-report=term archivebox/tests/
|
||||
|
||||
# Or run specific test file
|
||||
pytest --cov=archivebox --cov-report=term archivebox/tests/test_migrations_08_to_09.py
|
||||
```
|
||||
|
||||
#### Dev Server with Coverage
|
||||
```bash
|
||||
# Start dev server with coverage tracking
|
||||
coverage run --parallel-mode -m archivebox server
|
||||
|
||||
# Or CLI commands
|
||||
coverage run --parallel-mode -m archivebox init
|
||||
coverage run --parallel-mode -m archivebox add https://example.com
|
||||
```
|
||||
|
||||
#### Manual Testing (Always-On)
|
||||
To enable coverage during ALL Python executions (passive tracking):
|
||||
|
||||
```bash
|
||||
# Option 1: Use coverage run wrapper
|
||||
coverage run --parallel-mode -m archivebox [command]
|
||||
|
||||
# Option 2: Set environment variable (tracks everything)
|
||||
export COVERAGE_PROCESS_START=pyproject.toml
|
||||
# Now all Python processes will track coverage
|
||||
archivebox server
|
||||
archivebox add https://example.com
|
||||
```
|
||||
|
||||
### Viewing Coverage
|
||||
|
||||
#### Text Report (Quick View)
|
||||
```bash
|
||||
# Combine all parallel coverage data
|
||||
coverage combine
|
||||
|
||||
# View summary
|
||||
coverage report
|
||||
|
||||
# View detailed report with missing lines
|
||||
coverage report --show-missing
|
||||
|
||||
# View specific file
|
||||
coverage report --include="archivebox/core/models.py" --show-missing
|
||||
```
|
||||
|
||||
#### JSON Report (LLM-Friendly)
|
||||
```bash
|
||||
# Generate JSON report
|
||||
coverage json
|
||||
|
||||
# View the JSON
|
||||
cat coverage.json | jq '.files | keys' # List all files
|
||||
|
||||
# Find files with low coverage
|
||||
cat coverage.json | jq -r '.files | to_entries[] | select(.value.summary.percent_covered < 50) | "\(.key): \(.value.summary.percent_covered)%"'
|
||||
|
||||
# Find completely uncovered files (dead code candidates)
|
||||
cat coverage.json | jq -r '.files | to_entries[] | select(.value.summary.percent_covered == 0) | .key'
|
||||
|
||||
# Get missing lines for a specific file
|
||||
cat coverage.json | jq '.files["archivebox/core/models.py"].missing_lines'
|
||||
```
|
||||
|
||||
#### HTML Report (Visual)
|
||||
```bash
|
||||
# Generate interactive HTML report
|
||||
coverage html
|
||||
|
||||
# Open in browser
|
||||
open htmlcov/index.html
|
||||
```
|
||||
|
||||
### Isolated Runs
|
||||
|
||||
To measure coverage for specific scenarios:
|
||||
|
||||
```bash
|
||||
# 1. Reset coverage data
|
||||
coverage erase
|
||||
|
||||
# 2. Run your isolated test/scenario
|
||||
pytest --cov=archivebox archivebox/tests/test_migrations_fresh.py
|
||||
# OR
|
||||
coverage run --parallel-mode -m archivebox add https://example.com
|
||||
|
||||
# 3. View results
|
||||
coverage combine
|
||||
coverage report --show-missing
|
||||
|
||||
# 4. Optionally export for analysis
|
||||
coverage json
|
||||
```
|
||||
|
||||
### Finding Dead Code
|
||||
|
||||
```bash
|
||||
# 1. Run comprehensive tests + manual testing to build coverage
|
||||
pytest --cov=archivebox archivebox/tests/
|
||||
coverage run --parallel-mode -m archivebox server # Use the app manually
|
||||
coverage combine
|
||||
|
||||
# 2. Find files with 0% coverage (strong dead code candidates)
|
||||
coverage json
|
||||
cat coverage.json | jq -r '.files | to_entries[] | select(.value.summary.percent_covered == 0) | .key'
|
||||
|
||||
# 3. Find files with <10% coverage (likely dead code)
|
||||
cat coverage.json | jq -r '.files | to_entries[] | select(.value.summary.percent_covered < 10) | "\(.key): \(.value.summary.percent_covered)%"' | sort -t: -k2 -n
|
||||
|
||||
# 4. Generate detailed report for analysis
|
||||
coverage report --show-missing > coverage_report.txt
|
||||
```
|
||||
|
||||
### Tips
|
||||
|
||||
- **Parallel mode** (`--parallel-mode`): Allows multiple processes to track coverage simultaneously without conflicts
|
||||
- **Combine**: Always run `coverage combine` before viewing reports to merge parallel data
|
||||
- **Reset**: Use `coverage erase` to start fresh for isolated measurements
|
||||
- **Branch coverage**: Enabled by default - tracks if both branches of if/else are executed
|
||||
- **Exclude patterns**: Config in `pyproject.toml` excludes tests, migrations, type stubs
|
||||
|
||||
## Debugging Tips
|
||||
|
||||
### Check Migration State
|
||||
|
||||
Reference in New Issue
Block a user