alex/ArchiveBox

Fork 0

mirror of https://github.com/ArchiveBox/ArchiveBox.git synced 2026-01-02 17:05:38 +10:00

Files

Nick Sweeting f7457b13ad more migrations fixes attempts

2025-12-31 17:46:10 -08:00

15 KiB

Raw Permalink Blame History

TODO: Fix Migration Path for v0.7.2/v0.8.6rc0 → v0.9.0

Critical Issue

The migrations currently LOSE DATA during the v0.7.2 → v0.9.0 upgrade:

extractor field data is not being copied to plugin field
output field data is not being copied to output_str field
Timestamp fields (added, updated) may not be properly transformed
Tag UUID → INTEGER conversion may lose FK relationships

Test Database Locations

Sample databases for testing are available at:

/Users/squash/Local/Code/archiveboxes/archivebox-migration-path/archivebox-v0.7.2/data/index.sqlite3
/Users/squash/Local/Code/archiveboxes/archivebox-migration-path/archivebox-v0.8.6rc0/data/index.sqlite3

Schema comparison reports:

/tmp/schema_comparison_report.md
/tmp/table_presence_matrix.md

How to Test Migrations

1. Fresh Install Test

rm -rf /tmp/test_fresh && mkdir -p /tmp/test_fresh
DATA_DIR=/tmp/test_fresh python -m archivebox init
DATA_DIR=/tmp/test_fresh python -m archivebox status

2. v0.7.2 Migration Test

rm -rf /tmp/test_v072 && mkdir -p /tmp/test_v072
cp /Users/squash/Local/Code/archiveboxes/archivebox-migration-path/archivebox-v0.7.2/data/index.sqlite3 /tmp/test_v072/
DATA_DIR=/tmp/test_v072 python -m archivebox init
DATA_DIR=/tmp/test_v072 python -m archivebox status

3. v0.8.6rc0 Migration Test

rm -rf /tmp/test_v086 && mkdir -p /tmp/test_v086
cp /Users/squash/Local/Code/archiveboxes/archivebox-migration-path/archivebox-v0.8.6rc0/data/index.sqlite3 /tmp/test_v086/
DATA_DIR=/tmp/test_v086 python -m archivebox init
DATA_DIR=/tmp/test_v086 python -m archivebox status

4. Verify Data Integrity

After each test, compare original vs migrated data:

# Check ArchiveResult data preservation
echo "=== ORIGINAL ==="
sqlite3 /path/to/original.db "SELECT id, extractor, output, status FROM core_archiveresult LIMIT 5;"

echo "=== MIGRATED ==="
sqlite3 /tmp/test_vXXX/index.sqlite3 "SELECT id, plugin, output_str, status FROM core_archiveresult LIMIT 5;"

# Check Snapshot data preservation
echo "=== ORIGINAL SNAPSHOTS ==="
sqlite3 /path/to/original.db "SELECT id, url, title, added, updated FROM core_snapshot LIMIT 5;"

echo "=== MIGRATED SNAPSHOTS ==="
sqlite3 /tmp/test_vXXX/index.sqlite3 "SELECT id, url, title, bookmarked_at, created_at, modified_at FROM core_snapshot LIMIT 5;"

# Check Tag data preservation
echo "=== ORIGINAL TAGS ==="
sqlite3 /path/to/original.db "SELECT * FROM core_tag;"

echo "=== MIGRATED TAGS ==="
sqlite3 /tmp/test_vXXX/index.sqlite3 "SELECT * FROM core_tag;"

# Check snapshot-tag relationships
sqlite3 /tmp/test_vXXX/index.sqlite3 "SELECT COUNT(*) FROM core_snapshot_tags;"

CRITICAL: Verify:

Row counts match
All URLs, titles, timestamps are preserved
All extractor values are copied to plugin field
All output values are copied to output_str field
All tag relationships are maintained (tag IDs should be converted from UUID to INTEGER for v0.8.6)

Migration Philosophy

Principle: Minimal Manual SQL

Use this approach for complex migrations:

Python: Detect existing schema version

def get_table_columns(table_name):
    cursor = connection.cursor()
    cursor.execute(f"PRAGMA table_info({table_name})")
    return {row[1] for row in cursor.fetchall()}

cols = get_table_columns('core_archiveresult')
has_extractor = 'extractor' in cols
has_plugin = 'plugin' in cols

SQL: Modify database structure during migration

CREATE TABLE core_archiveresult_new (...);
INSERT INTO core_archiveresult_new SELECT ... FROM core_archiveresult;
DROP TABLE core_archiveresult;
ALTER TABLE core_archiveresult_new RENAME TO core_archiveresult;

Python: Copy data between old and new field names

if 'extractor' in cols and 'plugin' in cols:
    cursor.execute("UPDATE core_archiveresult SET plugin = COALESCE(extractor, '')")

SQL: Drop old columns/tables

-- Django's RemoveField will handle this

Django: Register the end state so Django knows what the schema should be

migrations.SeparateDatabaseAndState(
    database_operations=[...],  # Your SQL/Python migrations
    state_operations=[...]       # Tell Django what the final schema looks like
)

Key Files

core/migrations/0023_upgrade_to_0_9_0.py: Raw SQL migration that upgrades tables from v0.7.2/v0.8.6 schema
- Should create NEW tables with OLD field names (extractor, output, added, updated)
- Should preserve ALL data during table rebuild
- Should NOT add new fields yet (let Django migrations handle that)
core/migrations/0025_alter_archiveresult_options_...py: Django-generated migration
- Adds new fields (plugin, output_str, bookmarked_at, created_at, etc.)
- Should include RunPython to copy data from old fields to new fields AFTER AddField operations
- RemoveField operations to remove old columns
crawls/migrations/0002_upgrade_from_0_8_6.py: Handles crawls_crawl table upgrade
- v0.8.6 has seed_id + persona (VARCHAR)
- v0.9.0 has urls + persona_id (UUID FK)

How to Make vs Apply Migrations

Making Migrations (Creating New Migrations)

Always run from the archivebox/ subdirectory (NOT from a data dir):

cd archivebox/
./manage.py makemigrations
./manage.py makemigrations --check  # Verify no unreflected changes

This works because archivebox/manage.py has:

os.environ.setdefault('ARCHIVEBOX_DATA_DIR', '.')

Applying Migrations (Testing Migrations)

Always run from inside a data directory using archivebox init:

# WRONG - Don't do this:
cd /some/data/dir
../path/to/archivebox/manage.py migrate

# RIGHT - Do this:
DATA_DIR=/some/data/dir python -m archivebox init

Why? Because archivebox init:

Sets up the data directory structure
Runs migrations with proper DATA_DIR context
Creates necessary files and folders
Validates the installation

Schema Version Differences

v0.7.2 Schema (Migration 0022)

ArchiveResult: id (INTEGER), uuid, extractor, output, cmd, pwd, cmd_version, start_ts, end_ts, status, snapshot_id
Snapshot: id, url, timestamp, title, added, updated, crawl_id
Tag: id (INTEGER), name, slug
Crawl: Doesn't exist in v0.7.2

v0.8.6rc0 Schema

ArchiveResult: id, abid (not uuid!), extractor, output, created_at, modified_at, retry_at, status, ...
Snapshot: id, url, bookmarked_at, created_at, modified_at, crawl_id, status, retry_at, ...
Tag: id (UUID/CHAR!), name, slug, abid, created_at, modified_at, created_by_id
Crawl: id, seed_id, persona (VARCHAR), max_depth, tags_str, status, retry_at, ...

v0.9.0 Target Schema

ArchiveResult: id (INTEGER), uuid, plugin (not extractor!), output_str (not output!), hook_name, created_at, modified_at, output_files, output_json, output_size, output_mimetypes, retry_at, ...
Snapshot: id, url, bookmarked_at (not added!), created_at, modified_at (not updated!), crawl_id, parent_snapshot_id, status, retry_at, current_step, depth, fs_version, ...
Tag: id (INTEGER!), name, slug, created_at, modified_at, created_by_id
Crawl: id, urls (not seed_id!), persona_id (not persona!), label, notes, output_dir, ...

Critical Gotchas and Mistakes to Avoid

1. ❌ DON'T Create New Fields in SQL Migration (0023)

WRONG:

# In core/migrations/0023_upgrade_to_0_9_0.py
cursor.execute("""
    CREATE TABLE core_archiveresult_new (
        id INTEGER PRIMARY KEY,
        plugin VARCHAR(32),  # ❌ New field!
        output_str TEXT,     # ❌ New field!
        ...
    )
""")

RIGHT:

# In core/migrations/0023_upgrade_to_0_9_0.py - Keep OLD field names!
cursor.execute("""
    CREATE TABLE core_archiveresult_new (
        id INTEGER PRIMARY KEY,
        extractor VARCHAR(32),  # ✓ OLD field name
        output VARCHAR(1024),   # ✓ OLD field name
        ...
    )
""")

Why: If you create new fields in SQL, Django's AddField operation in migration 0025 will overwrite them with default values, losing your data!

2. ❌ DON'T Copy Data in SQL Migration

WRONG:

# In core/migrations/0023
cursor.execute("""
    INSERT INTO core_archiveresult_new (plugin, output_str, ...)
    SELECT COALESCE(extractor, ''), COALESCE(output, ''), ...
    FROM core_archiveresult
""")

RIGHT: Keep old field names in SQL, let Django AddField create new columns, then copy:

# In core/migrations/0025 (AFTER AddField operations)
def copy_old_to_new(apps, schema_editor):
    cursor = connection.cursor()
    cursor.execute("UPDATE core_archiveresult SET plugin = COALESCE(extractor, '')")
    cursor.execute("UPDATE core_archiveresult SET output_str = COALESCE(output, '')")

3. ❌ DON'T Assume Empty Tables Mean Fresh Install

WRONG:

cursor.execute("SELECT COUNT(*) FROM core_archiveresult")
if cursor.fetchone()[0] == 0:
    return  # Skip migration

Why: Fresh installs run migrations 0001-0022 which CREATE empty tables with old schema. Migration 0023 must still upgrade the schema even if tables are empty!

RIGHT: Detect schema version by checking column names:

cols = get_table_columns('core_archiveresult')
has_extractor = 'extractor' in cols
if has_extractor:
    # Old schema - needs upgrade

4. ❌ DON'T Run Migrations from Data Directories

WRONG:

cd /path/to/data/dir
python manage.py makemigrations

RIGHT:

cd archivebox/  # The archivebox package directory
./manage.py makemigrations

5. ❌ DON'T Use WHERE Clauses to Skip SQL Selects

WRONG:

INSERT INTO new_table SELECT uuid FROM old_table
WHERE EXISTS (SELECT 1 FROM pragma_table_info('old_table') WHERE name='uuid');

Why: SQLite still evaluates the uuid column reference even if WHERE clause is false, causing "no such column" errors.

RIGHT: Use Python to detect schema, then run appropriate SQL:

if 'uuid' in get_table_columns('old_table'):
    cursor.execute("INSERT INTO new_table SELECT uuid FROM old_table")
else:
    cursor.execute("INSERT INTO new_table SELECT abid as uuid FROM old_table")

6. ❌ DON'T Mix UUID and INTEGER for Tag IDs

v0.8.6rc0 has Tag.id as UUID, but v0.9.0 needs INTEGER. The conversion must:

Create mapping of old UUID → new INTEGER
Update core_tag with new IDs
Update core_snapshot_tags with new tag_id values

See core/migrations/0023_upgrade_to_0_9_0.py PART 3 for the correct approach.

7. ❌ DON'T Forget SeparateDatabaseAndState

When you manually change the database with SQL, you MUST tell Django what the final state is:

migrations.SeparateDatabaseAndState(
    database_operations=[
        migrations.RunPython(my_sql_function),
    ],
    state_operations=[
        migrations.RemoveField('archiveresult', 'extractor'),
        migrations.RemoveField('archiveresult', 'output'),
    ],
)

Without state_operations, Django won't know the old fields are gone and makemigrations --check will show unreflected changes.

8. ✅ DO Print Debug Messages

print(f'Migrating ArchiveResult from v0.7.2 schema...')
print(f'DEBUG: has_uuid={has_uuid}, has_abid={has_abid}, row_count={row_count}')

This helps diagnose which migration path is being taken.

9. ✅ DO Test All Three Scenarios

Always test:

Fresh install (empty database)
v0.7.2 upgrade (12 snapshots, 44 archiveresults, 2 tags)
v0.8.6rc0 upgrade (14 snapshots, 0 archiveresults, multiple tags with UUIDs)

10. ✅ DO Verify No Unreflected Migrations

After all changes:

cd archivebox/
./manage.py makemigrations --check
# Should output: No changes detected

Current Status

As of 2025-01-01, migrations have these issues:

✅ Fresh install works
✅ v0.7.2 → v0.9.0 migration runs without errors
✅ v0.8.6rc0 → v0.9.0 migration runs without errors
❌ DATA IS LOST: extractor → plugin field data not copied
❌ DATA IS LOST: output → output_str field data not copied
❌ Timestamps (added/updated → bookmarked_at/created_at/modified_at) may have wrong values
❌ Tag relationships may be broken after UUID → INTEGER conversion

Files That Need Fixing

core/migrations/0023_upgrade_to_0_9_0.py
- Line 42-58: CREATE TABLE should use OLD field names (extractor, output, added, updated)
- Lines 64-88: INSERT SELECT should just copy data as-is, no field renaming yet
- Remove all references to plugin, output_str, bookmarked_at, created_at - these are added by 0025
core/migrations/0025_...py
- Add RunPython operation AFTER all AddField operations
- This RunPython should copy: extractor→plugin, output→output_str, added→bookmarked_at/created_at, updated→modified_at
- Fix syntax error on line 28: {extractor" in cols} → {"extractor" in cols}
crawls/migrations/0002_upgrade_from_0_8_6.py
- Already correctly handles conditional upgrade based on schema detection
- No changes needed if crawls table data isn't critical

Next Steps

Fix core/migrations/0023 to preserve OLD field names
Fix core/migrations/0025 to copy data from old → new fields after AddField
Remove debug print statements (lines with print(f'DEBUG:...)
Test all three scenarios
Verify data integrity with SQL queries above
Run ./manage.py makemigrations --check to ensure no unreflected changes

Reference: Field Mappings

Old Field (v0.7.2/v0.8.6)	New Field (v0.9.0)	Notes
`extractor`	`plugin`	Rename
`output`	`output_str`	Rename
`added`	`bookmarked_at`	Rename + also use for `created_at`
`updated`	`modified_at`	Rename
`abid`	`uuid`	v0.8.6 only, field rename
Tag.id (UUID)	Tag.id (INTEGER)	v0.8.6 only, type conversion
`seed_id`	`urls`	Crawl table, v0.8.6 only
`persona` (VARCHAR)	`persona_id` (UUID FK)	Crawl table, v0.8.6 only

Testing Checklist

Fresh install creates correct schema
Fresh install has 0 snapshots, 0 archiveresults
v0.7.2 migration preserves all 12 snapshots
v0.7.2 migration preserves all 44 archiveresults
v0.7.2 migration preserves all 2 tags
v0.7.2 migration copies extractor → plugin (check first 5 rows)
v0.7.2 migration copies output → output_str (check first 5 rows)
v0.7.2 migration copies added → bookmarked_at (compare timestamps)
v0.7.2 migration copies updated → modified_at (compare timestamps)
v0.8.6 migration preserves all 14 snapshots
v0.8.6 migration converts Tag IDs from UUID → INTEGER
v0.8.6 migration preserves tag relationships in core_snapshot_tags
v0.8.6 migration converts abid → uuid field
./manage.py makemigrations --check shows no changes
All migrations run without errors
archivebox status shows correct snapshot/link counts

15 KiB Raw Permalink Blame History

TODO: Fix Migration Path for v0.7.2/v0.8.6rc0 → v0.9.0

Critical Issue

Test Database Locations

How to Test Migrations

1. Fresh Install Test

2. v0.7.2 Migration Test

3. v0.8.6rc0 Migration Test

4. Verify Data Integrity

Migration Philosophy

Principle: Minimal Manual SQL

Key Files

How to Make vs Apply Migrations

Making Migrations (Creating New Migrations)

Applying Migrations (Testing Migrations)

Schema Version Differences

v0.7.2 Schema (Migration 0022)

v0.8.6rc0 Schema

v0.9.0 Target Schema

Critical Gotchas and Mistakes to Avoid

1. ❌ DON'T Create New Fields in SQL Migration (0023)

2. ❌ DON'T Copy Data in SQL Migration

3. ❌ DON'T Assume Empty Tables Mean Fresh Install

4. ❌ DON'T Run Migrations from Data Directories

5. ❌ DON'T Use WHERE Clauses to Skip SQL Selects

6. ❌ DON'T Mix UUID and INTEGER for Tag IDs

7. ❌ DON'T Forget SeparateDatabaseAndState

8. ✅ DO Print Debug Messages

9. ✅ DO Test All Three Scenarios

10. ✅ DO Verify No Unreflected Migrations

Current Status

Files That Need Fixing

Next Steps

Reference: Field Mappings

Testing Checklist

15 KiB

Raw Permalink Blame History