mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-01-03 09:25:42 +10:00
43 KiB
43 KiB
Content-Addressable Storage (CAS) with Symlink Farm Architecture
Table of Contents
- Overview
- Architecture Design
- Database Models
- Storage Backends
- Symlink Farm Views
- Automatic Synchronization
- Migration Strategy
- Verification and Repair
- Configuration
- Workflow Examples
- Benefits
Overview
Problem Statement
ArchiveBox currently stores files in a timestamp-based structure:
/data/archive/{timestamp}/{extractor}/filename.ext
This leads to:
- Massive duplication:
jquery.min.jsstored 1000x across different snapshots - No S3 support: Direct filesystem coupling
- Inflexible organization: Hard to browse by domain, date, or user
Solution: Content-Addressable Storage + Symlink Farm
Core Concept:
- Store files once in content-addressable storage (CAS) by hash
- Create symlink farms in multiple human-readable views
- Database as source of truth with automatic sync
- Support S3 and local storage via django-storages
Storage Layout:
/data/
├── cas/ # Content-addressable storage (deduplicated)
│ └── sha256/
│ └── ab/
│ └── cd/
│ └── abcdef123... # Actual file (stored once)
│
├── archive/ # Human-browseable views (all symlinks)
│ ├── by_domain/
│ │ └── example.com/
│ │ └── 20241225/
│ │ └── 019b54ee-28d9-72dc/
│ │ ├── wget/
│ │ │ └── index.html -> ../../../../../cas/sha256/ab/cd/abcdef...
│ │ └── singlefile/
│ │ └── page.html -> ../../../../../cas/sha256/ef/12/ef1234...
│ │
│ ├── by_date/
│ │ └── 20241225/
│ │ └── example.com/
│ │ └── 019b54ee-28d9-72dc/
│ │ └── wget/
│ │ └── index.html -> ../../../../../../cas/sha256/ab/cd/abcdef...
│ │
│ ├── by_user/
│ │ └── squash/
│ │ └── 20241225/
│ │ └── example.com/
│ │ └── 019b54ee-28d9-72dc/
│ │
│ └── by_timestamp/ # Legacy compatibility
│ └── 1735142400.123/
│ └── wget/
│ └── index.html -> ../../../../cas/sha256/ab/cd/abcdef...
Architecture Design
Core Principles
- Database = Source of Truth: The
SnapshotFilemodel is authoritative - Symlinks = Materialized Views: Auto-generated from DB, disposable
- Atomic Updates: Symlinks created/deleted with DB transactions
- Idempotent: Operations can be safely retried
- Self-Healing: Automatic detection and repair of drift
- Content-Addressable: Files deduplicated by SHA-256 hash
- Storage Agnostic: Works with local filesystem, S3, Azure, etc.
Space Overhead Analysis
Symlinks are incredibly cheap:
Typical symlink size:
- ext4/XFS: ~60-100 bytes
- ZFS: ~120 bytes
- btrfs: ~80 bytes
Example calculation:
100,000 files × 4 views = 400,000 symlinks
400,000 symlinks × 100 bytes = 40 MB
Space saved by deduplication:
- Average 30% duplicate content across archives
- 100GB archive → saves ~30GB
- Symlink overhead: 0.04GB (0.13% of savings!)
Verdict: Symlinks are FREE compared to deduplication savings
Database Models
Blob Model
# archivebox/core/models.py
class Blob(models.Model):
"""
Immutable content-addressed blob.
Stored as: /cas/{hash_algorithm}/{ab}/{cd}/{full_hash}
"""
# Content identification
hash_algorithm = models.CharField(max_length=16, default='sha256', db_index=True)
hash = models.CharField(max_length=128, db_index=True)
size = models.BigIntegerField()
# Storage location
storage_backend = models.CharField(
max_length=32,
default='local',
choices=[
('local', 'Local Filesystem'),
('s3', 'S3'),
('azure', 'Azure Blob Storage'),
('gcs', 'Google Cloud Storage'),
],
db_index=True,
)
# Metadata
mime_type = models.CharField(max_length=255, blank=True)
created_at = models.DateTimeField(auto_now_add=True, db_index=True)
# Reference counting (for garbage collection)
ref_count = models.IntegerField(default=0, db_index=True)
class Meta:
unique_together = [('hash_algorithm', 'hash', 'storage_backend')]
indexes = [
models.Index(fields=['hash_algorithm', 'hash']),
models.Index(fields=['ref_count']),
models.Index(fields=['storage_backend', 'created_at']),
]
constraints = [
# Ensure ref_count is never negative
models.CheckConstraint(
check=models.Q(ref_count__gte=0),
name='blob_ref_count_positive'
),
]
def __str__(self):
return f"Blob({self.hash[:16]}..., refs={self.ref_count})"
@property
def storage_path(self) -> str:
"""Content-addressed path: sha256/ab/cd/abcdef123..."""
h = self.hash
return f"{self.hash_algorithm}/{h[:2]}/{h[2:4]}/{h}"
def get_file_url(self):
"""Get URL to access this blob"""
from django.core.files.storage import default_storage
return default_storage.url(self.storage_path)
class SnapshotFile(models.Model):
"""
Links a Snapshot to its files (many-to-many through Blob).
Preserves original path information for backwards compatibility.
"""
snapshot = models.ForeignKey(
Snapshot,
on_delete=models.CASCADE,
related_name='files'
)
blob = models.ForeignKey(
Blob,
on_delete=models.PROTECT # PROTECT: can't delete blob while referenced
)
# Original path information
extractor = models.CharField(max_length=32) # 'wget', 'singlefile', etc.
relative_path = models.CharField(max_length=512) # 'output.html', 'warc/example.warc.gz'
# Metadata
created_at = models.DateTimeField(auto_now_add=True, db_index=True)
class Meta:
unique_together = [('snapshot', 'extractor', 'relative_path')]
indexes = [
models.Index(fields=['snapshot', 'extractor']),
models.Index(fields=['blob']),
models.Index(fields=['created_at']),
]
def __str__(self):
return f"{self.snapshot.id}/{self.extractor}/{self.relative_path}"
@property
def logical_path(self) -> Path:
"""Virtual path as it would appear in old structure"""
return Path(self.snapshot.output_dir) / self.extractor / self.relative_path
def save(self, *args, **kwargs):
"""Override save to ensure paths are normalized"""
# Normalize path (no leading slash, use forward slashes)
self.relative_path = self.relative_path.lstrip('/').replace('\\', '/')
super().save(*args, **kwargs)
Updated Snapshot Model
class Snapshot(ModelWithOutputDir, ...):
# ... existing fields ...
@property
def output_dir(self) -> Path:
"""
Returns the primary view directory for browsing.
Falls back to legacy if needed.
"""
# Try by_timestamp view first (best compatibility)
by_timestamp = CONSTANTS.ARCHIVE_DIR / 'by_timestamp' / self.timestamp
if by_timestamp.exists():
return by_timestamp
# Fall back to legacy location (pre-CAS archives)
legacy = CONSTANTS.ARCHIVE_DIR / self.timestamp
if legacy.exists():
return legacy
# Default to by_timestamp for new snapshots
return by_timestamp
def get_output_dir(self, view: str = 'by_timestamp') -> Path:
"""Get output directory for a specific view"""
from storage.views import ViewManager
from urllib.parse import urlparse
if view not in ViewManager.VIEWS:
raise ValueError(f"Unknown view: {view}")
if view == 'by_domain':
domain = urlparse(self.url).netloc or 'unknown'
date = self.created_at.strftime('%Y%m%d')
return CONSTANTS.ARCHIVE_DIR / 'by_domain' / domain / date / str(self.id)
elif view == 'by_date':
domain = urlparse(self.url).netloc or 'unknown'
date = self.created_at.strftime('%Y%m%d')
return CONSTANTS.ARCHIVE_DIR / 'by_date' / date / domain / str(self.id)
elif view == 'by_user':
domain = urlparse(self.url).netloc or 'unknown'
date = self.created_at.strftime('%Y%m%d')
user = self.created_by.username
return CONSTANTS.ARCHIVE_DIR / 'by_user' / user / date / domain / str(self.id)
elif view == 'by_timestamp':
return CONSTANTS.ARCHIVE_DIR / 'by_timestamp' / self.timestamp
return self.output_dir
Updated ArchiveResult Model
class ArchiveResult(models.Model):
# ... existing fields ...
# Note: output_dir field is removed (was deprecated)
# Keep: output (relative path to primary output file)
@property
def output_files(self):
"""Get all files for this extractor"""
return self.snapshot.files.filter(extractor=self.extractor)
@property
def primary_output_file(self):
"""Get the primary output file (e.g., 'output.html')"""
if self.output:
return self.snapshot.files.filter(
extractor=self.extractor,
relative_path=self.output
).first()
return None
Storage Backends
Django Storage Configuration
# settings.py or archivebox/config/settings.py
# For local development/testing
STORAGES = {
"default": {
"BACKEND": "django.core.files.storage.FileSystemStorage",
"OPTIONS": {
"location": "/data/cas",
"base_url": "/cas/",
},
},
"staticfiles": {
"BACKEND": "django.contrib.staticfiles.storage.StaticFilesStorage",
},
}
# For production with S3
STORAGES = {
"default": {
"BACKEND": "storages.backends.s3.S3Storage",
"OPTIONS": {
"bucket_name": "archivebox-blobs",
"region_name": "us-east-1",
"default_acl": "private",
"object_parameters": {
"StorageClass": "INTELLIGENT_TIERING", # Auto-optimize storage costs
},
},
},
}
Blob Manager
# archivebox/storage/ingest.py
import hashlib
from django.core.files.storage import default_storage
from django.core.files.base import ContentFile
from django.db import transaction
from pathlib import Path
import os
class BlobManager:
"""Manages content-addressed blob storage with deduplication"""
@staticmethod
def hash_file(file_path: Path, algorithm='sha256') -> str:
"""Calculate content hash of a file"""
hasher = hashlib.new(algorithm)
with open(file_path, 'rb') as f:
for chunk in iter(lambda: f.read(65536), b''):
hasher.update(chunk)
return hasher.hexdigest()
@staticmethod
def ingest_file(
file_path: Path,
snapshot,
extractor: str,
relative_path: str,
mime_type: str = '',
create_views: bool = True,
) -> SnapshotFile:
"""
Ingest a file into blob storage with deduplication.
Args:
file_path: Path to the file to ingest
snapshot: Snapshot this file belongs to
extractor: Extractor name (wget, singlefile, etc.)
relative_path: Relative path within extractor dir
mime_type: MIME type of the file
create_views: Whether to create symlink views
Returns:
SnapshotFile reference
"""
from storage.views import ViewManager
# Calculate hash
file_hash = BlobManager.hash_file(file_path)
file_size = file_path.stat().st_size
with transaction.atomic():
# Check if blob already exists (deduplication!)
blob, created = Blob.objects.get_or_create(
hash_algorithm='sha256',
hash=file_hash,
storage_backend='local',
defaults={
'size': file_size,
'mime_type': mime_type,
}
)
if created:
# New blob - store in CAS
cas_path = ViewManager.get_cas_path(blob)
cas_path.parent.mkdir(parents=True, exist_ok=True)
# Use hardlink if possible (instant), copy if not
try:
os.link(file_path, cas_path)
except OSError:
import shutil
shutil.copy2(file_path, cas_path)
print(f"✓ Stored new blob: {file_hash[:16]}... ({file_size:,} bytes)")
else:
print(f"✓ Deduplicated: {file_hash[:16]}... (saved {file_size:,} bytes)")
# Increment reference count
blob.ref_count += 1
blob.save(update_fields=['ref_count'])
# Create snapshot file reference
snapshot_file, _ = SnapshotFile.objects.get_or_create(
snapshot=snapshot,
extractor=extractor,
relative_path=relative_path,
defaults={'blob': blob}
)
# Create symlink views (signal will also do this, but we can force it here)
if create_views:
views = ViewManager.create_symlinks(snapshot_file)
print(f" Created {len(views)} view symlinks")
return snapshot_file
@staticmethod
def ingest_directory(
dir_path: Path,
snapshot,
extractor: str
) -> list[SnapshotFile]:
"""Ingest all files from a directory"""
import mimetypes
snapshot_files = []
for file_path in dir_path.rglob('*'):
if file_path.is_file():
relative_path = str(file_path.relative_to(dir_path))
mime_type, _ = mimetypes.guess_type(str(file_path))
snapshot_file = BlobManager.ingest_file(
file_path,
snapshot,
extractor,
relative_path,
mime_type or ''
)
snapshot_files.append(snapshot_file)
return snapshot_files
Symlink Farm Views
View Classes
# archivebox/storage/views.py
from pathlib import Path
from typing import Protocol
from urllib.parse import urlparse
import os
import logging
logger = logging.getLogger(__name__)
class SnapshotView(Protocol):
"""Protocol for generating browseable views of snapshots"""
def get_view_path(self, snapshot_file: SnapshotFile) -> Path:
"""Get the human-readable path for this file in this view"""
...
class ByDomainView:
"""View: /archive/by_domain/{domain}/{YYYYMMDD}/{snapshot_id}/{extractor}/{filename}"""
def get_view_path(self, snapshot_file: SnapshotFile) -> Path:
snapshot = snapshot_file.snapshot
domain = urlparse(snapshot.url).netloc or 'unknown'
date = snapshot.created_at.strftime('%Y%m%d')
return (
CONSTANTS.ARCHIVE_DIR / 'by_domain' / domain / date /
str(snapshot.id) / snapshot_file.extractor / snapshot_file.relative_path
)
class ByDateView:
"""View: /archive/by_date/{YYYYMMDD}/{domain}/{snapshot_id}/{extractor}/{filename}"""
def get_view_path(self, snapshot_file: SnapshotFile) -> Path:
snapshot = snapshot_file.snapshot
domain = urlparse(snapshot.url).netloc or 'unknown'
date = snapshot.created_at.strftime('%Y%m%d')
return (
CONSTANTS.ARCHIVE_DIR / 'by_date' / date / domain /
str(snapshot.id) / snapshot_file.extractor / snapshot_file.relative_path
)
class ByUserView:
"""View: /archive/by_user/{username}/{YYYYMMDD}/{domain}/{snapshot_id}/{extractor}/{filename}"""
def get_view_path(self, snapshot_file: SnapshotFile) -> Path:
snapshot = snapshot_file.snapshot
user = snapshot.created_by.username
domain = urlparse(snapshot.url).netloc or 'unknown'
date = snapshot.created_at.strftime('%Y%m%d')
return (
CONSTANTS.ARCHIVE_DIR / 'by_user' / user / date / domain /
str(snapshot.id) / snapshot_file.extractor / snapshot_file.relative_path
)
class LegacyTimestampView:
"""View: /archive/by_timestamp/{timestamp}/{extractor}/{filename}"""
def get_view_path(self, snapshot_file: SnapshotFile) -> Path:
snapshot = snapshot_file.snapshot
return (
CONSTANTS.ARCHIVE_DIR / 'by_timestamp' / snapshot.timestamp /
snapshot_file.extractor / snapshot_file.relative_path
)
class ViewManager:
"""Manages symlink farm views"""
VIEWS = {
'by_domain': ByDomainView(),
'by_date': ByDateView(),
'by_user': ByUserView(),
'by_timestamp': LegacyTimestampView(),
}
@staticmethod
def get_cas_path(blob: Blob) -> Path:
"""Get the CAS storage path for a blob"""
h = blob.hash
return (
CONSTANTS.DATA_DIR / 'cas' / blob.hash_algorithm /
h[:2] / h[2:4] / h
)
@staticmethod
def create_symlinks(snapshot_file: SnapshotFile, views: list[str] = None) -> dict[str, Path]:
"""
Create symlinks for all views of a file.
If any operation fails, all are rolled back.
"""
from config.common import STORAGE_CONFIG
if views is None:
views = STORAGE_CONFIG.ENABLED_VIEWS
cas_path = ViewManager.get_cas_path(snapshot_file.blob)
# Verify CAS file exists before creating symlinks
if not cas_path.exists():
raise FileNotFoundError(f"CAS file missing: {cas_path}")
created = {}
cleanup_on_error = []
try:
for view_name in views:
if view_name not in ViewManager.VIEWS:
continue
view = ViewManager.VIEWS[view_name]
view_path = view.get_view_path(snapshot_file)
# Create parent directory
view_path.parent.mkdir(parents=True, exist_ok=True)
# Create relative symlink (more portable)
rel_target = os.path.relpath(cas_path, view_path.parent)
# Remove existing symlink/file if present
if view_path.exists() or view_path.is_symlink():
view_path.unlink()
# Create symlink
view_path.symlink_to(rel_target)
created[view_name] = view_path
cleanup_on_error.append(view_path)
return created
except Exception as e:
# Rollback: Remove partially created symlinks
for path in cleanup_on_error:
try:
if path.exists() or path.is_symlink():
path.unlink()
except Exception as cleanup_error:
logger.error(f"Failed to cleanup {path}: {cleanup_error}")
raise Exception(f"Failed to create symlinks: {e}")
@staticmethod
def create_symlinks_idempotent(snapshot_file: SnapshotFile, views: list[str] = None):
"""
Idempotent version - safe to call multiple times.
Returns dict of created symlinks, or empty dict if already correct.
"""
from config.common import STORAGE_CONFIG
if views is None:
views = STORAGE_CONFIG.ENABLED_VIEWS
cas_path = ViewManager.get_cas_path(snapshot_file.blob)
needs_update = False
# Check if all symlinks exist and point to correct target
for view_name in views:
if view_name not in ViewManager.VIEWS:
continue
view = ViewManager.VIEWS[view_name]
view_path = view.get_view_path(snapshot_file)
if not view_path.is_symlink():
needs_update = True
break
# Check if symlink points to correct target
try:
current_target = view_path.resolve()
if current_target != cas_path:
needs_update = True
break
except Exception:
needs_update = True
break
if needs_update:
return ViewManager.create_symlinks(snapshot_file, views)
return {} # Already correct
@staticmethod
def cleanup_symlinks(snapshot_file: SnapshotFile):
"""Remove all symlinks for a file"""
from config.common import STORAGE_CONFIG
for view_name in STORAGE_CONFIG.ENABLED_VIEWS:
if view_name not in ViewManager.VIEWS:
continue
view = ViewManager.VIEWS[view_name]
view_path = view.get_view_path(snapshot_file)
if view_path.exists() or view_path.is_symlink():
view_path.unlink()
logger.info(f"Removed symlink: {view_path}")
Automatic Synchronization
Django Signals for Sync
# archivebox/storage/signals.py
from django.db.models.signals import post_save, post_delete, pre_delete
from django.dispatch import receiver
from django.db import transaction
from core.models import SnapshotFile, Blob
import logging
logger = logging.getLogger(__name__)
@receiver(post_save, sender=SnapshotFile)
def sync_symlinks_on_save(sender, instance, created, **kwargs):
"""
Automatically create/update symlinks when SnapshotFile is saved.
Runs AFTER transaction commit to ensure DB consistency.
"""
from config.common import STORAGE_CONFIG
if not STORAGE_CONFIG.AUTO_SYNC_SYMLINKS:
return
if created:
# New file - create all symlinks
try:
from storage.views import ViewManager
views = ViewManager.create_symlinks(instance)
logger.info(f"Created {len(views)} symlinks for {instance.relative_path}")
except Exception as e:
logger.error(f"Failed to create symlinks for {instance.id}: {e}")
# Don't fail the transaction - can be repaired later
@receiver(pre_delete, sender=SnapshotFile)
def sync_symlinks_on_delete(sender, instance, **kwargs):
"""
Remove symlinks when SnapshotFile is deleted.
Runs BEFORE deletion so we still have the data.
"""
try:
from storage.views import ViewManager
ViewManager.cleanup_symlinks(instance)
logger.info(f"Removed symlinks for {instance.relative_path}")
except Exception as e:
logger.error(f"Failed to remove symlinks for {instance.id}: {e}")
@receiver(post_delete, sender=SnapshotFile)
def cleanup_unreferenced_blob(sender, instance, **kwargs):
"""
Decrement blob reference count and cleanup if no longer referenced.
"""
try:
blob = instance.blob
# Atomic decrement
from django.db.models import F
Blob.objects.filter(pk=blob.pk).update(ref_count=F('ref_count') - 1)
# Reload to get updated count
blob.refresh_from_db()
# Garbage collect if no more references
if blob.ref_count <= 0:
from storage.views import ViewManager
cas_path = ViewManager.get_cas_path(blob)
if cas_path.exists():
cas_path.unlink()
logger.info(f"Garbage collected blob {blob.hash[:16]}...")
blob.delete()
except Exception as e:
logger.error(f"Failed to cleanup blob: {e}")
App Configuration
# archivebox/storage/apps.py
from django.apps import AppConfig
class StorageConfig(AppConfig):
default_auto_field = 'django.db.models.BigAutoField'
name = 'storage'
def ready(self):
import storage.signals # Register signal handlers
Migration Strategy
Migration Command
# archivebox/core/management/commands/migrate_to_cas.py
from django.core.management.base import BaseCommand
from django.db.models import Q
from core.models import Snapshot
from storage.ingest import BlobManager
from storage.views import ViewManager
from pathlib import Path
import shutil
class Command(BaseCommand):
help = 'Migrate existing archives to content-addressable storage'
def add_arguments(self, parser):
parser.add_argument('--dry-run', action='store_true', help='Show what would be done')
parser.add_argument('--views', nargs='+', default=['by_timestamp', 'by_domain', 'by_date'])
parser.add_argument('--cleanup-legacy', action='store_true', help='Delete old files after migration')
parser.add_argument('--batch-size', type=int, default=100)
def handle(self, *args, **options):
dry_run = options['dry_run']
views = options['views']
cleanup = options['cleanup_legacy']
batch_size = options['batch_size']
snapshots = Snapshot.objects.all().order_by('created_at')
total = snapshots.count()
if dry_run:
self.stdout.write(self.style.WARNING('DRY RUN - No changes will be made'))
self.stdout.write(f"Found {total} snapshots to migrate")
total_files = 0
total_saved = 0
total_bytes = 0
error_count = 0
for i, snapshot in enumerate(snapshots, 1):
self.stdout.write(f"\n[{i}/{total}] Processing {snapshot.url[:60]}...")
legacy_dir = CONSTANTS.ARCHIVE_DIR / snapshot.timestamp
if not legacy_dir.exists():
self.stdout.write(f" Skipping (no legacy dir)")
continue
# Process each extractor directory
for extractor_dir in legacy_dir.iterdir():
if not extractor_dir.is_dir():
continue
extractor = extractor_dir.name
self.stdout.write(f" Processing extractor: {extractor}")
if dry_run:
file_count = sum(1 for _ in extractor_dir.rglob('*') if _.is_file())
self.stdout.write(f" Would ingest {file_count} files")
continue
# Track blobs before ingestion
blobs_before = Blob.objects.count()
try:
# Ingest all files from this extractor
ingested = BlobManager.ingest_directory(
extractor_dir,
snapshot,
extractor
)
total_files += len(ingested)
# Calculate deduplication savings
blobs_after = Blob.objects.count()
new_blobs = blobs_after - blobs_before
dedup_count = len(ingested) - new_blobs
if dedup_count > 0:
dedup_bytes = sum(f.blob.size for f in ingested[-dedup_count:])
total_saved += dedup_bytes
self.stdout.write(
f" ✓ Ingested {len(ingested)} files "
f"({new_blobs} new, {dedup_count} deduplicated, "
f"saved {dedup_bytes / 1024 / 1024:.1f} MB)"
)
else:
total_bytes_added = sum(f.blob.size for f in ingested)
total_bytes += total_bytes_added
self.stdout.write(
f" ✓ Ingested {len(ingested)} files "
f"({total_bytes_added / 1024 / 1024:.1f} MB)"
)
except Exception as e:
error_count += 1
self.stdout.write(self.style.ERROR(f" ✗ Error: {e}"))
continue
# Cleanup legacy files
if cleanup and not dry_run:
try:
shutil.rmtree(legacy_dir)
self.stdout.write(f" Cleaned up legacy dir: {legacy_dir}")
except Exception as e:
self.stdout.write(self.style.WARNING(f" Failed to cleanup: {e}"))
# Progress update
if i % 10 == 0:
self.stdout.write(
f"\nProgress: {i}/{total} | "
f"Files: {total_files:,} | "
f"Saved: {total_saved / 1024 / 1024:.1f} MB | "
f"Errors: {error_count}"
)
# Final summary
self.stdout.write("\n" + "="*80)
self.stdout.write(self.style.SUCCESS("Migration Complete!"))
self.stdout.write(f" Snapshots processed: {total}")
self.stdout.write(f" Files ingested: {total_files:,}")
self.stdout.write(f" Space saved by deduplication: {total_saved / 1024 / 1024:.1f} MB")
self.stdout.write(f" Errors: {error_count}")
self.stdout.write(f" Symlink views created: {', '.join(views)}")
Rebuild Views Command
# archivebox/core/management/commands/rebuild_views.py
from django.core.management.base import BaseCommand
from core.models import SnapshotFile
from storage.views import ViewManager
import shutil
class Command(BaseCommand):
help = 'Rebuild symlink farm views from database'
def add_arguments(self, parser):
parser.add_argument(
'--views',
nargs='+',
default=['by_timestamp', 'by_domain', 'by_date'],
help='Which views to rebuild'
)
parser.add_argument(
'--clean',
action='store_true',
help='Remove old symlinks before rebuilding'
)
def handle(self, *args, **options):
views = options['views']
clean = options['clean']
# Clean old views
if clean:
self.stdout.write("Cleaning old views...")
for view_name in views:
view_dir = CONSTANTS.ARCHIVE_DIR / view_name
if view_dir.exists():
shutil.rmtree(view_dir)
self.stdout.write(f" Removed {view_dir}")
# Rebuild all symlinks
total_symlinks = 0
total_files = SnapshotFile.objects.count()
self.stdout.write(f"Rebuilding symlinks for {total_files:,} files...")
for i, snapshot_file in enumerate(
SnapshotFile.objects.select_related('snapshot', 'blob'),
1
):
try:
created = ViewManager.create_symlinks(snapshot_file, views=views)
total_symlinks += len(created)
except Exception as e:
self.stdout.write(self.style.ERROR(
f"Failed to create symlinks for {snapshot_file}: {e}"
))
if i % 1000 == 0:
self.stdout.write(f" Created {total_symlinks:,} symlinks...")
self.stdout.write(
self.style.SUCCESS(
f"\n✓ Rebuilt {total_symlinks:,} symlinks across {len(views)} views"
)
)
Verification and Repair
Storage Verification Command
# archivebox/core/management/commands/verify_storage.py
from django.core.management.base import BaseCommand
from core.models import SnapshotFile, Blob
from storage.views import ViewManager
from pathlib import Path
class Command(BaseCommand):
help = 'Verify storage consistency between DB and filesystem'
def add_arguments(self, parser):
parser.add_argument('--fix', action='store_true', help='Fix issues found')
parser.add_argument('--vacuum', action='store_true', help='Remove orphaned symlinks')
def handle(self, *args, **options):
fix = options['fix']
vacuum = options['vacuum']
issues = {
'missing_cas_files': [],
'missing_symlinks': [],
'incorrect_symlinks': [],
'orphaned_symlinks': [],
'orphaned_blobs': [],
}
self.stdout.write("Checking database → filesystem consistency...")
# Check 1: Verify all blobs exist in CAS
self.stdout.write("\n1. Verifying CAS files...")
for blob in Blob.objects.all():
cas_path = ViewManager.get_cas_path(blob)
if not cas_path.exists():
issues['missing_cas_files'].append(blob)
self.stdout.write(self.style.ERROR(
f"✗ Missing CAS file: {cas_path} (blob {blob.hash[:16]}...)"
))
# Check 2: Verify all SnapshotFiles have correct symlinks
self.stdout.write("\n2. Verifying symlinks...")
total_files = SnapshotFile.objects.count()
for i, sf in enumerate(SnapshotFile.objects.select_related('blob'), 1):
if i % 100 == 0:
self.stdout.write(f" Checked {i}/{total_files} files...")
cas_path = ViewManager.get_cas_path(sf.blob)
for view_name in STORAGE_CONFIG.ENABLED_VIEWS:
view = ViewManager.VIEWS[view_name]
view_path = view.get_view_path(sf)
if not view_path.exists() and not view_path.is_symlink():
issues['missing_symlinks'].append((sf, view_name, view_path))
if fix:
try:
ViewManager.create_symlinks_idempotent(sf, [view_name])
self.stdout.write(self.style.SUCCESS(
f"✓ Created missing symlink: {view_path}"
))
except Exception as e:
self.stdout.write(self.style.ERROR(
f"✗ Failed to create symlink: {e}"
))
elif view_path.is_symlink():
# Verify symlink points to correct CAS file
try:
current_target = view_path.resolve()
if current_target != cas_path:
issues['incorrect_symlinks'].append((sf, view_name, view_path))
if fix:
ViewManager.create_symlinks_idempotent(sf, [view_name])
self.stdout.write(self.style.SUCCESS(
f"✓ Fixed incorrect symlink: {view_path}"
))
except Exception as e:
self.stdout.write(self.style.ERROR(
f"✗ Broken symlink: {view_path} - {e}"
))
# Check 3: Find orphaned symlinks
if vacuum:
self.stdout.write("\n3. Checking for orphaned symlinks...")
# Get all valid view paths from DB
valid_paths = set()
for sf in SnapshotFile.objects.all():
for view_name in STORAGE_CONFIG.ENABLED_VIEWS:
view = ViewManager.VIEWS[view_name]
valid_paths.add(view.get_view_path(sf))
# Scan filesystem for symlinks
for view_name in STORAGE_CONFIG.ENABLED_VIEWS:
view_base = CONSTANTS.ARCHIVE_DIR / view_name
if not view_base.exists():
continue
for path in view_base.rglob('*'):
if path.is_symlink() and path not in valid_paths:
issues['orphaned_symlinks'].append(path)
if fix:
path.unlink()
self.stdout.write(self.style.SUCCESS(
f"✓ Removed orphaned symlink: {path}"
))
# Check 4: Find orphaned blobs
self.stdout.write("\n4. Checking for orphaned blobs...")
orphaned_blobs = Blob.objects.filter(ref_count=0)
for blob in orphaned_blobs:
issues['orphaned_blobs'].append(blob)
if fix:
cas_path = ViewManager.get_cas_path(blob)
if cas_path.exists():
cas_path.unlink()
blob.delete()
self.stdout.write(self.style.SUCCESS(
f"✓ Removed orphaned blob: {blob.hash[:16]}..."
))
# Summary
self.stdout.write("\n" + "="*80)
self.stdout.write(self.style.WARNING("Storage Verification Summary:"))
self.stdout.write(f" Missing CAS files: {len(issues['missing_cas_files'])}")
self.stdout.write(f" Missing symlinks: {len(issues['missing_symlinks'])}")
self.stdout.write(f" Incorrect symlinks: {len(issues['incorrect_symlinks'])}")
self.stdout.write(f" Orphaned symlinks: {len(issues['orphaned_symlinks'])}")
self.stdout.write(f" Orphaned blobs: {len(issues['orphaned_blobs'])}")
total_issues = sum(len(v) for v in issues.values())
if total_issues == 0:
self.stdout.write(self.style.SUCCESS("\n✓ Storage is consistent!"))
elif fix:
self.stdout.write(self.style.SUCCESS(f"\n✓ Fixed {total_issues} issues"))
else:
self.stdout.write(self.style.WARNING(
f"\n⚠ Found {total_issues} issues. Run with --fix to repair."
))
Configuration
# archivebox/config/common.py
class StorageConfig(BaseConfigSet):
toml_section_header: str = "STORAGE_CONFIG"
# Existing fields
TMP_DIR: Path = Field(default=CONSTANTS.DEFAULT_TMP_DIR)
LIB_DIR: Path = Field(default=CONSTANTS.DEFAULT_LIB_DIR)
OUTPUT_PERMISSIONS: str = Field(default="644")
RESTRICT_FILE_NAMES: str = Field(default="windows")
ENFORCE_ATOMIC_WRITES: bool = Field(default=True)
DIR_OUTPUT_PERMISSIONS: str = Field(default="755")
# New CAS fields
USE_CAS: bool = Field(
default=True,
description="Use content-addressable storage with deduplication"
)
ENABLED_VIEWS: list[str] = Field(
default=['by_timestamp', 'by_domain', 'by_date'],
description="Which symlink farm views to maintain"
)
AUTO_SYNC_SYMLINKS: bool = Field(
default=True,
description="Automatically create/update symlinks via signals"
)
VERIFY_ON_STARTUP: bool = Field(
default=False,
description="Verify storage consistency on startup"
)
VERIFY_INTERVAL_HOURS: int = Field(
default=24,
description="Run periodic storage verification (0 to disable)"
)
CLEANUP_TEMP_FILES: bool = Field(
default=True,
description="Remove temporary extractor files after ingestion"
)
CAS_BACKEND: str = Field(
default='local',
choices=['local', 's3', 'azure', 'gcs'],
description="Storage backend for CAS blobs"
)
Workflow Examples
Example 1: Normal Operation
# Extractor writes files to temporary directory
extractor_dir = Path('/tmp/wget-output')
# After extraction completes, ingest into CAS
from storage.ingest import BlobManager
ingested_files = BlobManager.ingest_directory(
extractor_dir,
snapshot,
'wget'
)
# Behind the scenes:
# 1. Each file hashed (SHA-256)
# 2. Blob created/found in DB (deduplication)
# 3. File stored in CAS (if new)
# 4. SnapshotFile created in DB
# 5. post_save signal fires
# 6. Symlinks automatically created in all enabled views
# ✓ DB and filesystem in perfect sync
Example 2: Browse Archives
# User can browse in multiple ways:
# By domain (great for site collections)
$ ls /data/archive/by_domain/example.com/20241225/
019b54ee-28d9-72dc/
# By date (great for time-based browsing)
$ ls /data/archive/by_date/20241225/
example.com/
github.com/
wikipedia.org/
# By user (great for multi-user setups)
$ ls /data/archive/by_user/squash/20241225/
example.com/
github.com/
# Legacy timestamp (backwards compatibility)
$ ls /data/archive/by_timestamp/1735142400.123/
wget/
singlefile/
screenshot/
Example 3: Crash Recovery
# System crashes after DB save but before symlinks created
# - DB has SnapshotFile record ✓
# - Symlinks missing ✗
# Next verification run:
$ python -m archivebox verify_storage --fix
# Output:
# Checking database → filesystem consistency...
# ✗ Missing symlink: /data/archive/by_domain/example.com/.../index.html
# ✓ Created missing symlink
# ✓ Fixed 1 issues
# Storage is now consistent!
Example 4: Migration from Legacy
# Migrate all existing archives to CAS
$ python -m archivebox migrate_to_cas --dry-run
# Output:
# DRY RUN - No changes will be made
# Found 1000 snapshots to migrate
# [1/1000] Processing https://example.com...
# Would ingest wget: 15 files
# Would ingest singlefile: 1 file
# ...
# Run actual migration
$ python -m archivebox migrate_to_cas
# Output:
# [1/1000] Processing https://example.com...
# ✓ Ingested 15 files (3 new, 12 deduplicated, saved 2.4 MB)
# ...
# Migration Complete!
# Snapshots processed: 1000
# Files ingested: 45,231
# Space saved by deduplication: 12.3 GB
Benefits
Space Savings
- Massive deduplication: Common files (jquery, fonts, images) stored once
- 30-70% typical savings across archives
- Symlink overhead: ~0.1% of saved space (negligible)
Flexibility
- Multiple views: Browse by domain, date, user, timestamp
- Add views anytime: Run
rebuild_viewsto add new organization - No data migration needed: Just rebuild symlinks
S3 Support
- Use django-storages: Drop-in S3, Azure, GCS support
- Hybrid mode: Hot data local, cold data in S3
- Cost optimization: S3 Intelligent Tiering for automatic cost reduction
Data Integrity
- Database as truth: Symlinks are disposable, can be rebuilt
- Automatic sync: Signals keep symlinks current
- Self-healing: Verification detects and fixes drift
- Atomic operations: Transaction-safe
Backwards Compatibility
- Legacy view:
by_timestampmaintains old structure - Gradual migration: Old and new archives coexist
- Zero downtime: Archives keep working during migration
Developer Experience
- Human-browseable: Easy to inspect and debug
- Standard tools work: cp, rsync, tar, zip all work normally
- Multiple organization schemes: Find archives multiple ways
- Easy backups: Symlinks handled correctly by modern tools
Implementation Checklist
- Create database models (Blob, SnapshotFile)
- Create migrations for new models
- Implement BlobManager (ingest.py)
- Implement ViewManager (views.py)
- Implement Django signals (signals.py)
- Create migrate_to_cas command
- Create rebuild_views command
- Create verify_storage command
- Update Snapshot.output_dir property
- Update ArchiveResult to use SnapshotFile
- Add StorageConfig settings
- Configure django-storages
- Test with local filesystem
- Test with S3
- Document for users
- Update backup procedures
Future Enhancements
- Web UI for browsing CAS blobs
- API endpoints for file access
- Content-aware compression (compress similar files together)
- IPFS backend support
- Automatic tiering (hot → warm → cold → glacier)
- Deduplication statistics dashboard
- Export to WARC with CAS metadata