diff --git a/.claude/settings.local.json b/.claude/settings.local.json index ac196f40..70293cbd 100644 --- a/.claude/settings.local.json +++ b/.claude/settings.local.json @@ -6,7 +6,10 @@ "Bash(xargs:*)", "Bash(python -c:*)", "Bash(printf:*)", - "Bash(pkill:*)" + "Bash(pkill:*)", + "Bash(python3:*)", + "Bash(sqlite3:*)", + "WebFetch(domain:github.com)" ] } } diff --git a/PLUGIN_ENHANCEMENTS.md b/PLUGIN_ENHANCEMENTS.md deleted file mode 100644 index ade53064..00000000 --- a/PLUGIN_ENHANCEMENTS.md +++ /dev/null @@ -1,300 +0,0 @@ -# JS Implementation Features to Port to Python ArchiveBox - -## Priority: High Impact Features - -### 1. **Screen Recording** ⭐⭐⭐ -**JS Implementation:** Captures MP4 video + animated GIF of the archiving session -```javascript -// Records browser activity including scrolling, interactions -PuppeteerScreenRecorder → screenrecording.mp4 -ffmpeg conversion → screenrecording.gif (first 10s, optimized) -``` - -**Enhancement for Python:** -- Add `on_Snapshot__24_screenrecording.py` -- Use puppeteer or playwright screen recording APIs -- Generate both full MP4 and thumbnail GIF -- **Value:** Visual proof of what was captured, useful for QA and debugging - -### 2. **AI Quality Assurance** ⭐⭐⭐ -**JS Implementation:** Uses GPT-4o to analyze screenshots and validate archive quality -```javascript -// ai_qa.py analyzes screenshot.png and returns: -{ - "pct_visible": 85, - "warnings": ["Some content may be cut off"], - "main_content_title": "Article Title", - "main_content_author": "Author Name", - "main_content_date": "2024-01-15", - "website_brand_name": "Example.com" -} -``` - -**Enhancement for Python:** -- Add `on_Snapshot__95_aiqa.py` (runs after screenshot) -- Integrate with OpenAI API or local vision models -- Validates: content visibility, broken layouts, CAPTCHA blocks, error pages -- **Value:** Automatic detection of failed archives, quality scoring - -### 3. **Network Response Archiving** ⭐⭐⭐ -**JS Implementation:** Saves ALL network responses in organized structure -``` -responses/ -├── all/ # Timestamped unique files -│ ├── 20240101120000__GET__https%3A%2F%2Fexample.com%2Fapi.json -│ └── ... -├── script/ # Organized by resource type -│ └── example.com/path/to/script.js → ../all/... -├── stylesheet/ -├── image/ -├── media/ -└── index.jsonl # Searchable index -``` - -**Enhancement for Python:** -- Add `on_Snapshot__23_responses.py` -- Save all HTTP responses (XHR, images, scripts, etc.) -- Create both timestamped and URL-organized views via symlinks -- Generate `index.jsonl` with metadata (URL, method, status, mimeType, sha256) -- **Value:** Complete HTTP-level archive, better debugging, API response preservation - -### 4. **Detailed Metadata Extractors** ⭐⭐ - -#### 4a. SSL/TLS Details (`on_Snapshot__16_ssl.py`) -```python -{ - "protocol": "TLS 1.3", - "cipher": "AES_128_GCM", - "securityState": "secure", - "securityDetails": { - "issuer": "Let's Encrypt", - "validFrom": ..., - "validTo": ... - } -} -``` - -#### 4b. SEO Metadata (`on_Snapshot__17_seo.py`) -Extracts all `` tags: -```python -{ - "og:title": "Page Title", - "og:image": "https://example.com/image.jpg", - "twitter:card": "summary_large_image", - "description": "Page description", - ... -} -``` - -#### 4c. Accessibility Tree (`on_Snapshot__18_accessibility.py`) -```python -{ - "headings": ["# Main Title", "## Section 1", ...], - "iframes": ["https://embed.example.com/..."], - "tree": { ... } # Full accessibility snapshot -} -``` - -#### 4d. Outlinks Categorization (`on_Snapshot__19_outlinks.py`) -Better than current implementation - categorizes by type: -```python -{ - "hrefs": [...], # All links - "images": [...], # - "css_stylesheets": [...], # - "js_scripts": [...], #