alex/ArchiveBox

Fork 0

mirror of https://github.com/ArchiveBox/ArchiveBox.git synced 2026-01-02 17:05:38 +10:00

Files

Nick Sweeting d2e65cfd38 move todos

2025-12-28 04:44:38 -08:00

21 KiB

Raw Permalink Blame History

TODO: Rename Extractor to Plugin - Implementation Progress

Status: 🟡 In Progress (2/13 phases complete) Started: 2025-12-28 Estimated Files to Update: ~150+ files

Progress Overview

✅ Completed Phases (2/13)

Phase 1: Database Migration - Created migration 0033
Phase 2: Core Model Updates - Updated ArchiveResult, ArchiveResultManager, Snapshot models

🟡 In Progress (1/13)

Phase 3: Hook Execution System (hooks.py - all function renames)

⏳ Pending Phases (10/13)

Phase 4: JSONL Import/Export (misc/jsonl.py)
Phase 5: CLI Commands (archivebox_extract, archivebox_add, archivebox_update)
Phase 6: API Endpoints (v1_core.py, v1_cli.py)
Phase 7: Admin Interface (admin_archiveresults.py, forms.py)
Phase 8: Views and Templates (views.py, templatetags, progress_monitor.html)
Phase 9: Worker System (workers/worker.py)
Phase 10: State Machine (statemachines.py)
Phase 11: Tests (test_migrations_helpers.py, test_recursive_crawl.py, etc.)
Phase 12: Terminology Standardization (via_extractor→plugin, comments, docstrings)
Phase 13: Run migrations and verify all tests pass

What's Been Completed So Far

Phase 1: Database Migration ✅

File Created: archivebox/core/migrations/0033_rename_extractor_add_hook_name.py

Changes:

Used migrations.RenameField() to rename extractor → plugin
Added hook_name field (CharField, max_length=255, indexed, default='')
Preserves all existing data, indexes, and constraints

Phase 2: Core Models ✅

File Updated: archivebox/core/models.py

ArchiveResultManager

Updated indexable() method to use plugin__in and plugin=method
Changed reference from ARCHIVE_METHODS_INDEXING_PRECEDENCE to EXTRACTOR_INDEXING_PRECEDENCE

ArchiveResult Model

Field Changes:

Renamed field: extractor → plugin
Added field: hook_name (stores full filename like on_Snapshot__50_wget.py)
Updated comments to reference "plugin" instead of "extractor"

Method Updates:

get_extractor_choices() → get_plugin_choices()
__str__(): Now uses self.plugin
save(): Logs plugin instead of extractor
get_absolute_url(): Uses self.plugin
extractor_module property → plugin_module property
output_exists(): Checks self.plugin directory
embed_path(): Uses self.plugin for paths
create_output_dir(): Creates self.plugin directory
output_dir_name: Returns self.plugin
run(): All references to extractor → plugin (including extractor_dir → plugin_dir)
update_from_output(): All references updated to plugin/plugin_dir
_update_snapshot_title(): Parameter renamed to plugin_dir
trigger_search_indexing(): Passes plugin=self.plugin
output_dir property: Returns plugin directory
is_background_hook(): Uses plugin_dir

Snapshot Model

Method Updates:

create_pending_archiveresults(): Uses get_enabled_plugins(), filters by plugin=plugin
result_icons (calc_icons): Maps by r.plugin, calls get_plugin_name() and get_plugin_icon()
_merge_archive_results_from_index(): Maps by (ar.plugin, ar.start_ts), supports both 'extractor' and 'plugin' keys for backwards compat
_create_archive_result_if_missing(): Supports both 'extractor' and 'plugin' keys, creates with plugin=plugin
write_index_json(): Writes 'plugin': ar.plugin in archive_results
canonical_outputs(): Updates find_best_output_in_dir() to use plugin_name, accesses result.plugin, creates keys like {result.plugin}_path
latest_outputs(): Uses get_plugins(), filters by plugin=plugin
retry_failed_archiveresults(): Updated docstring to reference "plugins" instead of "extractors"

Total Lines Changed in models.py: ~50+ locations

Full Implementation Plan

ArchiveResult Model Refactoring Plan: Rename Extractor to Plugin + Add Hook Name Field

Overview

Refactor the ArchiveResult model and standardize terminology across the codebase:

Rename the extractor field to plugin in ArchiveResult model
Add a new hook_name field to store the specific hook filename that executed
Update all related code paths (CLI, API, admin, views, hooks, JSONL, etc.)
Standardize CLI flags from --extract/--extractors to --plugins
Standardize terminology throughout codebase:
- "parsers" → "parser plugins"
- "extractors" → "extractor plugins"
- "parser extractors" → "parser plugins"
- "archive methods" → "extractor plugins"
- Document apt/brew/npm/pip as "package manager plugins" in comments

Current State Analysis

ArchiveResult Model (archivebox/core/models.py:1679-1750)

class ArchiveResult(ModelWithOutputDir, ...):
    extractor = models.CharField(max_length=32, db_index=True)  # e.g., "screenshot", "wget"
    # New fields from migration 0029:
    output_str, output_json, output_files, output_size, output_mimetypes
    binary = ForeignKey('machine.Binary', ...)
    # No hook_name field yet

Hook Execution Flow

ArchiveResult.run() discovers hooks for the plugin (e.g., wget/on_Snapshot__50_wget.py)
run_hook() executes each hook script, captures output as HookResult
update_from_output() parses JSONL and updates ArchiveResult fields
Currently NO tracking of which specific hook file executed

Field Usage Across Codebase

extractor field is used in ~100 locations:

Model: ArchiveResult.extractor field definition, str, manager queries
CLI: archivebox_extract.py (--plugin flag), archivebox_add.py, tests
API: v1_core.py (extractor filter), v1_cli.py (extract/extractors args)
Admin: admin_archiveresults.py (list filter, display)
Views: core/views.py (archiveresult_objects dict by extractor)
Template Tags: core_tags.py (extractor_icon, extractor_thumbnail, extractor_embed)
Hooks: hooks.py (get_extractors, get_extractor_name, run_hook output parsing)
JSONL: misc/jsonl.py (archiveresult_to_jsonl serializes extractor)
Worker: workers/worker.py (ArchiveResultWorker filters by extractor)
Statemachine: statemachines.py (logs extractor in state transitions)

Implementation Plan

Phase 1: Database Migration (archivebox/core/migrations/) ✅ COMPLETE

Create migration 0033_rename_extractor_add_hook_name.py:

Rename field: extractor → plugin (preserve index, constraints)
Add field: hook_name = CharField(max_length=255, blank=True, default='', db_index=True)
- Stores full hook filename: on_Snapshot__50_wget.py, on_Crawl__10_chrome_session.js, etc.
- Empty string for existing records (data migration sets all to '')
Update any indexes or constraints that reference extractor

Decision: Full filename chosen for explicitness and easy grep-ability

Critical Files to Update:

✅ ArchiveResult model field definitions
✅ Migration dependencies (latest: 0032)

Phase 2: Core Model Updates (archivebox/core/models.py) ✅ COMPLETE

ArchiveResult Model (lines 1679-1820):

✅ Rename field: extractor → plugin
✅ Add field: hook_name = models.CharField(...)
✅ Update str: f'...-> {self.plugin}'
✅ Update absolute_url: Use plugin instead of extractor
✅ Update embed_path: Use plugin directory name

ArchiveResultManager (lines 1669-1677):

✅ Update indexable(): filter(plugin__in=INDEXABLE_METHODS, ...)
✅ Update precedence: When(plugin=method, ...)

Snapshot Model (lines 1000-1600):

✅ Update canonical_outputs: Access by plugin name
✅ Update create_pending_archiveresults: Use plugin parameter
✅ All queryset filters: archiveresult_set.filter(plugin=...)

Phase 3: Hook Execution System (archivebox/hooks.py) 🟡 IN PROGRESS

Function Renames:

get_extractors() → get_plugins() (lines 479-504)
get_parser_extractors() → get_parser_plugins() (lines 507-514)
get_extractor_name() → get_plugin_name() (lines 517-530)
is_parser_extractor() → is_parser_plugin() (lines 533-536)
get_enabled_extractors() → get_enabled_plugins() (lines 553-566)
get_extractor_template() → get_plugin_template() (line 1048)
get_extractor_icon() → get_plugin_icon() (line 1068)
get_all_extractor_icons() → get_all_plugin_icons() (line 1092)

Update HookResult TypedDict (lines 63-73):

Add field: hook_name: str to store hook filename
Add field: plugin: str (if not already present)

Update run_hook() (lines 141-389):

Add hook_name parameter: Pass hook filename to be stored in result
Update HookResult to include hook_name field
Update JSONL record output: Add hook_name key

Update ArchiveResult.run() (lines 1838-1914):

When calling run_hook, pass the hook filename
Store hook_name in ArchiveResult before/after execution

Update ArchiveResult.update_from_output() (lines 1916-2073):

Parse hook_name from JSONL output
Store in self.hook_name field
If not present in JSONL, infer from directory/filename

Constants to Rename:

ARCHIVE_METHODS_INDEXING_PRECEDENCE → EXTRACTOR_INDEXING_PRECEDENCE

Comments/Docstrings: Update all function docstrings to use "plugin" terminology

Phase 4: JSONL Import/Export (archivebox/misc/jsonl.py)

Update archiveresult_to_jsonl() (lines 173-200):

Change key: 'extractor': result.extractor → 'plugin': result.plugin
Add key: 'hook_name': result.hook_name

Update JSONL parsing:

Accept both 'extractor' (legacy) and 'plugin' (new) keys when importing
Always write 'plugin' key in new exports (never 'extractor')
Parse and store hook_name if present (backwards compat: empty if missing)

Decision: Support both keys on import for smooth migration, always export new format

Phase 5: CLI Commands (archivebox/cli/)

archivebox_extract.py (lines 1-230):

Rename flag: --plugin stays (already correct!)
Update internal references: extractor → plugin
Update filter: results.filter(plugin=plugin)
Update display: result.plugin

archivebox_add.py:

Rename config key: 'EXTRACTORS': plugins → 'PLUGINS': plugins (if not already)

archivebox_update.py:

Standardize to --plugins flag (currently may be --extractors or --extract)

tests/test_oneshot.py:

Update flag: --extract=... → --plugins=...

Phase 6: API Endpoints (archivebox/api/)

v1_core.py (ArchiveResult API):

Update schema field: extractor: str → plugin: str
Update schema field: Add hook_name: str = ''
Update FilterSchema: q=[..., 'plugin', ...]
Update extractor filter: plugin: Optional[str] = Field(None, q='plugin__icontains')

v1_cli.py (CLI API):

Rename AddCommandSchema field: extract: str → plugins: str
Rename UpdateCommandSchema field: extractors: str → plugins: str
Update endpoint mapping: args.plugins → plugins parameter

Phase 7: Admin Interface (archivebox/core/)

admin_archiveresults.py:

Update all references: extractor → plugin
Update list_filter: 'plugin' instead of 'extractor'
Update ordering: order_by('plugin')
Update get_plugin_icon: (rename from get_extractor_icon if exists)

admin_snapshots.py:

Update any commented TODOs referencing extractor

forms.py:

Rename function: get_archive_methods() → get_plugin_choices()
Update form field: archive_methods → plugins

Phase 8: Views and Templates (archivebox/core/)

views.py:

Update dict building: archiveresult_objects[result.plugin] = result
Update all extractor references to plugin

templatetags/core_tags.py:

Rename template tags (BREAKING CHANGE):
- extractor_icon() → plugin_icon()
- extractor_thumbnail() → plugin_thumbnail()
- extractor_embed() → plugin_embed()
Update internal: result.extractor → result.plugin

Update HTML templates (if any directly reference extractor):

Search for {{ result.extractor }} and similar
Update to {{ result.plugin }}
Update template tag calls
CRITICAL: Update JavaScript in templates/admin/progress_monitor.html:
- Lines 491, 505: Change extractor.extractor and a.extractor to use plugin field

Phase 9: Worker System (archivebox/workers/worker.py)

ArchiveResultWorker:

Rename parameter: extractor → plugin (lines 348, 350)
Update filter: qs.filter(plugin=self.plugin)
Update subprocess passing: Use plugin parameter

Phase 10: State Machine (archivebox/core/statemachines.py)

ArchiveResultMachine:

Update logging: Use self.archiveresult.plugin instead of extractor
Update any state metadata that includes extractor field

Phase 11: Tests and Fixtures

Update test files:

tests/test_migrations_*.py: Update expected field names in schema definitions
tests/test_hooks.py: Update assertions for plugin/hook_name fields
archivebox/tests/test_migrations_helpers.py: Update schema SQL (lines 161, 382, 468)
tests/test_recursive_crawl.py: Update SQL query WHERE extractor = '60_parse_html_urls' (line 163)
archivebox/cli/tests_piping.py: Update test function names and assertions
Any fixtures that create ArchiveResults: Use plugin parameter
Any mock objects that set .extractor attribute: Change to .plugin

Phase 12: Terminology Standardization (NEW)

This phase standardizes terminology throughout the codebase to use consistent "plugin" nomenclature.

via_extractor → plugin Rename (14 files):

Rename metadata field via_extractor to just plugin
Files affected:
- archivebox/hooks.py - Set plugin in run_hook() output
- archivebox/crawls/models.py - If via_extractor field exists
- archivebox/cli/archivebox_crawl.py - References to via_extractor
- All parser plugins that set via_extractor in output
- Test files with via_extractor assertions
Update all JSONL output from parser plugins to use "plugin" key

Logging Functions (archivebox/misc/logging_util.py):

log_archive_method_started() → log_extractor_started() (line 326)
log_archive_method_finished() → log_extractor_finished() (line 330)

Form Functions (archivebox/core/forms.py):

get_archive_methods() → get_plugin_choices() (line 15)
Form field archive_methods → plugins (line 24, 29)
Update form validation and view usage

Comments and Docstrings (81 files with "extractor" references):

Update comments to say "extractor plugin" instead of just "extractor"
Update comments to say "parser plugin" instead of "parser extractor"
All plugin files: Update docstrings to use "extractor plugin" terminology

Package Manager Plugin Documentation:

Update comments in package manager hook files to say "package manager plugin":
- archivebox/plugins/apt/on_Binary__install_using_apt_provider.py
- archivebox/plugins/brew/on_Binary__install_using_brew_provider.py
- archivebox/plugins/npm/on_Binary__install_using_npm_provider.py
- archivebox/plugins/pip/on_Binary__install_using_pip_provider.py
- archivebox/plugins/env/on_Binary__install_using_env_provider.py
- archivebox/plugins/custom/on_Binary__install_using_custom_bash.py

String Literals in Error Messages:

Search for error messages containing "extractor" and update to "plugin" or "extractor plugin"
Search for error messages containing "parser" and update to "parser plugin" where appropriate

Critical Files Summary

Must Update (Core):

✅ archivebox/core/models.py - ArchiveResult, ArchiveResultManager, Snapshot
✅ archivebox/core/migrations/0033_*.py - New migration
⏳ archivebox/hooks.py - All hook execution and discovery functions
⏳ archivebox/misc/jsonl.py - Serialization/deserialization

Must Update (CLI):

⏳ archivebox/cli/archivebox_extract.py
⏳ archivebox/cli/archivebox_add.py
⏳ archivebox/cli/archivebox_update.py

Must Update (API):

⏳ archivebox/api/v1_core.py
⏳ archivebox/api/v1_cli.py

Must Update (Admin/Views):

⏳ archivebox/core/admin_archiveresults.py
⏳ archivebox/core/views.py
⏳ archivebox/core/templatetags/core_tags.py

Must Update (Workers/State):

⏳ archivebox/workers/worker.py
⏳ archivebox/core/statemachines.py

Must Update (Tests):

⏳ tests/test_oneshot.py
⏳ archivebox/tests/test_hooks.py
⏳ archivebox/tests/test_migrations_helpers.py - Schema SQL definitions
⏳ tests/test_recursive_crawl.py - SQL queries with field names
⏳ archivebox/cli/tests_piping.py - Test function docstrings

Must Update (Terminology - Phase 12):

⏳ archivebox/misc/logging_util.py - Rename logging functions
⏳ archivebox/core/forms.py - Rename form helper and field
⏳ archivebox/templates/admin/progress_monitor.html - JavaScript field refs
⏳ All 81 plugin files - Update docstrings and comments
⏳ 28 files with parser terminology - Update comments consistently

Migration Strategy

Data Migration for Existing Records:

def forwards(apps, schema_editor):
    ArchiveResult = apps.get_model('core', 'ArchiveResult')
    # All existing records get empty hook_name
    ArchiveResult.objects.all().update(hook_name='')

Backwards Compatibility:

BREAKING CHANGES (per user requirements - no backwards compat):

CLI flags: Hard cutover to --plugins (no aliases)
API fields: extractor removed, plugin required
Template tags: All renamed to plugin_*

PARTIAL COMPAT (for migration):

JSONL: Write 'plugin', but accept both 'extractor' and 'plugin' on import

Testing Checklist

Migration 0033 runs successfully on test database
All migrations tests pass (test_migrations_*.py)
All hook tests pass (test_hooks.py)
CLI commands work with --plugins flag
API endpoints return plugin/hook_name fields correctly
Admin interface displays plugin correctly
Admin progress monitor JavaScript works (no console errors)
JSONL export includes both plugin and hook_name
JSONL import accepts both 'extractor' and 'plugin' keys
Hook execution populates hook_name field
Worker filtering by plugin works
Template tags render with new names (plugin_icon, etc.)
All renamed functions work correctly
SQL queries in tests use correct field names
Terminology is consistent across codebase

Critical Issues to Address

1. via_extractor Field (DECISION: RENAME)

Currently used in 14 files for tracking which parser plugin discovered a URL
Decision: Rename via_extractor → plugin (not via_plugin, just "plugin")
Impact: Crawler and parser plugin code - 14 files to update
Files affected:
- archivebox/hooks.py
- archivebox/crawls/models.py
- archivebox/cli/archivebox_crawl.py
- All parser plugins (parse_html_urls, parse_rss_urls, parse_jsonl_urls, etc.)
- Tests: tests_piping.py, test_parse_rss_urls_comprehensive.py
This creates consistent naming where "plugin" is used for both:
- ArchiveResult.plugin (which extractor plugin ran)
- URL discovery metadata "plugin" (which parser plugin discovered this URL)

2. Field Size Constraint

Current: extractor = CharField(max_length=32)
Decision: Keep max_length=32 when renaming to plugin
No size increase needed

3. Migration Implementation

Use migrations.RenameField('ArchiveResult', 'extractor', 'plugin') for clean migration
Preserves data, indexes, and constraints automatically
Add hook_name field in same migration

Rollout Notes

Breaking Changes:

CLI: --extract, --extractors → --plugins (no aliases)
API: extractor field → plugin field (no backwards compat)
Template tags: extractor_* → plugin_* (users must update custom templates)
Python API: All function names with "extractor" → "plugin" (import changes needed)
Form fields: archive_methods → plugins
via_extractor → plugin (URL discovery metadata field)

Migration Required: Yes - all instances must run migrations before upgrading

Estimated Impact: ~150+ files will need updates across the entire codebase

81 files: extractor terminology
28 files: parser terminology
10 files: archive_method legacy terminology
Plus templates, JavaScript, tests, etc.

Next Steps

Continue with Phase 3: Update hooks.py with all function renames and hook_name tracking
Then Phase 4: Update JSONL import/export with backwards compatibility
Then Phases 5-12: Systematically update all remaining files
Finally Phase 13: Run full test suite and verify everything works

Note: Migration can be tested immediately - the migration file is ready to run!

21 KiB Raw Permalink Blame History