Files
ArchiveBox/old/TODO_rename_extractor_to_plugin.md
Nick Sweeting d2e65cfd38 move todos
2025-12-28 04:44:38 -08:00

21 KiB

TODO: Rename Extractor to Plugin - Implementation Progress

Status: 🟡 In Progress (2/13 phases complete) Started: 2025-12-28 Estimated Files to Update: ~150+ files


Progress Overview

Completed Phases (2/13)

  • Phase 1: Database Migration - Created migration 0033
  • Phase 2: Core Model Updates - Updated ArchiveResult, ArchiveResultManager, Snapshot models

🟡 In Progress (1/13)

  • Phase 3: Hook Execution System (hooks.py - all function renames)

Pending Phases (10/13)

  • Phase 4: JSONL Import/Export (misc/jsonl.py)
  • Phase 5: CLI Commands (archivebox_extract, archivebox_add, archivebox_update)
  • Phase 6: API Endpoints (v1_core.py, v1_cli.py)
  • Phase 7: Admin Interface (admin_archiveresults.py, forms.py)
  • Phase 8: Views and Templates (views.py, templatetags, progress_monitor.html)
  • Phase 9: Worker System (workers/worker.py)
  • Phase 10: State Machine (statemachines.py)
  • Phase 11: Tests (test_migrations_helpers.py, test_recursive_crawl.py, etc.)
  • Phase 12: Terminology Standardization (via_extractor→plugin, comments, docstrings)
  • Phase 13: Run migrations and verify all tests pass

What's Been Completed So Far

Phase 1: Database Migration

File Created: archivebox/core/migrations/0033_rename_extractor_add_hook_name.py

Changes:

  • Used migrations.RenameField() to rename extractorplugin
  • Added hook_name field (CharField, max_length=255, indexed, default='')
  • Preserves all existing data, indexes, and constraints

Phase 2: Core Models

File Updated: archivebox/core/models.py

ArchiveResultManager

  • Updated indexable() method to use plugin__in and plugin=method
  • Changed reference from ARCHIVE_METHODS_INDEXING_PRECEDENCE to EXTRACTOR_INDEXING_PRECEDENCE

ArchiveResult Model

Field Changes:

  • Renamed field: extractorplugin
  • Added field: hook_name (stores full filename like on_Snapshot__50_wget.py)
  • Updated comments to reference "plugin" instead of "extractor"

Method Updates:

  • get_extractor_choices()get_plugin_choices()
  • __str__(): Now uses self.plugin
  • save(): Logs plugin instead of extractor
  • get_absolute_url(): Uses self.plugin
  • extractor_module property → plugin_module property
  • output_exists(): Checks self.plugin directory
  • embed_path(): Uses self.plugin for paths
  • create_output_dir(): Creates self.plugin directory
  • output_dir_name: Returns self.plugin
  • run(): All references to extractor → plugin (including extractor_dir → plugin_dir)
  • update_from_output(): All references updated to plugin/plugin_dir
  • _update_snapshot_title(): Parameter renamed to plugin_dir
  • trigger_search_indexing(): Passes plugin=self.plugin
  • output_dir property: Returns plugin directory
  • is_background_hook(): Uses plugin_dir

Snapshot Model

Method Updates:

  • create_pending_archiveresults(): Uses get_enabled_plugins(), filters by plugin=plugin
  • result_icons (calc_icons): Maps by r.plugin, calls get_plugin_name() and get_plugin_icon()
  • _merge_archive_results_from_index(): Maps by (ar.plugin, ar.start_ts), supports both 'extractor' and 'plugin' keys for backwards compat
  • _create_archive_result_if_missing(): Supports both 'extractor' and 'plugin' keys, creates with plugin=plugin
  • write_index_json(): Writes 'plugin': ar.plugin in archive_results
  • canonical_outputs(): Updates find_best_output_in_dir() to use plugin_name, accesses result.plugin, creates keys like {result.plugin}_path
  • latest_outputs(): Uses get_plugins(), filters by plugin=plugin
  • retry_failed_archiveresults(): Updated docstring to reference "plugins" instead of "extractors"

Total Lines Changed in models.py: ~50+ locations


Full Implementation Plan

ArchiveResult Model Refactoring Plan: Rename Extractor to Plugin + Add Hook Name Field

Overview

Refactor the ArchiveResult model and standardize terminology across the codebase:

  1. Rename the extractor field to plugin in ArchiveResult model
  2. Add a new hook_name field to store the specific hook filename that executed
  3. Update all related code paths (CLI, API, admin, views, hooks, JSONL, etc.)
  4. Standardize CLI flags from --extract/--extractors to --plugins
  5. Standardize terminology throughout codebase:
    • "parsers" → "parser plugins"
    • "extractors" → "extractor plugins"
    • "parser extractors" → "parser plugins"
    • "archive methods" → "extractor plugins"
    • Document apt/brew/npm/pip as "package manager plugins" in comments

Current State Analysis

ArchiveResult Model (archivebox/core/models.py:1679-1750)

class ArchiveResult(ModelWithOutputDir, ...):
    extractor = models.CharField(max_length=32, db_index=True)  # e.g., "screenshot", "wget"
    # New fields from migration 0029:
    output_str, output_json, output_files, output_size, output_mimetypes
    binary = ForeignKey('machine.Binary', ...)
    # No hook_name field yet

Hook Execution Flow

  1. ArchiveResult.run() discovers hooks for the plugin (e.g., wget/on_Snapshot__50_wget.py)
  2. run_hook() executes each hook script, captures output as HookResult
  3. update_from_output() parses JSONL and updates ArchiveResult fields
  4. Currently NO tracking of which specific hook file executed

Field Usage Across Codebase

extractor field is used in ~100 locations:

  • Model: ArchiveResult.extractor field definition, str, manager queries
  • CLI: archivebox_extract.py (--plugin flag), archivebox_add.py, tests
  • API: v1_core.py (extractor filter), v1_cli.py (extract/extractors args)
  • Admin: admin_archiveresults.py (list filter, display)
  • Views: core/views.py (archiveresult_objects dict by extractor)
  • Template Tags: core_tags.py (extractor_icon, extractor_thumbnail, extractor_embed)
  • Hooks: hooks.py (get_extractors, get_extractor_name, run_hook output parsing)
  • JSONL: misc/jsonl.py (archiveresult_to_jsonl serializes extractor)
  • Worker: workers/worker.py (ArchiveResultWorker filters by extractor)
  • Statemachine: statemachines.py (logs extractor in state transitions)

Implementation Plan

Phase 1: Database Migration (archivebox/core/migrations/) COMPLETE

Create migration 0033_rename_extractor_add_hook_name.py:

  1. Rename field: extractorplugin (preserve index, constraints)
  2. Add field: hook_name = CharField(max_length=255, blank=True, default='', db_index=True)
    • Stores full hook filename: on_Snapshot__50_wget.py, on_Crawl__10_chrome_session.js, etc.
    • Empty string for existing records (data migration sets all to '')
  3. Update any indexes or constraints that reference extractor

Decision: Full filename chosen for explicitness and easy grep-ability

Critical Files to Update:

  • ArchiveResult model field definitions
  • Migration dependencies (latest: 0032)

Phase 2: Core Model Updates (archivebox/core/models.py) COMPLETE

ArchiveResult Model (lines 1679-1820):

  • Rename field: extractorplugin
  • Add field: hook_name = models.CharField(...)
  • Update str: f'...-> {self.plugin}'
  • Update absolute_url: Use plugin instead of extractor
  • Update embed_path: Use plugin directory name

ArchiveResultManager (lines 1669-1677):

  • Update indexable(): filter(plugin__in=INDEXABLE_METHODS, ...)
  • Update precedence: When(plugin=method, ...)

Snapshot Model (lines 1000-1600):

  • Update canonical_outputs: Access by plugin name
  • Update create_pending_archiveresults: Use plugin parameter
  • All queryset filters: archiveresult_set.filter(plugin=...)

Phase 3: Hook Execution System (archivebox/hooks.py) 🟡 IN PROGRESS

Function Renames:

  • get_extractors()get_plugins() (lines 479-504)
  • get_parser_extractors()get_parser_plugins() (lines 507-514)
  • get_extractor_name()get_plugin_name() (lines 517-530)
  • is_parser_extractor()is_parser_plugin() (lines 533-536)
  • get_enabled_extractors()get_enabled_plugins() (lines 553-566)
  • get_extractor_template()get_plugin_template() (line 1048)
  • get_extractor_icon()get_plugin_icon() (line 1068)
  • get_all_extractor_icons()get_all_plugin_icons() (line 1092)

Update HookResult TypedDict (lines 63-73):

  • Add field: hook_name: str to store hook filename
  • Add field: plugin: str (if not already present)

Update run_hook() (lines 141-389):

  • Add hook_name parameter: Pass hook filename to be stored in result
  • Update HookResult to include hook_name field
  • Update JSONL record output: Add hook_name key

Update ArchiveResult.run() (lines 1838-1914):

  • When calling run_hook, pass the hook filename
  • Store hook_name in ArchiveResult before/after execution

Update ArchiveResult.update_from_output() (lines 1916-2073):

  • Parse hook_name from JSONL output
  • Store in self.hook_name field
  • If not present in JSONL, infer from directory/filename

Constants to Rename:

  • ARCHIVE_METHODS_INDEXING_PRECEDENCEEXTRACTOR_INDEXING_PRECEDENCE

Comments/Docstrings: Update all function docstrings to use "plugin" terminology


Phase 4: JSONL Import/Export (archivebox/misc/jsonl.py)

Update archiveresult_to_jsonl() (lines 173-200):

  • Change key: 'extractor': result.extractor'plugin': result.plugin
  • Add key: 'hook_name': result.hook_name

Update JSONL parsing:

  • Accept both 'extractor' (legacy) and 'plugin' (new) keys when importing
  • Always write 'plugin' key in new exports (never 'extractor')
  • Parse and store hook_name if present (backwards compat: empty if missing)

Decision: Support both keys on import for smooth migration, always export new format


Phase 5: CLI Commands (archivebox/cli/)

archivebox_extract.py (lines 1-230):

  • Rename flag: --plugin stays (already correct!)
  • Update internal references: extractor → plugin
  • Update filter: results.filter(plugin=plugin)
  • Update display: result.plugin

archivebox_add.py:

  • Rename config key: 'EXTRACTORS': plugins'PLUGINS': plugins (if not already)

archivebox_update.py:

  • Standardize to --plugins flag (currently may be --extractors or --extract)

tests/test_oneshot.py:

  • Update flag: --extract=...--plugins=...

Phase 6: API Endpoints (archivebox/api/)

v1_core.py (ArchiveResult API):

  • Update schema field: extractor: strplugin: str
  • Update schema field: Add hook_name: str = ''
  • Update FilterSchema: q=[..., 'plugin', ...]
  • Update extractor filter: plugin: Optional[str] = Field(None, q='plugin__icontains')

v1_cli.py (CLI API):

  • Rename AddCommandSchema field: extract: strplugins: str
  • Rename UpdateCommandSchema field: extractors: strplugins: str
  • Update endpoint mapping: args.pluginsplugins parameter

Phase 7: Admin Interface (archivebox/core/)

admin_archiveresults.py:

  • Update all references: extractor → plugin
  • Update list_filter: 'plugin' instead of 'extractor'
  • Update ordering: order_by('plugin')
  • Update get_plugin_icon: (rename from get_extractor_icon if exists)

admin_snapshots.py:

  • Update any commented TODOs referencing extractor

forms.py:

  • Rename function: get_archive_methods()get_plugin_choices()
  • Update form field: archive_methodsplugins

Phase 8: Views and Templates (archivebox/core/)

views.py:

  • Update dict building: archiveresult_objects[result.plugin] = result
  • Update all extractor references to plugin

templatetags/core_tags.py:

  • Rename template tags (BREAKING CHANGE):
    • extractor_icon()plugin_icon()
    • extractor_thumbnail()plugin_thumbnail()
    • extractor_embed()plugin_embed()
  • Update internal: result.extractorresult.plugin

Update HTML templates (if any directly reference extractor):

  • Search for {{ result.extractor }} and similar
  • Update to {{ result.plugin }}
  • Update template tag calls
  • CRITICAL: Update JavaScript in templates/admin/progress_monitor.html:
    • Lines 491, 505: Change extractor.extractor and a.extractor to use plugin field

Phase 9: Worker System (archivebox/workers/worker.py)

ArchiveResultWorker:

  • Rename parameter: extractorplugin (lines 348, 350)
  • Update filter: qs.filter(plugin=self.plugin)
  • Update subprocess passing: Use plugin parameter

Phase 10: State Machine (archivebox/core/statemachines.py)

ArchiveResultMachine:

  • Update logging: Use self.archiveresult.plugin instead of extractor
  • Update any state metadata that includes extractor field

Phase 11: Tests and Fixtures

Update test files:

  • tests/test_migrations_*.py: Update expected field names in schema definitions
  • tests/test_hooks.py: Update assertions for plugin/hook_name fields
  • archivebox/tests/test_migrations_helpers.py: Update schema SQL (lines 161, 382, 468)
  • tests/test_recursive_crawl.py: Update SQL query WHERE extractor = '60_parse_html_urls' (line 163)
  • archivebox/cli/tests_piping.py: Update test function names and assertions
  • Any fixtures that create ArchiveResults: Use plugin parameter
  • Any mock objects that set .extractor attribute: Change to .plugin

Phase 12: Terminology Standardization (NEW)

This phase standardizes terminology throughout the codebase to use consistent "plugin" nomenclature.

via_extractor → plugin Rename (14 files):

  • Rename metadata field via_extractor to just plugin
  • Files affected:
    • archivebox/hooks.py - Set plugin in run_hook() output
    • archivebox/crawls/models.py - If via_extractor field exists
    • archivebox/cli/archivebox_crawl.py - References to via_extractor
    • All parser plugins that set via_extractor in output
    • Test files with via_extractor assertions
  • Update all JSONL output from parser plugins to use "plugin" key

Logging Functions (archivebox/misc/logging_util.py):

  • log_archive_method_started()log_extractor_started() (line 326)
  • log_archive_method_finished()log_extractor_finished() (line 330)

Form Functions (archivebox/core/forms.py):

  • get_archive_methods()get_plugin_choices() (line 15)
  • Form field archive_methodsplugins (line 24, 29)
  • Update form validation and view usage

Comments and Docstrings (81 files with "extractor" references):

  • Update comments to say "extractor plugin" instead of just "extractor"
  • Update comments to say "parser plugin" instead of "parser extractor"
  • All plugin files: Update docstrings to use "extractor plugin" terminology

Package Manager Plugin Documentation:

  • Update comments in package manager hook files to say "package manager plugin":
    • archivebox/plugins/apt/on_Binary__install_using_apt_provider.py
    • archivebox/plugins/brew/on_Binary__install_using_brew_provider.py
    • archivebox/plugins/npm/on_Binary__install_using_npm_provider.py
    • archivebox/plugins/pip/on_Binary__install_using_pip_provider.py
    • archivebox/plugins/env/on_Binary__install_using_env_provider.py
    • archivebox/plugins/custom/on_Binary__install_using_custom_bash.py

String Literals in Error Messages:

  • Search for error messages containing "extractor" and update to "plugin" or "extractor plugin"
  • Search for error messages containing "parser" and update to "parser plugin" where appropriate

Critical Files Summary

Must Update (Core):

  1. archivebox/core/models.py - ArchiveResult, ArchiveResultManager, Snapshot
  2. archivebox/core/migrations/0033_*.py - New migration
  3. archivebox/hooks.py - All hook execution and discovery functions
  4. archivebox/misc/jsonl.py - Serialization/deserialization

Must Update (CLI):

  1. archivebox/cli/archivebox_extract.py
  2. archivebox/cli/archivebox_add.py
  3. archivebox/cli/archivebox_update.py

Must Update (API):

  1. archivebox/api/v1_core.py
  2. archivebox/api/v1_cli.py

Must Update (Admin/Views):

  1. archivebox/core/admin_archiveresults.py
  2. archivebox/core/views.py
  3. archivebox/core/templatetags/core_tags.py

Must Update (Workers/State):

  1. archivebox/workers/worker.py
  2. archivebox/core/statemachines.py

Must Update (Tests):

  1. tests/test_oneshot.py
  2. archivebox/tests/test_hooks.py
  3. archivebox/tests/test_migrations_helpers.py - Schema SQL definitions
  4. tests/test_recursive_crawl.py - SQL queries with field names
  5. archivebox/cli/tests_piping.py - Test function docstrings

Must Update (Terminology - Phase 12):

  1. archivebox/misc/logging_util.py - Rename logging functions
  2. archivebox/core/forms.py - Rename form helper and field
  3. archivebox/templates/admin/progress_monitor.html - JavaScript field refs
  4. All 81 plugin files - Update docstrings and comments
  5. 28 files with parser terminology - Update comments consistently

Migration Strategy

Data Migration for Existing Records:

def forwards(apps, schema_editor):
    ArchiveResult = apps.get_model('core', 'ArchiveResult')
    # All existing records get empty hook_name
    ArchiveResult.objects.all().update(hook_name='')

Backwards Compatibility:

BREAKING CHANGES (per user requirements - no backwards compat):

  • CLI flags: Hard cutover to --plugins (no aliases)
  • API fields: extractor removed, plugin required
  • Template tags: All renamed to plugin_*

PARTIAL COMPAT (for migration):

  • JSONL: Write 'plugin', but accept both 'extractor' and 'plugin' on import

Testing Checklist

  • Migration 0033 runs successfully on test database
  • All migrations tests pass (test_migrations_*.py)
  • All hook tests pass (test_hooks.py)
  • CLI commands work with --plugins flag
  • API endpoints return plugin/hook_name fields correctly
  • Admin interface displays plugin correctly
  • Admin progress monitor JavaScript works (no console errors)
  • JSONL export includes both plugin and hook_name
  • JSONL import accepts both 'extractor' and 'plugin' keys
  • Hook execution populates hook_name field
  • Worker filtering by plugin works
  • Template tags render with new names (plugin_icon, etc.)
  • All renamed functions work correctly
  • SQL queries in tests use correct field names
  • Terminology is consistent across codebase

Critical Issues to Address

1. via_extractor Field (DECISION: RENAME)

  • Currently used in 14 files for tracking which parser plugin discovered a URL
  • Decision: Rename via_extractorplugin (not via_plugin, just "plugin")
  • Impact: Crawler and parser plugin code - 14 files to update
  • Files affected:
    • archivebox/hooks.py
    • archivebox/crawls/models.py
    • archivebox/cli/archivebox_crawl.py
    • All parser plugins (parse_html_urls, parse_rss_urls, parse_jsonl_urls, etc.)
    • Tests: tests_piping.py, test_parse_rss_urls_comprehensive.py
  • This creates consistent naming where "plugin" is used for both:
    • ArchiveResult.plugin (which extractor plugin ran)
    • URL discovery metadata "plugin" (which parser plugin discovered this URL)

2. Field Size Constraint

  • Current: extractor = CharField(max_length=32)
  • Decision: Keep max_length=32 when renaming to plugin
  • No size increase needed

3. Migration Implementation

  • Use migrations.RenameField('ArchiveResult', 'extractor', 'plugin') for clean migration
  • Preserves data, indexes, and constraints automatically
  • Add hook_name field in same migration

Rollout Notes

Breaking Changes:

  1. CLI: --extract, --extractors--plugins (no aliases)
  2. API: extractor field → plugin field (no backwards compat)
  3. Template tags: extractor_*plugin_* (users must update custom templates)
  4. Python API: All function names with "extractor" → "plugin" (import changes needed)
  5. Form fields: archive_methodsplugins
  6. via_extractor → plugin (URL discovery metadata field)

Migration Required: Yes - all instances must run migrations before upgrading

Estimated Impact: ~150+ files will need updates across the entire codebase

  • 81 files: extractor terminology
  • 28 files: parser terminology
  • 10 files: archive_method legacy terminology
  • Plus templates, JavaScript, tests, etc.

Next Steps

  1. Continue with Phase 3: Update hooks.py with all function renames and hook_name tracking
  2. Then Phase 4: Update JSONL import/export with backwards compatibility
  3. Then Phases 5-12: Systematically update all remaining files
  4. Finally Phase 13: Run full test suite and verify everything works

Note: Migration can be tested immediately - the migration file is ready to run!