# TODO: Rename Extractor to Plugin - Implementation Progress **Status**: 🟑 In Progress (2/13 phases complete) **Started**: 2025-12-28 **Estimated Files to Update**: ~150+ files --- ## Progress Overview ### βœ… Completed Phases (2/13) - [x] **Phase 1**: Database Migration - Created migration 0033 - [x] **Phase 2**: Core Model Updates - Updated ArchiveResult, ArchiveResultManager, Snapshot models ### 🟑 In Progress (1/13) - [ ] **Phase 3**: Hook Execution System (hooks.py - all function renames) ### ⏳ Pending Phases (10/13) - [ ] **Phase 4**: JSONL Import/Export (misc/jsonl.py) - [ ] **Phase 5**: CLI Commands (archivebox_extract, archivebox_add, archivebox_update) - [ ] **Phase 6**: API Endpoints (v1_core.py, v1_cli.py) - [ ] **Phase 7**: Admin Interface (admin_archiveresults.py, forms.py) - [ ] **Phase 8**: Views and Templates (views.py, templatetags, progress_monitor.html) - [ ] **Phase 9**: Worker System (workers/worker.py) - [ ] **Phase 10**: State Machine (statemachines.py) - [ ] **Phase 11**: Tests (test_migrations_helpers.py, test_recursive_crawl.py, etc.) - [ ] **Phase 12**: Terminology Standardization (via_extractorβ†’plugin, comments, docstrings) - [ ] **Phase 13**: Run migrations and verify all tests pass --- ## What's Been Completed So Far ### Phase 1: Database Migration βœ… **File Created**: `archivebox/core/migrations/0033_rename_extractor_add_hook_name.py` Changes: - Used `migrations.RenameField()` to rename `extractor` β†’ `plugin` - Added `hook_name` field (CharField, max_length=255, indexed, default='') - Preserves all existing data, indexes, and constraints ### Phase 2: Core Models βœ… **File Updated**: `archivebox/core/models.py` #### ArchiveResultManager - Updated `indexable()` method to use `plugin__in` and `plugin=method` - Changed reference from `ARCHIVE_METHODS_INDEXING_PRECEDENCE` to `EXTRACTOR_INDEXING_PRECEDENCE` #### ArchiveResult Model **Field Changes**: - Renamed field: `extractor` β†’ `plugin` - Added field: `hook_name` (stores full filename like `on_Snapshot__50_wget.py`) - Updated comments to reference "plugin" instead of "extractor" **Method Updates**: - `get_extractor_choices()` β†’ `get_plugin_choices()` - `__str__()`: Now uses `self.plugin` - `save()`: Logs `plugin` instead of `extractor` - `get_absolute_url()`: Uses `self.plugin` - `extractor_module` property β†’ `plugin_module` property - `output_exists()`: Checks `self.plugin` directory - `embed_path()`: Uses `self.plugin` for paths - `create_output_dir()`: Creates `self.plugin` directory - `output_dir_name`: Returns `self.plugin` - `run()`: All references to extractor β†’ plugin (including extractor_dir β†’ plugin_dir) - `update_from_output()`: All references updated to plugin/plugin_dir - `_update_snapshot_title()`: Parameter renamed to `plugin_dir` - `trigger_search_indexing()`: Passes `plugin=self.plugin` - `output_dir` property: Returns plugin directory - `is_background_hook()`: Uses `plugin_dir` #### Snapshot Model **Method Updates**: - `create_pending_archiveresults()`: Uses `get_enabled_plugins()`, filters by `plugin=plugin` - `result_icons` (calc_icons): Maps by `r.plugin`, calls `get_plugin_name()` and `get_plugin_icon()` - `_merge_archive_results_from_index()`: Maps by `(ar.plugin, ar.start_ts)`, supports both 'extractor' and 'plugin' keys for backwards compat - `_create_archive_result_if_missing()`: Supports both 'extractor' and 'plugin' keys, creates with `plugin=plugin` - `write_index_json()`: Writes `'plugin': ar.plugin` in archive_results - `canonical_outputs()`: Updates `find_best_output_in_dir()` to use `plugin_name`, accesses `result.plugin`, creates keys like `{result.plugin}_path` - `latest_outputs()`: Uses `get_plugins()`, filters by `plugin=plugin` - `retry_failed_archiveresults()`: Updated docstring to reference "plugins" instead of "extractors" **Total Lines Changed in models.py**: ~50+ locations --- ## Full Implementation Plan # ArchiveResult Model Refactoring Plan: Rename Extractor to Plugin + Add Hook Name Field ## Overview Refactor the ArchiveResult model and standardize terminology across the codebase: 1. Rename the `extractor` field to `plugin` in ArchiveResult model 2. Add a new `hook_name` field to store the specific hook filename that executed 3. Update all related code paths (CLI, API, admin, views, hooks, JSONL, etc.) 4. Standardize CLI flags from `--extract/--extractors` to `--plugins` 5. **Standardize terminology throughout codebase**: - "parsers" β†’ "parser plugins" - "extractors" β†’ "extractor plugins" - "parser extractors" β†’ "parser plugins" - "archive methods" β†’ "extractor plugins" - Document apt/brew/npm/pip as "package manager plugins" in comments ## Current State Analysis ### ArchiveResult Model (archivebox/core/models.py:1679-1750) ```python class ArchiveResult(ModelWithOutputDir, ...): extractor = models.CharField(max_length=32, db_index=True) # e.g., "screenshot", "wget" # New fields from migration 0029: output_str, output_json, output_files, output_size, output_mimetypes binary = ForeignKey('machine.Binary', ...) # No hook_name field yet ``` ### Hook Execution Flow 1. `ArchiveResult.run()` discovers hooks for the plugin (e.g., `wget/on_Snapshot__50_wget.py`) 2. `run_hook()` executes each hook script, captures output as HookResult 3. `update_from_output()` parses JSONL and updates ArchiveResult fields 4. Currently NO tracking of which specific hook file executed ### Field Usage Across Codebase **extractor field** is used in ~100 locations: - **Model**: ArchiveResult.extractor field definition, __str__, manager queries - **CLI**: archivebox_extract.py (--plugin flag), archivebox_add.py, tests - **API**: v1_core.py (extractor filter), v1_cli.py (extract/extractors args) - **Admin**: admin_archiveresults.py (list filter, display) - **Views**: core/views.py (archiveresult_objects dict by extractor) - **Template Tags**: core_tags.py (extractor_icon, extractor_thumbnail, extractor_embed) - **Hooks**: hooks.py (get_extractors, get_extractor_name, run_hook output parsing) - **JSONL**: misc/jsonl.py (archiveresult_to_jsonl serializes extractor) - **Worker**: workers/worker.py (ArchiveResultWorker filters by extractor) - **Statemachine**: statemachines.py (logs extractor in state transitions) --- ## Implementation Plan ### Phase 1: Database Migration (archivebox/core/migrations/) βœ… COMPLETE **Create migration 0033_rename_extractor_add_hook_name.py**: 1. Rename field: `extractor` β†’ `plugin` (preserve index, constraints) 2. Add field: `hook_name` = CharField(max_length=255, blank=True, default='', db_index=True) - **Stores full hook filename**: `on_Snapshot__50_wget.py`, `on_Crawl__10_chrome_session.js`, etc. - Empty string for existing records (data migration sets all to '') 3. Update any indexes or constraints that reference extractor **Decision**: Full filename chosen for explicitness and easy grep-ability **Critical Files to Update**: - βœ… ArchiveResult model field definitions - βœ… Migration dependencies (latest: 0032) --- ### Phase 2: Core Model Updates (archivebox/core/models.py) βœ… COMPLETE **ArchiveResult Model** (lines 1679-1820): - βœ… Rename field: `extractor` β†’ `plugin` - βœ… Add field: `hook_name = models.CharField(...)` - βœ… Update __str__: `f'...-> {self.plugin}'` - βœ… Update absolute_url: Use plugin instead of extractor - βœ… Update embed_path: Use plugin directory name **ArchiveResultManager** (lines 1669-1677): - βœ… Update indexable(): `filter(plugin__in=INDEXABLE_METHODS, ...)` - βœ… Update precedence: `When(plugin=method, ...)` **Snapshot Model** (lines 1000-1600): - βœ… Update canonical_outputs: Access by plugin name - βœ… Update create_pending_archiveresults: Use plugin parameter - βœ… All queryset filters: `archiveresult_set.filter(plugin=...)` --- ### Phase 3: Hook Execution System (archivebox/hooks.py) 🟑 IN PROGRESS **Function Renames**: - [ ] `get_extractors()` β†’ `get_plugins()` (lines 479-504) - [ ] `get_parser_extractors()` β†’ `get_parser_plugins()` (lines 507-514) - [ ] `get_extractor_name()` β†’ `get_plugin_name()` (lines 517-530) - [ ] `is_parser_extractor()` β†’ `is_parser_plugin()` (lines 533-536) - [ ] `get_enabled_extractors()` β†’ `get_enabled_plugins()` (lines 553-566) - [ ] `get_extractor_template()` β†’ `get_plugin_template()` (line 1048) - [ ] `get_extractor_icon()` β†’ `get_plugin_icon()` (line 1068) - [ ] `get_all_extractor_icons()` β†’ `get_all_plugin_icons()` (line 1092) **Update HookResult TypedDict** (lines 63-73): - [ ] Add field: `hook_name: str` to store hook filename - [ ] Add field: `plugin: str` (if not already present) **Update run_hook()** (lines 141-389): - [ ] **Add hook_name parameter**: Pass hook filename to be stored in result - [ ] Update HookResult to include hook_name field - [ ] Update JSONL record output: Add `hook_name` key **Update ArchiveResult.run()** (lines 1838-1914): - [ ] When calling run_hook, pass the hook filename - [ ] Store hook_name in ArchiveResult before/after execution **Update ArchiveResult.update_from_output()** (lines 1916-2073): - [ ] Parse hook_name from JSONL output - [ ] Store in self.hook_name field - [ ] If not present in JSONL, infer from directory/filename **Constants to Rename**: - [ ] `ARCHIVE_METHODS_INDEXING_PRECEDENCE` β†’ `EXTRACTOR_INDEXING_PRECEDENCE` **Comments/Docstrings**: Update all function docstrings to use "plugin" terminology --- ### Phase 4: JSONL Import/Export (archivebox/misc/jsonl.py) **Update archiveresult_to_jsonl()** (lines 173-200): - [ ] Change key: `'extractor': result.extractor` β†’ `'plugin': result.plugin` - [ ] Add key: `'hook_name': result.hook_name` **Update JSONL parsing**: - [ ] **Accept both 'extractor' (legacy) and 'plugin' (new) keys when importing** - [ ] Always write 'plugin' key in new exports (never 'extractor') - [ ] Parse and store hook_name if present (backwards compat: empty if missing) **Decision**: Support both keys on import for smooth migration, always export new format --- ### Phase 5: CLI Commands (archivebox/cli/) **archivebox_extract.py** (lines 1-230): - [ ] Rename flag: `--plugin` stays (already correct!) - [ ] Update internal references: extractor β†’ plugin - [ ] Update filter: `results.filter(plugin=plugin)` - [ ] Update display: `result.plugin` **archivebox_add.py**: - [ ] Rename config key: `'EXTRACTORS': plugins` β†’ `'PLUGINS': plugins` (if not already) **archivebox_update.py**: - [ ] Standardize to `--plugins` flag (currently may be --extractors or --extract) **tests/test_oneshot.py**: - [ ] Update flag: `--extract=...` β†’ `--plugins=...` --- ### Phase 6: API Endpoints (archivebox/api/) **v1_core.py** (ArchiveResult API): - [ ] Update schema field: `extractor: str` β†’ `plugin: str` - [ ] Update schema field: Add `hook_name: str = ''` - [ ] Update FilterSchema: `q=[..., 'plugin', ...]` - [ ] Update extractor filter: `plugin: Optional[str] = Field(None, q='plugin__icontains')` **v1_cli.py** (CLI API): - [ ] Rename AddCommandSchema field: `extract: str` β†’ `plugins: str` - [ ] Rename UpdateCommandSchema field: `extractors: str` β†’ `plugins: str` - [ ] Update endpoint mapping: `args.plugins` β†’ `plugins` parameter --- ### Phase 7: Admin Interface (archivebox/core/) **admin_archiveresults.py**: - [ ] Update all references: extractor β†’ plugin - [ ] Update list_filter: `'plugin'` instead of `'extractor'` - [ ] Update ordering: `order_by('plugin')` - [ ] Update get_plugin_icon: (rename from get_extractor_icon if exists) **admin_snapshots.py**: - [ ] Update any commented TODOs referencing extractor **forms.py**: - [ ] Rename function: `get_archive_methods()` β†’ `get_plugin_choices()` - [ ] Update form field: `archive_methods` β†’ `plugins` --- ### Phase 8: Views and Templates (archivebox/core/) **views.py**: - [ ] Update dict building: `archiveresult_objects[result.plugin] = result` - [ ] Update all extractor references to plugin **templatetags/core_tags.py**: - [ ] **Rename template tags (BREAKING CHANGE)**: - `extractor_icon()` β†’ `plugin_icon()` - `extractor_thumbnail()` β†’ `plugin_thumbnail()` - `extractor_embed()` β†’ `plugin_embed()` - [ ] Update internal: `result.extractor` β†’ `result.plugin` **Update HTML templates** (if any directly reference extractor): - [ ] Search for `{{ result.extractor }}` and similar - [ ] Update to `{{ result.plugin }}` - [ ] Update template tag calls - [ ] **CRITICAL**: Update JavaScript in `templates/admin/progress_monitor.html`: - Lines 491, 505: Change `extractor.extractor` and `a.extractor` to use `plugin` field --- ### Phase 9: Worker System (archivebox/workers/worker.py) **ArchiveResultWorker**: - [ ] Rename parameter: `extractor` β†’ `plugin` (lines 348, 350) - [ ] Update filter: `qs.filter(plugin=self.plugin)` - [ ] Update subprocess passing: Use plugin parameter --- ### Phase 10: State Machine (archivebox/core/statemachines.py) **ArchiveResultMachine**: - [ ] Update logging: Use `self.archiveresult.plugin` instead of extractor - [ ] Update any state metadata that includes extractor field --- ### Phase 11: Tests and Fixtures **Update test files**: - [ ] tests/test_migrations_*.py: Update expected field names in schema definitions - [ ] tests/test_hooks.py: Update assertions for plugin/hook_name fields - [ ] archivebox/tests/test_migrations_helpers.py: Update schema SQL (lines 161, 382, 468) - [ ] tests/test_recursive_crawl.py: Update SQL query `WHERE extractor = '60_parse_html_urls'` (line 163) - [ ] archivebox/cli/tests_piping.py: Update test function names and assertions - [ ] Any fixtures that create ArchiveResults: Use plugin parameter - [ ] Any mock objects that set `.extractor` attribute: Change to `.plugin` --- ### Phase 12: Terminology Standardization (NEW) This phase standardizes terminology throughout the codebase to use consistent "plugin" nomenclature. **via_extractor β†’ plugin Rename (14 files)**: - [ ] Rename metadata field `via_extractor` to just `plugin` - [ ] Files affected: - archivebox/hooks.py - Set plugin in run_hook() output - archivebox/crawls/models.py - If via_extractor field exists - archivebox/cli/archivebox_crawl.py - References to via_extractor - All parser plugins that set via_extractor in output - Test files with via_extractor assertions - [ ] Update all JSONL output from parser plugins to use "plugin" key **Logging Functions (archivebox/misc/logging_util.py)**: - [ ] `log_archive_method_started()` β†’ `log_extractor_started()` (line 326) - [ ] `log_archive_method_finished()` β†’ `log_extractor_finished()` (line 330) **Form Functions (archivebox/core/forms.py)**: - [ ] `get_archive_methods()` β†’ `get_plugin_choices()` (line 15) - [ ] Form field `archive_methods` β†’ `plugins` (line 24, 29) - [ ] Update form validation and view usage **Comments and Docstrings (81 files with "extractor" references)**: - [ ] Update comments to say "extractor plugin" instead of just "extractor" - [ ] Update comments to say "parser plugin" instead of "parser extractor" - [ ] All plugin files: Update docstrings to use "extractor plugin" terminology **Package Manager Plugin Documentation**: - [ ] Update comments in package manager hook files to say "package manager plugin": - archivebox/plugins/apt/on_Binary__install_using_apt_provider.py - archivebox/plugins/brew/on_Binary__install_using_brew_provider.py - archivebox/plugins/npm/on_Binary__install_using_npm_provider.py - archivebox/plugins/pip/on_Binary__install_using_pip_provider.py - archivebox/plugins/env/on_Binary__install_using_env_provider.py - archivebox/plugins/custom/on_Binary__install_using_custom_bash.py **String Literals in Error Messages**: - [ ] Search for error messages containing "extractor" and update to "plugin" or "extractor plugin" - [ ] Search for error messages containing "parser" and update to "parser plugin" where appropriate --- ## Critical Files Summary ### Must Update (Core): 1. βœ… `archivebox/core/models.py` - ArchiveResult, ArchiveResultManager, Snapshot 2. βœ… `archivebox/core/migrations/0033_*.py` - New migration 3. ⏳ `archivebox/hooks.py` - All hook execution and discovery functions 4. ⏳ `archivebox/misc/jsonl.py` - Serialization/deserialization ### Must Update (CLI): 5. ⏳ `archivebox/cli/archivebox_extract.py` 6. ⏳ `archivebox/cli/archivebox_add.py` 7. ⏳ `archivebox/cli/archivebox_update.py` ### Must Update (API): 8. ⏳ `archivebox/api/v1_core.py` 9. ⏳ `archivebox/api/v1_cli.py` ### Must Update (Admin/Views): 10. ⏳ `archivebox/core/admin_archiveresults.py` 11. ⏳ `archivebox/core/views.py` 12. ⏳ `archivebox/core/templatetags/core_tags.py` ### Must Update (Workers/State): 13. ⏳ `archivebox/workers/worker.py` 14. ⏳ `archivebox/core/statemachines.py` ### Must Update (Tests): 15. ⏳ `tests/test_oneshot.py` 16. ⏳ `archivebox/tests/test_hooks.py` 17. ⏳ `archivebox/tests/test_migrations_helpers.py` - Schema SQL definitions 18. ⏳ `tests/test_recursive_crawl.py` - SQL queries with field names 19. ⏳ `archivebox/cli/tests_piping.py` - Test function docstrings ### Must Update (Terminology - Phase 12): 20. ⏳ `archivebox/misc/logging_util.py` - Rename logging functions 21. ⏳ `archivebox/core/forms.py` - Rename form helper and field 22. ⏳ `archivebox/templates/admin/progress_monitor.html` - JavaScript field refs 23. ⏳ All 81 plugin files - Update docstrings and comments 24. ⏳ 28 files with parser terminology - Update comments consistently --- ## Migration Strategy ### Data Migration for Existing Records: ```python def forwards(apps, schema_editor): ArchiveResult = apps.get_model('core', 'ArchiveResult') # All existing records get empty hook_name ArchiveResult.objects.all().update(hook_name='') ``` ### Backwards Compatibility: **BREAKING CHANGES** (per user requirements - no backwards compat): - CLI flags: Hard cutover to `--plugins` (no aliases) - API fields: `extractor` removed, `plugin` required - Template tags: All renamed to `plugin_*` **PARTIAL COMPAT** (for migration): - JSONL: Write 'plugin', but **accept both 'extractor' and 'plugin' on import** --- ## Testing Checklist - [ ] Migration 0033 runs successfully on test database - [ ] All migrations tests pass (test_migrations_*.py) - [ ] All hook tests pass (test_hooks.py) - [ ] CLI commands work with --plugins flag - [ ] API endpoints return plugin/hook_name fields correctly - [ ] Admin interface displays plugin correctly - [ ] Admin progress monitor JavaScript works (no console errors) - [ ] JSONL export includes both plugin and hook_name - [ ] JSONL import accepts both 'extractor' and 'plugin' keys - [ ] Hook execution populates hook_name field - [ ] Worker filtering by plugin works - [ ] Template tags render with new names (plugin_icon, etc.) - [ ] All renamed functions work correctly - [ ] SQL queries in tests use correct field names - [ ] Terminology is consistent across codebase --- ## Critical Issues to Address ### 1. via_extractor Field (DECISION: RENAME) - Currently used in 14 files for tracking which parser plugin discovered a URL - **Decision**: Rename `via_extractor` β†’ `plugin` (not via_plugin, just "plugin") - **Impact**: Crawler and parser plugin code - 14 files to update - Files affected: - archivebox/hooks.py - archivebox/crawls/models.py - archivebox/cli/archivebox_crawl.py - All parser plugins (parse_html_urls, parse_rss_urls, parse_jsonl_urls, etc.) - Tests: tests_piping.py, test_parse_rss_urls_comprehensive.py - This creates consistent naming where "plugin" is used for both: - ArchiveResult.plugin (which extractor plugin ran) - URL discovery metadata "plugin" (which parser plugin discovered this URL) ### 2. Field Size Constraint - Current: `extractor = CharField(max_length=32)` - **Decision**: Keep max_length=32 when renaming to plugin - No size increase needed ### 3. Migration Implementation - Use `migrations.RenameField('ArchiveResult', 'extractor', 'plugin')` for clean migration - Preserves data, indexes, and constraints automatically - Add hook_name field in same migration --- ## Rollout Notes **Breaking Changes**: 1. CLI: `--extract`, `--extractors` β†’ `--plugins` (no aliases) 2. API: `extractor` field β†’ `plugin` field (no backwards compat) 3. Template tags: `extractor_*` β†’ `plugin_*` (users must update custom templates) 4. Python API: All function names with "extractor" β†’ "plugin" (import changes needed) 5. Form fields: `archive_methods` β†’ `plugins` 6. **via_extractor β†’ plugin** (URL discovery metadata field) **Migration Required**: Yes - all instances must run migrations before upgrading **Estimated Impact**: ~150+ files will need updates across the entire codebase - 81 files: extractor terminology - 28 files: parser terminology - 10 files: archive_method legacy terminology - Plus templates, JavaScript, tests, etc. --- ## Next Steps 1. **Continue with Phase 3**: Update hooks.py with all function renames and hook_name tracking 2. **Then Phase 4**: Update JSONL import/export with backwards compatibility 3. **Then Phases 5-12**: Systematically update all remaining files 4. **Finally Phase 13**: Run full test suite and verify everything works **Note**: Migration can be tested immediately - the migration file is ready to run!