21 KiB
TODO: Rename Extractor to Plugin - Implementation Progress
Status: 🟡 In Progress (2/13 phases complete) Started: 2025-12-28 Estimated Files to Update: ~150+ files
Progress Overview
✅ Completed Phases (2/13)
- Phase 1: Database Migration - Created migration 0033
- Phase 2: Core Model Updates - Updated ArchiveResult, ArchiveResultManager, Snapshot models
🟡 In Progress (1/13)
- Phase 3: Hook Execution System (hooks.py - all function renames)
⏳ Pending Phases (10/13)
- Phase 4: JSONL Import/Export (misc/jsonl.py)
- Phase 5: CLI Commands (archivebox_extract, archivebox_add, archivebox_update)
- Phase 6: API Endpoints (v1_core.py, v1_cli.py)
- Phase 7: Admin Interface (admin_archiveresults.py, forms.py)
- Phase 8: Views and Templates (views.py, templatetags, progress_monitor.html)
- Phase 9: Worker System (workers/worker.py)
- Phase 10: State Machine (statemachines.py)
- Phase 11: Tests (test_migrations_helpers.py, test_recursive_crawl.py, etc.)
- Phase 12: Terminology Standardization (via_extractor→plugin, comments, docstrings)
- Phase 13: Run migrations and verify all tests pass
What's Been Completed So Far
Phase 1: Database Migration ✅
File Created: archivebox/core/migrations/0033_rename_extractor_add_hook_name.py
Changes:
- Used
migrations.RenameField()to renameextractor→plugin - Added
hook_namefield (CharField, max_length=255, indexed, default='') - Preserves all existing data, indexes, and constraints
Phase 2: Core Models ✅
File Updated: archivebox/core/models.py
ArchiveResultManager
- Updated
indexable()method to useplugin__inandplugin=method - Changed reference from
ARCHIVE_METHODS_INDEXING_PRECEDENCEtoEXTRACTOR_INDEXING_PRECEDENCE
ArchiveResult Model
Field Changes:
- Renamed field:
extractor→plugin - Added field:
hook_name(stores full filename likeon_Snapshot__50_wget.py) - Updated comments to reference "plugin" instead of "extractor"
Method Updates:
get_extractor_choices()→get_plugin_choices()__str__(): Now usesself.pluginsave(): Logsplugininstead ofextractorget_absolute_url(): Usesself.pluginextractor_moduleproperty →plugin_modulepropertyoutput_exists(): Checksself.plugindirectoryembed_path(): Usesself.pluginfor pathscreate_output_dir(): Createsself.plugindirectoryoutput_dir_name: Returnsself.pluginrun(): All references to extractor → plugin (including extractor_dir → plugin_dir)update_from_output(): All references updated to plugin/plugin_dir_update_snapshot_title(): Parameter renamed toplugin_dirtrigger_search_indexing(): Passesplugin=self.pluginoutput_dirproperty: Returns plugin directoryis_background_hook(): Usesplugin_dir
Snapshot Model
Method Updates:
create_pending_archiveresults(): Usesget_enabled_plugins(), filters byplugin=pluginresult_icons(calc_icons): Maps byr.plugin, callsget_plugin_name()andget_plugin_icon()_merge_archive_results_from_index(): Maps by(ar.plugin, ar.start_ts), supports both 'extractor' and 'plugin' keys for backwards compat_create_archive_result_if_missing(): Supports both 'extractor' and 'plugin' keys, creates withplugin=pluginwrite_index_json(): Writes'plugin': ar.pluginin archive_resultscanonical_outputs(): Updatesfind_best_output_in_dir()to useplugin_name, accessesresult.plugin, creates keys like{result.plugin}_pathlatest_outputs(): Usesget_plugins(), filters byplugin=pluginretry_failed_archiveresults(): Updated docstring to reference "plugins" instead of "extractors"
Total Lines Changed in models.py: ~50+ locations
Full Implementation Plan
ArchiveResult Model Refactoring Plan: Rename Extractor to Plugin + Add Hook Name Field
Overview
Refactor the ArchiveResult model and standardize terminology across the codebase:
- Rename the
extractorfield topluginin ArchiveResult model - Add a new
hook_namefield to store the specific hook filename that executed - Update all related code paths (CLI, API, admin, views, hooks, JSONL, etc.)
- Standardize CLI flags from
--extract/--extractorsto--plugins - Standardize terminology throughout codebase:
- "parsers" → "parser plugins"
- "extractors" → "extractor plugins"
- "parser extractors" → "parser plugins"
- "archive methods" → "extractor plugins"
- Document apt/brew/npm/pip as "package manager plugins" in comments
Current State Analysis
ArchiveResult Model (archivebox/core/models.py:1679-1750)
class ArchiveResult(ModelWithOutputDir, ...):
extractor = models.CharField(max_length=32, db_index=True) # e.g., "screenshot", "wget"
# New fields from migration 0029:
output_str, output_json, output_files, output_size, output_mimetypes
binary = ForeignKey('machine.Binary', ...)
# No hook_name field yet
Hook Execution Flow
ArchiveResult.run()discovers hooks for the plugin (e.g.,wget/on_Snapshot__50_wget.py)run_hook()executes each hook script, captures output as HookResultupdate_from_output()parses JSONL and updates ArchiveResult fields- Currently NO tracking of which specific hook file executed
Field Usage Across Codebase
extractor field is used in ~100 locations:
- Model: ArchiveResult.extractor field definition, str, manager queries
- CLI: archivebox_extract.py (--plugin flag), archivebox_add.py, tests
- API: v1_core.py (extractor filter), v1_cli.py (extract/extractors args)
- Admin: admin_archiveresults.py (list filter, display)
- Views: core/views.py (archiveresult_objects dict by extractor)
- Template Tags: core_tags.py (extractor_icon, extractor_thumbnail, extractor_embed)
- Hooks: hooks.py (get_extractors, get_extractor_name, run_hook output parsing)
- JSONL: misc/jsonl.py (archiveresult_to_jsonl serializes extractor)
- Worker: workers/worker.py (ArchiveResultWorker filters by extractor)
- Statemachine: statemachines.py (logs extractor in state transitions)
Implementation Plan
Phase 1: Database Migration (archivebox/core/migrations/) ✅ COMPLETE
Create migration 0033_rename_extractor_add_hook_name.py:
- Rename field:
extractor→plugin(preserve index, constraints) - Add field:
hook_name= CharField(max_length=255, blank=True, default='', db_index=True)- Stores full hook filename:
on_Snapshot__50_wget.py,on_Crawl__10_chrome_session.js, etc. - Empty string for existing records (data migration sets all to '')
- Stores full hook filename:
- Update any indexes or constraints that reference extractor
Decision: Full filename chosen for explicitness and easy grep-ability
Critical Files to Update:
- ✅ ArchiveResult model field definitions
- ✅ Migration dependencies (latest: 0032)
Phase 2: Core Model Updates (archivebox/core/models.py) ✅ COMPLETE
ArchiveResult Model (lines 1679-1820):
- ✅ Rename field:
extractor→plugin - ✅ Add field:
hook_name = models.CharField(...) - ✅ Update str:
f'...-> {self.plugin}' - ✅ Update absolute_url: Use plugin instead of extractor
- ✅ Update embed_path: Use plugin directory name
ArchiveResultManager (lines 1669-1677):
- ✅ Update indexable():
filter(plugin__in=INDEXABLE_METHODS, ...) - ✅ Update precedence:
When(plugin=method, ...)
Snapshot Model (lines 1000-1600):
- ✅ Update canonical_outputs: Access by plugin name
- ✅ Update create_pending_archiveresults: Use plugin parameter
- ✅ All queryset filters:
archiveresult_set.filter(plugin=...)
Phase 3: Hook Execution System (archivebox/hooks.py) 🟡 IN PROGRESS
Function Renames:
get_extractors()→get_plugins()(lines 479-504)get_parser_extractors()→get_parser_plugins()(lines 507-514)get_extractor_name()→get_plugin_name()(lines 517-530)is_parser_extractor()→is_parser_plugin()(lines 533-536)get_enabled_extractors()→get_enabled_plugins()(lines 553-566)get_extractor_template()→get_plugin_template()(line 1048)get_extractor_icon()→get_plugin_icon()(line 1068)get_all_extractor_icons()→get_all_plugin_icons()(line 1092)
Update HookResult TypedDict (lines 63-73):
- Add field:
hook_name: strto store hook filename - Add field:
plugin: str(if not already present)
Update run_hook() (lines 141-389):
- Add hook_name parameter: Pass hook filename to be stored in result
- Update HookResult to include hook_name field
- Update JSONL record output: Add
hook_namekey
Update ArchiveResult.run() (lines 1838-1914):
- When calling run_hook, pass the hook filename
- Store hook_name in ArchiveResult before/after execution
Update ArchiveResult.update_from_output() (lines 1916-2073):
- Parse hook_name from JSONL output
- Store in self.hook_name field
- If not present in JSONL, infer from directory/filename
Constants to Rename:
ARCHIVE_METHODS_INDEXING_PRECEDENCE→EXTRACTOR_INDEXING_PRECEDENCE
Comments/Docstrings: Update all function docstrings to use "plugin" terminology
Phase 4: JSONL Import/Export (archivebox/misc/jsonl.py)
Update archiveresult_to_jsonl() (lines 173-200):
- Change key:
'extractor': result.extractor→'plugin': result.plugin - Add key:
'hook_name': result.hook_name
Update JSONL parsing:
- Accept both 'extractor' (legacy) and 'plugin' (new) keys when importing
- Always write 'plugin' key in new exports (never 'extractor')
- Parse and store hook_name if present (backwards compat: empty if missing)
Decision: Support both keys on import for smooth migration, always export new format
Phase 5: CLI Commands (archivebox/cli/)
archivebox_extract.py (lines 1-230):
- Rename flag:
--pluginstays (already correct!) - Update internal references: extractor → plugin
- Update filter:
results.filter(plugin=plugin) - Update display:
result.plugin
archivebox_add.py:
- Rename config key:
'EXTRACTORS': plugins→'PLUGINS': plugins(if not already)
archivebox_update.py:
- Standardize to
--pluginsflag (currently may be --extractors or --extract)
tests/test_oneshot.py:
- Update flag:
--extract=...→--plugins=...
Phase 6: API Endpoints (archivebox/api/)
v1_core.py (ArchiveResult API):
- Update schema field:
extractor: str→plugin: str - Update schema field: Add
hook_name: str = '' - Update FilterSchema:
q=[..., 'plugin', ...] - Update extractor filter:
plugin: Optional[str] = Field(None, q='plugin__icontains')
v1_cli.py (CLI API):
- Rename AddCommandSchema field:
extract: str→plugins: str - Rename UpdateCommandSchema field:
extractors: str→plugins: str - Update endpoint mapping:
args.plugins→pluginsparameter
Phase 7: Admin Interface (archivebox/core/)
admin_archiveresults.py:
- Update all references: extractor → plugin
- Update list_filter:
'plugin'instead of'extractor' - Update ordering:
order_by('plugin') - Update get_plugin_icon: (rename from get_extractor_icon if exists)
admin_snapshots.py:
- Update any commented TODOs referencing extractor
forms.py:
- Rename function:
get_archive_methods()→get_plugin_choices() - Update form field:
archive_methods→plugins
Phase 8: Views and Templates (archivebox/core/)
views.py:
- Update dict building:
archiveresult_objects[result.plugin] = result - Update all extractor references to plugin
templatetags/core_tags.py:
- Rename template tags (BREAKING CHANGE):
extractor_icon()→plugin_icon()extractor_thumbnail()→plugin_thumbnail()extractor_embed()→plugin_embed()
- Update internal:
result.extractor→result.plugin
Update HTML templates (if any directly reference extractor):
- Search for
{{ result.extractor }}and similar - Update to
{{ result.plugin }} - Update template tag calls
- CRITICAL: Update JavaScript in
templates/admin/progress_monitor.html:- Lines 491, 505: Change
extractor.extractoranda.extractorto usepluginfield
- Lines 491, 505: Change
Phase 9: Worker System (archivebox/workers/worker.py)
ArchiveResultWorker:
- Rename parameter:
extractor→plugin(lines 348, 350) - Update filter:
qs.filter(plugin=self.plugin) - Update subprocess passing: Use plugin parameter
Phase 10: State Machine (archivebox/core/statemachines.py)
ArchiveResultMachine:
- Update logging: Use
self.archiveresult.plugininstead of extractor - Update any state metadata that includes extractor field
Phase 11: Tests and Fixtures
Update test files:
- tests/test_migrations_*.py: Update expected field names in schema definitions
- tests/test_hooks.py: Update assertions for plugin/hook_name fields
- archivebox/tests/test_migrations_helpers.py: Update schema SQL (lines 161, 382, 468)
- tests/test_recursive_crawl.py: Update SQL query
WHERE extractor = '60_parse_html_urls'(line 163) - archivebox/cli/tests_piping.py: Update test function names and assertions
- Any fixtures that create ArchiveResults: Use plugin parameter
- Any mock objects that set
.extractorattribute: Change to.plugin
Phase 12: Terminology Standardization (NEW)
This phase standardizes terminology throughout the codebase to use consistent "plugin" nomenclature.
via_extractor → plugin Rename (14 files):
- Rename metadata field
via_extractorto justplugin - Files affected:
- archivebox/hooks.py - Set plugin in run_hook() output
- archivebox/crawls/models.py - If via_extractor field exists
- archivebox/cli/archivebox_crawl.py - References to via_extractor
- All parser plugins that set via_extractor in output
- Test files with via_extractor assertions
- Update all JSONL output from parser plugins to use "plugin" key
Logging Functions (archivebox/misc/logging_util.py):
log_archive_method_started()→log_extractor_started()(line 326)log_archive_method_finished()→log_extractor_finished()(line 330)
Form Functions (archivebox/core/forms.py):
get_archive_methods()→get_plugin_choices()(line 15)- Form field
archive_methods→plugins(line 24, 29) - Update form validation and view usage
Comments and Docstrings (81 files with "extractor" references):
- Update comments to say "extractor plugin" instead of just "extractor"
- Update comments to say "parser plugin" instead of "parser extractor"
- All plugin files: Update docstrings to use "extractor plugin" terminology
Package Manager Plugin Documentation:
- Update comments in package manager hook files to say "package manager plugin":
- archivebox/plugins/apt/on_Binary__install_using_apt_provider.py
- archivebox/plugins/brew/on_Binary__install_using_brew_provider.py
- archivebox/plugins/npm/on_Binary__install_using_npm_provider.py
- archivebox/plugins/pip/on_Binary__install_using_pip_provider.py
- archivebox/plugins/env/on_Binary__install_using_env_provider.py
- archivebox/plugins/custom/on_Binary__install_using_custom_bash.py
String Literals in Error Messages:
- Search for error messages containing "extractor" and update to "plugin" or "extractor plugin"
- Search for error messages containing "parser" and update to "parser plugin" where appropriate
Critical Files Summary
Must Update (Core):
- ✅
archivebox/core/models.py- ArchiveResult, ArchiveResultManager, Snapshot - ✅
archivebox/core/migrations/0033_*.py- New migration - ⏳
archivebox/hooks.py- All hook execution and discovery functions - ⏳
archivebox/misc/jsonl.py- Serialization/deserialization
Must Update (CLI):
- ⏳
archivebox/cli/archivebox_extract.py - ⏳
archivebox/cli/archivebox_add.py - ⏳
archivebox/cli/archivebox_update.py
Must Update (API):
- ⏳
archivebox/api/v1_core.py - ⏳
archivebox/api/v1_cli.py
Must Update (Admin/Views):
- ⏳
archivebox/core/admin_archiveresults.py - ⏳
archivebox/core/views.py - ⏳
archivebox/core/templatetags/core_tags.py
Must Update (Workers/State):
- ⏳
archivebox/workers/worker.py - ⏳
archivebox/core/statemachines.py
Must Update (Tests):
- ⏳
tests/test_oneshot.py - ⏳
archivebox/tests/test_hooks.py - ⏳
archivebox/tests/test_migrations_helpers.py- Schema SQL definitions - ⏳
tests/test_recursive_crawl.py- SQL queries with field names - ⏳
archivebox/cli/tests_piping.py- Test function docstrings
Must Update (Terminology - Phase 12):
- ⏳
archivebox/misc/logging_util.py- Rename logging functions - ⏳
archivebox/core/forms.py- Rename form helper and field - ⏳
archivebox/templates/admin/progress_monitor.html- JavaScript field refs - ⏳ All 81 plugin files - Update docstrings and comments
- ⏳ 28 files with parser terminology - Update comments consistently
Migration Strategy
Data Migration for Existing Records:
def forwards(apps, schema_editor):
ArchiveResult = apps.get_model('core', 'ArchiveResult')
# All existing records get empty hook_name
ArchiveResult.objects.all().update(hook_name='')
Backwards Compatibility:
BREAKING CHANGES (per user requirements - no backwards compat):
- CLI flags: Hard cutover to
--plugins(no aliases) - API fields:
extractorremoved,pluginrequired - Template tags: All renamed to
plugin_*
PARTIAL COMPAT (for migration):
- JSONL: Write 'plugin', but accept both 'extractor' and 'plugin' on import
Testing Checklist
- Migration 0033 runs successfully on test database
- All migrations tests pass (test_migrations_*.py)
- All hook tests pass (test_hooks.py)
- CLI commands work with --plugins flag
- API endpoints return plugin/hook_name fields correctly
- Admin interface displays plugin correctly
- Admin progress monitor JavaScript works (no console errors)
- JSONL export includes both plugin and hook_name
- JSONL import accepts both 'extractor' and 'plugin' keys
- Hook execution populates hook_name field
- Worker filtering by plugin works
- Template tags render with new names (plugin_icon, etc.)
- All renamed functions work correctly
- SQL queries in tests use correct field names
- Terminology is consistent across codebase
Critical Issues to Address
1. via_extractor Field (DECISION: RENAME)
- Currently used in 14 files for tracking which parser plugin discovered a URL
- Decision: Rename
via_extractor→plugin(not via_plugin, just "plugin") - Impact: Crawler and parser plugin code - 14 files to update
- Files affected:
- archivebox/hooks.py
- archivebox/crawls/models.py
- archivebox/cli/archivebox_crawl.py
- All parser plugins (parse_html_urls, parse_rss_urls, parse_jsonl_urls, etc.)
- Tests: tests_piping.py, test_parse_rss_urls_comprehensive.py
- This creates consistent naming where "plugin" is used for both:
- ArchiveResult.plugin (which extractor plugin ran)
- URL discovery metadata "plugin" (which parser plugin discovered this URL)
2. Field Size Constraint
- Current:
extractor = CharField(max_length=32) - Decision: Keep max_length=32 when renaming to plugin
- No size increase needed
3. Migration Implementation
- Use
migrations.RenameField('ArchiveResult', 'extractor', 'plugin')for clean migration - Preserves data, indexes, and constraints automatically
- Add hook_name field in same migration
Rollout Notes
Breaking Changes:
- CLI:
--extract,--extractors→--plugins(no aliases) - API:
extractorfield →pluginfield (no backwards compat) - Template tags:
extractor_*→plugin_*(users must update custom templates) - Python API: All function names with "extractor" → "plugin" (import changes needed)
- Form fields:
archive_methods→plugins - via_extractor → plugin (URL discovery metadata field)
Migration Required: Yes - all instances must run migrations before upgrading
Estimated Impact: ~150+ files will need updates across the entire codebase
- 81 files: extractor terminology
- 28 files: parser terminology
- 10 files: archive_method legacy terminology
- Plus templates, JavaScript, tests, etc.
Next Steps
- Continue with Phase 3: Update hooks.py with all function renames and hook_name tracking
- Then Phase 4: Update JSONL import/export with backwards compatibility
- Then Phases 5-12: Systematically update all remaining files
- Finally Phase 13: Run full test suite and verify everything works
Note: Migration can be tested immediately - the migration file is ready to run!