ArchiveBox/old/TODO_rename_extractor_to_plugin.md

# TODO: Rename Extractor to Plugin - Implementation Progress

**Status**: 🟡 In Progress (2/13 phases complete)
**Started**: 2025-12-28
**Estimated Files to Update**: ~150+ files

---

## Progress Overview

### ✅ Completed Phases (2/13)

- [x] **Phase 1**: Database Migration - Created migration 0033
- [x] **Phase 2**: Core Model Updates - Updated ArchiveResult, ArchiveResultManager, Snapshot models

### 🟡 In Progress (1/13)

- [ ] **Phase 3**: Hook Execution System (hooks.py - all function renames)

### ⏳ Pending Phases (10/13)

- [ ] **Phase 4**: JSONL Import/Export (misc/jsonl.py)
- [ ] **Phase 5**: CLI Commands (archivebox_extract, archivebox_add, archivebox_update)
- [ ] **Phase 6**: API Endpoints (v1_core.py, v1_cli.py)
- [ ] **Phase 7**: Admin Interface (admin_archiveresults.py, forms.py)
- [ ] **Phase 8**: Views and Templates (views.py, templatetags, progress_monitor.html)
- [ ] **Phase 9**: Worker System (workers/worker.py)
- [ ] **Phase 10**: State Machine (statemachines.py)
- [ ] **Phase 11**: Tests (test_migrations_helpers.py, test_recursive_crawl.py, etc.)
- [ ] **Phase 12**: Terminology Standardization (via_extractor→plugin, comments, docstrings)
- [ ] **Phase 13**: Run migrations and verify all tests pass

---

## What's Been Completed So Far

### Phase 1: Database Migration ✅

**File Created**: `archivebox/core/migrations/0033_rename_extractor_add_hook_name.py`

Changes:
- Used `migrations.RenameField()` to rename `extractor` → `plugin`
- Added `hook_name` field (CharField, max_length=255, indexed, default='')
- Preserves all existing data, indexes, and constraints

### Phase 2: Core Models ✅

**File Updated**: `archivebox/core/models.py`

#### ArchiveResultManager
- Updated `indexable()` method to use `plugin__in` and `plugin=method`
- Changed reference from `ARCHIVE_METHODS_INDEXING_PRECEDENCE` to `EXTRACTOR_INDEXING_PRECEDENCE`

#### ArchiveResult Model
**Field Changes**:
- Renamed field: `extractor` → `plugin`
- Added field: `hook_name` (stores full filename like `on_Snapshot__50_wget.py`)
- Updated comments to reference "plugin" instead of "extractor"

**Method Updates**:
- `get_extractor_choices()` → `get_plugin_choices()`
- `__str__()`: Now uses `self.plugin`
- `save()`: Logs `plugin` instead of `extractor`
- `get_absolute_url()`: Uses `self.plugin`
- `extractor_module` property → `plugin_module` property
- `output_exists()`: Checks `self.plugin` directory
- `embed_path()`: Uses `self.plugin` for paths
- `create_output_dir()`: Creates `self.plugin` directory
- `output_dir_name`: Returns `self.plugin`
- `run()`: All references to extractor → plugin (including extractor_dir → plugin_dir)
- `update_from_output()`: All references updated to plugin/plugin_dir
- `_update_snapshot_title()`: Parameter renamed to `plugin_dir`
- `trigger_search_indexing()`: Passes `plugin=self.plugin`
- `output_dir` property: Returns plugin directory
- `is_background_hook()`: Uses `plugin_dir`

#### Snapshot Model
**Method Updates**:
- `create_pending_archiveresults()`: Uses `get_enabled_plugins()`, filters by `plugin=plugin`
- `result_icons` (calc_icons): Maps by `r.plugin`, calls `get_plugin_name()` and `get_plugin_icon()`
- `_merge_archive_results_from_index()`: Maps by `(ar.plugin, ar.start_ts)`, supports both 'extractor' and 'plugin' keys for backwards compat
- `_create_archive_result_if_missing()`: Supports both 'extractor' and 'plugin' keys, creates with `plugin=plugin`
- `write_index_json()`: Writes `'plugin': ar.plugin` in archive_results
- `canonical_outputs()`: Updates `find_best_output_in_dir()` to use `plugin_name`, accesses `result.plugin`, creates keys like `{result.plugin}_path`
- `latest_outputs()`: Uses `get_plugins()`, filters by `plugin=plugin`
- `retry_failed_archiveresults()`: Updated docstring to reference "plugins" instead of "extractors"

**Total Lines Changed in models.py**: ~50+ locations

---

## Full Implementation Plan

# ArchiveResult Model Refactoring Plan: Rename Extractor to Plugin + Add Hook Name Field

## Overview
Refactor the ArchiveResult model and standardize terminology across the codebase:
1. Rename the `extractor` field to `plugin` in ArchiveResult model
2. Add a new `hook_name` field to store the specific hook filename that executed
3. Update all related code paths (CLI, API, admin, views, hooks, JSONL, etc.)
4. Standardize CLI flags from `--extract/--extractors` to `--plugins`
5. **Standardize terminology throughout codebase**:
   - "parsers" → "parser plugins"
   - "extractors" → "extractor plugins"
   - "parser extractors" → "parser plugins"
   - "archive methods" → "extractor plugins"
   - Document apt/brew/npm/pip as "package manager plugins" in comments

## Current State Analysis

### ArchiveResult Model (archivebox/core/models.py:1679-1750)
```python
class ArchiveResult(ModelWithOutputDir, ...):
    extractor = models.CharField(max_length=32, db_index=True)  # e.g., "screenshot", "wget"
    # New fields from migration 0029:
    output_str, output_json, output_files, output_size, output_mimetypes
    binary = ForeignKey('machine.Binary', ...)
    # No hook_name field yet
```

### Hook Execution Flow
1. `ArchiveResult.run()` discovers hooks for the plugin (e.g., `wget/on_Snapshot__50_wget.py`)
2. `run_hook()` executes each hook script, captures output as HookResult
3. `update_from_output()` parses JSONL and updates ArchiveResult fields
4. Currently NO tracking of which specific hook file executed

### Field Usage Across Codebase
**extractor field** is used in ~100 locations:
- **Model**: ArchiveResult.extractor field definition, __str__, manager queries
- **CLI**: archivebox_extract.py (--plugin flag), archivebox_add.py, tests
- **API**: v1_core.py (extractor filter), v1_cli.py (extract/extractors args)
- **Admin**: admin_archiveresults.py (list filter, display)
- **Views**: core/views.py (archiveresult_objects dict by extractor)
- **Template Tags**: core_tags.py (extractor_icon, extractor_thumbnail, extractor_embed)
- **Hooks**: hooks.py (get_extractors, get_extractor_name, run_hook output parsing)
- **JSONL**: misc/jsonl.py (archiveresult_to_jsonl serializes extractor)
- **Worker**: workers/worker.py (ArchiveResultWorker filters by extractor)
- **Statemachine**: statemachines.py (logs extractor in state transitions)

---

## Implementation Plan

### Phase 1: Database Migration (archivebox/core/migrations/) ✅ COMPLETE

**Create migration 0033_rename_extractor_add_hook_name.py**:
1. Rename field: `extractor` → `plugin` (preserve index, constraints)
2. Add field: `hook_name` = CharField(max_length=255, blank=True, default='', db_index=True)
   - **Stores full hook filename**: `on_Snapshot__50_wget.py`, `on_Crawl__10_chrome_session.js`, etc.
   - Empty string for existing records (data migration sets all to '')
3. Update any indexes or constraints that reference extractor

**Decision**: Full filename chosen for explicitness and easy grep-ability

**Critical Files to Update**:
- ✅ ArchiveResult model field definitions
- ✅ Migration dependencies (latest: 0032)

---

### Phase 2: Core Model Updates (archivebox/core/models.py) ✅ COMPLETE

**ArchiveResult Model** (lines 1679-1820):
- ✅ Rename field: `extractor` → `plugin`
- ✅ Add field: `hook_name = models.CharField(...)`
- ✅ Update __str__: `f'...-> {self.plugin}'`
- ✅ Update absolute_url: Use plugin instead of extractor
- ✅ Update embed_path: Use plugin directory name

**ArchiveResultManager** (lines 1669-1677):
- ✅ Update indexable(): `filter(plugin__in=INDEXABLE_METHODS, ...)`
- ✅ Update precedence: `When(plugin=method, ...)`

**Snapshot Model** (lines 1000-1600):
- ✅ Update canonical_outputs: Access by plugin name
- ✅ Update create_pending_archiveresults: Use plugin parameter
- ✅ All queryset filters: `archiveresult_set.filter(plugin=...)`

---

### Phase 3: Hook Execution System (archivebox/hooks.py) 🟡 IN PROGRESS

**Function Renames**:
- [ ] `get_extractors()` → `get_plugins()` (lines 479-504)
- [ ] `get_parser_extractors()` → `get_parser_plugins()` (lines 507-514)
- [ ] `get_extractor_name()` → `get_plugin_name()` (lines 517-530)
- [ ] `is_parser_extractor()` → `is_parser_plugin()` (lines 533-536)
- [ ] `get_enabled_extractors()` → `get_enabled_plugins()` (lines 553-566)
- [ ] `get_extractor_template()` → `get_plugin_template()` (line 1048)
- [ ] `get_extractor_icon()` → `get_plugin_icon()` (line 1068)
- [ ] `get_all_extractor_icons()` → `get_all_plugin_icons()` (line 1092)

**Update HookResult TypedDict** (lines 63-73):
- [ ] Add field: `hook_name: str` to store hook filename
- [ ] Add field: `plugin: str` (if not already present)

**Update run_hook()** (lines 141-389):
- [ ] **Add hook_name parameter**: Pass hook filename to be stored in result
- [ ] Update HookResult to include hook_name field
- [ ] Update JSONL record output: Add `hook_name` key

**Update ArchiveResult.run()** (lines 1838-1914):
- [ ] When calling run_hook, pass the hook filename
- [ ] Store hook_name in ArchiveResult before/after execution

**Update ArchiveResult.update_from_output()** (lines 1916-2073):
- [ ] Parse hook_name from JSONL output
- [ ] Store in self.hook_name field
- [ ] If not present in JSONL, infer from directory/filename

**Constants to Rename**:
- [ ] `ARCHIVE_METHODS_INDEXING_PRECEDENCE` → `EXTRACTOR_INDEXING_PRECEDENCE`

**Comments/Docstrings**: Update all function docstrings to use "plugin" terminology

---

### Phase 4: JSONL Import/Export (archivebox/misc/jsonl.py)

**Update archiveresult_to_jsonl()** (lines 173-200):
- [ ] Change key: `'extractor': result.extractor` → `'plugin': result.plugin`
- [ ] Add key: `'hook_name': result.hook_name`

**Update JSONL parsing**:
- [ ] **Accept both 'extractor' (legacy) and 'plugin' (new) keys when importing**
- [ ] Always write 'plugin' key in new exports (never 'extractor')
- [ ] Parse and store hook_name if present (backwards compat: empty if missing)

**Decision**: Support both keys on import for smooth migration, always export new format

---

### Phase 5: CLI Commands (archivebox/cli/)

**archivebox_extract.py** (lines 1-230):
- [ ] Rename flag: `--plugin` stays (already correct!)
- [ ] Update internal references: extractor → plugin
- [ ] Update filter: `results.filter(plugin=plugin)`
- [ ] Update display: `result.plugin`

**archivebox_add.py**:
- [ ] Rename config key: `'EXTRACTORS': plugins` → `'PLUGINS': plugins` (if not already)

**archivebox_update.py**:
- [ ] Standardize to `--plugins` flag (currently may be --extractors or --extract)

**tests/test_oneshot.py**:
- [ ] Update flag: `--extract=...` → `--plugins=...`

---

### Phase 6: API Endpoints (archivebox/api/)

**v1_core.py** (ArchiveResult API):
- [ ] Update schema field: `extractor: str` → `plugin: str`
- [ ] Update schema field: Add `hook_name: str = ''`
- [ ] Update FilterSchema: `q=[..., 'plugin', ...]`
- [ ] Update extractor filter: `plugin: Optional[str] = Field(None, q='plugin__icontains')`

**v1_cli.py** (CLI API):
- [ ] Rename AddCommandSchema field: `extract: str` → `plugins: str`
- [ ] Rename UpdateCommandSchema field: `extractors: str` → `plugins: str`
- [ ] Update endpoint mapping: `args.plugins` → `plugins` parameter

---

### Phase 7: Admin Interface (archivebox/core/)

**admin_archiveresults.py**:
- [ ] Update all references: extractor → plugin
- [ ] Update list_filter: `'plugin'` instead of `'extractor'`
- [ ] Update ordering: `order_by('plugin')`
- [ ] Update get_plugin_icon: (rename from get_extractor_icon if exists)

**admin_snapshots.py**:
- [ ] Update any commented TODOs referencing extractor

**forms.py**:
- [ ] Rename function: `get_archive_methods()` → `get_plugin_choices()`
- [ ] Update form field: `archive_methods` → `plugins`

---

### Phase 8: Views and Templates (archivebox/core/)

**views.py**:
- [ ] Update dict building: `archiveresult_objects[result.plugin] = result`
- [ ] Update all extractor references to plugin

**templatetags/core_tags.py**:
- [ ] **Rename template tags (BREAKING CHANGE)**:
  - `extractor_icon()` → `plugin_icon()`
  - `extractor_thumbnail()` → `plugin_thumbnail()`
  - `extractor_embed()` → `plugin_embed()`
- [ ] Update internal: `result.extractor` → `result.plugin`

**Update HTML templates** (if any directly reference extractor):
- [ ] Search for `{{ result.extractor }}` and similar
- [ ] Update to `{{ result.plugin }}`
- [ ] Update template tag calls
- [ ] **CRITICAL**: Update JavaScript in `templates/admin/progress_monitor.html`:
  - Lines 491, 505: Change `extractor.extractor` and `a.extractor` to use `plugin` field

---

### Phase 9: Worker System (archivebox/workers/worker.py)

**ArchiveResultWorker**:
- [ ] Rename parameter: `extractor` → `plugin` (lines 348, 350)
- [ ] Update filter: `qs.filter(plugin=self.plugin)`
- [ ] Update subprocess passing: Use plugin parameter

---

### Phase 10: State Machine (archivebox/core/statemachines.py)

**ArchiveResultMachine**:
- [ ] Update logging: Use `self.archiveresult.plugin` instead of extractor
- [ ] Update any state metadata that includes extractor field

---

### Phase 11: Tests and Fixtures

**Update test files**:
- [ ] tests/test_migrations_*.py: Update expected field names in schema definitions
- [ ] tests/test_hooks.py: Update assertions for plugin/hook_name fields
- [ ] archivebox/tests/test_migrations_helpers.py: Update schema SQL (lines 161, 382, 468)
- [ ] tests/test_recursive_crawl.py: Update SQL query `WHERE extractor = '60_parse_html_urls'` (line 163)
- [ ] archivebox/cli/tests_piping.py: Update test function names and assertions
- [ ] Any fixtures that create ArchiveResults: Use plugin parameter
- [ ] Any mock objects that set `.extractor` attribute: Change to `.plugin`

---

### Phase 12: Terminology Standardization (NEW)

This phase standardizes terminology throughout the codebase to use consistent "plugin" nomenclature.

**via_extractor → plugin Rename (14 files)**:
- [ ] Rename metadata field `via_extractor` to just `plugin`
- [ ] Files affected:
  - archivebox/hooks.py - Set plugin in run_hook() output
  - archivebox/crawls/models.py - If via_extractor field exists
  - archivebox/cli/archivebox_crawl.py - References to via_extractor
  - All parser plugins that set via_extractor in output
  - Test files with via_extractor assertions
- [ ] Update all JSONL output from parser plugins to use "plugin" key

**Logging Functions (archivebox/misc/logging_util.py)**:
- [ ] `log_archive_method_started()` → `log_extractor_started()` (line 326)
- [ ] `log_archive_method_finished()` → `log_extractor_finished()` (line 330)

**Form Functions (archivebox/core/forms.py)**:
- [ ] `get_archive_methods()` → `get_plugin_choices()` (line 15)
- [ ] Form field `archive_methods` → `plugins` (line 24, 29)
- [ ] Update form validation and view usage

**Comments and Docstrings (81 files with "extractor" references)**:
- [ ] Update comments to say "extractor plugin" instead of just "extractor"
- [ ] Update comments to say "parser plugin" instead of "parser extractor"
- [ ] All plugin files: Update docstrings to use "extractor plugin" terminology

**Package Manager Plugin Documentation**:
- [ ] Update comments in package manager hook files to say "package manager plugin":
  - archivebox/plugins/apt/on_Binary__install_using_apt_provider.py
  - archivebox/plugins/brew/on_Binary__install_using_brew_provider.py
  - archivebox/plugins/npm/on_Binary__install_using_npm_provider.py
  - archivebox/plugins/pip/on_Binary__install_using_pip_provider.py
  - archivebox/plugins/env/on_Binary__install_using_env_provider.py
  - archivebox/plugins/custom/on_Binary__install_using_custom_bash.py

**String Literals in Error Messages**:
- [ ] Search for error messages containing "extractor" and update to "plugin" or "extractor plugin"
- [ ] Search for error messages containing "parser" and update to "parser plugin" where appropriate

---

## Critical Files Summary

### Must Update (Core):
1. ✅ `archivebox/core/models.py` - ArchiveResult, ArchiveResultManager, Snapshot
2. ✅ `archivebox/core/migrations/0033_*.py` - New migration
3. ⏳ `archivebox/hooks.py` - All hook execution and discovery functions
4. ⏳ `archivebox/misc/jsonl.py` - Serialization/deserialization

### Must Update (CLI):
5. ⏳ `archivebox/cli/archivebox_extract.py`
6. ⏳ `archivebox/cli/archivebox_add.py`
7. ⏳ `archivebox/cli/archivebox_update.py`

### Must Update (API):
8. ⏳ `archivebox/api/v1_core.py`
9. ⏳ `archivebox/api/v1_cli.py`

### Must Update (Admin/Views):
10. ⏳ `archivebox/core/admin_archiveresults.py`
11. ⏳ `archivebox/core/views.py`
12. ⏳ `archivebox/core/templatetags/core_tags.py`

### Must Update (Workers/State):
13. ⏳ `archivebox/workers/worker.py`
14. ⏳ `archivebox/core/statemachines.py`

### Must Update (Tests):
15. ⏳ `tests/test_oneshot.py`
16. ⏳ `archivebox/tests/test_hooks.py`
17. ⏳ `archivebox/tests/test_migrations_helpers.py` - Schema SQL definitions
18. ⏳ `tests/test_recursive_crawl.py` - SQL queries with field names
19. ⏳ `archivebox/cli/tests_piping.py` - Test function docstrings

### Must Update (Terminology - Phase 12):
20. ⏳ `archivebox/misc/logging_util.py` - Rename logging functions
21. ⏳ `archivebox/core/forms.py` - Rename form helper and field
22. ⏳ `archivebox/templates/admin/progress_monitor.html` - JavaScript field refs
23. ⏳ All 81 plugin files - Update docstrings and comments
24. ⏳ 28 files with parser terminology - Update comments consistently

---

## Migration Strategy

### Data Migration for Existing Records:
```python
def forwards(apps, schema_editor):
    ArchiveResult = apps.get_model('core', 'ArchiveResult')
    # All existing records get empty hook_name
    ArchiveResult.objects.all().update(hook_name='')
```

### Backwards Compatibility:
**BREAKING CHANGES** (per user requirements - no backwards compat):
- CLI flags: Hard cutover to `--plugins` (no aliases)
- API fields: `extractor` removed, `plugin` required
- Template tags: All renamed to `plugin_*`

**PARTIAL COMPAT** (for migration):
- JSONL: Write 'plugin', but **accept both 'extractor' and 'plugin' on import**

---

## Testing Checklist

- [ ] Migration 0033 runs successfully on test database
- [ ] All migrations tests pass (test_migrations_*.py)
- [ ] All hook tests pass (test_hooks.py)
- [ ] CLI commands work with --plugins flag
- [ ] API endpoints return plugin/hook_name fields correctly
- [ ] Admin interface displays plugin correctly
- [ ] Admin progress monitor JavaScript works (no console errors)
- [ ] JSONL export includes both plugin and hook_name
- [ ] JSONL import accepts both 'extractor' and 'plugin' keys
- [ ] Hook execution populates hook_name field
- [ ] Worker filtering by plugin works
- [ ] Template tags render with new names (plugin_icon, etc.)
- [ ] All renamed functions work correctly
- [ ] SQL queries in tests use correct field names
- [ ] Terminology is consistent across codebase

---

## Critical Issues to Address

### 1. via_extractor Field (DECISION: RENAME)
- Currently used in 14 files for tracking which parser plugin discovered a URL
- **Decision**: Rename `via_extractor` → `plugin` (not via_plugin, just "plugin")
- **Impact**: Crawler and parser plugin code - 14 files to update
- Files affected:
  - archivebox/hooks.py
  - archivebox/crawls/models.py
  - archivebox/cli/archivebox_crawl.py
  - All parser plugins (parse_html_urls, parse_rss_urls, parse_jsonl_urls, etc.)
  - Tests: tests_piping.py, test_parse_rss_urls_comprehensive.py
- This creates consistent naming where "plugin" is used for both:
  - ArchiveResult.plugin (which extractor plugin ran)
  - URL discovery metadata "plugin" (which parser plugin discovered this URL)

### 2. Field Size Constraint
- Current: `extractor = CharField(max_length=32)`
- **Decision**: Keep max_length=32 when renaming to plugin
- No size increase needed

### 3. Migration Implementation
- Use `migrations.RenameField('ArchiveResult', 'extractor', 'plugin')` for clean migration
- Preserves data, indexes, and constraints automatically
- Add hook_name field in same migration

---

## Rollout Notes

**Breaking Changes**:
1. CLI: `--extract`, `--extractors` → `--plugins` (no aliases)
2. API: `extractor` field → `plugin` field (no backwards compat)
3. Template tags: `extractor_*` → `plugin_*` (users must update custom templates)
4. Python API: All function names with "extractor" → "plugin" (import changes needed)
5. Form fields: `archive_methods` → `plugins`
6. **via_extractor → plugin** (URL discovery metadata field)

**Migration Required**: Yes - all instances must run migrations before upgrading

**Estimated Impact**: ~150+ files will need updates across the entire codebase
- 81 files: extractor terminology
- 28 files: parser terminology
- 10 files: archive_method legacy terminology
- Plus templates, JavaScript, tests, etc.

---

## Next Steps

1. **Continue with Phase 3**: Update hooks.py with all function renames and hook_name tracking
2. **Then Phase 4**: Update JSONL import/export with backwards compatibility
3. **Then Phases 5-12**: Systematically update all remaining files
4. **Finally Phase 13**: Run full test suite and verify everything works

**Note**: Migration can be tested immediately - the migration file is ready to run!