Multiple hooks in the same plugin directory were overwriting each
other's stdout.log, stderr.log, hook.pid, and cmd.sh files. Now each
hook uses filenames prefixed with its hook name:
- on_Snapshot__20_chrome_tab.bg.stdout.log
- on_Snapshot__20_chrome_tab.bg.stderr.log
- on_Snapshot__20_chrome_tab.bg.pid
- on_Snapshot__20_chrome_tab.bg.sh
Updated:
- hooks.py run_hook() to use hook-specific names
- core/models.py cleanup and update_from_output methods
- Plugin scripts to no longer write redundant hook.pid files
<!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line
length changes. -->
# Summary
<!--e.g. This PR fixes ABC or adds the ability to do XYZ...-->
# Related issues
<!-- e.g. #123 or Roadmap goal #
https://github.com/pirate/ArchiveBox/wiki/Roadmap -->
# Changes these areas
- [ ] Bugfixes
- [ ] Feature behavior
- [ ] Command line interface
- [ ] Configuration options
- [ ] Internal architecture
- [ ] Snapshot data layout on disk
<!-- This is an auto-generated description by cubic. -->
---
## Summary by cubic
Prevented hook file collisions by giving each hook its own stdout,
stderr, pid, and cmd filenames. This fixes mixed logs and ensures
correct cleanup and status checks when multiple hooks run in the same
plugin directory.
- **Bug Fixes**
- hooks.py: write hook-specific stdout/stderr/pid/cmd files and exclude
them from new_files; derive cmd.sh from pid for safe kill.
- core/models.py: read hook-specific logs; exclude hook output files
when computing outputs; cleanup and background detection use *.pid.
- Plugins: stop writing redundant hook.pid files; minor chrome utils
cleanup.
<sup>Written for commit 754b096193.
Summary will update on new commits.</sup>
<!-- End of auto-generated description by cubic. -->
Multiple hooks in the same plugin directory were overwriting each
other's stdout.log, stderr.log, hook.pid, and cmd.sh files. Now
each hook uses filenames prefixed with its hook name:
- on_Snapshot__20_chrome_tab.bg.stdout.log
- on_Snapshot__20_chrome_tab.bg.stderr.log
- on_Snapshot__20_chrome_tab.bg.pid
- on_Snapshot__20_chrome_tab.bg.sh
Updated:
- hooks.py run_hook() to use hook-specific names
- core/models.py cleanup and update_from_output methods
- Plugin scripts to no longer write redundant hook.pid files
New section 1.5 adds @property proc that returns psutil.Process ONLY if:
- PID exists in OS
- OS start time matches our started_at (within tolerance)
- We're on the same machine
Safety features:
- Validates start time via psutil.Process.create_time()
- Optional command validation (binary name matches)
- Returns None instead of wrong process on PID reuse
Also adds convenience methods:
- is_running: Check via validated psutil
- get_memory_info(): RSS/VMS if running
- get_cpu_percent(): CPU usage if running
- get_children_pids(): Child PIDs from OS
Updated kill() to use self.proc for safe killing - never kills
a recycled PID since we validate start time first.
Phase 3.3 now includes:
- Module-level _supervisord_db_process variable
- start_new_supervisord_process(): Create Process record after Popen
- stop_existing_supervisord_process(): Update Process status on shutdown
- Process hierarchy diagram showing CLI → supervisord → workers chain
Key insight: PPID-based linking works because workers call Process.current()
in on_startup(), which finds supervisord's Process via PPID lookup.
Phase 2 now includes line-by-line mapping of:
- run_hook(): Create Process record, use Process.launch(), parse
JSONL for child binary Process records
- process_is_alive(): Accept Path or Process, use Process.is_alive()
- kill_process(): Accept Path or Process, use Process.kill()
- ArchiveResult.run(): Pass self.process as parent_process to run_hook()
- ArchiveResult.update_from_output(): Read from Process.stdout/stderr
- Snapshot.cleanup(): Kill via Process model, fallback to PID files
- Snapshot.has_running_background_hooks(): Check via Process model
Hook JSONL contract updated to support {"type": "Process"} records
for tracking binary executions within hooks.
PIDs are recycled by OS, so all Process queries now:
- Filter by machine=Machine.current() (PIDs unique per machine)
- Filter by started_at within PID_REUSE_WINDOW (24h)
- Validate start time matches OS via psutil.Process.create_time()
Added:
- ProcessManager.get_by_pid() for safe PID lookups
- Process.cleanup_stale_running() to mark orphaned RUNNING as EXITED
- START_TIME_TOLERANCE (5s) for start time comparison
- Uses psutil.Process.create_time() for accurate started_at
Key addition: Process.current() class method (like Machine.current())
that auto-creates/retrieves the Process record for the current OS process.
Benefits:
- Uses PPID lookup to find parent Process automatically
- Detects process_type from sys.argv
- Cached with validation (like Machine.current())
- Eliminates need for thread-local context management
Simplified Phase 3 (workers) and Phase 4 (CLI) to just call
Process.current() instead of manual Process creation.
Documents 7-phase refactoring to use machine.Process as the core data
model for all subprocess management:
- Phase 1: Add parent FK and process_type to Process model
- Phase 2: Add lifecycle methods (launch, kill, poll, wait)
- Phase 3: Update hook system to create Process records
- Phase 4-5: Track workers/orchestrator/supervisord as Process
- Phase 6: Create root Process on CLI invocation
- Phase 7: Admin UI with tree visualization
Enables full process hierarchy tracking from CLI → binary execution.
<!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line
length changes. -->
# Summary
<!--e.g. This PR fixes ABC or adds the ability to do XYZ...-->
# Related issues
<!-- e.g. #123 or Roadmap goal #
https://github.com/pirate/ArchiveBox/wiki/Roadmap -->
# Changes these areas
- [ ] Bugfixes
- [ ] Feature behavior
- [ ] Command line interface
- [ ] Configuration options
- [ ] Internal architecture
- [x] Snapshot data layout on disk
<!-- This is an auto-generated description by cubic. -->
---
## Summary by cubic
Switch snapshot index storage from index.json to a flat index.jsonl
format for easier parsing and extensibility. Includes automatic
migration and backward-compatible reading, plus updated CLI pipeline to
emit/consume JSONL records.
- **New Features**
- Write and read index.jsonl with per-line records (Snapshot,
ArchiveResult, Binary, Process); reconcile prefers JSONL.
- Auto-convert legacy index.json to JSONL during migration/update;
load_from_directory/create_from_directory support both formats.
- Serialization moved to model to_jsonl methods; added schema_version to
all records, including Tag, Crawl, Binary, and Process.
- CLI pipeline updated: crawl creates a single Crawl job from all input
URLs and outputs Crawl JSONL (no immediate crawling); snapshot accepts
Crawl JSONL/IDs and outputs Snapshot JSONL; extract outputs
ArchiveResult JSONL via model methods.
- **Migration**
- Conversion runs during filesystem migration and reconcile; no manual
steps.
- Legacy index.json is deleted after conversion; external tools should
switch to index.jsonl.
<sup>Written for commit 251fe33e49.
Summary will update on new commits.</sup>
<!-- End of auto-generated description by cubic. -->
Changed from singular --plugin to plural --plugins in both snapshot and extract
commands to match the pattern in archivebox add command. Updated to accept
comma-separated plugin names (e.g., --plugins=screenshot,singlefile,title).
- Updated CLI option from --plugin to --plugins
- Added parsing for comma-separated plugin names
- Updated function signatures and logic to handle multiple plugins
- Updated help text, docstrings, and examples
Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>
The --plugins parameter was incorrectly renamed to --extract (boolean).
This restores --plugin (singular, matching extract command) with correct
semantics: specify which plugin to run after creating snapshots.
- Changed --extract/--no-extract back to --plugin (string parameter)
- Updated function signature and logic to use plugin parameter
- Added ArchiveResult creation for specific plugin when --plugin is passed
- Updated docstring and examples
Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>
- Add JSONL_INDEX_FILENAME to ALLOWED_IN_DATA_DIR for consistency
- Fix fallback logic in legacy.py to try JSON when JSONL parsing fails
- Replace bare except clauses with specific exception types
- Fix stdin double-consumption in archivebox_crawl.py
- Merge CLI --tag option with crawl tags in archivebox_snapshot.py
- Remove tautological mock tests (covered by integration tests)
Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>
- archivebox crawl now creates one Crawl with all URLs as newline-separated string
- Updated tests to reflect new pipeline: crawl -> snapshot -> extract
- Added tests for Crawl JSONL parsing and output
- Tests verify Crawl.from_jsonl() handles multiple URLs correctly
Implement a sleek inline tag editor with autocomplete and AJAX support:
- Create TagEditorWidget and InlineTagEditorWidget in core/widgets.py
- Pills display with X remove button, sorted alphabetically
- Text input with HTML5 datalist autocomplete
- Enter/Space/Comma to add tags, auto-creates if doesn't exist
- Backspace removes last tag when input is empty
- Add API endpoints in api/v1_core.py
- GET /tags/autocomplete/ - search tags by name
- POST /tags/create/ - get_or_create tag
- POST /tags/add-to-snapshot/ - add tag to snapshot via AJAX
- POST /tags/remove-from-snapshot/ - remove tag from snapshot
- Update admin_snapshots.py
- Replace FilteredSelectMultiple with TagEditorWidget in bulk actions
- Create SnapshotAdminForm with tags_editor field
- Update title_str() to render inline tag editor in list view
- Remove TagInline, use widget instead
- Add CSS styles in templates/admin/base.html
- Blue gradient pill styling matching admin theme
- Focus ring and hover states
- Compact inline variant for list view
<!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line
length changes. -->
# Summary
<!--e.g. This PR fixes ABC or adds the ability to do XYZ...-->
# Related issues
<!-- e.g. #123 or Roadmap goal #
https://github.com/pirate/ArchiveBox/wiki/Roadmap -->
# Changes these areas
- [ ] Bugfixes
- [ ] Feature behavior
- [ ] Command line interface
- [ ] Configuration options
- [ ] Internal architecture
- [ ] Snapshot data layout on disk
<!-- This is an auto-generated description by cubic. -->
---
## Summary by cubic
Implemented a new interactive tags editor for Django admin with
autocomplete and AJAX add/remove, replacing the old multi-select and
inline. This makes tagging snapshots faster and safer in detail, list,
and bulk actions.
- **New Features**
- TagEditorWidget and InlineTagEditorWidget with pill UI and remove
buttons, XSS-safe rendering, and delegated events.
- Keyboard support: Enter/Space/Comma to add, Backspace to remove last
when input is empty.
- Datalist autocomplete and debounced search via GET
/tags/autocomplete/.
- AJAX endpoints: POST /tags/create/, /tags/add-to-snapshot/,
/tags/remove-from-snapshot/.
- **Refactors**
- Replaced FilteredSelectMultiple with TagEditorWidget in bulk actions;
parse comma-separated tags and use bulk_create/delete for efficient
add/remove.
- Added SnapshotAdminForm with tags_editor field; saves tags
case-insensitively and fixes remove_tags matching.
- Rendered inline tag editor in list view via title_str; removed
TagInline.
- Added CSS in admin/base.html for pill styling, focus ring, and compact
inline variant.
<sup>Written for commit 0dee662f41.
Summary will update on new commits.</sup>
<!-- End of auto-generated description by cubic. -->
- Add Tag.to_jsonl() method with schema_version
- Add Crawl.to_jsonl() method with schema_version
- Fix Tag.from_jsonl() to not depend on jsonl.py helper
- Update tests to use Snapshot.from_jsonl() instead of non-existent get_or_create_snapshot
Remove model-specific functions from misc/jsonl.py:
- tag_to_jsonl() - use Tag.to_jsonl() instead
- crawl_to_jsonl() - use Crawl.to_jsonl() instead
- get_or_create_tag() - use Tag.from_jsonl() instead
- process_jsonl_records() - use model from_jsonl() methods directly
jsonl.py now only contains generic I/O utilities:
- Type constants (TYPE_SNAPSHOT, etc.)
- parse_line(), read_stdin(), read_file(), read_args_or_stdin()
- write_record(), write_records()
- filter_by_type(), process_records()
- add_tags: Uses SnapshotTag.objects.bulk_create() with ignore_conflicts
Instead of N calls to obj.tags.add(), now makes 1 query per tag
- remove_tags: Uses single SnapshotTag.objects.filter().delete()
Instead of N calls to obj.tags.remove(), now makes 1 query total
Works correctly with "select all across pages" via queryset.values_list()
- Fix case-sensitivity mismatch in remove_tags (use name__iexact)
- Fix XSS vulnerability by removing onclick attributes
- Use data attributes and event delegation instead
- Apply DOM APIs to prevent injection attacks
Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>
Move JSONL serialization from standalone functions to model methods
to mirror the from_jsonl() pattern:
- Add Binary.to_jsonl() method
- Add Process.to_jsonl() method
- Add ArchiveResult.to_jsonl() method
- Add Snapshot.to_jsonl() method
- Update write_index_jsonl() to use model methods
- Update jsonl.py functions to be thin wrappers
Switch from hierarchical index.json to flat index.jsonl format for
snapshot metadata storage. Each line is a self-contained JSON record
with a 'type' field (Snapshot, ArchiveResult, Binary, Process).
Changes:
- Add JSONL_INDEX_FILENAME constant to constants.py
- Add TYPE_PROCESS and TYPE_MACHINE to jsonl.py type constants
- Add binary_to_jsonl(), process_to_jsonl(), machine_to_jsonl() converters
- Add Snapshot.write_index_jsonl() to write new format
- Add Snapshot.read_index_jsonl() to read new format
- Add Snapshot.convert_index_json_to_jsonl() for migration
- Update Snapshot.reconcile_with_index() to handle both formats
- Update fs_migrate to convert during filesystem migration
- Update load_from_directory/create_from_directory for both formats
- Update legacy.py parse_json_links_details for JSONL support
The new format is easier to parse, extend, and mix record types.
Implement a sleek inline tag editor with autocomplete and AJAX support:
- Create TagEditorWidget and InlineTagEditorWidget in core/widgets.py
- Pills display with X remove button, sorted alphabetically
- Text input with HTML5 datalist autocomplete
- Enter/Space/Comma to add tags, auto-creates if doesn't exist
- Backspace removes last tag when input is empty
- Add API endpoints in api/v1_core.py
- GET /tags/autocomplete/ - search tags by name
- POST /tags/create/ - get_or_create tag
- POST /tags/add-to-snapshot/ - add tag to snapshot via AJAX
- POST /tags/remove-from-snapshot/ - remove tag from snapshot
- Update admin_snapshots.py
- Replace FilteredSelectMultiple with TagEditorWidget in bulk actions
- Create SnapshotAdminForm with tags_editor field
- Update title_str() to render inline tag editor in list view
- Remove TagInline, use widget instead
- Add CSS styles in templates/admin/base.html
- Blue gradient pill styling matching admin theme
- Focus ring and hover states
- Compact inline variant for list view
## Summary
Resolves#1484 where CUSTOM_TEMPLATES_DIR configuration was being
ignored.
The setting was previously removed from ServerConfig and hardcoded as a
constant, preventing users from customizing the templates directory
location.
## Changes
- Added CUSTOM_TEMPLATES_DIR field to StorageConfig in common.py
- Updated settings.py to use STORAGE_CONFIG.CUSTOM_TEMPLATES_DIR
- Updated paths.py to use configurable value in version output
## Usage
Users can now configure the custom templates directory via:
- ArchiveBox.conf: `CUSTOM_TEMPLATES_DIR = ./custom_templates`
- Environment variable: `export CUSTOM_TEMPLATES_DIR=/path/to/templates`
- Defaults to DATA_DIR/user_templates if not configured
Generated with [Claude Code](https://claude.ai/code)
<!-- This is an auto-generated description by cubic. -->
---
## Summary by cubic
Restores CUSTOM_TEMPLATES_DIR configurability so users can override the
templates directory. Fixes issue #1484 and updates the app to
consistently use the configured path.
- **Bug Fixes**
- Added CUSTOM_TEMPLATES_DIR to StorageConfig.
- Updated settings.py and paths.py to read
STORAGE_CONFIG.CUSTOM_TEMPLATES_DIR.
- **Migration**
- Configure via ArchiveBox.conf or the CUSTOM_TEMPLATES_DIR env var.
- Defaults to DATA_DIR/user_templates if not set.
<sup>Written for commit 329d185d95.
Summary will update automatically on new commits.</sup>
<!-- End of auto-generated description by cubic. -->