Commit Graph

50 Commits

Author SHA1 Message Date
Nick Sweeting
e26a0f6fc0 Fix hook file overwrites in plugin directory (#1732)
Multiple hooks in the same plugin directory were overwriting each
other's stdout.log, stderr.log, hook.pid, and cmd.sh files. Now each
hook uses filenames prefixed with its hook name:
- on_Snapshot__20_chrome_tab.bg.stdout.log
- on_Snapshot__20_chrome_tab.bg.stderr.log
- on_Snapshot__20_chrome_tab.bg.pid
- on_Snapshot__20_chrome_tab.bg.sh

Updated:
- hooks.py run_hook() to use hook-specific names
- core/models.py cleanup and update_from_output methods
- Plugin scripts to no longer write redundant hook.pid files

<!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line
length changes. -->

# Summary

<!--e.g. This PR fixes ABC or adds the ability to do XYZ...-->

# Related issues

<!-- e.g. #123 or Roadmap goal #
https://github.com/pirate/ArchiveBox/wiki/Roadmap -->

# Changes these areas

- [ ] Bugfixes
- [ ] Feature behavior
- [ ] Command line interface
- [ ] Configuration options
- [ ] Internal architecture
- [ ] Snapshot data layout on disk


<!-- This is an auto-generated description by cubic. -->
---
## Summary by cubic
Prevented hook file collisions by giving each hook its own stdout,
stderr, pid, and cmd filenames. This fixes mixed logs and ensures
correct cleanup and status checks when multiple hooks run in the same
plugin directory.

- **Bug Fixes**
- hooks.py: write hook-specific stdout/stderr/pid/cmd files and exclude
them from new_files; derive cmd.sh from pid for safe kill.
- core/models.py: read hook-specific logs; exclude hook output files
when computing outputs; cleanup and background detection use *.pid.
- Plugins: stop writing redundant hook.pid files; minor chrome utils
cleanup.

<sup>Written for commit 754b096193.
Summary will update on new commits.</sup>

<!-- End of auto-generated description by cubic. -->
2025-12-30 23:36:09 -08:00
Nick Sweeting
dac6c63bba working extension tests 2025-12-30 18:30:16 -08:00
Nick Sweeting
42d3fb7025 extension test fixes 2025-12-30 18:28:14 -08:00
Claude
754b096193 use hook-specific filenames to avoid overwrites
Multiple hooks in the same plugin directory were overwriting each
other's stdout.log, stderr.log, hook.pid, and cmd.sh files. Now
each hook uses filenames prefixed with its hook name:
- on_Snapshot__20_chrome_tab.bg.stdout.log
- on_Snapshot__20_chrome_tab.bg.stderr.log
- on_Snapshot__20_chrome_tab.bg.pid
- on_Snapshot__20_chrome_tab.bg.sh

Updated:
- hooks.py run_hook() to use hook-specific names
- core/models.py cleanup and update_from_output methods
- Plugin scripts to no longer write redundant hook.pid files
2025-12-31 02:00:15 +00:00
Nick Sweeting
dd2302ad92 new jsonl cli interface 2025-12-30 16:12:53 -08:00
Nick Sweeting
08366cfa46 document chrome configs 2025-12-30 12:42:50 -08:00
Nick Sweeting
80f75126c6 more fixes 2025-12-29 21:03:05 -08:00
Nick Sweeting
64dccb7a19 passing 2025-12-29 18:55:57 -08:00
Nick Sweeting
5549a79869 more speed fixes 2025-12-29 18:55:37 -08:00
Nick Sweeting
abf5f44134 more debug logging 2025-12-29 18:53:52 -08:00
Nick Sweeting
bcf0513d05 more debug logging 2025-12-29 18:50:04 -08:00
Nick Sweeting
7e6e3be9e7 messing with chrome install process to reuse cached chromium with pinned version 2025-12-29 18:49:36 -08:00
Nick Sweeting
b670612685 centralize chrome pid and zombie logic in chrome_utils 2025-12-29 17:57:23 -08:00
Nick Sweeting
4ba3e8d120 fix extension loading and consolidate chromium logic 2025-12-29 17:47:37 -08:00
Nick Sweeting
638b3ba774 add modalcloser plugin 2025-12-29 14:36:15 -08:00
Nick Sweeting
8c69124935 make infiniscroll plugin also expand details and comments sections 2025-12-29 13:55:27 -08:00
Nick Sweeting
b649db5294 fix infiniscroll plugin 2025-12-29 13:55:26 -08:00
Nick Sweeting
690f0669cd remove uneeded test 2025-12-29 13:30:25 -08:00
Nick Sweeting
73e977ea97 ytdlp fixes 2025-12-29 13:26:50 -08:00
Nick Sweeting
967c5d53e0 make plugin config more consistent 2025-12-29 13:21:46 -08:00
Nick Sweeting
8d76b2b0c6 add infiniscroll plugin 2025-12-29 13:14:40 -08:00
Claude
ac64c77341 move default yt-dlp args to config.json YTDLP_ARGS for user override
- Move hardcoded default args from Python to config.json YTDLP_ARGS
- Add get_ytdlp_args() function to read from YTDLP_ARGS env var
- Keep format arg with max_size in code (depends on YTDLP_MAX_SIZE)
- YTDLP_ARGS can be overridden as JSON array in environment
2025-12-29 19:38:37 +00:00
Claude
a5654e877f rename media plugin to ytdlp with backwards-compatible aliases
- Rename archivebox/plugins/media/ → archivebox/plugins/ytdlp/
- Rename hook script on_Snapshot__63_media.bg.py → on_Snapshot__63_ytdlp.bg.py
- Update config.json: YTDLP_* as primary keys, MEDIA_* as x-aliases
- Update templates CSS classes: media-* → ytdlp-*
- Fix gallerydl bug: remove incorrect dependency on media plugin output
- Update all codebase references to use YTDLP_* and SAVE_YTDLP
- Add backwards compatibility test for MEDIA_ENABLED alias
2025-12-29 19:09:05 +00:00
Nick Sweeting
30c60eef76 much better tests and add page ui 2025-12-29 04:02:11 -08:00
Nick Sweeting
f4e7820533 use full dotted paths for all archivebox imports, add migrations and more fixes 2025-12-29 00:47:08 -08:00
Nick Sweeting
1e4d3ffd11 improve plugin tests and config 2025-12-29 00:45:23 -08:00
Nick Sweeting
f0aa19fa7d wip 2025-12-28 17:51:54 -08:00
Claude
1b5a816022 Implement hook step-based concurrency system
This implements the hook concurrency plan from TODO_hook_concurrency.md:

## Schema Changes
- Add Snapshot.current_step (IntegerField 0-9, default=0)
- Create migration 0034_snapshot_current_step.py
- Fix uuid_compat imports in migrations 0032 and 0003

## Core Logic
- Add extract_step(hook_name) utility - extracts step from __XX_ pattern
- Add is_background_hook(hook_name) utility - checks for .bg. suffix
- Update Snapshot.create_pending_archiveresults() to create one AR per hook
- Update ArchiveResult.run() to handle hook_name field
- Add Snapshot.advance_step_if_ready() method for step advancement
- Integrate with SnapshotMachine.is_finished() to call advance_step_if_ready()

## Worker Coordination
- Update ArchiveResultWorker.get_queue() for step-based filtering
- ARs are only claimable when their step <= snapshot.current_step

## Hook Renumbering
- Step 5 (DOM extraction): singlefile→50, screenshot→51, pdf→52, dom→53,
  title→54, readability→55, headers→55, mercury→56, htmltotext→57
- Step 6 (post-DOM): wget→61, git→62, media→63.bg, gallerydl→64.bg,
  forumdl→65.bg, papersdl→66.bg
- Step 7 (URL extraction): parse_* hooks moved to 70-75

Background hooks (.bg suffix) don't block step advancement, enabling
long-running downloads to continue while other hooks proceed.
2025-12-28 13:47:25 +00:00
Nick Sweeting
4ccb0863bb continue renaming extractor to plugin, add plan for hook concurrency, add chrome kill helper script 2025-12-28 05:29:24 -08:00
Nick Sweeting
bd265c0083 rename extractor to plugin everywhere 2025-12-28 04:43:15 -08:00
Nick Sweeting
50e527ec65 way better plugin hooks system wip 2025-12-28 03:39:59 -08:00
Claude
b632894bc9 Update views, API, and exports for new ArchiveResult output fields
Replace old `output` field with new fields across the codebase:
- output_str: Human-readable output summary
- output_json: Structured metadata (optional)
- output_files: Dict of output files with metadata
- output_size: Total size in bytes
- output_mimetypes: CSV of file mimetypes

Files updated:
- api/v1_core.py: Update MinimalArchiveResultSchema to expose new fields
- api/v1_core.py: Update ArchiveResultFilterSchema to search output_str
- cli/archivebox_extract.py: Use output_str in CLI output
- core/admin_archiveresults.py: Update admin fields, search, and fieldsets
- core/admin_archiveresults.py: Fix output_html variable name bug in output_summary
- misc/jsonl.py: Update archiveresult_to_jsonl() to include new fields
- plugins/extractor_utils.py: Update ExtractorResult helper class

The embed_path() method already uses output_files and output_str,
so snapshot detail page and template tags work correctly.
2025-12-27 20:28:22 +00:00
Claude
e3ba599812 Update install hooks to respect XYZ_BINARY env vars
- All install hooks now respect their respective XYZ_BINARY env vars
  (e.g., WGET_BINARY, CHROME_BINARY, YTDLP_BINARY, etc.)
- Support both absolute paths (/usr/bin/wget2) and binary names (wget2)
- Dynamic bin_name used in Dependency JSONL output
- Updated 11 install hooks to follow the new pattern
- Mark checklist items as complete in TODO_hook_architecture.md
2025-12-27 10:12:45 +00:00
Claude
8c846b7d1c Rename validate hooks to install hooks
- Rename 13 on_Crawl__00_validate_* hooks to on_Crawl__00_install_*
- This better reflects what these hooks actually do (check/install binaries)
- Update TODO_hook_architecture.md to reflect renamed hooks
2025-12-27 10:06:34 +00:00
Claude
2623c6cc11 Complete JS hooks to clean JSONL format + rename background hooks
- Update 12 remaining JS snapshot hooks to output clean JSONL
- Remove RESULT_JSON= prefix, START_TS=, END_TS=, STATUS= output
- Rename 3 background hooks with .bg. suffix:
  - consolelog -> on_Snapshot__21_consolelog.bg.js
  - ssl -> on_Snapshot__23_ssl.bg.js
  - responses -> on_Snapshot__24_responses.bg.js
- Update TODO_hook_architecture.md with completion status
2025-12-27 09:46:59 +00:00
Claude
c52eef1459 Update Python/JS hooks to clean JSONL format + add audit report
Phase 4 Plugin Audit Progress:
- Audited all 6 Dependency hooks (all already compliant)
- Audited all 11 Crawl Validate hooks (all already compliant)
- Updated 8 Python Snapshot hooks to clean JSONL format
- Updated 1 JS Snapshot hook (title.js) to clean JSONL format

Snapshot hooks updated to remove:
- RESULT_JSON= prefix
- Extra output lines (START_TS=, END_TS=, DURATION=, VERSION=, OUTPUT=, STATUS=)

Now output clean JSONL:
{"type": "ArchiveResult", "status": "...", "output_str": "..."}

Added implementation report to TODO_hook_architecture.md documenting:
- All completed phases (1, 3, 6, 7)
- Plugin audit results with status tables
- Remaining 13 JS hooks that need updating
- Files modified list
2025-12-27 09:31:03 +00:00
Claude
741c098a2b Merge remote-tracking branch 'origin/dev' into claude/improve-test-suite-xm6Bh 2025-12-27 05:53:06 +00:00
Nick Sweeting
2f81c0cc76 add overrides options to binproviders 2025-12-26 20:39:56 -08:00
Nick Sweeting
9bc5d99488 add overrides options to binproviders 2025-12-26 20:16:58 -08:00
Claude
13be196fd7 Merge remote-tracking branch 'origin/dev' into claude/improve-test-suite-xm6Bh
# Conflicts:
#	pyproject.toml
2025-12-27 02:27:51 +00:00
Nick Sweeting
6fdc52cc57 add papersdl plugin 2025-12-26 18:25:52 -08:00
Nick Sweeting
e2cbcd17f6 more tests and migrations fixes 2025-12-26 18:22:48 -08:00
Claude
0941aca4a3 Improve test suite: remove mocks and add 0.8.x migration tests
- Remove mock-based tests from plugin tests (headers, singlefile, ublock, captcha2)
- Replace fake cache tests with real double-install tests that verify cache behavior
- Add SCHEMA_0_8 and seed_0_8_data() for testing 0.8.x data directory migrations
- Add TestMigrationFrom08x class with comprehensive migration tests:
  - Snapshot count preservation
  - Crawl record preservation
  - Snapshot-to-crawl relationship preservation
  - Tag preservation
  - ArchiveResult status preservation
  - CLI command verification after migration
- Add more CLI tests for add command (tags, multiple URLs, file input)
- All tests now use real functionality without mocking
2025-12-26 23:01:49 +00:00
Nick Sweeting
0fbcbd2616 gallerydl template 2025-12-26 11:55:19 -08:00
Nick Sweeting
4fd7fcdbcf new gallerydl plugin and more 2025-12-26 11:55:03 -08:00
Nick Sweeting
9838d7ba02 tons of ui fixes and plugin fixes 2025-12-25 03:59:51 -08:00
Nick Sweeting
866f993f26 logging and admin ui improvements 2025-12-25 01:10:41 -08:00
Nick Sweeting
d95f0dc186 remove huey 2025-12-24 23:40:18 -08:00
Nick Sweeting
6c769d831c wip 2 2025-12-24 21:46:14 -08:00
Nick Sweeting
1915333b81 wip major changes 2025-12-24 20:10:38 -08:00