Fixes#1445
This PR resolves the issue where SingleFile was not respecting Chrome
user data directory and other Chrome launch options that work for other
Chrome-based extractors (PDF, Screenshot, etc.).
## Changes
- Added `SINGLEFILE_CHROME_ARGS` config option with fallback to
`CHROME_ARGS`
- Updated SingleFile extractor to pass Chrome arguments via
`--browser-args`
- Updated documentation
This ensures SingleFile respects the same Chrome configuration as other
Chrome-based extractors.
Generated with [Claude Code](https://claude.ai/code)
Show small thumbnails of recently completed ArchiveResult content in the
progress header. The thumbnail strip appears below the stats bar and
shows the last 20 successfully archived items with embeddable content
(screenshots, favicons, DOM snapshots, etc.).
Features:
- API returns recent_thumbnails with embed paths for succeeded results
- Thumbnails display with plugin-specific icons as fallback
- New thumbnails animate in with a pop effect
- Clicking a thumbnail navigates to the snapshot admin page
- Horizontal scrollable strip with custom scrollbar styling
<!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line
length changes. -->
# Summary
<!--e.g. This PR fixes ABC or adds the ability to do XYZ...-->
# Related issues
<!-- e.g. #123 or Roadmap goal #
https://github.com/pirate/ArchiveBox/wiki/Roadmap -->
# Changes these areas
- [ ] Bugfixes
- [ ] Feature behavior
- [ ] Command line interface
- [ ] Configuration options
- [ ] Internal architecture
- [ ] Snapshot data layout on disk
<!-- This is an auto-generated description by cubic. -->
---
## Summary by cubic
Adds a thumbnail strip to the live progress header. It shows previews of
the last 20 successful archived items for quick visual feedback and
one-click navigation.
- **New Features**
- API returns recent_thumbnails with embed paths for succeeded results.
- Horizontal, scrollable thumbnail strip under the header.
- Uses preview images when available; plugin icons as fallback.
- New thumbnails animate in with a pop effect.
- Clicking a thumbnail opens the snapshot admin page.
<sup>Written for commit 17029ba8b8.
Summary will update on new commits.</sup>
<!-- End of auto-generated description by cubic. -->
…nstall
- Delete chrome/on_Crawl__10_chrome_validate.py (duplicates
chrome_install)
- Rename wget/on_Crawl__11_wget_validate.py →
on_Crawl__06_wget_install.py
All hooks now follow consistent naming: install, launch, or config
<!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line
length changes. -->
# Summary
<!--e.g. This PR fixes ABC or adds the ability to do XYZ...-->
# Related issues
<!-- e.g. #123 or Roadmap goal #
https://github.com/pirate/ArchiveBox/wiki/Roadmap -->
# Changes these areas
- [ ] Bugfixes
- [ ] Feature behavior
- [ ] Command line interface
- [ ] Configuration options
- [ ] Internal architecture
- [ ] Snapshot data layout on disk
<!-- This is an auto-generated description by cubic. -->
---
## Summary by cubic
Removed the redundant Chrome validate hook, renamed the Wget validate
hook to wget_install, and standardized hook names and priorities to
match the install/launch/config lifecycle. This removes duplicate logic
and fixes priority conflicts across Crawl, Binary, and Snapshot hooks.
- **Refactors**
- Deleted chrome/on_Crawl__10_chrome_validate.py (dup of chrome_install)
- Renamed wget validate to on_Crawl__06_wget_install.py
- Standardized on_Binary hook priorities: npm 10, pip 11, brew 12, apt
13, custom 14, env 15
- Fixed on_Snapshot order: staticfile 32, readability 56, mercury 57,
htmltotext 58
<sup>Written for commit 09a1ca3134.
Summary will update on new commits.</sup>
<!-- End of auto-generated description by cubic. -->
Deleted dead/duplicate hooks:
- wget/on_Crawl__10_install_wget.py (duplicate of
__10_wget_validate_config.py)
- chrome/on_Crawl__00_chrome_install.py (simpler version, kept full one)
- chrome/on_Crawl__20_chrome_launch.bg.js (legacy, kept __30 version)
- singlefile/on_Crawl__20_install_singlefile_extension.js
(disabled/dead)
- istilldontcareaboutcookies/on_Crawl__20_install_*.js (legacy)
- ublock/on_Crawl__03_ublock.js (legacy, kept __20 version)
- Entire captcha2/ plugin (legacy version of twocaptcha/)
Renamed hooks to follow consistent pattern:
on_Crawl__XX_<plugin>_<action>.<ext>
Priority bands:
00-09: Binary/extension installation 10-19: Config validation 20-29:
Browser launch and post-launch config
Final hooks:
00 ripgrep_install.py, 01 chrome_install.py 02
istilldontcareaboutcookies_install.js 03 ublock_install.js, 04
singlefile_install.js 05 twocaptcha_install.js 10 chrome_validate.py, 11
wget_validate.py 20 chrome_launch.bg.js, 25 twocaptcha_config.js
<!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line
length changes. -->
# Summary
<!--e.g. This PR fixes ABC or adds the ability to do XYZ...-->
# Related issues
<!-- e.g. #123 or Roadmap goal #
https://github.com/pirate/ArchiveBox/wiki/Roadmap -->
# Changes these areas
- [ ] Bugfixes
- [ ] Feature behavior
- [ ] Command line interface
- [ ] Configuration options
- [ ] Internal architecture
- [ ] Snapshot data layout on disk
<!-- This is an auto-generated description by cubic. -->
---
## Summary by cubic
Cleaned up Crawl-level hooks by removing legacy/duplicate code and
standardizing hook names and priorities. Chrome launch is now a single,
updated hook with better extension detection and cleaner outputs.
- **Refactors**
- Removed dead hooks (legacy chrome install/launch, singlefile
extension, old ublock/cookies scripts, duplicate wget validate) and the
legacy captcha2 plugin in favor of twocaptcha.
- Renamed hooks to on_Crawl__XX_<plugin>_<action> with priority bands:
00-09 install, 10-19 validate, 20-29 launch/config.
- Consolidated Chrome launch into on_Crawl__20_chrome_launch.bg.js;
writes outputs to the current dir, resolves real extension IDs via
chrome://extensions, and records extensions.json after verification.
- **Migration**
- If you used captcha2, switch to the twocaptcha hooks
(on_Crawl__05_twocaptcha_install.js and
on_Crawl__25_twocaptcha_config.js).
- Update any docs/scripts that reference old hook filenames.
<sup>Written for commit 4c77949197.
Summary will update on new commits.</sup>
<!-- End of auto-generated description by cubic. -->
Show small thumbnails of recently completed ArchiveResult content in the
progress header. The thumbnail strip appears below the stats bar and shows
the last 20 successfully archived items with embeddable content (screenshots,
favicons, DOM snapshots, etc.).
Features:
- API returns recent_thumbnails with embed paths for succeeded results
- Thumbnails display with plugin-specific icons as fallback
- New thumbnails animate in with a pop effect
- Clicking a thumbnail navigates to the snapshot admin page
- Horizontal scrollable strip with custom scrollbar styling
- Add assertIsNotNone for accessibility_data to ensure test fails if no data generated
- Capture and report JSON decode errors in parse_dom_outlinks test
- Add assertIsNotNone for outlinks_data with error details
- Removes conditional checks that allowed tests to pass without verifying functionality
Addresses review comments from cubic-dev-ai
Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>
- test_seo.py: Add assertIsNotNone before conditional to catch SEO extraction failures
- test_ssl.py: Add assertIsNotNone to ensure SSL data is captured from HTTPS URLs
- test_pip_provider.py: Assert jsonl_found variable to verify binary discovery
- dns plugin: Deduplicate NXDOMAIN records using seenResolutions map
Tests now fail when functionality doesn't work (no cheating).
Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>
Add input validation and path safety checks to prevent path traversal
attacks in persona name handling:
- Add validate_persona_name() to block dangerous characters (/, \, .., etc)
- Add ensure_path_within_personas_dir() to verify resolved paths stay within PERSONAS_DIR
- Apply validation at persona creation, renaming, and deletion operations
Fixes security issues identified by cubic-dev-ai in PR review.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>
- merkletree: Tests merkle tree generation with real files,
empty directory handling, and disabled mode
- custom: Tests custom bash command execution and binary discovery
Real integration tests using Chrome sessions with example.com:
- accessibility: Tests page outline and accessibility tree extraction
- parse_dom_outlinks: Tests link extraction and categorization
- consolelog: Tests console output capture
The assertion was checking 'has_seo_data or seo_data' inside an 'if seo_data:' block,
making it always truthy. Changed to just check 'has_seo_data' to properly verify
that expected SEO keys were extracted.
Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>
- Add `archivebox persona create/list/update/delete` commands
- Support `--import=chrome|firefox|brave` to copy browser profile
- Extract cookies via CDP to generate cookies.txt for non-browser tools
- Fix JSDoc comment parsing issue in chrome_utils.js
Import shared utilities (getEnv, getEnvBool, getEnvInt) from
chrome_utils.js instead of duplicating them locally.
Also use DNS_TIMEOUT config for dynamic timeout calculations.
Records hostname → IP resolutions during page load using Chrome CDP.
Uses Network.responseReceived events to capture DNS resolution data
and writes one JSON line per record to dns.jsonl.
Features:
- Captures hostname to IP address mappings (A/AAAA records)
- Records failed DNS lookups (NXDOMAIN)
- Deduplicates resolution records per page load
- Integrates with existing Chrome plugin infrastructure
- Add real integration tests for SSL, redirects, and SEO plugins
using Chrome session helpers for live URL testing
- Remove fake "format" tests that just created dicts and asserted on them
(apt, pip, npm provider output format tests)
- Remove npm integration test that created dirs then checked they existed
- Fix SQLite search test to use SQLITEFTS_DB constant instead of hardcoded value
- Remove pid_utils tests (module deleted in dev)
- Update orchestrator tests to use Process model for tracking
- Add tests for Process.current(), cleanup_stale_running(), terminate()
- Add tests for Process hierarchy (parent/child, root, depth)
- Add tests for Process.get_running(), get_running_count()
- Add tests for ProcessMachine state machine
- Update machine model tests to match current API (from_jsonl vs from_json)
- Fix Process.current() to store psutil cmdline instead of sys.argv for accurate validation
- Fix worker process_type detection: explicitly set to WORKER after registration
- Fix ArchiveResultWorker.start() to use Process.TypeChoices.WORKER consistently
- Fix migration to be explicitly irreversible (SQLite doesn't support DROP COLUMN)
- Fix get_running_workers() to return process_id instead of incorrectly named worker_id
- Fix safe_kill_process() to wait for termination and escalate to SIGKILL if needed
- Fix migration to include all indexes in state_operations (parent_id, process_type)
- Fix documentation to use Machine.current() scoping and StatusChoices constants
Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>
Resolved conflicts by keeping Process model changes and accepting dev changes for unrelated files. Ensured pid_utils.py remains deleted as intended by this PR.
Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>
- Add prominent view mode switcher with List/Grid toggle buttons
- Improve filter sidebar CSS with modern styling, rounded corners
- Add live progress bar for in-progress snapshots showing hooks status
- Show plugin icons only when output directory has content
- Display archive result output_size sum from new field
- Show hooks succeeded/total count in size column
- Add get_progress_stats() method to Snapshot model
- Add CSS for progress spinner and status badges
- Update grid view template with progress indicator for archiving cards
- Add tests for admin views, search, and progress stats
- Add pwd validation in Process.launch() to prevent crashes
- Fix psutil returncode handling (use wait() return value, not returncode attr)
- Add None check for proc.pid in cleanup_stale_running()
- Add stale process cleanup in Orchestrator.is_running()
- Ensure orchestrator process_type is correctly set to ORCHESTRATOR
- Fix KeyboardInterrupt handling (exit code 0 for graceful shutdown)
- Throttle cleanup_stale_running() to once per 30 seconds for performance
- Fix worker process_type to use TypeChoices.WORKER consistently
- Fix get_running_workers() API to return list of dicts (not Process objects)
- Only delete PID files after successful kill or confirmed stale
- Fix migration index names to match between SQL and Django state
- Remove db_index=True from process_type (index created manually)
- Update documentation to reflect actual implementation
- Add explanatory comments to empty except blocks
- Fix exit codes to use Unix convention (128 + signal number)
Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>
- Fix conftest.py: use subprocess for init, remove unused cli_env fixture
- Update all test files to use data_dir parameter instead of env
- Remove mock-based TestJSONLOutput class from tests_piping.py
- Remove unused imports (MagicMock, patch)
- Fix file permissions for cli_utils.py
All tests now use real subprocess calls per CLAUDE.md guidelines:
- NO MOCKS - tests exercise real code paths
- NO SKIPS - every test runs