Fixes#1445
This PR resolves the issue where SingleFile was not respecting Chrome
user data directory and other Chrome launch options that work for other
Chrome-based extractors (PDF, Screenshot, etc.).
## Changes
- Added `SINGLEFILE_CHROME_ARGS` config option with fallback to
`CHROME_ARGS`
- Updated SingleFile extractor to pass Chrome arguments via
`--browser-args`
- Updated documentation
This ensures SingleFile respects the same Chrome configuration as other
Chrome-based extractors.
Generated with [Claude Code](https://claude.ai/code)
Show small thumbnails of recently completed ArchiveResult content in the
progress header. The thumbnail strip appears below the stats bar and
shows the last 20 successfully archived items with embeddable content
(screenshots, favicons, DOM snapshots, etc.).
Features:
- API returns recent_thumbnails with embed paths for succeeded results
- Thumbnails display with plugin-specific icons as fallback
- New thumbnails animate in with a pop effect
- Clicking a thumbnail navigates to the snapshot admin page
- Horizontal scrollable strip with custom scrollbar styling
<!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line
length changes. -->
# Summary
<!--e.g. This PR fixes ABC or adds the ability to do XYZ...-->
# Related issues
<!-- e.g. #123 or Roadmap goal #
https://github.com/pirate/ArchiveBox/wiki/Roadmap -->
# Changes these areas
- [ ] Bugfixes
- [ ] Feature behavior
- [ ] Command line interface
- [ ] Configuration options
- [ ] Internal architecture
- [ ] Snapshot data layout on disk
<!-- This is an auto-generated description by cubic. -->
---
## Summary by cubic
Adds a thumbnail strip to the live progress header. It shows previews of
the last 20 successful archived items for quick visual feedback and
one-click navigation.
- **New Features**
- API returns recent_thumbnails with embed paths for succeeded results.
- Horizontal, scrollable thumbnail strip under the header.
- Uses preview images when available; plugin icons as fallback.
- New thumbnails animate in with a pop effect.
- Clicking a thumbnail opens the snapshot admin page.
<sup>Written for commit 17029ba8b8.
Summary will update on new commits.</sup>
<!-- End of auto-generated description by cubic. -->
…nstall
- Delete chrome/on_Crawl__10_chrome_validate.py (duplicates
chrome_install)
- Rename wget/on_Crawl__11_wget_validate.py →
on_Crawl__06_wget_install.py
All hooks now follow consistent naming: install, launch, or config
<!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line
length changes. -->
# Summary
<!--e.g. This PR fixes ABC or adds the ability to do XYZ...-->
# Related issues
<!-- e.g. #123 or Roadmap goal #
https://github.com/pirate/ArchiveBox/wiki/Roadmap -->
# Changes these areas
- [ ] Bugfixes
- [ ] Feature behavior
- [ ] Command line interface
- [ ] Configuration options
- [ ] Internal architecture
- [ ] Snapshot data layout on disk
<!-- This is an auto-generated description by cubic. -->
---
## Summary by cubic
Removed the redundant Chrome validate hook, renamed the Wget validate
hook to wget_install, and standardized hook names and priorities to
match the install/launch/config lifecycle. This removes duplicate logic
and fixes priority conflicts across Crawl, Binary, and Snapshot hooks.
- **Refactors**
- Deleted chrome/on_Crawl__10_chrome_validate.py (dup of chrome_install)
- Renamed wget validate to on_Crawl__06_wget_install.py
- Standardized on_Binary hook priorities: npm 10, pip 11, brew 12, apt
13, custom 14, env 15
- Fixed on_Snapshot order: staticfile 32, readability 56, mercury 57,
htmltotext 58
<sup>Written for commit 09a1ca3134.
Summary will update on new commits.</sup>
<!-- End of auto-generated description by cubic. -->
Deleted dead/duplicate hooks:
- wget/on_Crawl__10_install_wget.py (duplicate of
__10_wget_validate_config.py)
- chrome/on_Crawl__00_chrome_install.py (simpler version, kept full one)
- chrome/on_Crawl__20_chrome_launch.bg.js (legacy, kept __30 version)
- singlefile/on_Crawl__20_install_singlefile_extension.js
(disabled/dead)
- istilldontcareaboutcookies/on_Crawl__20_install_*.js (legacy)
- ublock/on_Crawl__03_ublock.js (legacy, kept __20 version)
- Entire captcha2/ plugin (legacy version of twocaptcha/)
Renamed hooks to follow consistent pattern:
on_Crawl__XX_<plugin>_<action>.<ext>
Priority bands:
00-09: Binary/extension installation 10-19: Config validation 20-29:
Browser launch and post-launch config
Final hooks:
00 ripgrep_install.py, 01 chrome_install.py 02
istilldontcareaboutcookies_install.js 03 ublock_install.js, 04
singlefile_install.js 05 twocaptcha_install.js 10 chrome_validate.py, 11
wget_validate.py 20 chrome_launch.bg.js, 25 twocaptcha_config.js
<!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line
length changes. -->
# Summary
<!--e.g. This PR fixes ABC or adds the ability to do XYZ...-->
# Related issues
<!-- e.g. #123 or Roadmap goal #
https://github.com/pirate/ArchiveBox/wiki/Roadmap -->
# Changes these areas
- [ ] Bugfixes
- [ ] Feature behavior
- [ ] Command line interface
- [ ] Configuration options
- [ ] Internal architecture
- [ ] Snapshot data layout on disk
<!-- This is an auto-generated description by cubic. -->
---
## Summary by cubic
Cleaned up Crawl-level hooks by removing legacy/duplicate code and
standardizing hook names and priorities. Chrome launch is now a single,
updated hook with better extension detection and cleaner outputs.
- **Refactors**
- Removed dead hooks (legacy chrome install/launch, singlefile
extension, old ublock/cookies scripts, duplicate wget validate) and the
legacy captcha2 plugin in favor of twocaptcha.
- Renamed hooks to on_Crawl__XX_<plugin>_<action> with priority bands:
00-09 install, 10-19 validate, 20-29 launch/config.
- Consolidated Chrome launch into on_Crawl__20_chrome_launch.bg.js;
writes outputs to the current dir, resolves real extension IDs via
chrome://extensions, and records extensions.json after verification.
- **Migration**
- If you used captcha2, switch to the twocaptcha hooks
(on_Crawl__05_twocaptcha_install.js and
on_Crawl__25_twocaptcha_config.js).
- Update any docs/scripts that reference old hook filenames.
<sup>Written for commit 4c77949197.
Summary will update on new commits.</sup>
<!-- End of auto-generated description by cubic. -->
Show small thumbnails of recently completed ArchiveResult content in the
progress header. The thumbnail strip appears below the stats bar and shows
the last 20 successfully archived items with embeddable content (screenshots,
favicons, DOM snapshots, etc.).
Features:
- API returns recent_thumbnails with embed paths for succeeded results
- Thumbnails display with plugin-specific icons as fallback
- New thumbnails animate in with a pop effect
- Clicking a thumbnail navigates to the snapshot admin page
- Horizontal scrollable strip with custom scrollbar styling
- Add assertIsNotNone for accessibility_data to ensure test fails if no data generated
- Capture and report JSON decode errors in parse_dom_outlinks test
- Add assertIsNotNone for outlinks_data with error details
- Removes conditional checks that allowed tests to pass without verifying functionality
Addresses review comments from cubic-dev-ai
Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>
- test_seo.py: Add assertIsNotNone before conditional to catch SEO extraction failures
- test_ssl.py: Add assertIsNotNone to ensure SSL data is captured from HTTPS URLs
- test_pip_provider.py: Assert jsonl_found variable to verify binary discovery
- dns plugin: Deduplicate NXDOMAIN records using seenResolutions map
Tests now fail when functionality doesn't work (no cheating).
Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>
Add input validation and path safety checks to prevent path traversal
attacks in persona name handling:
- Add validate_persona_name() to block dangerous characters (/, \, .., etc)
- Add ensure_path_within_personas_dir() to verify resolved paths stay within PERSONAS_DIR
- Apply validation at persona creation, renaming, and deletion operations
Fixes security issues identified by cubic-dev-ai in PR review.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>
- merkletree: Tests merkle tree generation with real files,
empty directory handling, and disabled mode
- custom: Tests custom bash command execution and binary discovery
Real integration tests using Chrome sessions with example.com:
- accessibility: Tests page outline and accessibility tree extraction
- parse_dom_outlinks: Tests link extraction and categorization
- consolelog: Tests console output capture
The assertion was checking 'has_seo_data or seo_data' inside an 'if seo_data:' block,
making it always truthy. Changed to just check 'has_seo_data' to properly verify
that expected SEO keys were extracted.
Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>
- Add `archivebox persona create/list/update/delete` commands
- Support `--import=chrome|firefox|brave` to copy browser profile
- Extract cookies via CDP to generate cookies.txt for non-browser tools
- Fix JSDoc comment parsing issue in chrome_utils.js
Import shared utilities (getEnv, getEnvBool, getEnvInt) from
chrome_utils.js instead of duplicating them locally.
Also use DNS_TIMEOUT config for dynamic timeout calculations.
Records hostname → IP resolutions during page load using Chrome CDP.
Uses Network.responseReceived events to capture DNS resolution data
and writes one JSON line per record to dns.jsonl.
Features:
- Captures hostname to IP address mappings (A/AAAA records)
- Records failed DNS lookups (NXDOMAIN)
- Deduplicates resolution records per page load
- Integrates with existing Chrome plugin infrastructure
- Add real integration tests for SSL, redirects, and SEO plugins
using Chrome session helpers for live URL testing
- Remove fake "format" tests that just created dicts and asserted on them
(apt, pip, npm provider output format tests)
- Remove npm integration test that created dirs then checked they existed
- Fix SQLite search test to use SQLITEFTS_DB constant instead of hardcoded value
- Remove pid_utils tests (module deleted in dev)
- Update orchestrator tests to use Process model for tracking
- Add tests for Process.current(), cleanup_stale_running(), terminate()
- Add tests for Process hierarchy (parent/child, root, depth)
- Add tests for Process.get_running(), get_running_count()
- Add tests for ProcessMachine state machine
- Update machine model tests to match current API (from_jsonl vs from_json)