- Add assertIsNotNone for accessibility_data to ensure test fails if no data generated
- Capture and report JSON decode errors in parse_dom_outlinks test
- Add assertIsNotNone for outlinks_data with error details
- Removes conditional checks that allowed tests to pass without verifying functionality
Addresses review comments from cubic-dev-ai
Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>
- test_seo.py: Add assertIsNotNone before conditional to catch SEO extraction failures
- test_ssl.py: Add assertIsNotNone to ensure SSL data is captured from HTTPS URLs
- test_pip_provider.py: Assert jsonl_found variable to verify binary discovery
- dns plugin: Deduplicate NXDOMAIN records using seenResolutions map
Tests now fail when functionality doesn't work (no cheating).
Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>
Add input validation and path safety checks to prevent path traversal
attacks in persona name handling:
- Add validate_persona_name() to block dangerous characters (/, \, .., etc)
- Add ensure_path_within_personas_dir() to verify resolved paths stay within PERSONAS_DIR
- Apply validation at persona creation, renaming, and deletion operations
Fixes security issues identified by cubic-dev-ai in PR review.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>
- merkletree: Tests merkle tree generation with real files,
empty directory handling, and disabled mode
- custom: Tests custom bash command execution and binary discovery
Real integration tests using Chrome sessions with example.com:
- accessibility: Tests page outline and accessibility tree extraction
- parse_dom_outlinks: Tests link extraction and categorization
- consolelog: Tests console output capture
The assertion was checking 'has_seo_data or seo_data' inside an 'if seo_data:' block,
making it always truthy. Changed to just check 'has_seo_data' to properly verify
that expected SEO keys were extracted.
Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>
- Add `archivebox persona create/list/update/delete` commands
- Support `--import=chrome|firefox|brave` to copy browser profile
- Extract cookies via CDP to generate cookies.txt for non-browser tools
- Fix JSDoc comment parsing issue in chrome_utils.js
Import shared utilities (getEnv, getEnvBool, getEnvInt) from
chrome_utils.js instead of duplicating them locally.
Also use DNS_TIMEOUT config for dynamic timeout calculations.
Records hostname → IP resolutions during page load using Chrome CDP.
Uses Network.responseReceived events to capture DNS resolution data
and writes one JSON line per record to dns.jsonl.
Features:
- Captures hostname to IP address mappings (A/AAAA records)
- Records failed DNS lookups (NXDOMAIN)
- Deduplicates resolution records per page load
- Integrates with existing Chrome plugin infrastructure
- Add real integration tests for SSL, redirects, and SEO plugins
using Chrome session helpers for live URL testing
- Remove fake "format" tests that just created dicts and asserted on them
(apt, pip, npm provider output format tests)
- Remove npm integration test that created dirs then checked they existed
- Fix SQLite search test to use SQLITEFTS_DB constant instead of hardcoded value
- Remove pid_utils tests (module deleted in dev)
- Update orchestrator tests to use Process model for tracking
- Add tests for Process.current(), cleanup_stale_running(), terminate()
- Add tests for Process hierarchy (parent/child, root, depth)
- Add tests for Process.get_running(), get_running_count()
- Add tests for ProcessMachine state machine
- Update machine model tests to match current API (from_jsonl vs from_json)
- Fix Process.current() to store psutil cmdline instead of sys.argv for accurate validation
- Fix worker process_type detection: explicitly set to WORKER after registration
- Fix ArchiveResultWorker.start() to use Process.TypeChoices.WORKER consistently
- Fix migration to be explicitly irreversible (SQLite doesn't support DROP COLUMN)
- Fix get_running_workers() to return process_id instead of incorrectly named worker_id
- Fix safe_kill_process() to wait for termination and escalate to SIGKILL if needed
- Fix migration to include all indexes in state_operations (parent_id, process_type)
- Fix documentation to use Machine.current() scoping and StatusChoices constants
Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>
Resolved conflicts by keeping Process model changes and accepting dev changes for unrelated files. Ensured pid_utils.py remains deleted as intended by this PR.
Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>
- Add prominent view mode switcher with List/Grid toggle buttons
- Improve filter sidebar CSS with modern styling, rounded corners
- Add live progress bar for in-progress snapshots showing hooks status
- Show plugin icons only when output directory has content
- Display archive result output_size sum from new field
- Show hooks succeeded/total count in size column
- Add get_progress_stats() method to Snapshot model
- Add CSS for progress spinner and status badges
- Update grid view template with progress indicator for archiving cards
- Add tests for admin views, search, and progress stats
- Add pwd validation in Process.launch() to prevent crashes
- Fix psutil returncode handling (use wait() return value, not returncode attr)
- Add None check for proc.pid in cleanup_stale_running()
- Add stale process cleanup in Orchestrator.is_running()
- Ensure orchestrator process_type is correctly set to ORCHESTRATOR
- Fix KeyboardInterrupt handling (exit code 0 for graceful shutdown)
- Throttle cleanup_stale_running() to once per 30 seconds for performance
- Fix worker process_type to use TypeChoices.WORKER consistently
- Fix get_running_workers() API to return list of dicts (not Process objects)
- Only delete PID files after successful kill or confirmed stale
- Fix migration index names to match between SQL and Django state
- Remove db_index=True from process_type (index created manually)
- Update documentation to reflect actual implementation
- Add explanatory comments to empty except blocks
- Fix exit codes to use Unix convention (128 + signal number)
Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>
- Fix conftest.py: use subprocess for init, remove unused cli_env fixture
- Update all test files to use data_dir parameter instead of env
- Remove mock-based TestJSONLOutput class from tests_piping.py
- Remove unused imports (MagicMock, patch)
- Fix file permissions for cli_utils.py
All tests now use real subprocess calls per CLAUDE.md guidelines:
- NO MOCKS - tests exercise real code paths
- NO SKIPS - every test runs
Add comprehensive unit tests for the CLI piping architecture:
- test_cli_crawl.py: crawl create/list/update/delete tests
- test_cli_snapshot.py: snapshot create/list/update/delete tests
- test_cli_archiveresult.py: archiveresult create/list/update/delete tests
- test_cli_run.py: run command create-or-update and pass-through tests
Extend tests_piping.py with:
- TestPassThroughBehavior: tests for pass-through behavior in all commands
- TestPipelineAccumulation: tests for accumulating records through pipeline
All tests use pytest fixtures from conftest.py with isolated DATA_DIR.
Adds a new CLI command `archivebox pluginmap` that displays:
- ASCII art diagrams of all core state machines (Crawl, Snapshot,
ArchiveResult, Binary)
- Lists all auto-detected on_Modelname_xyz hooks grouped by model/event
- Shows hook execution order (step 0-9), plugin name, and background status
Usage:
archivebox pluginmap # Show all diagrams and hooks
archivebox pluginmap -m Snapshot # Filter to specific model
archivebox pluginmap -a # Include disabled plugins
archivebox pluginmap -q # Output JSON only
DELETED:
- workers/pid_utils.py (-192 lines) - replaced by Process model methods
SIMPLIFIED:
- crawls/models.py Crawl.cleanup() (80 lines -> 10 lines)
- hooks.py: deleted process_is_alive() and kill_process() (-45 lines)
UPDATED to use Process model:
- core/models.py: Snapshot.cleanup() and has_running_background_hooks()
- machine/models.py: Binary.cleanup()
- workers/worker.py: Worker.on_startup/shutdown, get_running_workers, start
- workers/orchestrator.py: Orchestrator.on_startup/shutdown, is_running
All subprocess management now uses:
- Process.current() for registering current process
- Process.get_running() / get_running_count() for querying
- Process.cleanup_stale_running() for cleanup
- safe_kill_process() for validated PID killing
Total line reduction: ~250 lines
This consolidates scattered subprocess management logic into the Process model:
- terminate(): Graceful SIGTERM → wait → SIGKILL (replaces stop_worker, etc.)
- kill_tree(): Kill process and all OS children (replaces os.killpg logic)
- kill_children_db(): Kill DB-tracked child processes
- get_running(): Query running processes by type (replaces get_all_worker_pids)
- get_running_count(): Count running processes (replaces get_running_worker_count)
- stop_all(): Stop all processes of a type
- get_next_worker_id(): Get next worker ID for spawning
Added Phase 8 to TODO documenting ~390 lines that can be deleted after
consolidation, including workers/pid_utils.py which becomes obsolete.
Also includes migration 0002 for parent FK and process_type fields.
Phase 1: Model Prerequisites
- Add ArchiveResult.from_json() and from_jsonl() methods
- Fix Snapshot.to_json() to use tags_str (consistent with Crawl)
Phase 2: Shared Utilities
- Create archivebox/cli/cli_utils.py with shared apply_filters()
- Update 7 CLI files to import from cli_utils.py instead of duplicating
Phase 3: Pass-Through Behavior
- Add pass-through to crawl create (non-Crawl records pass unchanged)
- Add pass-through to snapshot create (Crawl records + others pass through)
- Add pass-through to archiveresult create (Snapshot records + others)
- Add create-or-update behavior to run command:
- Records WITHOUT id: Create via Model.from_json()
- Records WITH id: Lookup existing, re-queue
- Outputs JSONL of all processed records for chaining
Phase 4: Test Infrastructure
- Create archivebox/tests/conftest.py with pytest-django fixtures
- Include CLI helpers, output assertions, database assertions
Phase 6: Config Update
- Update supervisord_util.py: orchestrator -> run command
This enables Unix-style piping:
archivebox crawl create URL | archivebox run
archivebox archiveresult list --status=failed | archivebox run
curl API | jq transform | archivebox crawl create | archivebox run
<!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line
length changes. -->
# Summary
<!--e.g. This PR fixes ABC or adds the ability to do XYZ...-->
# Related issues
<!-- e.g. #123 or Roadmap goal #
https://github.com/pirate/ArchiveBox/wiki/Roadmap -->
# Changes these areas
- [ ] Bugfixes
- [ ] Feature behavior
- [ ] Command line interface
- [ ] Configuration options
- [ ] Internal architecture
- [ ] Snapshot data layout on disk
<!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line
length changes. -->
# Summary
<!--e.g. This PR fixes ABC or adds the ability to do XYZ...-->
# Related issues
<!-- e.g. #123 or Roadmap goal #
https://github.com/pirate/ArchiveBox/wiki/Roadmap -->
# Changes these areas
- [ ] Bugfixes
- [ ] Feature behavior
- [ ] Command line interface
- [ ] Configuration options
- [ ] Internal architecture
- [ ] Snapshot data layout on disk
<!-- This is an auto-generated description by cubic. -->
---
## Summary by cubic
Switch background hook cleanup to a graceful termination flow using
plugin-specific timeouts, only SIGKILLing if needed. This improves
reliability and records accurate exit codes and stderr for better result
reporting.
- **Refactors**
- Added graceful_terminate_background_hooks(): send SIGTERM to all
hooks, wait per plugin timeout, SIGKILL remaining, reap with waitpid,
write .returncode files.
- Snapshot.cleanup() now uses merged config (get_config) to apply
plugin-specific timeouts and terminate hooks gracefully.
- update_from_output() reads .returncode and .stderr.log, infers status
when no JSONL (handles signals like -9/-15), includes stderr on
failures, and cleans up .returncode files.
<sup>Written for commit 524e8e98c3.
Summary will update on new commits.</sup>
<!-- End of auto-generated description by cubic. -->
<!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line
length changes. -->
# Summary
<!--e.g. This PR fixes ABC or adds the ability to do XYZ...-->
# Related issues
<!-- e.g. #123 or Roadmap goal #
https://github.com/pirate/ArchiveBox/wiki/Roadmap -->
# Changes these areas
- [ ] Bugfixes
- [ ] Feature behavior
- [ ] Command line interface
- [ ] Configuration options
- [ ] Internal architecture
- [ ] Snapshot data layout on disk
Extended graceful_terminate_background_hooks() to:
- Reap processes with os.waitpid() to get exit codes
- Write returncode to .returncode file for update_from_output()
- Return detailed result dict with status, returncode, and pid
Updated update_from_output() to:
- Read .returncode and .stderr.log files
- Determine status from returncode if no ArchiveResult JSONL record
- Include stderr in output_str for failed hooks
- Handle signal termination (negative returncodes like -9 for SIGKILL)
- Clean up .returncode files along with other hook output files
- Add getMachineType, getLibDir, getNodeModulesDir, getTestEnv CLI commands to chrome_utils.js
These are now the single source of truth for path calculations
- Update chrome_test_helpers.py with call_chrome_utils() dispatcher
- Add get_test_env_from_js(), get_machine_type_from_js(), kill_chrome_via_js() helpers
- Update cleanup_chrome and kill_chromium_session to use JS killChrome
- Remove unused Chrome binary search lists from singlefile hook (~25 lines)
- Update readability, mercury, favicon, title tests to use shared helpers