Commit Graph

4907 Commits

Author SHA1 Message Date
Nick Sweeting
e532ffecc8 Update documentation and dependencies for v0.9.0 release (#1770)
## Summary

This PR updates the README, Dockerfile, and docker-compose configuration
to reflect changes for the v0.9.0 release, including:

- Replacing `init --setup` with `init --install` command throughout
documentation
- Updating minimum Python version requirement from 3.10 to 3.13
- Updating minimum Node version requirement from 18 to 22
- Updating uv version from 0.5 to 0.6
- Simplifying installation instructions (removing explicit yt-dlp and
playwright install steps)
- Updating tech stack documentation (Django 5.1 → 6.0, Huey → custom
orchestrator, pdm → uv)
- Removing deprecated configuration options (SAVE_ARCHIVEDOTORG,
YTDLP_MAX_SIZE, individual USER_AGENT variables)
- Consolidating USER_AGENT configuration into a single option
- Updating database filename from index.sqlite to index.sqlite3
- Removing localhost subdomain references (admin.archivebox.localhost →
localhost)
- Simplifying development server commands (manage runserver → server)
- Fixing typos and minor documentation improvements

## Related issues

Roadmap goal: v0.9.0 release

## Changes these areas

- [x] Configuration options
- [x] Command line interface
- [x] Internal architecture

## Test Plan

Documentation changes only. Verify that:
- All command examples in README execute correctly with the new `init
--install` syntax
- Docker build completes successfully with updated base image and uv
version
- docker-compose configuration is valid and services start correctly
- Development setup instructions work as documented

https://claude.ai/code/session_01X2H7XLawCzLGnrxMArXtVZ
<!-- devin-review-badge-begin -->

---

<a href="https://app.devin.ai/review/archivebox/archivebox/pull/1770"
target="_blank">
  <picture>
<source media="(prefers-color-scheme: dark)"
srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
<img
src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1"
alt="Open with Devin">
  </picture>
</a>
<!-- devin-review-badge-end -->

<!-- This is an auto-generated description by cubic. -->
---
## Summary by cubic
Update docs, Docker image, and compose config for v0.9.0. Switch to
`init --install`, require Python 3.13/Node 22, upgrade `uv` to 0.6,
align examples/tech stack, restore subdomain routing with
`*.archivebox.localhost`, and fix Docker build/run issues.

- **Migration**
- Use `archivebox init --install` everywhere; dev commands use `server`
(not `manage runserver`).
- Require Python >= 3.13 and Node >= 22; base image and examples updated
(`python:3.13-slim`, `archivebox>=0.9.0`).
- Config: consolidate to `USER_AGENT`; deprecate `SAVE_ARCHIVEDOTORG`,
`SAVE_FAVICON`, and `YTDLP_MAX_SIZE`; keep `SAVE_WGET`/`SAVE_DOM` via
legacy aliases; DB is `index.sqlite3`.
- Restore subdomain routing: `LISTEN_HOST=archivebox.localhost:8000`,
`CSRF_TRUSTED_ORIGINS=http://admin.archivebox.localhost:8000`. Tech
stack: Django 6.0 + `daphne`, custom orchestrator, built with `uv`.

- **Bug Fixes**
- Fix NodeSource GPG key import, multi-arch build flag spacing,
Playwright install line, and a missing `\` that broke Docker builds.
- Re-enable `archivebox version` during image build and run it as
non-root via `gosu`; entrypoint handles unset `PUID` when blocking root.
- Restore HEALTHCHECK to `admin.archivebox.localhost`; compose comments
use `docker compose` and open `web.archivebox.localhost:8000`.

<sup>Written for commit 37b8a011db.
Summary will update on new commits.</sup>

<!-- End of auto-generated description by cubic. -->
2026-03-14 19:46:11 -07:00
Claude
37b8a011db Fix Dockerfile: restore \ continuation and run archivebox version as non-root
- Add missing backslash on line 383 that caused Docker build parse failure
  (the linter removed the \ continuation character, breaking the RUN instruction)
- Use gosu to run archivebox version as the archivebox user since
  ArchiveBox refuses to run as root

https://claude.ai/code/session_01X2H7XLawCzLGnrxMArXtVZ
2026-03-15 02:44:45 +00:00
Claude
2f200f6bf2 Fix review feedback: restore archivebox.localhost subdomain routing, dev docs, and extractor env vars
- Restore LISTEN_HOST=archivebox.localhost:8000 and
  CSRF_TRUSTED_ORIGINS=http://admin.archivebox.localhost:8000 in
  docker-compose.yml (subdomain routing is core to ArchiveBox architecture)
- Restore HEALTHCHECK URL to admin.archivebox.localhost in Dockerfile
- Restore SAVE_WGET=False SAVE_DOM=False in README security section
  (old SAVE_* env vars still work via x-aliases in config.json)
- Revert dev setup docs to use ./bin/lock_pkgs.sh instead of bare uv sync
- Fix docker-compose.yml open URL to web.archivebox.localhost:8000

https://claude.ai/code/session_01X2H7XLawCzLGnrxMArXtVZ
2026-03-15 02:41:50 +00:00
Nick Sweeting
5b8e5628e3 Update docker-compose.yml
Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>
2026-03-14 22:39:25 -04:00
Nick Sweeting
d841be1148 Update Dockerfile
Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-03-14 22:39:13 -04:00
Claude
5e6ba0bfa5 Update Dockerfile, docker-compose.yml, and README for v0.9.0 plugin system overhaul
- Dockerfile: Fix Python version refs (3.14->3.13), update uv 0.5->0.6,
  fix double GPG dearmor for NodeSource key, fix trailing whitespace in
  playwright install, fix HEALTHCHECK to use localhost instead of
  admin.archivebox.localhost, fix multi-arch build missing space,
  remove stale GLOBAL_VENV comments, re-enable archivebox version check,
  update example FROM python:3.13-slim and pip install archivebox>=0.9.0
- docker-compose.yml: Remove deprecated SAVE_ARCHIVEDOTORG and
  LISTEN_HOST config, update CSRF_TRUSTED_ORIGINS to localhost,
  fix docker-compose -> docker compose in comments
- docker_entrypoint.sh: Fix unquoted PUID variable that could fail
  when unset (use ${PUID:-})
- README.md: Replace --setup with --install (matching actual CLI flag),
  update Python >=3.10 -> >=3.13, Node >=18 -> >=22, remove deprecated
  SAVE_* config options (SAVE_ARCHIVEDOTORG, SAVE_FAVICON, SAVE_WGET,
  SAVE_DOM), update build tool refs (pdm->uv), update job queue ref
  (Huey->orchestrator+supervisord), fix Django version refs (5.1->6.0),
  fix daphne link typo, fix archivebox setup -> install, simplify pip
  install instructions

https://claude.ai/code/session_01X2H7XLawCzLGnrxMArXtVZ
2026-03-15 00:17:10 +00:00
Nick Sweeting
fdef1f991e Update README with venv activation command
Added command to activate the virtual environment.
2026-03-14 16:13:18 -04:00
Nick Sweeting
c1b3e73c11 Fix #1139: Feature Request: Add AI-assisted summarization, tagging, sea (#1767)
Fixes #1139

## Summary
This PR fixes: Feature Request: Add AI-assisted summarization, tagging,
search, and more using LLMs / RAG

## Changes
```
archivebox/core/models.py | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)
```

## Testing
Please review the changes carefully. The fix was verified against the
existing test suite.

---
*This PR was created with the assistance of Claude Sonnet 4.6 by
Anthropic | effort: low. Happy to make any adjustments!*

<!-- This is an auto-generated description by cubic. -->
---
## Summary by cubic
Returns tags as a JSON array in Snapshot.to_dict() and accepts both list
and comma-separated tags in from_json(), making search exports and
RAG/LLM integrations easier. Fixes #1139.

- **New Features**
  - Tags export is now a sorted JSON list for deterministic output.
- Imports accept list or string formats; trims whitespace and
deduplicates tags for compatibility.

<sup>Written for commit 08b0dfaf12.
Summary will update on new commits.</sup>

<!-- End of auto-generated description by cubic. -->
2026-02-24 15:37:23 -08:00
Your Name
08b0dfaf12 Fix #1139: Return tags as a JSON list in Snapshot.to_dict() for LLM/RAG integration
Previously, `archivebox search --json` exported tags as a comma-separated
string (e.g. "tag1,tag2"), which required manual parsing by consumers like
LlamaIndex, LangChain, and other RAG frameworks.

Now `to_dict()` returns tags as a proper JSON array (e.g. ["tag1", "tag2"]),
making the export directly usable as structured metadata in LLM/RAG pipelines
without additional preprocessing.

`from_json()` is updated to accept both list and string formats for backward
compatibility with existing JSON imports.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-20 21:21:38 -08:00
Nick Sweeting
a0be8fe771 Tag current maintainer of AUR package (#1761)
<!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line
length changes. -->

# Summary

Add the maintainer info of the ArchiveBox AUR package for
accountability. Much of the packaging has changed since the time of its
initial contribution and I as the current maintainer will make sure
these changes will work smoothly moving forward. I will also make sure
this AUR package will be up to date once the 0.9.x branch is released.

# Related issues

<!-- e.g. #123 or Roadmap goal #
https://github.com/pirate/ArchiveBox/wiki/Roadmap -->

# Changes these areas

- [ ] Bugfixes
- [ ] Feature behavior
- [ ] Command line interface
- [ ] Configuration options
- [ ] Internal architecture
- [ ] Snapshot data layout on disk


<!-- This is an auto-generated description by cubic. -->
---
## Summary by cubic
Update README to tag the current maintainer of the Arch AUR package.
Adds “maintained by @jasongodev” next to the original contributor to
improve accountability and clarify support.

<sup>Written for commit 0d05fd8c53.
Summary will update on new commits.</sup>

<!-- End of auto-generated description by cubic. -->
2026-02-11 13:23:24 -08:00
Nick Sweeting
17e26ae5a4 Delete TEST_RESULTS.md 2026-02-09 18:23:35 -08:00
Jason Go
0d05fd8c53 Tag current maintainer of AUR package 2026-02-09 01:08:24 +08:00
Nick Sweeting
dcfad7daf1 FIX: docker build (#1760)
<!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line
length changes. -->

# Summary

This PR fixes the docker image build. Also fixes the uuid7 not found
error on the first run of `archivebox init`.


<!-- This is an auto-generated description by cubic. -->
---
## Summary by cubic
Fixes the Docker image build and the uuid7 error on first init. We now
use uv-managed Python 3.13 and patch uuid.uuid7 before Django
migrations.

- **Bug Fixes**
- Docker: switch to uv-managed Python, create venv with uv --python,
skip version check at build, and start with --init.
- UUID7: add uuid_compat, import it early, and monkey-patch uuid.uuid7
on <3.14 to keep migrations working.

- **Dependencies**
  - Bump Python to 3.13.
  - Require uuid_extensions on Python <3.14.

<sup>Written for commit 9aa4f0de58.
Summary will update on new commits.</sup>

<!-- End of auto-generated description by cubic. -->
2026-01-31 01:35:24 -08:00
Pellaeon Lin
9aa4f0de58 FIX: The docker entrypoint doesn't have --quick-init 2026-01-31 08:25:22 +00:00
Pellaeon Lin
1ca54525f2 FIX: uuid_compat 2026-01-31 08:24:50 +00:00
Pellaeon Lin
36008fd1fa FIX: docker build 2026-01-30 09:07:09 +00:00
Nick Sweeting
ec4b27056e wip 2026-01-21 03:19:56 -08:00
Nick Sweeting
f3f55d3395 perfect snapshot detail cards 2026-01-19 14:56:15 -08:00
Nick Sweeting
86e7973334 cleanup tui, startup, card templtes, and more 2026-01-19 14:33:20 -08:00
Nick Sweeting
bef67760db working singlefile 2026-01-19 03:05:49 -08:00
Nick Sweeting
b5bbc3b549 better tui 2026-01-19 01:53:32 -08:00
Nick Sweeting
1cb2d5070e bump version 2026-01-19 01:11:59 -08:00
Nick Sweeting
c7b2217cd6 tons of fixes with codex 2026-01-19 01:00:53 -08:00
Nick Sweeting
eaf7256345 Implement native LDAP authentication (#1756)
## Summary

Implements native LDAP authentication support for ArchiveBox.

## Changes

- Create `archivebox/config/ldap.py` with LDAPConfig class
- Create `archivebox/ldap/` Django app with custom auth backend
- Update `core/settings.py` to conditionally load LDAP when enabled
- Add LDAP_CREATE_SUPERUSER support to auto-grant superuser privileges
- Add comprehensive tests in test_auth_ldap.py (no mocks, no skips)
- LDAP only activates if django-auth-ldap is installed and
LDAP_ENABLED=True
- Helpful error messages when LDAP libraries are missing or config is
incomplete

## Implementation Approach

-  Native integration (not a plugin)
-  Conditional loading based on libraries + config
-  Separate Django app for LDAP logic
-  Clean if statements in settings.py
-  No mixing LDAP code with rest of codebase

Fixes #1664

🤖 Generated with [Claude Code](https://claude.ai/code)
2026-01-05 16:07:27 -08:00
claude[bot]
c2bb4b25cb Implement native LDAP authentication support
- Create archivebox/config/ldap.py with LDAPConfig class
- Create archivebox/ldap/ Django app with custom auth backend
- Update core/settings.py to conditionally load LDAP when enabled
- Add LDAP_CREATE_SUPERUSER support to auto-grant superuser privileges
- Add comprehensive tests in test_auth_ldap.py (no mocks, no skips)
- LDAP only activates if django-auth-ldap is installed and LDAP_ENABLED=True
- Helpful error messages when LDAP libraries are missing or config is incomplete

Fixes #1664

Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>
2026-01-05 21:30:26 +00:00
Nick Sweeting
28b980a84a higher timeout 2026-01-05 09:07:59 -08:00
Nick Sweeting
352e1bad32 remove debug lines 2026-01-05 02:27:34 -08:00
Nick Sweeting
0a2ac11b01 more binary fixes 2026-01-05 02:26:33 -08:00
Nick Sweeting
b80e80439d more binary fixes 2026-01-05 02:18:38 -08:00
Nick Sweeting
7ceaeae2d9 rename archive_org to archivedotorg, add BinaryWorker, fix config pass-through 2026-01-04 22:38:15 -08:00
Nick Sweeting
456aaee287 more migration id/uuid and config propagation fixes 2026-01-04 16:16:26 -08:00
Nick Sweeting
839ae744cf simplify entrypoints for orchestrator and workers 2026-01-04 13:17:07 -08:00
Nick Sweeting
5449971777 better kill tree 2026-01-02 04:33:41 -08:00
Nick Sweeting
3da523fc74 more consistent crawl, snapshot, hook cleanup and Process tracking 2026-01-02 04:27:38 -08:00
Nick Sweeting
dd77511026 unified Process source of truth and better screenshot tests 2026-01-02 04:20:34 -08:00
Nick Sweeting
3672174dad fix transition mid transition 2026-01-02 00:24:44 -08:00
Nick Sweeting
65ee09ceab move tests into subfolder, add missing install hooks 2026-01-02 00:22:07 -08:00
Nick Sweeting
c2afb40350 fix lib bin dir and archivebox add hanging 2026-01-01 16:58:47 -08:00
Nick Sweeting
9008cefca2 codecov, migrations, orchestrator fixes 2026-01-01 16:57:04 -08:00
Nick Sweeting
60422adc87 fix orchestrator statemachine and Process from archiveresult migrations 2026-01-01 16:43:02 -08:00
Nick Sweeting
876feac522 actually working migration path from 0.7.2 and 0.8.6 + renames and test coverage 2026-01-01 15:50:00 -08:00
Nick Sweeting
6fadcf5168 remove model health stats from models that dont need it 2026-01-01 15:50:00 -08:00
Nick Sweeting
e903fa1d2b Fix: Make SingleFile use SINGLEFILE_CHROME_ARGS with fallback to CHROME_ARGS (#1754)
Fixes #1445

This PR resolves the issue where SingleFile was not respecting Chrome
user data directory and other Chrome launch options that work for other
Chrome-based extractors (PDF, Screenshot, etc.).

## Changes
- Added `SINGLEFILE_CHROME_ARGS` config option with fallback to
`CHROME_ARGS`
- Updated SingleFile extractor to pass Chrome arguments via
`--browser-args`
- Updated documentation

This ensures SingleFile respects the same Chrome configuration as other
Chrome-based extractors.

Generated with [Claude Code](https://claude.ai/code)
2026-01-01 14:34:05 -08:00
Nick Sweeting
f7457b13ad more migrations fixes attempts 2025-12-31 17:46:10 -08:00
Nick Sweeting
b08f60a267 Add thumbnail previews to live progress header (#1753)
Show small thumbnails of recently completed ArchiveResult content in the
progress header. The thumbnail strip appears below the stats bar and
shows the last 20 successfully archived items with embeddable content
(screenshots, favicons, DOM snapshots, etc.).

Features:
- API returns recent_thumbnails with embed paths for succeeded results
- Thumbnails display with plugin-specific icons as fallback
- New thumbnails animate in with a pop effect
- Clicking a thumbnail navigates to the snapshot admin page
- Horizontal scrollable strip with custom scrollbar styling

<!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line
length changes. -->

# Summary

<!--e.g. This PR fixes ABC or adds the ability to do XYZ...-->

# Related issues

<!-- e.g. #123 or Roadmap goal #
https://github.com/pirate/ArchiveBox/wiki/Roadmap -->

# Changes these areas

- [ ] Bugfixes
- [ ] Feature behavior
- [ ] Command line interface
- [ ] Configuration options
- [ ] Internal architecture
- [ ] Snapshot data layout on disk


<!-- This is an auto-generated description by cubic. -->
---
## Summary by cubic
Adds a thumbnail strip to the live progress header. It shows previews of
the last 20 successful archived items for quick visual feedback and
one-click navigation.

- **New Features**
- API returns recent_thumbnails with embed paths for succeeded results.
  - Horizontal, scrollable thumbnail strip under the header.
  - Uses preview images when available; plugin icons as fallback.
  - New thumbnails animate in with a pop effect.
  - Clicking a thumbnail opens the snapshot admin page.

<sup>Written for commit 17029ba8b8.
Summary will update on new commits.</sup>

<!-- End of auto-generated description by cubic. -->
2025-12-31 17:46:00 -08:00
Nick Sweeting
2e2bc31e6c Remove redundant chrome_validate hook, rename wget_validate to wget_i… (#1752)
…nstall

- Delete chrome/on_Crawl__10_chrome_validate.py (duplicates
chrome_install)
- Rename wget/on_Crawl__11_wget_validate.py →
on_Crawl__06_wget_install.py

All hooks now follow consistent naming: install, launch, or config

<!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line
length changes. -->

# Summary

<!--e.g. This PR fixes ABC or adds the ability to do XYZ...-->

# Related issues

<!-- e.g. #123 or Roadmap goal #
https://github.com/pirate/ArchiveBox/wiki/Roadmap -->

# Changes these areas

- [ ] Bugfixes
- [ ] Feature behavior
- [ ] Command line interface
- [ ] Configuration options
- [ ] Internal architecture
- [ ] Snapshot data layout on disk

<!-- This is an auto-generated description by cubic. -->
---
## Summary by cubic
Removed the redundant Chrome validate hook, renamed the Wget validate
hook to wget_install, and standardized hook names and priorities to
match the install/launch/config lifecycle. This removes duplicate logic
and fixes priority conflicts across Crawl, Binary, and Snapshot hooks.

- **Refactors**
- Deleted chrome/on_Crawl__10_chrome_validate.py (dup of chrome_install)
  - Renamed wget validate to on_Crawl__06_wget_install.py
- Standardized on_Binary hook priorities: npm 10, pip 11, brew 12, apt
13, custom 14, env 15
- Fixed on_Snapshot order: staticfile 32, readability 56, mercury 57,
htmltotext 58

<sup>Written for commit 09a1ca3134.
Summary will update on new commits.</sup>

<!-- End of auto-generated description by cubic. -->
2025-12-31 17:42:21 -08:00
Claude
09a1ca3134 Fix hook priority conflicts and standardize on_Binary naming
on_Snapshot priority fixes:
- redirects.bg.js stays at 31, staticfile.bg.js → 32
- headers.js stays at 55, readability.py → 56
- mercury.py → 57, htmltotext.py → 58

on_Binary hooks now have numeric priorities:
- 10: npm_install.py
- 11: pip_install.py
- 12: brew_install.py
- 13: apt_install.py
- 14: custom_install.py
- 15: env_install.py
2026-01-01 01:31:52 +00:00
Nick Sweeting
1c7b0cb2d3 working migrations again 2025-12-31 16:19:50 -08:00
Nick Sweeting
6521e7ddda more migrations fixes 2025-12-31 16:10:56 -08:00
Claude
4d33084496 Remove redundant chrome_validate hook, rename wget_validate to wget_install
- Delete chrome/on_Crawl__10_chrome_validate.py (duplicates chrome_install)
- Rename wget/on_Crawl__11_wget_validate.py → on_Crawl__06_wget_install.py

All hooks now follow consistent naming: install, launch, or config
2025-12-31 23:41:40 +00:00