ArchiveBox

mirror of https://github.com/ArchiveBox/ArchiveBox.git synced 2026-04-03 14:27:55 +10:00

Files

Ross Williams b6a20c962a Extract text from singlefile.html when indexing

singlefile.html contains a lot of large strings in the form of `data:`
URLs, which can be unnecessarily stored in full-text indices. Also,
large chunks of JavaScript shouldn't be indexed, either, as they pollute
search results for searches about JS functions, etc.

This commit takes a blanket approach of parsing singlefile.html as it is
read and only outputting text and selected textual attributes (like
`alt`) for indexing.

2023-10-12 13:06:35 -04:00

backends

bail out on sonic indexing after 5 errors

2021-04-10 05:18:03 -04:00

__init__.py

refactor: Remove setup_django from search

2020-12-11 16:43:48 -05:00

utils.py

Extract text from singlefile.html when indexing

2023-10-12 13:06:35 -04:00