mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-03 14:27:55 +10:00
singlefile.html contains a lot of large strings in the form of `data:` URLs, which can be unnecessarily stored in full-text indices. Also, large chunks of JavaScript shouldn't be indexed, either, as they pollute search results for searches about JS functions, etc. This commit takes a blanket approach of parsing singlefile.html as it is read and only outputting text and selected textual attributes (like `alt`) for indexing.