ArchiveBox

mirror of https://github.com/ArchiveBox/ArchiveBox.git synced 2026-01-06 10:55:44 +10:00

Files

Ben Muthalaly 77917e9b55 Fix HTML title parsing bugs.

This slightly modifies the HTML_TITLE_REGEX to fix two parsing errors.
The first occurred when title tags were empty (e.g. "<title></title>")
which was parsed as "</title". The second occurred when titles were a
single character (e.g. "<title>A</title>") which was not matched by the
regex, and so would fall back to link.base_url.

Now when tags are empty, it falls back to link.base_url, and single
character titles are parsed correctly.

The way the regex works now is still a bit wonky for some edge cases.
I couldn't find any cases of incorrect behavior, but it still might be
worth reworking more completely for robustness.

2023-10-09 02:00:01 -05:00

__init__.py

just use out_dir

2023-05-29 10:03:49 +02:00

archive_org.py

enforce utf8 on literally all file operations because windows sucks

2021-03-27 01:16:29 -04:00

dom.py

After a timeout, chrome will leave behind a SingletonLock, which prevents future instances of chrome from starting. When an extractor fails due to a timeout, remove this file.

2023-08-28 17:27:03 +02:00

favicon.py

Add FAVICON_PROVIDER option for custom favicon service