Files
ArchiveBox/archivebox/extractors
Ben Muthalaly 77917e9b55 Fix HTML title parsing bugs.
This slightly modifies the HTML_TITLE_REGEX to fix two parsing errors.
The first occurred when title tags were empty (e.g. "<title></title>")
which was parsed as "</title". The second occurred when titles were a
single character (e.g. "<title>A</title>") which was not matched by the
regex, and so would fall back to link.base_url.

Now when tags are empty, it falls back to link.base_url, and single
character titles are parsed correctly.

The way the regex works now is still a bit wonky for some edge cases.
I couldn't find any cases of incorrect behavior, but it still might be
worth reworking more completely for robustness.
2023-10-09 02:00:01 -05:00
..
2023-05-29 10:03:49 +02:00
2022-09-12 20:40:45 +00:00
2022-02-09 10:48:51 +08:00
2023-03-14 20:29:41 +09:00
2023-10-09 02:00:01 -05:00