Something is missing#
The extractor uses several fallbacks to make sure enough text is returned. Content extraction is a tradeoff between precision and recall, that is between desired and undesirable content. Being ready to accept more unwanted text makes it easier to gather more of the relevant text in the output. Here are ways to tackle the issue:
Changing the minimum acceptable length in the settings
(see also reported issues with The New Yorker)
Beyond raw HTML#
For various reasons, it is also possible that the standard download utility doesn’t come through. Using another one is then an option (see
pycurl with Python and
curl on the command-line).
Installing the additional download utility
pycurlmanually or using
pip3 install trafilatura[all]can alleviate the problem: another download library is used, leading to different results.
Several alternatives are available on the command-line, e.g.
wget -O - "my_url" | trafilaturainstead of
trafilatura -u "my_url".
Emulating a browser is also possible, see the information on headless browsing above.
Downloads may fail because your IP or user agent are blocked. Trafilatura’s crawling and download capacities do not bypass such restrictions.
Web page no longer available on the Internet#
Download issues can be addressed by retrieving the files somewhere else, i.e. from already existing internet archives like the Internet Archive or the CommonCrawl.
On the command-line you can use
--archived to use the Internet Archive to retrieve pages for which download failed. A corresponding function in Python could look as follows:
# url is the target
# downloaded is the result of the download
# also needs a function fetch_url() or equivalent
if downloaded is None:
new_url = "https://web.archive.org/web/20/" + url
downloaded = fetch_url(new_url)
This approach is generic as it fetches the last available snapshot from the archive.
Download first and extract later#
Since the they have distinct characteristics it can be useful to separate the infrastructure needed for download from the extraction. Using a custom IP or network infrastructure can also prevent your usual IP from getting banned.
For an approach using files from the Common Crawl and Trafilatura, see the external tool datatrove/process_common_crawl_dump.py.