Download web pages#
This documentation page shows how to run simple downloads and how to configure and execute parallel downloads with threads. Both single and concurrent downloads should respect basic “politeness” rules which are described below.
A main objective of data collection over the Internet such as web crawling is to efficiently gather as many useful web pages as possible. In order to retrieve multiples web pages at once it makes sense to retrieve as many domains as possible in parallel. However, particular rules apply then.
New in version 0.9: Functions exposed and made usable for convenience.
Running simple downloads is straightforward with the
fetch_url() fonction. This method is also known as single-threaded downloads as they are processed sequentially.
from trafilatura.downloads import fetch_url # single download downloaded = fetch_url('https://www.example.org') # sequential downloads using a list mylist = ["https://www.example.org", "https://httpbin.org"] for url in mylist: downloaded = fetch_url(url) # do something with it
For efficiency reasons the function makes use of a connection pool where connections are kept open (unless too many websites are retrieved at once). You may see warnings in logs about it which you can safely ignore.
The content (stored here in the variable
downloaded) is seamlessly decoded to a Unicode string.
This default setting can be overriden using the
fetch_url() then returns a urllib3-like response object providing additional information.
RawResponse object comprises the attributes
url which can be accessed as follows:
# RawResponse object instead of Unicode string >>> response = fetch_url('https://www.example.org', decode=False) >>> response.status 200 >>> response.url 'https://www.example.org' >>> response.data # raw HTML in binary format
Trafilatura-backed parallel threads#
Threads are a way to run several program parts at once, see for instance An Intro to Threading in Python. Multi-threaded downloads are a good option in order to make a more efficient use of the Internet connection. The threads download pages as they go.
This only makes sense if you are fetching pages from different websites and want the downloads to run in parallel.
The following variant of multi-threaded downloads with throttling is implemented, it also uses a compressed dictionary to store URLs and possibly save space. Both happen seamlessly, here is how to run it:
from trafilatura.downloads import add_to_compressed_dict, buffered_downloads, load_download_buffer # list of URLs mylist = ['https://www.example.org', 'https://www.httpbin.org/html'] # number of threads to use threads = 4 # converted the input list to an internal format url_store = add_to_compressed_dict(mylist) # processing loop while url_store.done is False: bufferlist, url_store = load_download_buffer(url_store, sleep_time=5) # process downloads for url, result in buffered_downloads(bufferlist, threads): # do something here print(url) print(result)
This safe but efficient option consists in throttling requests based on domains/websites from which content is downloaded. It is highly recommended!
Asynchronous processing in probably even more efficient in the context of file downloads from a variety of websites. See for instance the AIOHTTP library.
On the command-line#
Downloads on the command-line are automatically run with threads and domain-aware throttling as described above. The following will read URLs from a file, process the results and save them accordingly:
# basic output as raw text with backup directory $ trafilatura -i list.txt -o txtfiles/ --backup-dir htmlbackup/
To check for download errors you can use the exit code (0 if all pages could be downloaded, 1 otherwise) and sift through the logs if necessary.
For more information, see page on command-line use.
Enforcing politeness rules#
Machines consume resources on the visited systems and they often visit sites unprompted. That is why issues of schedule, load, and politeness come into play. Mechanisms exist for public sites not wishing to be crawled to make this known to the crawling agent.
We want to space out requests to any given server and not request the same content multiple times in a row
We also should avoid parts of a server that are restricted
We save time for us and the others if we do not request unnecessary information (see content-aware URL selection)
Beware that there should be a tacit scraping etiquette and that a server may block you after the download of a certain number of pages from the same website/domain in a short period of time.
In addition, some websites may block Trafilatura’s user agent. Thus, the software waits a few seconds between requests per default.
This additional constraint means we have to not only care for download speed but also manage a register of known websites and apply the rules so as to keep maximizing speed while not being too intrusive. Here is how to keep an eye on it.
Robots exclusion standard#
The robots.txt file is usually available at the root of a website (e.g. www.example.com/robots.txt). It describes what a crawler should or should not crawl according to the Robots exclusion_standard. Certain websites indeed restrict access for machines, for example by the number of web pages or site sections which are open to them.
The file lists a series of rules which define how bots can interact with the websites. It should be fetched from a website in order to test whether the URL under consideration passes the robot restrictions, and these politeness policies should be respected.
Python features a module addressing the issue in its core packages, the gist of its operation is described below, for more see urllib.robotparser in the official Python documentation.
import urllib.robotparser from trafilatura import get_crawl_delay # define a website to look for rules base_url = 'https://www.example.org' # load the necessary components, fetch and parse the file rules = urllib.robotparser.RobotFileParser() rules.set_url(base_url + '/robots.txt') rules.read() # determine if a page can be fetched by all crawlers rules.can_fetch("*", "https://www.example.org/page1234.html") # returns True or False
In addition, some websites may block certain user agents. By replacing the star with one’s user agent (e.g. bot name) we can check if we have been explicitly banned from certain sections or from all the website, which can happen when rules are ignored.
There should an interval in successive requests to avoid burdening the web servers of interest. That way, you will not slow them down and/or risk getting banned. In addition, Trafilatura includes URLs deduplication.
To prevent the execution of too many requests within too little time, the optional argument
sleep_time can be passed to the
load_download_buffer() function. It is the time in seconds between two requests for the same domain/website.
from trafilatura.downloads import load_download_buffer # 30 seconds is a safe choice mybuffer, threads, domain_dict, backoff_dict = load_download_buffer(url_store, sleep_time=30) # then proceed as instructed above...
One of the rules that can be defined by a
robots.txt file is the crawl delay (
Crawl-Delay), i.e. the time between two download requests for a given website. This delay (in seconds) can be retrieved as follows:
# get the desired information seconds = get_crawl_delay(rules) # provide a backup value in case no rule exists (happens quite often) seconds = get_crawl_delay(rules, default=30)
Trafilatura’s focused crawler implements the delay where applicable. For further info and rules see the documentation page on crawling.
You can also decide to store the rules for convenience and later use, for example in a domain-based dictionary:
# this module comes with trafilatura from courlan import extract_domain rules_dict = dict() # storing information domain = extract_domain(base_url) rules_dict[domain] = rules # retrieving rules info seconds = get_crawl_delay(rules_dict[domain])
You can then use such rules with the crawling module.
Here is the simplest way to stay polite while taking all potential constraints into account:
robots.txtfiles, filter your URL list accordingly and care for crawl delay
Use the framework described above and set the throttling variable to a safe value (your main bottleneck is your connection speed anyway)
Optional: for longer crawls, keep track of the throttling info and revisit
See also page on troubleshooting.