Web crawling

This tutorial shows how to perform web crawling tasks with Python and on the command-line. The package allows for easy focused crawling.

New in version 0.9. Still experimental.

Concept

Intra vs. inter

A necessary distinction has to be made between intra- and inter-domains crawling:

  1. Focused crawling on web-page level: Finding sources within a web page is relatively easy if the page is not too big or too convoluted. For this Trafilatura offers functions to search for links in sitemaps and feeds.

  2. Web crawling: Hopping between websites can be cumbersome. Discovering more domains without gathering too much junk or running into bugs is difficult without experience with the subject.

For practical reasons the first solution (“intra”) is best, along with “good” (i.e. customized as needed) seeds/sources. As an alternative, prefix searches on the Common Crawl index can be used.

See information on finding sources for more details.

Operation

The focused crawler aims at the discovery of texts within a websites by exploration and retrieval of links.

This tool is commonly known as (web) crawler or spider. A Web crawler starts with a list of URLs to visit, called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the pages and adds them to the list of URLs to visit, called the crawl frontier.

The spider module implements politeness rules as defined by the Robots exclusion standard where applicable.

It prioritizes navigation pages (archives, categories, etc.) over the rest in order to gather as many links as possible in few iterations.

With Python

Focused crawler

The focused_crawler function integrates all necessary components. It can be adjusted by a series of arguments:

>>> from trafilatura.spider import focused_crawler

# starting a crawl
>>> to_visit, known_urls = focused_crawler('https://www.example.org', max_seen_urls=10, max_known_urls=100000)
# resuming a crawl
>>> to_visit, known_urls = focused_crawler(homepage, max_seen_urls=10, max_known_urls=100000, todo=to_visit, known_links=known_urls)

The collected links can then be downloaded and processed. The links to visit (crawl frontier) are stored as a deque (a double-ended queue) which mostly works like a list. The known URLs are stored as a set. Both can also be converted to a list if necessary:

to_visit, known_urls = list(to_visit), sorted(known_urls)

On the command-line

Three different options are available on the command-line:

  • --sitemap : try to find and use sitemaps

  • --crawl : crawl a fixed number of pages within the website

  • --explore : combination of sitemap and crawl

On the CLI the crawler automatically works its way through a website, stopping at a maximum of 30 page visits or exhaustion of the total number of pages on the website, whichever comes first.

$ trafilatura --crawl "https://www.example.org" > links.txt

It can also crawl websites in parallel by reading a list of target sites from a list (-i/--inputfile option).

Useful references

Cho, J., Garcia-Molina, H., & Page, L. (1998). Efficient crawling through URL ordering. Computer networks and ISDN systems, 30(1-7), 161-172.

Olston, C., & Najork, M. (2010). Web crawling. Now Publishers Inc.