Tutorial: Gathering a custom web corpus#

Get your system up and running#

  1. Installation: see dedicated page

  2. Ensure that you have installed the latest version: pip install -U trafilatura (or pip3)

Note

The following consists of command-line instructions.

For an introduction to and more information on this topic see the documentation page on command-line usage.

Content discovery#

Web sources#

Sources used by Trafilatura can consist of previously known or listed web pages. Currently, functions to discover content within a website are available. Other methods include sifting through Wikipedia, social networks, or using lists of links gathered by other projects.

Hint

Please refer to the tutorial page on sources for detailed information.

Finding subpages within a website#

In order to gather web documents it can be useful to download the portions of a website programmatically, mostly to save time and resources. The retrieval and download of documents within a website is often called web crawling or web spidering. Web crawlers usually discover pages from links within the site and from other sites. Trafilatura supports three different ways to gather further links:

  1. Sitemaps

  2. Web feeds (Atom and RSS)

  3. Web crawling (see the corresponding documentation page)

A comprehensive overview of the available documents can be obtained faster and more efficiently using the first two methods than by systematically extracting and following links within a website.

The formats supported are all machine-readable rather than human-readable they can also be used to automatically transfer information from one website to another without any human intervention. However, link inspection and filtering prior to systematic download is recommended to avoid undesired content or overstreching computing resources.

In addition, trafilatura includes support for multilingual and multinational sitemaps. For example, a site can target English language users through links like http://www.example.com/en/… and German language users through http://www.example.com/de/….

Sitemaps#

A sitemap is a file that lists the visible or whitelisted URLs for a given site, the main goal being to reveal where machines can look for content. Web crawlers usually discover pages from links within the site and from other sites, following a series of rules and protocols. Sitemaps supplement this data to allow crawlers that support Sitemaps to pick up all URLs in the Sitemap and learn about those URLs using the associated metadata.

The sitemaps protocol primarily allows webmasters to inform search engines about pages on their sites that are available for crawling. Crawlers can use it to pick up all URLs in the sitemap and learn about those URLs using the associated metadata. Sitemaps follow the XML format, so each sitemap is or should be a valid XML file.

Sitemaps are particularly useful by large or complex websites since they are made so that machines can more intelligently crawl the site. This particularly true if there is a chance to overlook some of the new or recently updated content, for example because some areas of the website are not available through the browsable interface, or when websites have a huge number of pages that are isolated or not well linked together.

Feeds#

A web feed (or news feed) is a data format used for providing users with frequently updated content. This process is also called web syndication, meaning a form of syndication in which content is made available from one website to other sites.

Most commonly, feeds are made available to provide either summaries or full renditions of a website’s recently added content. The term may also describe other kinds of content licensing for reuse. The kinds of content delivered by a web feed are typically HTML (webpage content) or links to webpages and other kinds of digital media. Many news websites, weblogs, schools, and podcasters operate web feeds. The feed icon is commonly used to indicate that a web feed is available.

Trafilatura supports XML-based feeds with the two common formats Atom and RSS.