Tutorial: Gathering a custom web corpus

Get your system up and running

  • Installation: see dedicated page
  • Ensure that you have installed the latest version: pip install -U trafilatura

The following instructions use the command-line interface (CLI):

Find and filter sources

Finding subpages within a website

Sources used by Trafilatura may consist of previously known web pages, as well as listed web pages. It can also be useful to operate on website level by downloading portions of a website programmatically. To this end, a sitemap is a file that lists the visible URLs for a given site. For more information, refer to this blog post explaining how to use sitemaps: to retrieve URLs within a website.

URLs can also be extracted from feeds, and a future version will allow for automatic extraction of sitemap URLs, for more information see link discovery in core functions.

Filtering with coURLan

It is better to examine a list of URLs for content adequacy, most notably to make download and extraction more efficient by removing unwanted and redundant content. The courlan software package is installed along with trafilatura. It separates the wheat from the chaff by focusing on non-spam text-rich HTML pages, and can be used on the command-line:

courlan --inputfile linkliste-roh.txt --outputfile linkliste-gefiltert.txt

Custom filtering

URL lists can be filtered manually or with grep, a command-line utility to search text data which operates on line-level and returns either matching or non-matching lines.

  • Matching relevant links: grep "/article/" mylist.txt > filtered-list.txt
  • Exclusion criteria: grep -v "/video/" mylist.txt > filtered-list.txt

For further filters in grep, see grep tutorial.

Other relevant utilities include sort and shuf:

# sort the links and make sure they are unique
sort -u myfile.txt > myfile-sorted.txt
# alternatives to shuffle the URLs
sort -R myfile.txt > myfile-random.txt
shuf myfile.txt > myfile-random.txt

To draw a random sample of a list of URLs head or tail come in handy after a random sorting: shuf myfile.txt | head -100 > myfile-random-sample.txt

Trafilatura automatically sorts the input list to optimize the download order and make sure the input URLs are unique; it is not mandatory to perform these steps by yourself.

Work with the data

See A Gentle Introduction to XML or the module xmltodict which provide ways to directly read the files and work with the data as if it were in JSON format.

The textometry platform TXM can read both XML and TEI-XML files and perform annotation and exploration of corpus data.

Different solutions in Python:

For natural language processing see this list of open-source/off-the-shelf NLP tools for German and further lists for other languages.