Tutorial: Gathering a custom web corpus

Get your system up and running

  • Installation: see dedicated page
  • Making sure you have the latest version: pip install -U trafilatura

The following instructions use the command-line interface (CLI):

Find and filter sources

The sources can consist of previously known and listed web pages. It can also be useful to operate on website-level by downloading portions of a website programmatically. A sitemap is a file that lists the visible URLs for a given site, for more information see this blog post explaining how to use sitemaps to retrieve URLs within a website.

URL lists can be filtered manually or with grep, a command-line utility to search text data which operates on line-level and returns matching lines or non-matching ones:

  • Matching relevant links: grep "/article/" mylist.txt > filtered-list.txt
  • Exclusion criteria: grep -v "/video/" mylist.txt > filtered-list.txt

For further filters see this grep tutorial.

Other relevant utilities are sort and shuf:

# sort the links and make sure they are unique
sort -u myfile.txt > myfile-sorted.txt
# alternatives to shuffle the URLs
sort -R myfile.txt > myfile-random.txt
shuf myfile.txt > myfile-random.txt

To draw a random sample of a list of URLs head or tail come in handy after a random sorting: shuf myfile.txt | head -100 > myfile-random-sample.txt

Trafilatura automatically sorts the input list to optimize the download order and make sure the input URLs are unique, it is not mandatory to perform these steps by yourself.

Work with the data

See A Gentle Introduction to XML or the module xmltodict which provides a way to directly read the files and work with the data as if it were in JSON format.

The textometry platform TXM can read both XML and TEI-XML files and perform annotation and exploration of corpus data.

Different solutions in Python:

For natural language processing see this list of open-source/off-the-shelf NLP tools for German and further lists for other languages.