Finding sources for web corpora#

Existing resources#

Corpora#

URL lists from corpus linguistic projects can be a starting ground to derive information from, either to recreate existing corpora or to re-crawl the websites and find new content. If the websites do not exist anymore, the links can still be useful as the corresponding web pages can be retrieved from web archives.

URL directories#

DMOZ (now an archive) and Wikipedia work quite well as primary sources:

Searching for URLs#

The Common Crawl is a good place to start looking for already known URLs, and possibly for the corresponding pages stored by the project. So is the Internet Archive (with a different focus):

  • getallurls (gau) to fetch known URLs from the Wayback Machine and the Common Crawl (among others)

  • cdx_toolkit (toolkit for CDX indices such as Common Crawl and the Internet Archive’s Wayback Machine) & Python example

  • Python script to extract all URLs known by the Internet Archive for a given domain

Related info: before retrieving them later, storing web documents in Internet archives can be fruitful, see for instance the tool archivenow.

With particular filters, one may look for specific kinds of sources as well, here is for instance a regular expression targeting feeds, as used in a study on web syndication feeds:

"(<link[^>]*(?:\s(?:type=[\"']?(application\/rss\+xml|application\/atom\+xml|application\/rss|application\/atom|application\/rdf\+xml|application\/rdf|text\/rss\+xml|text\/atom\+xml|text\/rss|text\/atom|text\/rdf\+xml|text\/rdf|text\/xml|application\/xml)[\"']?|rel=[\"']?(?:alternate)[\"']?))[^>]*>)"

Discovering feeds on social networks can also be used for corpus construction (Minocha et al. 2013).

Search engines#

The BootCat approach (Baroni & Bernardini 2004) uses randomly generated search engines queries and gathers the links in the results (seed URLs). The queries consist of several randomly combined word seeds.

Here is how to make this method work in a modular way:

  1. First, you need a list of words in the target language(s). For German see for instance the DWDS list.

  2. Then, draw random word tuples, e.g. with Python:

>>> import random
# use the list gathered in (1)
>>> wordlist = ['word1', 'word2', 'word3', 'word4']  # and so on
# draw 3 random words from the list
>>> selection = random.sample(wordlist, k=3)
  1. Get URL results from search engines for the random tuples. Here are examples of Python modules to query search engines: search-engine-parser and GoogleScraper.

One of the main drawbacks of the BootCaT method is that it is not stable in time, both search engines and scraper modules may not work as intended anymore. In that case it would be necessary to look for alternatives, look for concepts like “SERP” and “search engines scraping”.

  1. Download and process the link list with Trafilatura, see usage.

Hint

For more information, see the corresponding blog post: Replicating the BootCat method to build web corpora from search engines.

Selecting random documents from the Web#

A model for web texts is described along with some experiments in the PhD thesis preceding the work on this library. Here are criteria you could use:

  • General text form, line and sentences lengths, etc.

  • Proportion of discourse and temporal markers

For more see Indicators for intrinsic quality assessment (section of PhD thesis).

See also the blog post What is good enough to become part of a web corpus?

Social networks#

Series of surface scrapers that crawl the networks without even logging in, thus circumventing the API restrictions. Development of such software solutions is fast-paced, so no links will be listed here at the moment.

Previously collected tweet IDs can be “hydrated”, i.e. retrieved from Twitter in bulk. see for instance:

Links can be extracted from tweets with a regular expression such as re.findall(r'https?://[^ ]+'). They probably need to be resolved first to get actual link targets and not just shortened URLs (like t.co/…).

For further ideas from previous projects see references below.

Remarks#

For relatively small and focused corpora, human supervision is key. It is advisable to keep an eye on all steps of corpus construction.

A crawling method using diverse seeds for corpus building can yield better results and notably ensure better randomness in a population of web documents (see Henzinger et al. 2000).

Screening and refining the lists of URLs you use for your projects can also enhance corpus quality, see for example the implementation details in the papers mentioned below as well as the filtering tool courlan included with Trafilatura.

The following blog posts give more insights on aspects of web corpus construction:

References#

  • Barbaresi, A. (2014). Finding viable seed URLs for web corpora: a scouting approach and comparative study of available sources. In 9th Web as Corpus Workshop (WaC-9), 14th Conference of the European Chapter of the Association for Computational Linguistics (pp. 1-8).

  • Barbaresi, A. (2015). Ad hoc and general-purpose corpus construction from web sources (Doctoral dissertation, ENS Lyon).

  • Barbaresi, A. (2016). Collection and indexing of tweets with a geographical focus. In Proceedings of CMLC workshop, 10th International Conference on Language Resources and Evaluation (LREC 2016), pp. 24-27.

  • Baroni, M., & Bernardini, S. (2004). BootCaT: Bootstrapping Corpora and Terms from the Web. In Proceedings of LREC 2004 (pp. 1313-1316).

  • Berners-Lee, T., Hall, W., & Hendler, J. A. (2006). A framework for web science. Found. Trends Web Sci. 1, 1, 1–130.

  • Blombach, A., Dykes, N., Heinrich, P., Kabashi, B., & Proisl, T. (2020). A corpus of German Reddit exchanges (GeRedE). In Proceedings of the 12th Language Resources and Evaluation Conference (pp. 6310-6316).

  • Henzinger, M. R., Heydon, A., Mitzenmacher, M., & Najork, M. (2000). On near-uniform URL sampling. Computer Networks, 33(1-6), 295-308.

  • Jauhiainen, H., Jauhiainen, T., & Lindén, K. (2020). Building web corpora for minority languages. In Proceedings of the 12th Web as Corpus Workshop (pp. 23-32).

  • Minocha, A., Reddy, S., & Kilgarriff, A. (2014). Feed Corpus: an ever growing up-to-date corpus. Proceedings of the 8th Web as Corpus Workshop, pp. 1-4, ACL SIGWAC.

  • Schäfer, R., Barbaresi, A., & Bildhauer, F. (2014). Focused web corpus crawling. In Proceedings of the 9th Web as Corpus workshop (WAC-9), pp. 9-15.