Finding URLs for web corpora

Webpages are used as sources to build corpora and also to gather links to start crawls from (i.e. “seeds”). Corpora can be focused on a given range of websites or topics or be merely language-minded and opportunistically take all seen texts into account. In the latter case, using diverse sources of URL seeds could ensure that there is not a potentially unknown bias.

Existing corpus resources

URL lists from corpus linguistic projects can be a starting ground to derive information from, either to recreate existing corpora or to re-crawl the websites and find new content. If the websites don’t exist anymore, the links can still be useful as the corresponding web pages can be retrieved from archives.

URL directories

DMOZ (at the time of experiments) and Wikipedia work quite well as primary sources:

Searching for URLs

The Common Crawl is a good place to start looking for already known URLs, and possibly for the corresponding pages stored by the project. So is the Internet Archive (with a different focus):

Related info: before retrieving them storing web documents in Internet archives can be fruitful, see for instance the tool archivenow.

set -e
dir="$(dirname "$1")"
name="$(basename "$1")"

mkdir -p "$dir"
if [ ! -r "$fpath" ]; then
  curl -s --retry 5 "$url" \
    | zcat \
    | grep -i 'WARC-TARGET-URI:' \
    | awk '{print $2}' \
    | gzip > "$fpath"

If saved as dl-wat, one could then run it as follows:

$ zcat wat.paths.gz | xargs -P32 -n1 dl-wat


Social Networks

Series of surface scrapers that crawl the networks without even logging in, thus circumventing the API restrictions. Development of such software solutions is fast-paced, so no links will be listed here at the moment.

Previously collected tweet IDs can be “hydrated”, i.e. retrieved from Twitter in bulk. see for instance:

Links can be extracted from tweets with a regular expression such as re.findall(r'https://[^ ]+'). They probably need to be resolved first to get actual link targets and not just shortened URLs (like…).

For further ideas from previous projects see:

Search Engines

The BootCat approach (Baroni & Bernardini 2004) grounds on the assumption that randomly generated search engines queries made of random words will lead to combined cross-domain text collections. The queries consist of several randomly combined word seeds, first coming from an initial list and later from unigram extraction in the corpus itself. As a result, seed URLs are gathered, which are used as a starting point for web crawlers.

Because of increasing limitations of the search engine APIs, the querying process with a very limited financial budget is not practical or slow. All in all, the APIs may be too expensive and/or too unstable in time to support large-scale corpus building projects. Moreover, the question whether the method used so far provides a good overview of a language is still open. Other technical difficulties include diverse and partly unknown search biases related to search engine optimization tricks as well as undocumented PageRank adjustments.

Marco Baroni and Silvia Bernardini. 2004. BootCaT: Bootstrapping corpora and terms from the web. Proceedings of LREC 2004.

Selecting random documents from the Web

A model for web texts is described along with some experiments in the PhD thesis preceding the work on this library. Here are criteria you could use:

  • General text form, line and sentences lengths, etc.

  • Proportion of discourse and temporal markers

For more see Indicators for intrinsic quality assessment (section of PhD thesis).

Remarks and references

A crawling method using diverse seeds for corpus building can yield better results and notably ensure better randomness in a population of web documents (see Henzinger et al. 2000).

Monika R. Henzinger, Allan Heydon, Michael Mitzenmacher, and Marc Najork. 2000. On near-uniform URL sampling. In Proceedings of the 9th International World Wide Web conference on Computer Networks, pages 295–308. North-Holland Publishing Company.