Working with corpus data#

After gathering texts from the Web, what to do next? This page lists options to work with output generated by Trafilatura.

Generic solutions in Python#

Data science#

Natural language processing#

For a first hand approach to NLP pipelines, see Textblob or the Natural Language Toolkit (NTLK).

Accessible tutorials:
Specific tools:
  • Topic modeling, including word2vec models: Gensim tutorials

  • Scattertext is a tool for finding distinguishing terms in corpora, and presenting them in an interactive scatter plot.

Formats and software used in corpus linguistics#

Input/Output formats: TXT, XML and XML-TEI are quite frequent in corpus linguistics.

The XML and XML-TEI formats#

See A Gentle Introduction to XML or the Python package xmltodict which provide ways to directly read the files and work with the data as if it were in JSON format.

Corpus analysis tools#

  • Antconc is expected to work with TXT files

  • CorpusExplorer supports CSV, TXT and various XML formats

  • Corpus Workbench (CWB) uses verticalized texts whose origin can be in TXT or XML format

  • LancsBox support various formats, notably TXT & XML

  • TXM (textometry platform) can take TXT, XML & XML-TEI files as input

  • Voyant support various formats, notably TXT, XML & XML-TEI

  • Wmatrix can work with TXT and XML

  • WordSmith supports TXT and XML

Further corpus analysis software can be found on

Generic NLP solutions#

For natural language processing see this list of open-source/off-the-shelf NLP tools for German and further lists for other languages.