Working with corpus data#
After gathering texts from the Web, what to do next? This page lists options to work with output generated by Trafilatura.
Generic solutions in Python#
Natural language processing#
Formats and software used in corpus linguistics#
Input/Output formats: TXT, XML and XML-TEI are quite frequent in corpus linguistics.
Han., N.-R. (2022). “Transforming Data”, The Open Handbook of Linguistic Data.
The XML and XML-TEI formats#
Corpus analysis tools#
Antconc is expected to work with TXT files
CorpusExplorer supports CSV, TXT and various XML formats
Corpus Workbench (CWB) uses verticalized texts whose origin can be in TXT or XML format
LancsBox support various formats, notably TXT & XML
TXM (textometry platform) can take TXT, XML & XML-TEI files as input
Voyant support various formats, notably TXT, XML & XML-TEI
Wmatrix can work with TXT and XML
WordSmith supports TXT and XML
Further corpus analysis software can be found on corpus-analysis.com.