Working with corpus data ======================== .. meta:: :description lang=en: After gathering texts from the Web, what to do next? This page lists options to work with output generated by Trafilatura. After gathering texts from the Web, what to do next? This page lists options to work with output generated by Trafilatura. Generic solutions in Python --------------------------- Data science ~~~~~~~~~~~~ - Load the input into the data analysis library Pandas: - `read_csv `_ - `read_json `_ Natural language processing ~~~~~~~~~~~~~~~~~~~~~~~~~~~ For a first hand approach to NLP pipelines, see `Textblob `_ or the `Natural Language Toolkit (NTLK) `_. Accessible tutorials: - `Part-of-Speech Tagging `_ - `TF-IDF with Scikit-Learn `_ Specific tools: - Topic modeling, including word2vec models: `Gensim tutorials `_ - `Scattertext `_ is a tool for finding distinguishing terms in corpora, and presenting them in an interactive scatter plot. Formats and software used in corpus linguistics ----------------------------------------------- Input/Output formats: TXT, XML and XML-TEI are quite frequent in corpus linguistics. - Han., N.-R. (2022). "`Transforming Data `_", The Open Handbook of Linguistic Data. The XML and XML-TEI formats ~~~~~~~~~~~~~~~~~~~~~~~~~~~ See `A Gentle Introduction to XML `_ or the Python package `xmltodict `_ which provide ways to directly read the files and work with the data as if it were in JSON format. Corpus analysis tools ~~~~~~~~~~~~~~~~~~~~~ - `Antconc `_ is expected to work with TXT files - `CorpusExplorer `_ supports CSV, TXT and various XML formats - `Corpus Workbench (CWB) `_ uses verticalized texts whose origin can be in TXT or XML format - `LancsBox `_ support various formats, notably TXT & XML - `TXM `_ (textometry platform) can take TXT, XML & XML-TEI files as input - `Voyant `_ support various formats, notably TXT, XML & XML-TEI - `Wmatrix `_ can work with TXT and XML - `WordSmith `_ supports TXT and XML Further corpus analysis software can be found on `corpus-analysis.com `_. Generic NLP solutions --------------------- For natural language processing see this list of open-source/off-the-shelf `NLP tools for German `_ and `further lists for other languages `_.