Tutorial: From a list of links to a frequency list ================================================== .. meta:: :description lang=en: This how-to explains how to easily extract text from HTML web pages and compute a list of the most frequent word forms. Get your system up and running ------------------------------ - Installation: see `dedicated page `_ - Ensure you have installed the latest version: ``pip install -U trafilatura`` - Additional software for this tutorial: ``pip install -U SoMaJo`` The following consists of `command-line instructions `_. For an introduction see the `page on command-line usage `_. Process a list of links ----------------------- For the collection and filtering of links see `this tutorial `_ and this `blog post `_. Two major options are necessary here: - ``-i`` or ``--input-file`` to select an input list to read links from - ``-o`` or ``--output-dir`` to define a directory to eventually store the results The input list will be read sequentially, and only lines beginning with a valid URL will be read; any other information contained in the file will be discarded. The output directory can be created on demand, but it has to be writable. .. code-block:: bash $ trafilatura -i list.txt -o txtfiles # output as raw text $ trafilatura --xml -i list.txt -o xmlfiles # output in XML format The second instruction creates a collection of `XML files `_ which can be edited with a basic text editor or a full-fledged text-editing package or IDE such as `Atom `_. Build frequency lists ---------------------- Step-by-step ~~~~~~~~~~~~ Tokenization ^^^^^^^^^^^^ The `SoMaJo `_ tokenizer splits text into words and sentences. It works with Python and gets good results when applied to texts in German and English. Assuming the output directory you are working with is called ``txtfiles``: .. code-block:: bash # concatenate all files $ cat txtfiles/*.txt > txtfiles/all.txt # output all tokens $ somajo-tokenizer txtfiles/all.txt > tokens.txt # sort the tokens by decreasing frequency and output up to 10 most frequent tokens $ sort tokens.txt | uniq -c | sort -nrk1 | head -10 Filtering words ^^^^^^^^^^^^^^^ .. code-block:: bash # further filtering: remove punctuation, delete empty lines and lowercase strings $ < tokens.txt sed -e "s/[[:punct:]]//g" -e "/^$/d" -e "s/.*/\L\0/" > tokens-filtered.txt # display most frequent tokens $ < tokens-filtered.txt sort | uniq -c | sort -nrk1 | head -20 # store frequency information in a CSV-file $ < tokens.txt sort | uniq -c | sort -nrk1 | sed -e "s|^ *||g" -e "s| |\t|" > txtfiles/frequencies.csv Further filtering steps: - with a list of stopwords: ``egrep -vixFf stopwords.txt`` - alternative to convert to lower case: ``uconv -x lower`` Collocations and multi-word units ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: bash # word bigrams $ < tokens-filtered.txt tr "\n" " " | awk '{for (i=1; i`_ (count and sort words, compute ngram statistics, make a Concordance) - `Word analysis and N-grams `_ - `N-Grams with NLTK `_ and `collocations howto `_ - `Analyzing Documents with Term Frequency - Inverse Document Frequency (tf-idf) `_, both a corpus exploration method and a pre-processing step for many other text-mining measures and models Additional information for XML files ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Assuming the output directory you are working with is called ``xmlfiles``: .. code-block:: bash # tokenize a file $ somajo-tokenizer --xml xmlfiles/filename.xml # remove tags $ somajo-tokenizer --xml xmlfiles/filename.xml | sed -e "s|||g" -e "/^$/d" # continue with the steps above...