On the command-line


Trafilatura includes a command-line interface and can be conveniently used without writing code.

For the very first steps please refer to this multilingual, step-by-step Introduction to the command-line interface and this section of the Introduction to Cultural Analytics & Python.

For instructions related to specific platforms see:

As well as these compendia:


URLs can be used directly (-u/--URL):

$ trafilatura -u "https://github.blog/2019-03-29-leader-spotlight-erin-spiceland/"
# outputs main content and comments as plain text ...
$ trafilatura --xml --URL "https://github.blog/2019-03-29-leader-spotlight-erin-spiceland/"
$ # outputs main text with basic XML structure ...
$ trafilatura -h
# displays help message

You can also pipe a HTML document (and response body) to trafilatura:

# use the contents of an already existing file
$ cat myfile.html | trafilatura
# alternative syntax
$ < myfile.html trafilatura
# use a custom download utility and pipe it to trafilatura
$ wget -qO- "https://de.creativecommons.org/index.php/was-ist-cc/" | trafilatura

Extraction parameters

Choice of HTML elements

Several elements can be included or discarded (see list of options below):

  • Text elements: comments, tables

  • Structural elements: formatting, images, links

Only comments and text extracted from HTML <table> elements are extracted by default, --no-comments and --no-tables deactivate this setting.

Further options:

  • --formatting: Keep structural elements related to formatting (<b>/<strong>, <i>/<emph> etc.)

  • --links: Keep link targets (in href="...")

  • --images: Keep track of images along with their targets (<img> attributes: alt, src, title)


Certain elements are only visible in the output if the chosen format allows it (e.g. images and XML).

Including extra elements works best with conversion to XML/XML-TEI.

Output format

Output as TXT without metadata is the default, another format can be selected in two different ways:

  • --csv, --json, --xml or --xmltei

  • -out or --output-format {txt,csv,json,xml,xmltei}


Combining TXT, CSV and JSON formats with certain structural elements (e.g. formatting or links) triggers output in TXT+Markdown format.

Process files locally

In case web pages have already been downloaded and stored, it’s possible to process single files or directories as a whole.

Two major command line arguments are necessary here:

  • --inputdir to select a directory to read files from

  • -o or --outputdir to define a directory to eventually store the results


In case no directory is selected, results are printed to standard output (STDOUT, e.g. in the terminal window).


Text extraction can be parametrized by providing a custom configuration file (that is a variant of settings.cfg) with the --config-file option, which overrides the standard settings.

Further information

For all usage instructions see trafilatura -h:

trafilatura [-h] [-i INPUTFILE | --inputdir INPUTDIR | -u URL]
               [--parallel PARALLEL] [-b BLACKLIST] [--list]
               [-o OUTPUTDIR] [--backup-dir BACKUP_DIR] [--keep-dirs]
               [--hash-as-name] [--feed [FEED] | --sitemap [SITEMAP] |
               --crawl [CRAWL] | --explore [EXPLORE]] [--archived]
               [--url-filter URL_FILTER [URL_FILTER ...]] [-f]
               [--formatting] [--links] [--images] [--no-comments]
               [--no-tables] [--only-with-metadata]
               [--target-language TARGET_LANGUAGE] [--deduplicate]
               [--config-file CONFIG_FILE]
               [-out {txt,csv,json,xml,xmltei} | --csv | --json | --xml | --xmltei]
               [--validate-tei] [-v]

Command-line interface for Trafilatura

optional arguments:
-h, --help

show this help message and exit

-v, --verbose

increase logging verbosity (-v or -vv)


URLs, files or directories to process


name of input file for batch processing

--inputdir INPUTDIR

read files from a specified directory (relative path)


custom URL download

--parallel PARALLEL

specify a number of cores/threads for downloads and/or processing


file containing unwanted URLs to discard during processing


Determines if and how files will be written


display a list of URLs without downloading them


write results in a specified directory (relative path)

--backup-dir BACKUP_DIR

preserve a copy of downloaded files in a backup directory


keep input directory structure and file names


use hash value as output file name instead of random default


Link discovery and web crawling

--feed URL

look for feeds and/or pass a feed URL as input

--sitemap URL

look for sitemaps for the given website and/or enter a sitemap URL

--crawl URL

crawl a fixed number of pages within a website starting from the given URL

--explore URL

explore the given websites (combination of sitemap and crawl)


try to fetch URLs from the Internet Archive if downloads fail

--url-filter URL_FILTER

only process/output URLs containing these patterns (space-separated strings)


Customization of text and metadata processing

-f, --fast

fast (without fallback detection)


include text formatting (bold, italic, etc.)


include links along with their targets (experimental)


include image sources in output (experimental)


don’t output any comments


don’t output any table elements


only output those documents with title, URL and date (for formats supporting metadata)

--target-language TARGET_LANGUAGE

select a target language (ISO 639-1 codes)


filter out duplicate documents and sections

--config-file CONFIG_FILE

override standard extraction parameters with a custom config file


Selection of the output format

-out, --output-format

determine output format, possible choices: txt, csv, json, xml, xmltei


CSV output


JSON output


XML output


XML TEI output


validate XML TEI output