Usage on the command-line

Introduction

Trafilatura includes a command-line interface and can be conveniently used without writing code.

For the very first steps please refer to this nice step-by-step introduction and for general instructions see:

As well as these compendia:

Quickstart

$ trafilatura -u "https://github.blog/2019-03-29-leader-spotlight-erin-spiceland/"
# outputs main content and comments as plain text ...
$ trafilatura --xml --nocomments -u "URL..."
# outputs main content without comments as XML ...
$ trafilatura -h
# displays help message

Usage

URLs can be used directly (-u/--URL):

$ trafilatura -u https://de.creativecommons.org/index.php/was-ist-cc/
$ # outputs main content in plain text format ...
$ trafilatura --xml --URL "https://github.blog/2019-03-29-leader-spotlight-erin-spiceland/"
$ # outputs main text with basic XML structure ...

You can also pipe a HTML document (and response body) to trafilatura:

$ cat myfile.html | trafilatura # use the contents of an already existing file
$ wget -qO- "https://de.creativecommons.org/index.php/was-ist-cc/" | trafilatura # use a custom download

The -i/--inputfile option allows for bulk download and processing of a list of URLs from a file listing one link per line. Beware that there should be a tacit scraping etiquette and that a server may block you after the download of a certain number of pages from the same website/domain in a short period of time. In addition, some website may block the requests user-agent. Thus, trafilatura waits a few seconds per default between requests.

For all usage instructions see trafilatura -h:

usage: trafilatura [-h] [-v] [-vv] [-i INPUTFILE] [--inputdir INPUTDIR]
               [-o OUTPUTDIR] [-u URL] [--feed [FEED]]
               [--sitemap [SITEMAP]] [--list] [-b BLACKLIST]
               [--backup-dir BACKUP_DIR] [--timeout] [--parallel PARALLEL]
               [--keep-dirs] [-out {txt,csv,json,xml,xmltei}] [--csv]
               [--json] [--xml] [--xmltei] [--validate] [-f]
               [--formatting] [--nocomments] [--notables]
               [--with-metadata] [--target-language TARGET_LANGUAGE]
               [--deduplicate]

Command-line interface for Trafilatura

optional arguments:
-h, --help show this help message and exit
-v, --verbose increase output verbosity
-vv, --very-verbose
 maximum output verbosity
I/O:

Input and output options affecting processing

-i, --inputfile INPUTFILE
 name of input file for batch processing
--inputdir INPUTDIR
 read files from a specified directory (relative path)
-o, --outputdir OUTPUTDIR
 write results in a specified directory (relative path)
-u, --URL URL custom URL download
--feed FEED look for feeds and/or pass a feed URL as input
--sitemap SITEMAP
 look for sitemaps URLs for the given website
--list return a list of URLs without downloading them
-b, --blacklist BLACKLIST
 name of file containing already processed or unwanted URLs to discard during batch processing
--backup-dir BACKUP_DIR
 Preserve a copy of downloaded files in a backup directory
--timeout Use timeout for file conversion to prevent bugs
--parallel PARALLEL
 Specify a number of cores/threads for parallel downloads and/or processing
--keep-dirs Keep input directory structure and file names
Format:

Selection of the output format

-out, –output-format {txt,csv,json,xml,xmltei}
determine output format
--csv CSV output
--json JSON output
--xml XML output
--xmltei XML TEI output
--validate validate TEI output
Extraction:

Customization of text and metadata extraction

-f, --fast fast (without fallback detection)
--formatting include text formatting (bold, italic, etc.)
--nocomments don’t output any comments
--notables don’t output any table elements
--with-metadata
 only output those documents with necessary metadata: title, URL and date (CSV and XML formats)
--target-language TARGET_LANGUAGE
 select a target language (ISO 639-1 codes)
--deduplicate Filter out duplicate documents and sections