On the command-line#

Introduction#

Trafilatura offers a robust command-line interface and can be conveniently used without writing code.

For the very first steps:

Quickstart#

All instructions for the terminal window are followed by pressing the enter key.

URLs can be used directly (-u/--URL):

# outputs main content and comments as plain text ...
$ trafilatura -u "https://github.blog/2019-03-29-leader-spotlight-erin-spiceland/"

# outputs main text with basic XML structure ...
$ trafilatura --xml --URL "https://github.blog/2019-03-29-leader-spotlight-erin-spiceland/"

# displays help message
$ trafilatura -h

You can also pipe a HTML document (and response body) to trafilatura:

# use the contents of an already existing file
$ cat myfile.html | trafilatura

# alternative syntax
$ < myfile.html trafilatura

# use a custom download utility and pipe it to trafilatura
$ wget -qO- "https://de.creativecommons.org/index.php/was-ist-cc/" | trafilatura

Extraction parameters#

Choice of HTML elements#

Several elements can be included or discarded (see list of options below):

  • Text elements
    • Comments and tables are extracted by default.

    • --no-comments and --no-tables deactivate these settings.

  • Structural elements
    --formatting

    Keep structural elements related to formatting (<b>/<strong>, <i>/<emph> etc.)

    --links

    Keep link targets (in href="..."), converting relative URLs to absolute where possible

    --images

    Keep track of images along with their targets (<img> attributes: alt, src, title)

Note

Certain elements are only visible in the output if the chosen format allows it (e.g. images and XML). Including extra elements works best with conversion to XML/XML-TEI.

The heuristics used by the main algorithm change according to the presence of certain elements in the HTML. If the output seems odd, try removing a constraint (e.g. formatting) to improve the result.

Output format#

Output as TXT without metadata is the default, another format can be selected in two different ways:

  • --csv, --html, --json, --markdown, --xml or --xmltei

  • --output-format {csv,json,html,markdown,txt,xml,xmltei}

Hint

Combining TXT, CSV and JSON formats with certain structural elements (e.g. formatting or links) triggers output in Markdown format. Selecting Markdown automatically includes text formatting.

HTML output is available from version 1.11, Markdown from version 1.9 onwards.

Optimizing for precision and recall#

The arguments --precision or --recall can be passed to adjust the focus of the extraction process.

  • If your results contain too much noise, prioritize precision to focus on the most central and relevant elements.

  • If parts of your documents are missing, try this preset to take more elements into account.

  • If parts of the contents are still missing, see troubleshooting.

Language identification#

Passing the argument --target-language along with a 2-letter code (ISO 639-1) will trigger language filtering of the output if the identification component has been installed and if the target language is available.

Note

Additional components are required: pip install trafilatura[all]. This feature currently uses the py3langid package and is dependent on language availability and performance of the original model.

Changing default settings#

See documentation page on settings.

Process files locally#

In case web pages have already been downloaded and stored, it is possible to process single files or directories as a whole. It can be especially helpful to separate download and extraction to circumvent blocking mechanisms, either by scrambling IPs used to access the pages or by using web browser automation software to bypass issues related to cookies and paywalls.

Trafilatura will work as well provided web pages (HTML documents) are used as input. Two major command line arguments are necessary:

  • --input-dir to select a directory to read files from

  • -o or --output-dir to define a directory to eventually store the results

Note

In case no directory is selected, results are printed to standard output (STDOUT, e.g. in the terminal window).

Deprecations#

The following arguments have been deprecated since inception:

  • --nocomments and --notables--no-comments and --no-tables

  • --inputfile, --inputdir, and --outputdir--input-file, --input-dir, and --output-dir

  • -out--output-format

  • --hash-as-name → hashes used by default

  • --with-metadata (include metadata) had once the effect of today’s --only-with-metadata (only documents with necessary metadata)

Further information#

For all usage instructions see trafilatura -h:

trafilatura [-h] [-i INPUTFILE | --input-dir INPUTDIR | -u URL]
               [--parallel PARALLEL] [-b BLACKLIST] [--list]
               [-o OUTPUTDIR] [--backup-dir BACKUP_DIR] [--keep-dirs]
               [--feed [FEED] | --sitemap [SITEMAP] | --crawl [CRAWL] |
               --explore [EXPLORE] | --probe [PROBE]] [--archived]
               [--url-filter URL_FILTER [URL_FILTER ...]] [-f]
               [--formatting] [--links] [--images] [--no-comments]
               [--no-tables] [--only-with-metadata] [--with-metadata]
               [--target-language TARGET_LANGUAGE] [--deduplicate]
               [--config-file CONFIG_FILE] [--precision] [--recall]
               [--output-format {csv,json,html,markdown,txt,xml,xmltei} |
               --csv | --html | --json | --markdown | --xml | --xmltei]
               [--validate-tei] [-v] [--version]

Command-line interface for Trafilatura

optional arguments:
-h, --help

show this help message and exit

-v, --verbose

increase logging verbosity (-v or -vv)

--version

show version information and exit

Input:

URLs, files or directories to process

-i INPUT_FILE, --input-file INPUT_FILE

name of input file for batch processing

--input-dir INPUT_DIR

read files from a specified directory (relative path)

-u URL, --URL URL

custom URL download

--parallel PARALLEL

specify a number of cores/threads for downloads and/or processing

-b BLACKLIST, --blacklist BLACKLIST

file containing unwanted URLs to discard during processing

Output:

Determines if and how files will be written

--list

display a list of URLs without downloading them

-o OUTPUT_DIR, --output-dir OUTPUT_DIR

write results in a specified directory (relative path)

--backup-dir BACKUP_DIR

preserve a copy of downloaded files in a backup directory

--keep-dirs

keep input directory structure and file names

Navigation:

Link discovery and web crawling

--feed [FEED]         look for feeds and/or pass a feed URL as input
--sitemap [SITEMAP]   look for sitemaps for the given website and/or enter a sitemap URL
--crawl [CRAWL]       crawl a fixed number of pages within a website starting from the given URL
--explore [EXPLORE]   explore the given websites (combination of sitemap and crawl)
--probe [PROBE]       probe for extractable content (works best with target language)
--archived            try to fetch URLs from the Internet Archive if downloads fail
--url-filter URL_FILTER [URL_FILTER ...] only process/output URLs containing these patterns (space-separated strings)
Extraction:

Customization of text and metadata processing

-f, --fast

fast (without fallback detection)

--formatting

include text formatting (bold, italic, etc.)

--links

include links along with their targets (experimental)

--images

include image sources in output (experimental)

--no-comments

don’t output any comments

--no-tables

don’t output any table elements

--only-with-metadata

only output those documents with title, URL and date

--with-metadata

extract and add metadata to the output

--target-language TARGET_LANGUAGE

select a target language (ISO 639-1 codes)

--deduplicate

filter out duplicate documents and sections

--config-file CONFIG_FILE

override standard extraction parameters with a custom config file

--precision

favor extraction precision (less noise, possibly less text)

--recall

favor extraction recall (more text, possibly more noise)

Format:

Selection of the output format

--output-format {csv,json,html,markdown,txt,xml,xmltei}
                      determine output format
--csv                 shorthand for CSV output
--html                shorthand for HTML output
--json                shorthand for JSON output
--markdown            shorthand for MD output
--xml                 shorthand for XML output
--xmltei              shorthand for XML TEI output
--validate-tei        validate XML TEI output