Settings and customization#

Tailor Trafilatura to your needs, its modular design and configuration options allow for extensive customization. In a nutshell, there are two main files which can be edited in order to modify the default download and extraction behavior:

  1. settings.cfg (values designed to be adapted by the user)

  2. settings.py (package-wide settings, advanced)

Configuration file#

Text extraction can be parametrized by providing a custom configuration file which overrides the standard settings. Useful adjustments include download parameters, minimal extraction length, or de-duplication settings.

File structure#

The default file included in the package is settings.cfg . Important values include:

  • Download
    • DOWNLOAD_TIMEOUT = 30 the time (in seconds) before requests are dropped

    • SLEEP_TIME = 5 time between requests (higher is better to avoid detection)

    • USER_AGENTS and COOKIE are empty by default

  • Input
    • MAX_FILE_SIZE = 20000000 maximum acceptable size of input (in bytes)

    • MIN_FILE_SIZE = 10 minimum acceptable size of input (in bytes)

  • Extraction
    • MIN_EXTRACTED_SIZE = 250 acceptable size in characters (used to trigger fallbacks)

    • MIN_OUTPUT_SIZE = 1 absolute acceptable minimum for main text output

    • MIN_EXTRACTED_COMM_SIZE and MIN_OUTPUT_COMM_SIZE work the same for comment extraction

    • EXTRACTION_TIMEOUT = 30 now only affects processing on the command-line: drop extraction after 30 seconds to prevent malicious HTML bombs. Set to 0 if you see errors related to the signal module and/or use a module such as defusedxml

  • Deduplication (not active by default)
    • MIN_DUPLCHECK_SIZE = 100 minimum size in characters to run deduplication on

    • MAX_REPETITIONS = 2 maximum number of duplicates allowed

  • Metadata
    • EXTENSIVE_DATE_SEARCH = on set to off to deactivate htmldate’s opportunistic search (lower recall, higher precision)

  • Navigation
    • EXTERNAL_URLS = off do not take URLs from other websites in feeds and sitemaps (CLI mode)

    • MAX_REDIRECTS = 2: maximum number of URL redirections to be followed. Set to 0 to not follow any redirection.

Using a custom file on the command-line#

With the --config-file option, followed by the file name or path. All the required variables have to be present in the custom file.

Adapting settings in Python#

The standard settings file can be modified, or a custom configuration file can be provided with the config parameter to the bare_extraction() and extract() functions.

In the following, a single default value is changed, which has an immediate effect on extraction. The resulting text is indeed too short and ends up being discarded. On the contrary, lowering default values can trigger a more opportunistic extraction.

# load necessary functions and data
>>> from copy import deepcopy
>>> from trafilatura import extract
>>> from trafilatura.settings import DEFAULT_CONFIG

# a very short HTML file
>>> my_html = "<html><body><p>Text.</p></body></html>"

# load the configuration and change the minimum output length
>>> my_config = deepcopy(DEFAULT_CONFIG)
>>> my_config['DEFAULT']['MIN_OUTPUT_SIZE'] = '1000'

# apply new settings, extraction will fail
>>> extract(my_html, config=my_config)
>>>
# default extraction works
>>> extract(my_html)
'Text.'

Alternatively, it is possible to override all standard settings by loading a new configuration file where all necessary values have been specified.

# load the required functions
>>> from trafilatura import extract
>>> from trafilatura.settings import use_config

# load the new settings by providing a file name
>>> newconfig = use_config("myfile.cfg")

# use with a previously downloaded document
>>> extract(downloaded, config=newconfig)

# provide a file name directly (can be slower)
>>> extract(downloaded, settingsfile="myfile.cfg")

Note

Useful adjustments include download parameters, minimal extraction length, or de-duplication settings. User agent settings can also be specified in a custom settings.cfg file.

Package settings#

For further configuration it is possible to edit package-wide variables contained in the settings.py file provided with Trafilatura.

These settings notably include:

  • Lists of HTML elements to accept or to discard

  • Configuration of parallel processing

  • Further download and deduplication settings

  • Files written in CLI mode

Here is how to change them:

  1. Find the locally installed version of the package or clone the repository

  2. Edit settings.py

  3. Reinstall the package locally: pip install --no-deps -U . in the home directory of the cloned repository

These remaining variables greatly alter the functioning of the package!