There are two different files which can be edited in order to modify the default download and extraction settings:
settings.cfg(values designed to be adapted by the user)
settings.py(package-wide settings, advanced)
Text extraction can be parametrized by providing a custom configuration file which overrides the standard settings. Useful adjustments include download parameters, minimal extraction length, or de-duplication settings.
The default file included in the package is settings.cfg . Important values include:
DOWNLOAD_TIMEOUT = 30the time (in seconds) before requests are dropped
SLEEP_TIME = 5time between requests (higher is better to avoid detection)
COOKIEare empty by default
MAX_FILE_SIZE = 20000000maximum acceptable size of input (in bytes)
MIN_FILE_SIZE = 10minimum acceptable size of input (in bytes)
MIN_EXTRACTED_SIZE = 250acceptable size in characters (used to trigger fallbacks)
MIN_OUTPUT_SIZE = 1absolute acceptable minimum for main text output
MIN_OUTPUT_COMM_SIZEwork the same for comment extraction
EXTRACTION_TIMEOUT = 30now only affects processing on the command-line: drop extraction after 30 seconds to prevent malicious HTML bombs. Set to 0 if you see errors related to the
signalmodule and/or use a module such as defusedxml
- Deduplication (not active by default)
MIN_DUPLCHECK_SIZE = 100minimum size in characters to run deduplication on
MAX_REPETITIONS = 2maximum number of duplicates allowed
EXTENSIVE_DATE_SEARCH = onset to
htmldate’s opportunistic search (lower recall, higher precision)
EXTERNAL_URLS = offdo not take URLs from other websites in feeds and sitemaps (CLI mode)
Using a custom file on the command-line#
--config-file option, followed by the file name or path. All the required variables have to be present in the custom file.
Adapting settings in Python#
The standard settings file can be modified, or a custom configuration file can be provided with the
config parameter to the
In the following, a single default value is changed, which has an immediate effect on extraction. The resulting text is indeed too short and ends up being discarded. On the contrary, lowering default values can trigger a more opportunistic extraction.
# load necessary functions and data >>> from copy import deepcopy >>> from trafilatura import extract >>> from trafilatura.settings import DEFAULT_CONFIG # a very short HTML file >>> my_html = "<html><body><p>Text.</p></body></html>" # load the configuration and change the minimum output length >>> my_config = deepcopy(DEFAULT_CONFIG) >>> my_config['DEFAULT']['MIN_OUTPUT_SIZE'] = '1000' # apply new settings, extraction will fail >>> extract(my_html, config=my_config) >>> # default extraction works >>> extract(my_html) 'Text.'
Alternatively, it is possible to override all standard settings by loading a new configuration file where all necessary values have been specified.
# load the required functions >>> from trafilatura import extract >>> from trafilatura.settings import use_config # load the new settings by providing a file name >>> newconfig = use_config("myfile.cfg") # use with a previously downloaded document >>> extract(downloaded, config=newconfig) # provide a file name directly (can be slower) >>> extract(downloaded, settingsfile="myfile.cfg")
Useful adjustments include download parameters, minimal extraction length, or de-duplication settings.
User agent settings can also be specified in a custom
For further configuration it is possible to edit package-wide variables contained in the settings.py file provided with Trafilatura.
These settings notably include:
Lists of HTML elements to accept or to discard
Configuration of parallel processing
Further download and deduplication settings
Files written in CLI mode
Here is how to change them:
Find the locally installed version of the package or clone the repository
Reinstall the package locally:
pip install --no-deps -U .in the home directory of the cloned repository
These remaining variables greatly alter the functioning of the package!