With Python#

The Python programming language#

Python can be easy to pick up whether you’re a first time programmer or you’re experienced with other languages:

Step-by-step#

Quickstart#

For the basics see quickstart documentation page.

Note

For a hands-on tutorial see also the Python Notebook Trafilatura Overview.

Extraction functions#

The functions can be imported using from trafilatura import ... and used on raw documents (strings) or parsed HTML (LXML elements).

Main text extraction, good balance between precision and recall:

  • extract: Wrapper function, easiest way to perform text extraction and conversion

  • bare_extraction: Internal function returning bare Python variables

Additional fallback functions:

  • baseline: Faster extraction function targeting text paragraphs and/or JSON metadata

  • html2txt: Extract all text in a document, maximizing recall

Output#

By default, the output is in plain text (TXT) format without metadata. The following additional formats are available:

  • CSV

  • HTML (from version 1.11 onwards)

  • JSON

  • Markdown (from version 1.9 onwards)

  • XML and XML-TEI (following the guidelines of the Text Encoding Initiative)

To specify the output format, use one of the following strings: "csv", "json", "html", "markdown", "txt", "xml", "xmltei".

The bare_extraction function also accepts an additional python format to work with Python on the output.

To extract and include metadata in the output, use the with_metadata=True argument.

Examples#

# some formatting preserved in basic XML structure
>>> extract(downloaded, output_format="xml")

# output in JSON format with metadata extracted
>>> extract(downloaded, output_format="json", with_metadata=True)

Note that combining TXT, CSV and JSON formats with certain structural elements (e.g. formatting or links) triggers output in Markdown format (plain text with additional elements).

Choice of HTML elements#

Customize the extraction process by including or excluding specific HTML elements:

  • Text elements:
    include_comments=True

    Include comment sections at the bottom of articles.

    include_tables=True

    Extract text from HTML <table> elements.

  • Structural elements:
    include_formatting=True

    Keep structural elements related to formatting (<b>/<strong>, <i>/<emph> etc.)

    include_links=True

    Keep link targets (in href="...")

    include_images=True

    Keep track of images along with their targets (<img> attributes: alt, src, title)

To operate on these elements, pass the corresponding parameters to the extract() function:

# exclude comments from the output
>>> result = extract(downloaded, include_comments=False)

# skip tables and include links in the output
>>> result = extract(downloaded, include_tables=False, include_links=True)

# convert relative links to absolute links where possible
>>> extract(downloaded, output_format='xml', include_links=True, url=url)

Important notes#

  • include_comments and include_tables are activated by default.

  • Including extra elements works best with conversion to XML formats or using bare_extraction(). This allows for direct display and manipulation of the elements.

  • Certain elements may not be visible in the output if the chosen format does not allow it.

  • Selecting Markdown automatically includes text formatting.

Hint

The heuristics used by the main algorithm change according to the presence of certain elements in the HTML. If the output seems odd, try removing a constraint (e.g. formatting) to improve the result.

The precision and recall presets#

The main extraction functions offer two presets to adjust to focus of the extraction process: favor_precision and favor_recall.

These parameters allow you to change the balance between accuracy and comprehensiveness of the output.

>>> result = extract(downloaded, url, favor_precision=True)

Precision#

  • If your results contain too much noise, prioritize precision to focus on the most central and relevant elements.

  • Additionally, you can use the prune_xpath parameter to target specific HTML elements using a list of XPath expressions.

Recall#

  • If parts of your documents are missing, try this preset to take more elements into account.

  • If content is still missing, refer to the troubleshooting guide.

Additional functions for text extraction#

The html2txt and baseline functions offer simpler approaches to extracting text from HTML content, prioritizing performance over precision.

html2txt()#

The html2txt function serves as a last resort for extracting text from HTML content. It emulates the behavior of similar functions in other packages and can be used to output all possible text from a given HTML source, maximizing recall. However, it may not always produce accurate or meaningful results, as it does not consider the context of the extracted sections.

>>> from trafilatura import html2txt
>>> html2txt(downloaded)

baseline()#

For a better balance between precision and recall, as well as improved performance, consider using the baseline function instead. This function returns a tuple containing an LXML element with the body, the extracted text as a string, and the length of the text. It uses a set of heuristics to extract text from the HTML content, which generally produces more accurate results than html2txt.

>>> from trafilatura import baseline
>>> postbody, text, len_text = baseline(downloaded)

For more advanced use cases, consider using other functions in the package that provide more control and customization over the text extraction process.

Guessing if text can be found#

The function is_probably_readerable() has been ported from Mozilla’s Readability.js, it is available from version 1.10 onwards and provides a way to guess if a page probably has a main text to extract.

>>> from trafilatura.readability_lxml import is_probably_readerable
>>> is_probably_readerable(html)  # HTML string or already parsed tree

Language identification#

The target language can also be set using 2-letter codes (ISO 639-1), there will be no output if the detected language of the result does not match and no such filtering if the identification component has not been installed (see above installation instructions) or if the target language is not available.

>>> result = extract(downloaded, url, target_language="de")

Note

Additional components are required: pip install trafilatura[all]. This feature currently uses the py3langid package and is dependent on language availability and performance of the original model.

Optimizing for speed#

Execution speed not only depends on the platform and on supplementary packages (trafilatura[all], htmldate[speed]), but also on the extraction strategy.

The available fallbacks make extraction more precise but also slower. The use of fallback algorithms can also be bypassed in fast mode, which should make extraction about twice as fast:

# skip algorithms used as fallback
>>> result = extract(downloaded, no_fallback=True)

The following combination usually leads to shorter processing times:

>>> result = extract(downloaded, include_comments=False, include_tables=False, no_fallback=True)

Extraction settings#

Hint

See also settings page.

Function parameters#

Starting from version 1.9, the Extractor class provides a convenient way to define and manage extraction parameters. It allows users to customize all options used by the extraction functions and offers a convenient shortcut compared to multiple function parameters.

Here is how to use the class:

# import the Extractor class from the settings module
>>> from trafilatura.settings import Extractor

# set multiple options at once
>>> options = Extractor(output_format="json", with_metadata=True)

# add or adjust settings as needed
>>> options.formatting = True  # same as include_formatting
>>> options.source = "My Source"  # useful for debugging

# use the options in an extraction function
>>> extract(my_doc, options=options)

See the settings.py file for a full example.

Metadata extraction#

  • with_metadata=True: extract metadata fields and include them in the output

  • only_with_metadata=True: only output documents featuring all essential metadata (date, title, url)

Date#

Among metadata extraction, dates are handled by an external module: htmldate. By default, focus is on original dates and the extraction replicates the fast/no_fallback option.

Custom parameters can be passed through the extraction function or through the extract_metadata function in trafilatura.metadata, most notably:

  • extensive_search (boolean), to activate further heuristics (higher recall, lower precision)

  • original_date (boolean) to look for the original publication date,

  • outputformat (string), to provide a custom datetime format,

  • max_date (string), to set the latest acceptable date manually (YYYY-MM-DD format).

# import the extract() function, use a previously downloaded document
# pass the new parameters as dict
>>> extract(downloaded, output_format="xml", date_extraction_params={
        "extensive_search": True, "max_date": "2018-07-01"
    })

URL#

Even if the page to process has already been downloaded it can still be useful to pass the URL as an argument. See this previous bug for an example:

# define a URL and download the example
>>> url = "https://web.archive.org/web/20210613232513/https://www.thecanary.co/feature/2021/05/19/another-by-election-headache-is-incoming-for-keir-starmer/"
>>> downloaded = fetch_url(url)

# content discarded since necessary metadata couldn't be extracted
>>> bare_extraction(downloaded, only_with_metadata=True)
>>>

# date found in URL, extraction successful
>>> bare_extraction(downloaded, only_with_metadata=True, url=url)

Memory use#

Trafilatura uses caches to speed up extraction and cleaning processes. This may lead to memory leaks in some cases, particularly in large-scale applications. If that happens you can reset all cached information in order to release RAM:

# import the function
>>> from trafilatura.meta import reset_caches

# use it at any given point
>>> reset_caches()

Input/Output types#

Python objects as output#

The extraction can be customized using a series of parameters, for more see the core functions page.

The function bare_extraction can be used to bypass output conversion, it returns Python variables for metadata (dictionary) as well as main text and comments (both LXML objects).

>>> from trafilatura import bare_extraction
>>> bare_extraction(downloaded)

Raw HTTP response objects#

The fetch_response() function can pass a response object straight to the extraction.

This can be useful to get the final redirection URL with response.url and then pass is directly as a URL argument to the extraction function:

# necessary components
>>> from trafilatura import fetch_response, bare_extraction

# load an example
>>> response = fetch_response("https://www.example.org")

# perform extract() or bare_extraction() on Trafilatura's response object
>>> bare_extraction(response.data, url=response.url)  # here is the redirection URL

LXML objects#

The input can consist of a previously parsed tree (i.e. a lxml.html object), which is then handled seamlessly:

# define document and load it with LXML
>>> from lxml import html
>>> my_doc = """<html><body><article><p>
                Here is the main text.
                </p></article></body></html>"""
>>> mytree = html.fromstring(my_doc)

# extract from the already loaded LXML tree
>>> extract(mytree)
'Here is the main text.'

Interaction with BeautifulSoup#

Here is how to convert a BS4 object to LXML format in order to use it with Trafilatura:

>>> from bs4 import BeautifulSoup
>>> from lxml.html.soupparser import convert_tree
>>> from trafilatura import extract

>>> soup = BeautifulSoup("<html><body><time>The date is Feb 2, 2024</time></body></html>", "lxml")
>>> lxml_tree = convert_tree(soup)[0]
>>> extract(lxml_tree)

Deprecations#

The following functions and arguments are deprecated:

  • extraction:
    • process_record() function → use extract() instead

    • csv_output, json_output, tei_output, xml_output → use output_format parameter instead

    • bare_extraction(as_dict=True) → the function returns a Document object, use .as_dict() method on it

    • bare_extraction() and extract(): no_fallback → use fast instead

    • max_tree_size parameter moved to settings.cfg file

  • downloads: decode argument in fetch_url() → use fetch_response instead

  • utils: decode_response() function → use decode_file() instead

  • metadata: with_metadata (include metadata) had once the effect of today’s only_with_metadata (only documents with necessary metadata)