Core functions#

Extraction#

extract()#

trafilatura.extract(filecontent: Any, url: str | None = None, record_id: str | None = None, fast: bool = False, no_fallback: bool = False, favor_precision: bool = False, favor_recall: bool = False, include_comments: bool = True, output_format: str = 'txt', tei_validation: bool = False, target_language: str | None = None, include_tables: bool = True, include_images: bool = False, include_formatting: bool = False, include_links: bool = False, deduplicate: bool = False, date_extraction_params: dict[str, ~typing.Any] | None=None, with_metadata: bool = False, only_with_metadata: bool = False, max_tree_size: int | None = None, url_blacklist: set[str] | None = None, author_blacklist: set[str] | None = None, settingsfile: str | None = None, prune_xpath: str | list[str] | None = None, config: ConfigParser = <configparser.ConfigParser object>, options: Extractor | None = None) str | None[source]#
Main function exposed by the package:

Wrapper for text extraction and conversion to chosen output format.

Parameters:
  • filecontent – HTML code as string.

  • url – URL of the webpage.

  • record_id – Add an ID to the metadata.

  • fast – Use faster heuristics and skip backup extraction.

  • no_fallback – Deprecated, use “fast” instead.

  • favor_precision – prefer less text but correct extraction.

  • favor_recall – when unsure, prefer more text.

  • include_comments – Extract comments along with the main text.

  • output_format – Define an output format: “csv”, “html”, “json”, “markdown”, “txt”, “xml”, and “xmltei”.

  • tei_validation – Validate the XML-TEI output with respect to the TEI standard.

  • target_language – Define a language to discard invalid documents (ISO 639-1 format).

  • include_tables – Take into account information within the HTML <table> element.

  • include_images – Take images into account (experimental).

  • include_formatting – Keep structural elements related to formatting (only valuable if output_format is set to XML).

  • include_links – Keep links along with their targets (experimental).

  • deduplicate – Remove duplicate segments and documents.

  • date_extraction_params – Provide extraction parameters to htmldate as dict().

  • with_metadata – Extract metadata fields and add them to the output.

  • only_with_metadata – Only keep documents featuring all essential metadata (date, title, url).

  • url_blacklist – Provide a blacklist of URLs as set() to filter out documents.

  • author_blacklist – Provide a blacklist of Author Names as set() to filter out authors.

  • settingsfile – Use a configuration file to override the standard settings.

  • prune_xpath – Provide an XPath expression to prune the tree before extraction. can be str or list of str.

  • config – Directly provide a configparser configuration.

  • options – Directly provide a whole extractor configuration.

Returns:

A string in the desired format or None.

bare_extraction()#

trafilatura.bare_extraction(filecontent: Any, url: str | None = None, fast: bool = False, no_fallback: bool = False, favor_precision: bool = False, favor_recall: bool = False, include_comments: bool = True, output_format: str = 'python', target_language: str | None = None, include_tables: bool = True, include_images: bool = False, include_formatting: bool = False, include_links: bool = False, deduplicate: bool = False, date_extraction_params: dict[str, ~typing.Any] | None=None, with_metadata: bool = False, only_with_metadata: bool = False, max_tree_size: int | None = None, url_blacklist: set[str] | None = None, author_blacklist: set[str] | None = None, as_dict: bool = False, prune_xpath: str | list[str] | None = None, config: ConfigParser = <configparser.ConfigParser object>, options: Extractor | None = None) Document | dict[str, Any] | None[source]#

Internal function for text extraction returning bare Python variables.

Parameters:
  • filecontent – HTML code as string.

  • url – URL of the webpage.

  • fast – Use faster heuristics and skip backup extraction.

  • no_fallback – Deprecated, use “fast” instead.

  • favor_precision – prefer less text but correct extraction.

  • favor_recall – prefer more text even when unsure.

  • include_comments – Extract comments along with the main text.

  • output_format – Define an output format, Python being the default and the interest of this internal function. Other values: “csv”, “html”, “json”, “markdown”, “txt”, “xml”, and “xmltei”.

  • target_language – Define a language to discard invalid documents (ISO 639-1 format).

  • include_tables – Take into account information within the HTML <table> element.

  • include_images – Take images into account (experimental).

  • include_formatting – Keep structural elements related to formatting (present in XML format, converted to markdown otherwise).

  • include_links – Keep links along with their targets (experimental).

  • deduplicate – Remove duplicate segments and documents.

  • date_extraction_params – Provide extraction parameters to htmldate as dict().

  • with_metadata – Extract metadata fields and add them to the output.

  • only_with_metadata – Only keep documents featuring all essential metadata (date, title, url).

  • url_blacklist – Provide a blacklist of URLs as set() to filter out documents.

  • author_blacklist – Provide a blacklist of Author Names as set() to filter out authors.

  • as_dict – Deprecated, use the .as_dict() method instead.

  • prune_xpath – Provide an XPath expression to prune the tree before extraction. can be str or list of str.

  • config – Directly provide a configparser configuration.

  • options – Directly provide a whole extractor configuration.

Returns:

A Python dict() containing all the extracted information or None.

Raises:

ValueError – Extraction problem.

baseline()#

trafilatura.baseline(filecontent: Any) tuple[_Element, str, int][source]#

Use baseline extraction function targeting text paragraphs and/or JSON metadata.

Parameters:

filecontent – HTML code as binary string or string.

Returns:

A LXML <body> element containing the extracted paragraphs, the main text as string, and its length as integer.

html2txt()#

trafilatura.html2txt(content: Any, clean: bool = True) str[source]#

Run basic html2txt on a document.

Parameters:
  • content – HTML document as string or LXML element.

  • clean – remove potentially undesirable elements.

Returns:

The extracted text in the form of a string or an empty string.

try_readability()#

trafilatura.external.try_readability(htmlinput: HtmlElement) HtmlElement[source]#

Safety net: try with the generic algorithm readability

try_justext()#

trafilatura.external.try_justext(tree: HtmlElement, url: str | None, target_language: str | None) _Element[source]#

Second safety net: try with the generic algorithm justext

extract_metadata()#

trafilatura.extract_metadata(filecontent: HtmlElement | str, default_url: str | None = None, date_config: dict[str, Any] | None = None, extensive: bool = True, author_blacklist: set[str] | None = None) Document[source]#

Main process for metadata extraction.

Parameters:
  • filecontent – HTML code as string or parsed tree.

  • default_url – Previously known URL of the downloaded document.

  • date_config – Provide extraction parameters to htmldate as dict().

  • extensive – Use extensive search for date extraction.

  • author_blacklist – Provide a blacklist of Author Names as set() to filter out authors.

Returns:

A trafilatura.settings.Document containing the extracted metadata information. The Document class has .as_dict() method that will return a copy as a dict.

extract_comments()#

trafilatura.core.extract_comments(tree: HtmlElement, options: Extractor) tuple[_Element, str, int, HtmlElement][source]#

Try to extract comments out of potential sections in the HTML.

Helpers#

fetch_url()#

trafilatura.fetch_url(url: str, no_ssl: bool = False, config: ConfigParser = <configparser.ConfigParser object>, options: Extractor | None = None) str | None[source]#

Downloads a web page and seamlessly decodes the response.

Parameters:
  • url – URL of the page to fetch.

  • no_ssl – Do not try to establish a secure connection (to prevent SSLError).

  • config – Pass configuration values for output control.

  • options – Extraction options (supersedes config).

Returns:

Unicode string or None in case of failed downloads and invalid results.

fetch_response()#

trafilatura.fetch_response(url: str, *, decode: bool = False, no_ssl: bool = False, with_headers: bool = False, config: ConfigParser = <configparser.ConfigParser object>) Response | None[source]#

Downloads a web page and returns a full response object.

Parameters:
  • url – URL of the page to fetch.

  • decode – Use html attribute to decode the data (boolean).

  • no_ssl – Don’t try to establish a secure connection (to prevent SSLError).

  • with_headers – Keep track of the response headers.

  • config – Pass configuration values for output control.

Returns:

Response object or None in case of failed downloads and invalid results.

decode_file()#

trafilatura.utils.decode_file(filecontent: bytes | str) str[source]#

Check if the bytestring could be GZip and eventually decompress it, guess bytestring encoding and try to decode to Unicode string. Resort to destructive conversion otherwise.

load_html()#

trafilatura.load_html(htmlobject: Any) HtmlElement | None[source]#

Load object given as input and validate its type (accepted: lxml.html tree, trafilatura/urllib3 response, bytestring and string)

sanitize()#

trafilatura.utils.sanitize(text: str, preserve_space: bool = False, trailing_space: bool = False) str | None[source]#

Convert text and discard incompatible and invalid characters

trim()#

trafilatura.utils.trim(string: str) str[source]#

Remove unnecessary spaces within a text string.

XML processing#

xmltotxt()#

trafilatura.xml.xmltotxt(xmloutput: _Element | None, include_formatting: bool) str[source]#

Convert to plain text format and optionally preserve formatting as markdown.

validate_tei()#

trafilatura.xml.validate_tei(xmldoc: _Element) bool[source]#

Check if an XML document is conform to the guidelines of the Text Encoding Initiative