Core functions#
Extraction#
extract()#
- trafilatura.extract(filecontent: Any, url: str | None = None, record_id: str | None = None, fast: bool = False, no_fallback: bool = False, favor_precision: bool = False, favor_recall: bool = False, include_comments: bool = True, output_format: str = 'txt', tei_validation: bool = False, target_language: str | None = None, include_tables: bool = True, include_images: bool = False, include_formatting: bool = False, include_links: bool = False, deduplicate: bool = False, date_extraction_params: dict[str, ~typing.Any] | None=None, with_metadata: bool = False, only_with_metadata: bool = False, max_tree_size: int | None = None, url_blacklist: set[str] | None = None, author_blacklist: set[str] | None = None, settingsfile: str | None = None, prune_xpath: str | list[str] | None = None, config: ConfigParser = <configparser.ConfigParser object>, options: Extractor | None = None) str | None[source]#
- Main function exposed by the package:
Wrapper for text extraction and conversion to chosen output format.
- Parameters:
filecontent – HTML code as string.
url – URL of the webpage.
record_id – Add an ID to the metadata.
fast – Use faster heuristics and skip backup extraction.
no_fallback – Deprecated, use “fast” instead.
favor_precision – prefer less text but correct extraction.
favor_recall – when unsure, prefer more text.
include_comments – Extract comments along with the main text.
output_format – Define an output format: “csv”, “html”, “json”, “markdown”, “txt”, “xml”, and “xmltei”.
tei_validation – Validate the XML-TEI output with respect to the TEI standard.
target_language – Define a language to discard invalid documents (ISO 639-1 format).
include_tables – Take into account information within the HTML <table> element.
include_images – Take images into account (experimental).
include_formatting – Keep structural elements related to formatting (only valuable if output_format is set to XML).
include_links – Keep links along with their targets (experimental).
deduplicate – Remove duplicate segments and documents.
date_extraction_params – Provide extraction parameters to htmldate as dict().
with_metadata – Extract metadata fields and add them to the output.
only_with_metadata – Only keep documents featuring all essential metadata (date, title, url).
url_blacklist – Provide a blacklist of URLs as set() to filter out documents.
author_blacklist – Provide a blacklist of Author Names as set() to filter out authors.
settingsfile – Use a configuration file to override the standard settings.
prune_xpath – Provide an XPath expression to prune the tree before extraction. can be str or list of str.
config – Directly provide a configparser configuration.
options – Directly provide a whole extractor configuration.
- Returns:
A string in the desired format or None.
bare_extraction()#
- trafilatura.bare_extraction(filecontent: Any, url: str | None = None, fast: bool = False, no_fallback: bool = False, favor_precision: bool = False, favor_recall: bool = False, include_comments: bool = True, output_format: str = 'python', target_language: str | None = None, include_tables: bool = True, include_images: bool = False, include_formatting: bool = False, include_links: bool = False, deduplicate: bool = False, date_extraction_params: dict[str, ~typing.Any] | None=None, with_metadata: bool = False, only_with_metadata: bool = False, max_tree_size: int | None = None, url_blacklist: set[str] | None = None, author_blacklist: set[str] | None = None, as_dict: bool = False, prune_xpath: str | list[str] | None = None, config: ConfigParser = <configparser.ConfigParser object>, options: Extractor | None = None) Document | dict[str, Any] | None[source]#
Internal function for text extraction returning bare Python variables.
- Parameters:
filecontent – HTML code as string.
url – URL of the webpage.
fast – Use faster heuristics and skip backup extraction.
no_fallback – Deprecated, use “fast” instead.
favor_precision – prefer less text but correct extraction.
favor_recall – prefer more text even when unsure.
include_comments – Extract comments along with the main text.
output_format – Define an output format, Python being the default and the interest of this internal function. Other values: “csv”, “html”, “json”, “markdown”, “txt”, “xml”, and “xmltei”.
target_language – Define a language to discard invalid documents (ISO 639-1 format).
include_tables – Take into account information within the HTML <table> element.
include_images – Take images into account (experimental).
include_formatting – Keep structural elements related to formatting (present in XML format, converted to markdown otherwise).
include_links – Keep links along with their targets (experimental).
deduplicate – Remove duplicate segments and documents.
date_extraction_params – Provide extraction parameters to htmldate as dict().
with_metadata – Extract metadata fields and add them to the output.
only_with_metadata – Only keep documents featuring all essential metadata (date, title, url).
url_blacklist – Provide a blacklist of URLs as set() to filter out documents.
author_blacklist – Provide a blacklist of Author Names as set() to filter out authors.
as_dict – Deprecated, use the .as_dict() method instead.
prune_xpath – Provide an XPath expression to prune the tree before extraction. can be str or list of str.
config – Directly provide a configparser configuration.
options – Directly provide a whole extractor configuration.
- Returns:
A Python dict() containing all the extracted information or None.
- Raises:
ValueError – Extraction problem.
baseline()#
- trafilatura.baseline(filecontent: Any) tuple[_Element, str, int][source]#
Use baseline extraction function targeting text paragraphs and/or JSON metadata.
- Parameters:
filecontent – HTML code as binary string or string.
- Returns:
A LXML <body> element containing the extracted paragraphs, the main text as string, and its length as integer.
html2txt()#
try_readability()#
try_justext()#
extract_metadata()#
- trafilatura.extract_metadata(filecontent: HtmlElement | str, default_url: str | None = None, date_config: dict[str, Any] | None = None, extensive: bool = True, author_blacklist: set[str] | None = None) Document[source]#
Main process for metadata extraction.
- Parameters:
filecontent – HTML code as string or parsed tree.
default_url – Previously known URL of the downloaded document.
date_config – Provide extraction parameters to htmldate as dict().
extensive – Use extensive search for date extraction.
author_blacklist – Provide a blacklist of Author Names as set() to filter out authors.
- Returns:
A trafilatura.settings.Document containing the extracted metadata information. The Document class has .as_dict() method that will return a copy as a dict.
extract_comments()#
Link discovery#
sitemap_search()#
- trafilatura.sitemaps.sitemap_search(url: str, target_lang: str | None = None, external: bool = False, sleep_time: float = 2.0, max_sitemaps: int = 10000) list[str][source]#
Look for sitemaps for the given URL and gather links.
- Parameters:
url – Webpage or sitemap URL as string. Triggers URL-based filter if the webpage isn’t a homepage.
target_lang – Define a language to filter URLs based on heuristics (two-letter string, ISO 639-1 format).
external – Similar hosts only or external URLs (boolean, defaults to False).
sleep_time – Wait between requests on the same website.
max_sitemaps – Maximum number of sitemaps to process.
- Returns:
The extracted links as a list (sorted list of unique links).
find_feed_urls()#
- trafilatura.feeds.find_feed_urls(url: str, target_lang: str | None = None, external: bool = False, sleep_time: float = 2.0) list[str][source]#
Try to find feed URLs.
- Parameters:
url – Webpage or feed URL as string. Triggers URL-based filter if the webpage isn’t a homepage.
target_lang – Define a language to filter URLs based on heuristics (two-letter string, ISO 639-1 format).
external – Similar hosts only or external URLs (boolean, defaults to False).
sleep_time – Wait between requests on the same website.
- Returns:
The extracted links as a list (sorted list of unique links).
focused_crawler()#
- trafilatura.spider.focused_crawler(homepage: str, max_seen_urls: int = 10, max_known_urls: int = 100000, todo: list[str] | None = None, known_links: list[str] | None = None, lang: str | None = None, config: ConfigParser = <configparser.ConfigParser object>, rules: RobotFileParser | None = None, prune_xpath: str | None = None) tuple[list[str], list[str]][source]#
Basic crawler targeting pages of interest within a website.
- Parameters:
homepage – URL of the page to first page to fetch, preferably the homepage of a website.
max_seen_urls – maximum number of pages to visit, stop iterations at this number or at the exhaustion of pages on the website, whichever comes first.
max_known_urls – stop if the total number of pages “known” exceeds this number.
todo – provide a previously generated list of pages to visit / crawl frontier.
known_links – provide a list of previously known pages.
lang – try to target links according to language heuristics.
config – use a different configuration (configparser format).
rules – provide politeness rules (urllib.robotparser.RobotFileParser() format).
prune_xpath – remove unwanted elements from the HTML pages using XPath.
- Returns:
List of pages to visit, deque format, possibly empty if there are no further pages to visit. Set of known links.
Helpers#
fetch_url()#
- trafilatura.fetch_url(url: str, no_ssl: bool = False, config: ConfigParser = <configparser.ConfigParser object>, options: Extractor | None = None) str | None[source]#
Downloads a web page and seamlessly decodes the response.
- Parameters:
url – URL of the page to fetch.
no_ssl – Do not try to establish a secure connection (to prevent SSLError).
config – Pass configuration values for output control.
options – Extraction options (supersedes config).
- Returns:
Unicode string or None in case of failed downloads and invalid results.
fetch_response()#
- trafilatura.fetch_response(url: str, *, decode: bool = False, no_ssl: bool = False, with_headers: bool = False, config: ConfigParser = <configparser.ConfigParser object>) Response | None[source]#
Downloads a web page and returns a full response object.
- Parameters:
url – URL of the page to fetch.
decode – Use html attribute to decode the data (boolean).
no_ssl – Don’t try to establish a secure connection (to prevent SSLError).
with_headers – Keep track of the response headers.
config – Pass configuration values for output control.
- Returns:
Response object or None in case of failed downloads and invalid results.