With Python =========== .. meta:: :description lang=en: This tutorial focuses on text extraction from web pages with Python code snippets. Data mining with this library encompasses HTML parsing and language identification. The Python programming language ------------------------------- Python can be easy to pick up whether you're a first time programmer or you're experienced with other languages: - Official `Python Tutorial `_ - `The Hitchhiker’s Guide to Python `_ - `The Best Python Tutorials (freeCodeCamp) `_ Step-by-step ------------ Quickstart ^^^^^^^^^^ For the basics see `quickstart documentation page `_. .. note:: For a hands-on tutorial see also the Python Notebook `Trafilatura Overview `_. Extraction functions ^^^^^^^^^^^^^^^^^^^^ The functions can be imported using ``from trafilatura import ...`` and used on raw documents (strings) or parsed HTML (LXML elements). Main text extraction, good balance between precision and recall: - ``extract``: Wrapper function, easiest way to perform text extraction and conversion - ``bare_extraction``: Internal function returning bare Python variables Additional fallback functions: - ``baseline``: Faster extraction function targeting text paragraphs and/or JSON metadata - ``html2txt``: Extract all text in a document, maximizing recall Output ^^^^^^ By default, the output is in plain text (TXT) format without metadata. The following additional formats are available: - CSV - HTML (from version 1.11 onwards) - JSON - Markdown (from version 1.9 onwards) - XML and XML-TEI (following the guidelines of the Text Encoding Initiative) To specify the output format, use one of the following strings: ``"csv", "json", "html", "markdown", "txt", "xml", "xmltei"``. The ``bare_extraction`` function also accepts an additional ``python`` format to work with Python on the output. To extract and include metadata in the output, use the ``with_metadata=True`` argument. Examples ~~~~~~~~ .. code-block:: python # some formatting preserved in basic XML structure >>> extract(downloaded, output_format="xml") # output in JSON format with metadata extracted >>> extract(downloaded, output_format="json", with_metadata=True) Note that combining TXT, CSV and JSON formats with certain structural elements (e.g. formatting or links) triggers output in Markdown format (plain text with additional elements). Choice of HTML elements ^^^^^^^^^^^^^^^^^^^^^^^ Customize the extraction process by including or excluding specific HTML elements: - Text elements: ``include_comments=True`` Include comment sections at the bottom of articles. ``include_tables=True`` Extract text from HTML ```` elements. - Structural elements: ``include_formatting=True`` Keep structural elements related to formatting (````/````, ````/```` etc.) ``include_links=True`` Keep link targets (in ``href="..."``) ``include_images=True`` Keep track of images along with their targets (```` attributes: alt, src, title) To operate on these elements, pass the corresponding parameters to the ``extract()`` function: .. code-block:: python # exclude comments from the output >>> result = extract(downloaded, include_comments=False) # skip tables and include links in the output >>> result = extract(downloaded, include_tables=False, include_links=True) # convert relative links to absolute links where possible >>> extract(downloaded, output_format='xml', include_links=True, url=url) Important notes ~~~~~~~~~~~~~~~ - ``include_comments`` and ``include_tables`` are activated by default. - Including extra elements works best with conversion to XML formats or using ``bare_extraction()``. This allows for direct display and manipulation of the elements. - Certain elements may not be visible in the output if the chosen format does not allow it. - Selecting Markdown automatically includes text formatting. .. hint:: The heuristics used by the main algorithm change according to the presence of certain elements in the HTML. If the output seems odd, try removing a constraint (e.g. formatting) to improve the result. The precision and recall presets ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The main extraction functions offer two presets to adjust to focus of the extraction process: ``favor_precision`` and ``favor_recall``. These parameters allow you to change the balance between accuracy and comprehensiveness of the output. .. code-block:: python >>> result = extract(downloaded, url, favor_precision=True) Precision ~~~~~~~~~ - If your results contain too much noise, prioritize precision to focus on the most central and relevant elements. - Additionally, you can use the ``prune_xpath`` parameter to target specific HTML elements using a list of XPath expressions. Recall ~~~~~~ - If parts of your documents are missing, try this preset to take more elements into account. - If content is still missing, refer to the `troubleshooting guide `_. Additional functions for text extraction ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The ``html2txt`` and ``baseline`` functions offer simpler approaches to extracting text from HTML content, prioritizing performance over precision. html2txt() ~~~~~~~~~~ The ``html2txt`` function serves as a last resort for extracting text from HTML content. It emulates the behavior of similar functions in other packages and can be used to output all possible text from a given HTML source, maximizing recall. However, it may not always produce accurate or meaningful results, as it does not consider the context of the extracted sections. .. code-block:: python >>> from trafilatura import html2txt >>> html2txt(downloaded) baseline() ~~~~~~~~~~ For a better balance between precision and recall, as well as improved performance, consider using the ``baseline`` function instead. This function returns a tuple containing an LXML element with the body, the extracted text as a string, and the length of the text. It uses a set of heuristics to extract text from the HTML content, which generally produces more accurate results than ``html2txt``. .. code-block:: python >>> from trafilatura import baseline >>> postbody, text, len_text = baseline(downloaded) For more advanced use cases, consider using other functions in the package that provide more control and customization over the text extraction process. Guessing if text can be found ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The function ``is_probably_readerable()`` has been ported from Mozilla's Readability.js, it is available from version 1.10 onwards and provides a way to guess if a page probably has a main text to extract. .. code-block:: python >>> from trafilatura.readability_lxml import is_probably_readerable >>> is_probably_readerable(html) # HTML string or already parsed tree Language identification ^^^^^^^^^^^^^^^^^^^^^^^ The target language can also be set using 2-letter codes (ISO 639-1), there will be no output if the detected language of the result does not match and no such filtering if the identification component has not been installed (see above `installation instructions `_) or if the target language is not available. .. code-block:: python >>> result = extract(downloaded, url, target_language="de") .. note:: Additional components are required: ``pip install trafilatura[all]``. This feature currently uses the `py3langid package `_ and is dependent on language availability and performance of the original model. Optimizing for speed ^^^^^^^^^^^^^^^^^^^^ Execution speed not only depends on the platform and on supplementary packages (``trafilatura[all]``, ``htmldate[speed]``), but also on the extraction strategy. The available fallbacks make extraction more precise but also slower. The use of fallback algorithms can also be bypassed in *fast* mode, which should make extraction about twice as fast: .. code-block:: python # skip algorithms used as fallback >>> result = extract(downloaded, no_fallback=True) The following combination usually leads to shorter processing times: .. code-block:: python >>> result = extract(downloaded, include_comments=False, include_tables=False, no_fallback=True) Extraction settings ------------------- .. hint:: See also `settings page `_. Function parameters ^^^^^^^^^^^^^^^^^^^ Starting from version 1.9, the ``Extractor`` class provides a convenient way to define and manage extraction parameters. It allows users to customize all options used by the extraction functions and offers a convenient shortcut compared to multiple function parameters. Here is how to use the class: .. code-block:: python # import the Extractor class from the settings module >>> from trafilatura.settings import Extractor # set multiple options at once >>> options = Extractor(output_format="json", with_metadata=True) # add or adjust settings as needed >>> options.formatting = True # same as include_formatting >>> options.source = "My Source" # useful for debugging # use the options in an extraction function >>> extract(my_doc, options=options) See the ``settings.py`` file for a full example. Metadata extraction ^^^^^^^^^^^^^^^^^^^ - ``with_metadata=True``: extract metadata fields and include them in the output - ``only_with_metadata=True``: only output documents featuring all essential metadata (date, title, url) Date ~~~~ Among metadata extraction, dates are handled by an external module: `htmldate `_. By default, focus is on original dates and the extraction replicates the *fast/no_fallback* option. `Custom parameters `_ can be passed through the extraction function or through the ``extract_metadata`` function in ``trafilatura.metadata``, most notably: - ``extensive_search`` (boolean), to activate further heuristics (higher recall, lower precision) - ``original_date`` (boolean) to look for the original publication date, - ``outputformat`` (string), to provide a custom datetime format, - ``max_date`` (string), to set the latest acceptable date manually (YYYY-MM-DD format). .. code-block:: python # import the extract() function, use a previously downloaded document # pass the new parameters as dict >>> extract(downloaded, output_format="xml", date_extraction_params={ "extensive_search": True, "max_date": "2018-07-01" }) URL ~~~ Even if the page to process has already been downloaded it can still be useful to pass the URL as an argument. See this `previous bug `_ for an example: .. code-block:: python # define a URL and download the example >>> url = "https://web.archive.org/web/20210613232513/https://www.thecanary.co/feature/2021/05/19/another-by-election-headache-is-incoming-for-keir-starmer/" >>> downloaded = fetch_url(url) # content discarded since necessary metadata couldn't be extracted >>> bare_extraction(downloaded, only_with_metadata=True) >>> # date found in URL, extraction successful >>> bare_extraction(downloaded, only_with_metadata=True, url=url) Memory use ^^^^^^^^^^ Trafilatura uses caches to speed up extraction and cleaning processes. This may lead to memory leaks in some cases, particularly in large-scale applications. If that happens you can reset all cached information in order to release RAM: .. code-block:: python # import the function >>> from trafilatura.meta import reset_caches # use it at any given point >>> reset_caches() Input/Output types ------------------ Python objects as output ^^^^^^^^^^^^^^^^^^^^^^^^ The extraction can be customized using a series of parameters, for more see the `core functions `_ page. The function ``bare_extraction`` can be used to bypass output conversion, it returns Python variables for metadata (dictionary) as well as main text and comments (both LXML objects). .. code-block:: python >>> from trafilatura import bare_extraction >>> bare_extraction(downloaded) Raw HTTP response objects ^^^^^^^^^^^^^^^^^^^^^^^^^ The ``fetch_response()`` function can pass a response object straight to the extraction. This can be useful to get the final redirection URL with ``response.url`` and then pass is directly as a URL argument to the extraction function: .. code-block:: python # necessary components >>> from trafilatura import fetch_response, bare_extraction # load an example >>> response = fetch_response("https://www.example.org") # perform extract() or bare_extraction() on Trafilatura's response object >>> bare_extraction(response.data, url=response.url) # here is the redirection URL LXML objects ^^^^^^^^^^^^ The input can consist of a previously parsed tree (i.e. a *lxml.html* object), which is then handled seamlessly: .. code-block:: python # define document and load it with LXML >>> from lxml import html >>> my_doc = """

Here is the main text.

""" >>> mytree = html.fromstring(my_doc) # extract from the already loaded LXML tree >>> extract(mytree) 'Here is the main text.' Interaction with BeautifulSoup ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Here is how to convert a BS4 object to LXML format in order to use it with Trafilatura: .. code-block:: python >>> from bs4 import BeautifulSoup >>> from lxml.html.soupparser import convert_tree >>> from trafilatura import extract >>> soup = BeautifulSoup("", "lxml") >>> lxml_tree = convert_tree(soup)[0] >>> extract(lxml_tree) Navigation ---------- Three potential navigation strategies are currently available: feeds (mostly for fresh content), sitemaps (for exhaustivity, all potential pages as listed by the owners) and discovery by web crawling (i.e. by following the internal links, more experimental). Feeds ^^^^^ The function ``find_feed_urls`` is a all-in-one utility that attempts to discover the feeds from a webpage if required and/or downloads and parses feeds. It returns the extracted links as list, more precisely as a sorted list of unique links. .. code-block:: python # import the feeds module >>> from trafilatura import feeds # use the homepage to automatically retrieve feeds >>> mylist = feeds.find_feed_urls('https://www.theguardian.com/') >>> mylist ['https://www.theguardian.com/international/rss', '...'] # and so on # use a predetermined feed URL directly >>> mylist = feeds.find_feed_urls('https://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml') >>> mylist is not [] True # it's not empty .. note:: The links are seamlessly filtered for patterns given by the user, e.g. using ``https://www.un.org/en/`` as argument implies taking all URLs corresponding to this category. An optional argument ``target_lang`` makes it possible to filter links according to their expected target language. A series of heuristics are applied on the link path and parameters to try to discard unwanted URLs, thus saving processing time and download bandwidth. .. code-block:: python # the feeds module has to be imported # search for feeds in English >>> mylist = feeds.find_feed_urls('https://www.un.org/en/rss.xml', target_lang='en') >>> mylist is not [] True # links found as expected # target_lang set to Japanese, the English links are discarded >>> mylist = feeds.find_feed_urls('https://www.un.org/en/rss.xml', target_lang='ja') >>> mylist [] For more information about feeds and web crawling see: - This blog post: `Using RSS and Atom feeds to collect web pages with Python `_ - This Youtube tutorial: `Extracting links from ATOM and RSS feeds `_ Sitemaps ^^^^^^^^ - Youtube tutorial: `Learn how to process XML sitemaps to extract all texts present on a website `_ .. code-block:: python # load sitemaps module >>> from trafilatura import sitemaps # automatically find sitemaps by providing the homepage >>> mylinks = sitemaps.sitemap_search('https://www.theguardian.com/') # the target_lang argument works as explained above >>> mylinks = sitemaps.sitemap_search('https://www.un.org/', target_lang='en') The links are also seamlessly filtered for patterns given by the user, e.g. using ``https://www.theguardian.com/society`` as argument implies taking all URLs corresponding to the society category. Web crawling ^^^^^^^^^^^^ See the `documentation page on web crawling `_ for more information. .. hint:: For more information on how to refine and filter a URL collection, see the underlying `courlan `_ library. Deprecations ------------ The following functions and arguments are deprecated: - extraction: - ``process_record()`` function → use ``extract()`` instead - ``csv_output``, ``json_output``, ``tei_output``, ``xml_output`` → use ``output_format`` parameter instead - ``bare_extraction(as_dict=True)`` → the function returns a ``Document`` object, use ``.as_dict()`` method on it - ``bare_extraction()`` and ``extract()``: ``no_fallback`` → use ``fast`` instead - ``max_tree_size`` parameter moved to ``settings.cfg`` file - downloads: ``decode`` argument in ``fetch_url()`` → use ``fetch_response`` instead - utils: ``decode_response()`` function → use ``decode_file()`` instead - metadata: ``with_metadata`` (include metadata) had once the effect of today's ``only_with_metadata`` (only documents with necessary metadata)