With R ====== .. meta:: :description lang=en: Trafilatura extends its download and extractions capabilities to the R community. Discover how to use Trafilatura in your R projects with this dedicated guide. Introduction ------------ R is a free software environment for statistical computing and graphics. `Reticulate `_ is an R package that enables easy interoperability between R and Python. With Reticulate, you can import Python modules as if they were R packages and call Python functions from R. This allows R users to leverage the vast array of Python packages and tools and basically allows for execution of Python code inside an R session. Python packages can then be used with minimal adaptations rather than having to go back and forth between languages and environments. Installation ------------ Both R and Python installed on your system, for the latter see `installation page `_. Reticulate ^^^^^^^^^^ The reticulate package can be easily installed from CRAN and then loaded into your R session: .. code-block:: R > install.packages("reticulate") > library(reticulate) A recent version of Python is necessary. Some systems already have such an environment installed, to check it just run the following command in a terminal window: .. code-block:: bash $ python3 --version Python 3.10.12 # version 3.10 or higher is fine By default, reticulate will use the Python executable found on your system's PATH. You can use the ``use_python()`` function to set the Python version and path that you want to use: ``use_python("/path/to/python/executable")``. You can also use the ``py_config()`` function to check your current Python configuration. Trafilatura ^^^^^^^^^^^ The most convenient way to install Python packages is to use the `reticulate::py_install() `_ function. Of course ``Trafilatura`` can also be installed with `pip `_ as any other Python package. Skip the installation of Miniconda if it doesn't seem necessary, you should only be prompted once; or see `Installing Python Packages `_. Here is a simple example using the ``py_install()`` function included in ``reticulate``: .. code-block:: R > library(reticulate) > py_install("trafilatura") Here is how to do it with the ``pip`` Python package manager: .. code-block:: bash $ pip install trafilatura Once you have installed a Python package, you can use it from R with the ``import()`` function. Download and extraction ----------------------- This section demonstrates how to use functions in a R environment. Beyond the examples below, all functions in these documentation pages should be available as well. R syntax ^^^^^^^^ Text extraction from HTML documents (including downloads) is available in a straightforward way: .. code-block:: R # getting started > install.packages("reticulate") > library(reticulate) # loading the Trafilatura module > trafilatura <- import("trafilatura") # fetching a web page > url <- "https://example.org/" > downloaded <- trafilatura$fetch_url(url) # extracting the text content > text <- trafilatura$extract(downloaded) > cat(text) [1] "This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission.\nMore information..." # extraction with arguments > trafilatura$extract(downloaded, output_format="xml", url=url) [1] "\n
\n
\n Example Domain\n

This domain is for use in illustrative examples in documents. You may use this\ndomain in literature without prior coordination or asking for permission.

\n

More information...

\n
\n
\n \n
" For a full list of arguments see `extraction documentation `_. Already stored documents can also be read directly from R, for example with CSV/TSV output and ``read_delim()``, see information on `data import in R `_. The ``html2txt`` function extracts all possible text on the webpage, it can be used as follows: .. code-block:: R > trafilatura$html2txt(downloaded) Python syntax ^^^^^^^^^^^^^ You can also use Python functions and objects from R. For example: .. code-block:: R > py_run_string("import trafilatura") > url <- "https://www.example.com" > py_df <- py_run_string("trafilatura.extract(url)") > df <- py_to_r(py_df) Other functions --------------- Specific parts of the package can also be imported on demand, which provides access to functions not directly exported by the package. For a list of relevant functions and arguments see `core functions `_. .. code-block:: R # using the code for link discovery in sitemaps > sitemapsfunc <- py_run_string("from trafilatura.sitemaps import sitemap_search") > sitemapsfunc$sitemap_search("https://www.sitemaps.org/") [1] "https://www.sitemaps.org" [2] "https://www.sitemaps.org/protocol.html" [3] "https://www.sitemaps.org/faq.html" [4] "https://www.sitemaps.org/terms.html" # and so on... # import the metadata part of the package as a function > metadatafunc <- py_run_string("from trafilatura.metadata import extract_metadata") > downloaded <- trafilatura$fetch_url("https://github.com/rstudio/reticulate") > metadatafunc$extract_metadata(downloaded) $title [1] "rstudio/reticulate" $author [1] "Rstudio" $url [1] "https://github.com/rstudio/reticulate" $hostname [1] "github.com" # and so on... Going further ------------- By combining the web scraping capabilities of Trafilatura with the data analysis capabilities of R, you can create powerful workflows for extracting and analyzing data from web pages. Further resources: - Complete vignette: `Calling Python from R `_. - Tutorial showing how to import a Python scraper and use the results directly with the usual R syntax: `Web scraping with R: Text and metadata extraction `_. Working with the content: - `Basic Text Processing in R `_