R is a free software environment for statistical computing and graphics. The reticulate package provides a comprehensive set of tools for seamless interoperability between Python and R. It basically allows for execution of Python code inside an R session, so that Python packages can be used with minimal adaptations, which is ideal for those who would rather operate from R than having to go back and forth between languages and environments.
The package provides several ways to integrate Python code into R projects:
Python in R Markdown
Importing Python modules
Sourcing Python scripts
An interactive Python console within R.
Complete vignette: Calling Python from R.
This tutorial shows how to import a Python scraper straight from R and use the results directly with the usual R syntax: Web scraping with R: Text and metadata extraction.
The reticulate package can be easily installed from CRAN as follows:
A recent version of Python 3 is necessary. Some systems already have such an environment installed, to check it just run the following command in a terminal window:
$ python3 --version Python 3.8.6 # version 3.6 or higher is fine
In case Python is not installed, please refer to the excellent Djangogirls tutorial: Python installation.
Here is a simple example using the
py_install() function included in
> library(reticulate) > py_install("trafilatura")
Download and extraction#
Text extraction from HTML documents (including downloads) is available in a straightforward way:
# getting started > install.packages("reticulate") > library(reticulate) > trafilatura <- import("trafilatura") # get a HTML document as string > url <- "https://example.org/" > downloaded <- trafilatura$fetch_url(url) # extraction > trafilatura$extract(downloaded)  "This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission.\nMore information..." # extraction with arguments > trafilatura$extract(downloaded, output_format="xml", url=url)  "<doc sitename=\"example.org\" title=\"Example Domain\" source=\"https://example.org/\" hostname=\"example.org\" categories=\"\" tags=\"\" fingerprint=\"lxZaiIwoxp80+AXA2PtCBnJJDok=\">\n <main>\n <div>\n <head>Example Domain</head>\n <p>This domain is for use in illustrative examples in documents. You may use this\ndomain in literature without prior coordination or asking for permission.</p>\n <p>More information...</p>\n </div>\n </main>\n <comments/>\n</doc>"
For a full list of arguments see extraction documentation.
Already stored documents can also be read directly from R, for example with CSV/TSV output and
read_delim(), see information on data import in R.
html2txt function extracts all possible text on the webpage, it can be used as follows:
Specific parts of the package can also be imported on demand, which provides access to functions not directly exported by the package. For a list of relevant functions and arguments see core functions.
# using the code for link discovery in sitemaps > sitemapsfunc <- py_run_string("from trafilatura.sitemaps import sitemap_search") > sitemapsfunc$sitemap_search("https://www.sitemaps.org/")  "https://www.sitemaps.org"  "https://www.sitemaps.org/protocol.html"  "https://www.sitemaps.org/faq.html"  "https://www.sitemaps.org/terms.html" # and so on... # import the metadata part of the package as a function > metadatafunc <- py_run_string("from trafilatura.metadata import extract_metadata") > downloaded <- trafilatura$fetch_url("https://github.com/rstudio/reticulate") > metadatafunc$extract_metadata(downloaded) $title  "rstudio/reticulate" $author  "Rstudio" $url  "https://github.com/rstudio/reticulate" $hostname  "github.com" # and so on...