With R#

Introduction#

R is a free software environment for statistical computing and graphics. Trafilatura extends its capabilities to the R community. Discover how to use Trafilatura in your R projects with this dedicated guide.

The reticulate package provides a comprehensive set of tools for seamless interoperability between Python and R. It basically allows for execution of Python code inside an R session, so that Python packages can be used with minimal adaptations, which is ideal for those who would rather operate from R than having to go back and forth between languages and environments.

The package provides several ways to integrate Python code into R projects:

  • Python in R Markdown

  • Importing Python modules

  • Sourcing Python scripts

  • An interactive Python console within R.

Complete vignette: Calling Python from R.

This tutorial shows how to import a Python scraper straight from R and use the results directly with the usual R syntax: Web scraping with R: Text and metadata extraction.

Installation#

The reticulate package can be easily installed from CRAN as follows:

> install.packages("reticulate")

A recent version of Python 3 is necessary. Some systems already have such an environment installed, to check it just run the following command in a terminal window:

$ python3 --version
Python 3.8.6 # version 3.6 or higher is fine

In case Python is not installed, please refer to the excellent Djangogirls tutorial: Python installation.

Trafilatura has to be installed with pip, conda, or py_install. Skip the installation of Miniconda if it doesn’t seem necessary, you should only be prompted once; or see Installing Python Packages.

Here is a simple example using the py_install() function included in reticulate:

> library(reticulate)
> py_install("trafilatura")

Download and extraction#

Text extraction from HTML documents (including downloads) is available in a straightforward way:

# getting started
> install.packages("reticulate")
> library(reticulate)
> trafilatura <- import("trafilatura")

# get a HTML document as string
> url <- "https://example.org/"
> downloaded <- trafilatura$fetch_url(url)

# extraction
> trafilatura$extract(downloaded)
[1] "This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission.\nMore information..."

# extraction with arguments
> trafilatura$extract(downloaded, output_format="xml", url=url)
[1] "<doc sitename=\"example.org\" title=\"Example Domain\" source=\"https://example.org/\" hostname=\"example.org\" categories=\"\" tags=\"\" fingerprint=\"lxZaiIwoxp80+AXA2PtCBnJJDok=\">\n  <main>\n    <div>\n      <head>Example Domain</head>\n      <p>This domain is for use in illustrative examples in documents. You may use this\ndomain in literature without prior coordination or asking for permission.</p>\n      <p>More information...</p>\n    </div>\n  </main>\n  <comments/>\n</doc>"

For a full list of arguments see extraction documentation.

Already stored documents can also be read directly from R, for example with CSV/TSV output and read_delim(), see information on data import in R.

The html2txt function extracts all possible text on the webpage, it can be used as follows:

> trafilatura$html2txt(downloaded)

Other functions#

Specific parts of the package can also be imported on demand, which provides access to functions not directly exported by the package. For a list of relevant functions and arguments see core functions.

# using the code for link discovery in sitemaps
> sitemapsfunc <- py_run_string("from trafilatura.sitemaps import sitemap_search")
> sitemapsfunc$sitemap_search("https://www.sitemaps.org/")
[1] "https://www.sitemaps.org"
[2] "https://www.sitemaps.org/protocol.html"
[3] "https://www.sitemaps.org/faq.html"
[4] "https://www.sitemaps.org/terms.html"
# and so on...

# import the metadata part of the package as a function
> metadatafunc <- py_run_string("from trafilatura.metadata import extract_metadata")
> downloaded <- trafilatura$fetch_url("https://github.com/rstudio/reticulate")
> metadatafunc$extract_metadata(downloaded)
$title
[1] "rstudio/reticulate"

$author
[1] "Rstudio"

$url
[1] "https://github.com/rstudio/reticulate"

$hostname
[1] "github.com"
# and so on...

Going further#