With R ====== .. meta:: :description lang=en: Trafilatura extends its download and extractions capabilities to the R community. Discover how to use Trafilatura in your R projects with this dedicated guide. Introduction ------------ `R `_ is a free software environment for statistical computing and graphics. Trafilatura extends its capabilities to the R community. Discover how to use Trafilatura in your R projects with this dedicated guide. The `reticulate `_ package provides a comprehensive set of tools for seamless interoperability between Python and R. It basically allows for execution of Python code inside an R session, so that Python packages can be used with minimal adaptations, which is ideal for those who would rather operate from R than having to go back and forth between languages and environments. The package provides several ways to integrate Python code into R projects: - Python in R Markdown - Importing Python modules - Sourcing Python scripts - An interactive Python console within R. Complete vignette: `Calling Python from R `_. This tutorial shows how to import a Python scraper straight from R and use the results directly with the usual R syntax: `Web scraping with R: Text and metadata extraction `_. Installation ------------ The reticulate package can be easily installed from CRAN as follows: .. code-block:: R > install.packages("reticulate") A recent version of Python 3 is necessary. Some systems already have such an environment installed, to check it just run the following command in a terminal window: .. code-block:: bash $ python3 --version Python 3.8.6 # version 3.6 or higher is fine In case Python is not installed, please refer to the excellent `Djangogirls tutorial: Python installation `_. ``Trafilatura`` has to be installed with `pip `_, `conda `_, or `py_install `_. Skip the installation of Miniconda if it doesn't seem necessary, you should only be prompted once; or see `Installing Python Packages `_. Here is a simple example using the ``py_install()`` function included in ``reticulate``: .. code-block:: R > library(reticulate) > py_install("trafilatura") Download and extraction ----------------------- Text extraction from HTML documents (including downloads) is available in a straightforward way: .. code-block:: R # getting started > install.packages("reticulate") > library(reticulate) > trafilatura <- import("trafilatura") # get a HTML document as string > url <- "https://example.org/" > downloaded <- trafilatura$fetch_url(url) # extraction > trafilatura$extract(downloaded) [1] "This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission.\nMore information..." # extraction with arguments > trafilatura$extract(downloaded, output_format="xml", url=url) [1] "\n
\n
\n Example Domain\n

This domain is for use in illustrative examples in documents. You may use this\ndomain in literature without prior coordination or asking for permission.

\n

More information...

\n
\n
\n \n
" For a full list of arguments see `extraction documentation `_. Already stored documents can also be read directly from R, for example with CSV/TSV output and ``read_delim()``, see information on `data import in R `_. The ``html2txt`` function extracts all possible text on the webpage, it can be used as follows: .. code-block:: R > trafilatura$html2txt(downloaded) Other functions --------------- Specific parts of the package can also be imported on demand, which provides access to functions not directly exported by the package. For a list of relevant functions and arguments see `core functions `_. .. code-block:: R # using the code for link discovery in sitemaps > sitemapsfunc <- py_run_string("from trafilatura.sitemaps import sitemap_search") > sitemapsfunc$sitemap_search("https://www.sitemaps.org/") [1] "https://www.sitemaps.org" [2] "https://www.sitemaps.org/protocol.html" [3] "https://www.sitemaps.org/faq.html" [4] "https://www.sitemaps.org/terms.html" # and so on... # import the metadata part of the package as a function > metadatafunc <- py_run_string("from trafilatura.metadata import extract_metadata") > downloaded <- trafilatura$fetch_url("https://github.com/rstudio/reticulate") > metadatafunc$extract_metadata(downloaded) $title [1] "rstudio/reticulate" $author [1] "Rstudio" $url [1] "https://github.com/rstudio/reticulate" $hostname [1] "github.com" # and so on... Going further ------------- - `Basic Text Processing in R `_ - `Quanteda `_ is an R package for managing and analyzing text