Installation
============
.. meta::
:description lang=en:
Setting up Trafilatura is straightforward. This installation guide walks you through the process step-by-step.
This installation guide entails all necessary steps to set up Trafilatura.
Python
------
Trafilatura runs using Python, currently one of the most frequently used programming languages.
It is tested on Linux, macOS and Windows systems and on all recent versions of Python.
- `Djangogirls Installation Tutorial: Python installation `_.
- `Installing Python 3 `_
Some systems already have such an environment installed, to check it just run the following command in a terminal window:
.. code-block:: bash
$ python3 --version # python can also work
Python 3.10.12 # version 3.10 or higher is fine
Trafilatura package
-------------------
Trafilatura is packaged as a software library available from the package repository `PyPI `_. As such it can notably be installed with a package manager like ``pip`` or ``pipenv``.
Installing Python packages
~~~~~~~~~~~~~~~~~~~~~~~~~~
- Straightforward: `Official documentation `_
- Advanced: `Pipenv & Virtual Environments `_
Basics
~~~~~~
Here is how to install Trafilatura using pip:
1. Open a terminal or command prompt. Please refer to `this section `_ for an introduction on command-line usage.
2. Type the following command: ``pip install trafilatura`` (``pip3`` where applicable)
3. Press *Enter*: pip will download and install Trafilatura and its dependencies.
This project is under active development, please make sure you keep it up-to-date to benefit from latest improvements:
.. code-block:: bash
# to make sure you have the latest version
$ pip install --upgrade trafilatura
# latest available code base
$ pip install --force-reinstall -U git+https://github.com/adbar/trafilatura
.. hint::
Installation on MacOS is generally easier with `brew `_.
Older Python versions
~~~~~~~~~~~~~~~~~~~~~
In case this does not happen automatically, specify the version number:
``pip install trafilatura==number``
- Last version for Python 3.6 and 3.7: ``1.12.2``
- Last version for Python 3.5: ``0.9.3``
- Last version for Python 3.4: ``0.8.2``
Command-line tool
~~~~~~~~~~~~~~~~~
If you installed the library successfully but cannot start the command-line tool, try adding the user-level ``bin`` directory to your ``PATH`` environment variable.
If you are using a Unix derivative (e.g. Linux, OS X), you can achieve this by running the following command: ``export PATH="$HOME/.local/bin:$PATH"``.
For local or user installations where trafilatura cannot be used from the command-line, please refer to `the official Python documentation `_ and this page on `finding executables from the command-line `_.
Additional functionality
------------------------
Compression
~~~~~~~~~~~
Trafilatura works best if compression modules in the Python standard library are available. If this is not the case the following modules are impacted: processing of compressed HTML data (less coverage), backup HTML storage (CLI), and UrlStore in the underlying courlan library (lesser capacity).
Optional modules
~~~~~~~~~~~~~~~~
A few additional libraries can be installed for extended functionality and faster processing, e.g. language detection and faster encoding detection with ``faust-cchardet``.
.. code-block:: bash
$ pip install faust-cchardet # single package only
$ pip install trafilatura[all] # all additional functionality
*For infos on dependency management of Python packages see* `this discussion thread `_.
.. hint::
Everything works even if not all packages are installed (e.g. because installation fails).
You can also install or update relevant packages separately, *trafilatura* will detect which ones are present on your system and opt for the best available combination.
brotli
Additional compression algorithm for downloads
faust-cchardet
Faster encoding detection, also possibly more accurate (especially for encodings used in Asia)
htmldate[all] / htmldate[speed]
Faster and more precise date extraction with a series of dedicated packages
py3langid
Language detection on extracted main text
pycurl
Faster downloads, useful where urllib3 fails
urllib3[socks]
Downloads through SOCKS proxy with urllib3
zstandard
Additional compression algorithm for downloads