Installation ============ .. meta:: :description lang=en: Setting up Trafilatura is straightforward. This installation guide walks you through the process step-by-step. Setting up Trafilatura is straightforward. This installation guide walks you through the process step-by-step. Python ------ Trafilatura runs using `Python `_, currently one of the most frequently used programming languages. This software library/package is tested on Linux, macOS and Windows systems. It is compatible with all recent versions of Python: - `Installing Python 3 on Mac OS X `_ (& `official documentation for Mac `_) - `Installing Python 3 on Windows `_ (& `official documentation for Windows `_) - `Installing Python 3 on Linux `_ (& `official documentation for Unix `_) - Beginners guide: `downloading Python `_ Then you need a version of Python to interact with as well as the Python packages needed for the task. A recent version of Python 3 is necessary. Some systems already have such an environment installed, to check it just run the following command in a terminal window: .. code-block:: bash $ python3 --version Python 3.8.6 # version 3.6 or higher is fine In case Python is not installed, please refer to the excellent `Djangogirls tutorial: Python installation `_. Trafilatura package ------------------- Trafilatura is packaged as a software library available from the package repository `PyPI `_. As such it can notably be installed with ``pip`` or ``pipenv``. Installing Python packages ~~~~~~~~~~~~~~~~~~~~~~~~~~ - Straightforward: `Installing packages in python using pip `_ (& `official documentation `_) - `Using pip on Windows `_ - Advanced: `Pipenv & Virtual Environments `_ Basics ~~~~~~ Please refer to `this section `_ for an introduction on command-line usage. .. code-block:: bash $ pip install trafilatura # pip3 where applicable This project is under active development, please make sure you keep it up-to-date to benefit from latest improvements: .. code-block:: bash # to make sure you have the latest version $ pip install -U trafilatura # latest available code base $ pip install --force-reinstall -U git+https://github.com/adbar/trafilatura On **Mac OS** it can be necessary to install certificates by hand if you get errors like ``[SSL: CERTIFICATE_VERIFY_FAILED]`` while downloading webpages: execute ``pip install certifi`` and perform the post-installation step by clicking on ``/Applications/Python 3.X/Install Certificates.command``. For more information see this `help page on SSL errors `_. .. hint:: Installation on MacOS is generally easier with `brew `_. Older Python versions ~~~~~~~~~~~~~~~~~~~~~ - Last version for Python 3.5: ``pip install trafilatura==0.9.3`` - Last version for Python 3.4: ``pip install trafilatura==0.8.2`` Command-line tool ~~~~~~~~~~~~~~~~~ If you installed the library successfully but cannot start the command-line tool, try adding the user-level ``bin`` directory to your ``PATH`` environment variable. If you are using a Unix derivative (e.g. Linux, OS X), you can achieve this by running the following command: ``export PATH="$HOME/.local/bin:$PATH"``. For local or user installations where trafilatura cannot be used from the command-line, please refer to `the official Python documentation `_ and this page on `finding executables from the command-line `_. Additional functionality ------------------------ Optional modules ~~~~~~~~~~~~~~~~ A few additional libraries can be installed for extended functionality and faster processing: language detection and faster encoding detection: the ``cchardet`` package may not work on all systems but it is highly recommended. .. code-block:: bash $ pip install cchardet # single package only $ pip install trafilatura[all] # all additional functionality *For infos on dependency management of Python packages see* `this discussion thread `_. .. hint:: Everything works even if not all packages are installed (e.g. because installation fails). You can also install or update relevant packages separately, *trafilatura* will detect which ones are present on your system and opt for the best available combination. brotli Additional compression algorithm for downloads cchardet / faust-cchardet (Python >= 3.11) Faster encoding detection, also possibly more accurate (especially for encodings used in Asia) htmldate[all] / htmldate[speed] Faster and more precise date extraction with a series of dedicated packages py3langid Language detection on extracted main text pycurl Faster downloads, possibly less robust though Graphical user interface ------------------------ .. toctree:: :maxdepth: 2 installation-gui