Installing Python packages
Over the past decade, the Python programming language and Scientific Python packages like NumPy, SciPy, JAX, and PyTorch have gained a lot of popularity in the data science and HPC communities.
A Python installation usually consists of the Python interpreter, the Python standard library and one or more third party Python packages. Such Python packages may include both compiled code and a lot of so-called Python modules, i.e. a lot of small files containing Python code. A typical Conda environment tends to contain tens to hundreds of thousands of relatively small files filling up your file quota. Additionally, installing such a large number of small files to our storage can put a lot of strain on the filesystem and can lead to suboptimal cluster performance.
Expert tip: Use containerized environments
In order to circumvent the issue of “lots of tiny files” and potentially degraded performance, you can wrap your Python environment in a container and run your simulations from within this isolated environment. Note that this approach is especially useful if the Python environment is very static and is going to be used over long periods of time!
Anaconda (conda
)
Conda is an open source environment and package management system. With Conda you can create independent environments, where you can install applications such as python and R, together with any packages which will be used by these applications. The environments are independent, with the Conda package manager managing the binaries, resolving dependencies, and ensuring that package used in multiple environments are stored only once. In a typical setting, each user has their own installation of a Conda and a set of personal environments.
Generic binaries
Conda installs generic binaries that may be suboptimal for the performance on UBELIX clusters. In most situations it is recommended to use the pre-installed libraries over the version that can be installed from conda
.
Channels
Conda channels are the locations where packages are stored. There are also multiple channels, with some important channels being:
defaults
, the default channel,anaconda
, a mirror of the default channel,bioconda
, a distribution of bioinformatics software, andconda-forge
, a community-led collection of recipes, build infrastructure, and distributions for the conda package manager.
The most useful channel that comes pre-installed in all distributions, is Conda-Forge. Channels are usually hosted in the official Anaconda page, but in some rare occasions custom channels may be used. For instance the default channel is hosted independently from the official Anaconda page. Many channels also maintain web pages with documentation both for their usage and for packages they distribute:
Loading the Anaconda3
module
The Anaconda3
distribution is provided as a module on UBELIX. To use any
conda
commands, load the module using
module load Anaconda3
Using conda
When using conda
the system may complain about:
CommandNotFoundError: Your shell has not been properly configured to use 'conda activate'.
To initialize your shell, run
$ conda init <SHELL_NAME>
Currently supported shells are:
- bash
- fish
- tcsh
- xonsh
- zsh
- powershell
See 'conda init --help' for more information and options.
IMPORTANT: You may need to close and restart your shell after running 'conda init'.
Please do not run conda init
. Instead initialise the conda environment using:
module load Anaconda3
eval "$(conda shell.bash hook)"
This should also be used in your batch submission scripts when working with conda environments.
Managing environments
As an example, the creation and use of an environment for pandas jobs is presented. The command,
conda create --name pandas
pandas
. The environment is activated with the command
conda activate pandas
Next, install the base R environment package that contains the R program, and any R packages required by the project. To install packages, first ensure that the pandas
environment is active, and then install any package with the command
conda install <package_name>
conda install --channel <channel_name> <package_name>
conda install --channel conda-forge pandas
Packages in the conda-forge channel come with instructions for their installation. Quite often the channel is specified in the installation instructions, -c conda-forge
or --channel conda-forge
.
After work in an environment is complete, deactivate the environment,
conda deactivate
Using environments in submission scripts
Since all computationally heavy operations must be performed in compute nodes, Conda environments are also used in jobs submitted to the queuing system. Returning to the pandas example, a submission script running a single core pandas job can use the pandas
environment as follows:
#SBATCH --job-name pandas-test-job
#SBATCH --nodes 1
#SBATCH --ntasks-per-node 1
#SBATCH --cpus-per-task 1
#SBATCH --time=0-02:00:00
#SBATCH --partition epyc2,bdw
echo "Launched at $(date)"
echo "Job ID: ${SLURM_JOBID}"
echo "Node list: ${SLURM_NODELIST}"
echo "Submit dir.: ${SLURM_SUBMIT_DIR}"
echo "Numb. of cores: ${SLURM_CPUS_PER_TASK}"
module load Anaconda3
eval "$(conda shell.bash hook)"
conda activate pandas
python pandas_test.py
conda deactivate
Cleaning up package data
The Conda environment managers download and store a sizable amount of data to provided packages to the various environments. Even though the package data are shared between the various environments, they still consume space in your or your project’s account.
There are two main sources of unused data, the compressed archives of the packages that Conda stores in its cache when downloading a package, and the data of removed packages. All unused data can be removed with the command
conda clean --all
${HOME}/conda
).
Updating environments to remove old package versions
As we create new environments, we often install the latest version of each package. However, if the environments are not updated regularly, we may end up with different versions of the same package across multiple environments. If we have the same version of a package installed in all environments, we can save space by removing unused older versions.
To update a package across all environments, use the command
for e in $(conda env list | awk 'FNR>2 {print $1}'); do conda update --name $e <package name>; done
WARNING: Ensure this is really what you want! Sometimes you need a specific (older) version of a package in an environment because of compatibility!
After updating packages, the clean
command can be called to removed the data of unused older package versions.
Sources
Pip
In some cases Python packages are not avaible through the conda channels and
need to be installed through pip. In this case simply ensure that pip
is
indeed set to the pip
executable within the environment
which pip
~/.conda/envs/<env name>/bin/pip
For instance, assume that a mkdocs
project requires the following packages:
mkdocs
mkdocs-minify-plugin
The package mkdocs-minify-plugin
is less popular and thus is is not available though a Conda channel, but it is available in PyPI. Activate your conda
environment and install the required packages with pip
pip install --upgrade mkdocs mkdocs-minify-plugin
inside the environment. The packages will be installed inside a directory that conda
created for the Conda environment, for instance
${HOME}/conda/envs/mkdocs
conda
. As a results, ‘system-wide’ installations with pip
inside a Conda environment do not interfere with system packages.
Do not install packages in Conda environments with pip as a user
User installed packages (e.g.pip install --user --upgrade mkdocs-minify-plugin
) are installed in the same directory for all environments, typically in ~/.local/
, and can interfere with other versions of the same package installed from other Conda environments.
Python Virtual Environments (virtualenv)
The default Python is the OS Python
When you log into UBELIX, running python
without loading a module or using
a container will result in using the operating system Python installation.
This is a Python 3.9 which can’t be upgraded!
Make sure this is what you want before continuing with this section!
Virtualenv is the Pythons native way of isolating a particular python environment from the default one. Each environment resides in a self-contained directory, so multiple virtualenvs can co-exist side by side with different versions of tools or dependencies installed. The downside of this approach is that the Python version can’t be changed. If you need to be able to run specific versions of Python, please use the Conda installation method.
Creating a virtualenv
We can create a virtualenv:
python -m venv --system-site-packages venvs/venv-for-demo
or
virtualenv --system-site-packages venvs/venv-for-demo
We recommend that, as in the example above, to use the --system-site-packages
option. This ensures that already installed packages are used.
Using a virtualenv
To use a virtualenv it is necessary to activate it:
source venvs/venv-for-demo/bin/activate
(venv-for-demo) [user@cluster:~]$
Once activated the prompt changes and any Python related commands that follow will refer to the Python installation and packages contained/linked in the virtualenv.
Installing new packages
To install packages in the virtualenv it needs to be activated, and while it is active any packages installed with pip will be actually installed inside the virtualenv itself:
source venvs/venv-for-demo/bin/activate
(venv-for-demo) [user@cluster:~]$ pip install biopython
Stop using a virtualenv
To stop using a virtualenv one needs to deactivate it:
(venv-for-demo) [user@cluster:~]$ deactivate
[user@cluster:~]$
Removing a virtualenv
To permanently remove a virtualenv one simply deletes the directory which contains it.
rm -rf venvs/venv-for-demo/
Discouraged installation methods
We strongly discourage installing Python packages
directly through the OS Python with pip
without using virtual environments as
described above.