TensorFlow#

TensorFlow is a machine learning framework.

Security issue of TensorBoard on multi-user systems

For security reasons, do not run TensorBoard on a multi-user system, like our cluster frontends or GPU nodes. Anyone with access to such a system can attach to your TensorBoard port and run code under your account. This is by design and we cannot mitigate the issue.

We recommend using TensorBoard on your local machine and mounting the corresponding filesystem ($HOME, $HPCVAULT, $WORK). For mounting you can use for example SSHFS.

Workarounds found on GitHub do not fix the problem on multi-user systems like:

Using --host localhost is insecure.
Using a unique address is insecure.

Install or build packages inside an interactive job on the target cluster

Install or build packages using an interactive job on the target cluster (Alex, Fritz, TinyGPU, Woody) to make sure GPU support and hardware features can be used properly.

For internet access on the compute node, configure a proxy by executing the following lines in your shell inside the interactive job:

export http_proxy=http://proxy.nhr.fau.de:80
export https_proxy=http://proxy.nhr.fau.de:80

Installation via pip/conda#

Preparation#

Start interactive Job on a cluster node. See cluster documentation page for hints.
Load the Python module by executing:
```
module add python
```
Without loading the Python module the system-installed Python will be used which is pretty old.
Optional: create a virtual environment:
- For conda see conda environments.
- For Python venv see Virtual environments with venv.

Installation#

The officially documented way for installing TensorFlow is by using pip or using it via a Docker container.

pip: install TensorFlow with CUDA support, but not the CUDA Toolkit.
```
pip install tensorflow
```
Before you can use TensorFlow in your script the correct version of cuda, cudnn, and tensorrt (optional) environment modules must be loaded. Which version of the CUDA module is required you can find
- in this table or
- run the following code after installing TensorFlow. The last line prints the required CUDA version:
```
python3 -c "import tensorflow as tf; print(tf.sysconfig.get_build_info()['cuda_version'])"
```
pip: install TensorFlow with CUDA support and CUDA Toolkit.
```
pip install tensorflow[and-cuda]
```
- Supported since TensorFlow version > 2.13.1.
- Does not require loading any environment modules.

Installing TensorFlow via conda is not officially mentioned, but conda-forge provides GPU-enabled builds:

conda over conda-forge channel: explicitly installing the GPU package
```
conda install tensorflow-gpu -c conda-forge
```
You might need to prefix the command with CONDA_OVERRIDE_CUDA="12.2" and adjust the CUDA version accordingly.
conda over conda-forge channel: installing generic package, but forcing the GPU package
```
CONDA_OVERRIDE_CUDA="12.2" conda install tensorflow cudatoolkit>=12.2 -c conda-forge
```
Adjust the CUDA version, here 12.2, as needed.
Do not install TensorFlow with conda over Anaconda channel as TensorFlow GPU is only supported up to version 2.4.1 (2021).

Test installation#

To check your TensorFlow installation is functional and detects the GPU(s) execute the following command on a compute node after loading the Python module and activating the virtual environment you installed it to:

python3 -c 'import tensorflow as tf; print(tf.config.list_physical_devices("GPU"))'
# output when GPUs are usable:
## [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

If no GPUs are found or the cuda and/or cudnn modules are not loaded a message like

Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.

or like

Could not find cuda drivers on your machine, GPU will not be used.

will be printed.

Determine the needed CUDA version for TensorFlow#

Each TensorFlow version requires a corresponding CUDA version. The specific version you can lookup in this table or run one of the following codes, after you have TensorFlow installed:

Print needed CUDA version:

python3 -c "import tensorflow as tf; print(tf.sysconfig.get_build_info()['cuda_version'])"

Print needed CUDA version and other information:

python3 -c "import tensorflow as tf;  print(f\"tf_version: {tf.__version__}\");  bi = tf.sysconfig.get_build_info();  print(\"\n\".join([ f\"{k}: {bi[k] if k in bi else 'not defined'}\" for k in [ 'cuda_version', 'cudnn_version', 'is_cuda_build', 'is_rocm_build' ]]))"

Possible output might look like:

tf_version: 2.15.0
cuda_version: 12.2
cudnn_version: 8
is_cuda_build: True
is_rocm_build: False

Python script printing CUDA version and other information.

#!/usr/bin/env python3
import tensorflow as tf

print(f"tf_version: {tf.__version__}")

bi = tf.sysconfig.get_build_info()

for k in [ 'cuda_version', 'cudnn_version', 'is_cuda_build', 'is_rocm_build' ]:
    print(f"{k}: {bi[k] if k in bi else 'not defined'}")

Troubleshooting#

No GPUs found despite CUDA module is loaded#

You are on a cluster node with GPUs, have a CUDA module loaded, and still TensorFlow does not find any GPUs, indicated by the output:

... I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.

This can happen if the version of a loaded CUDA module is incompatible with the installed TensorFlow. The solution is loading a compatible CUDA module for your TensorFlow. See the following Determine the needed CUDA version for TensorFlow for the procedure.

Fix warning "Can't find `libdevice` directory" or "`libdevice` not found at `./libdevice.10.bc`"#

If your output contains warnings like the following:

... I tensorflow/compiler/xla/service/service.cc:169] XLA service 0x14eb1660f180 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
... I tensorflow/compiler/xla/service/service.cc:177]   StreamExecutor device (0): NVIDIA A40, Compute Capability 8.6
... I tensorflow/compiler/xla/service/service.cc:177]   StreamExecutor device (1): NVIDIA A40, Compute Capability 8.6
... I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
... W tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:530] Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice. This may result in compilation or runtime failures, if the program we try to run uses routines from libdevice.
...llowing directories:
...irectory by setting xla_gpu_cuda_data_dir in HloModule's DebugOptions.  For most apps, setting the environment variable XLA_FLAGS=--xla_gpu_cuda_data_dir=/path/to/cuda will work.
... W tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:274] libdevice is required by this HLO module but was not found at ./libdevice.10.bc
... W tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:274] libdevice is required by this HLO module but was not found at ./libdevice.10.bc
... W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:362 : INTERNAL: libdevice not found at ./libdevice.10.bc

you can fix this issue, like indicated in the output, by setting the environment XLA_FLAGS and pointing --xla_gpu_cuda_data_dir= to the installation path of CUDA.

On our systems with a loaded CUDA module this can be done by:

export "XLA_FLAGS=--xla_gpu_cuda_data_dir=$CUDA_ROOT"

For more information regarding XLA see https://www.tensorflow.org/xla.

Fix warning "Could not find TensorRT"#

If the output contains warnings like

... W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT

then load the corresponding tensorrt module that matches the version of the loaded CUDA module.

List tensorrt modules:

mdoule avail tensorrt
# possible output:
# tensorrt/7.2.3.4-cuda11.0-cudnn8.1  tensorrt/8.5.3.1-cuda11.8-cudnn8.6

If you are using for example the CUDA 11.8 module then load the corresponding tensorrt module:

module add tensorrt/8.5.3.1-cuda11.8-cudnn8.6

Using Docker images#

It is possible to use pre-built Docker images of TensorFlow, e.g either from DockerHub or Nvidia NGC.

However, instead of Docker Apptainer is used on our systems. Apptainer allows for downloading and converting a Docker container into its own sif file format.

From DockerHub#

To download and convert the latest TensorFlow container from DockerHub run:

cd $WORK
export APPTAINER_CACHEDIR=$(mktemp -d)
apptainer pull tensorflow-latest.sif docker://tensorflow/tensorflow:latest-gpu
rm -r "$APPTAINER_CACHEDIR"

Valid tags can be found under https://www.tensorflow.org/install/docker if you want to install a different one than latest-gpu.

From Nvidia NGC#

To download and convert the TensorFlow container from Nvidia NGC run:

cd $WORK
export APPTAINER_CACHEDIR=$(mktemp -d)
singularity pull tensorflow-ngc-23.11-tf2-py3.sif docker://nvcr.io/nvidia/tensorflow:23.11-tf2-py3
rm -r "$APPTAINER_CACHEDIR"

To get the latest container, replace tag 23.11-tf2-py3 with the newest found on the tag page.

Using the imported container#

Within your job script, you use the container as follows:

apptainer exec tensorflow-latest.sif ./script.py

Replace tensorflow-latest.sif with the filename you used when pulling the container.

In the container automatically bind-mounted are the file systems /home and /apps, i.e. $HOME, $HPCVAULT, and $WORK.

For a simple test run:

apptainer exec tensorflow-latest.sif python3 -c 'import tensorflow as tf; print(tf.config.list_physical_devices("GPU"))'
# output when GPUs are usable:
# INFO:    underlay of /usr/bin/nvidia-smi required more than 50 (452) bind mounts
# 2023-06-27 17:33:48.814135: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
# To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
## [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]