TensorFlow#
TensorFlow is a machine learning framework.
Security issue of TensorBoard on multi-user systems
For security reasons, do not run TensorBoard on a multi-user system, like our cluster frontends or GPU nodes. Anyone with access to such a system can attach to your TensorBoard port and run code under your account. This is by design and we cannot mitigate the issue.
We recommend using TensorBoard on your local machine and mounting the corresponding
filesystem ($HOME
, $HPCVAULT
, $WORK
).
For mounting you can use for example SSHFS.
Workarounds found on GitHub do not fix the problem on multi-user systems like:
- Using
--host localhost
is insecure. - Using a unique address is insecure.
Install or build packages inside an interactive job on the target cluster
Install or build packages using an interactive job on the target cluster (Alex, Fritz, TinyGPU, Woody) to make sure GPU support and hardware features can be used properly.
For internet access on the compute node, configure a proxy by executing the following lines in your shell inside the interactive job:
Installation via pip/conda#
Preparation#
-
Start interactive Job on a cluster node. See cluster documentation page for hints.
-
Load the Python module by executing:
Without loading the Python module the system-installed Python will be used which is pretty old. -
Optional: create a virtual environment:
- For conda see conda environments.
- For Python
venv
see Virtual environments withvenv
.
Installation#
The officially documented way for installing TensorFlow is by using pip or using it via a Docker container.
pip
: install TensorFlow with CUDA support, but not the CUDA Toolkit. Before you can use TensorFlow in your script the correct version ofcuda
,cudnn
, andtensorrt
(optional) environment modules must be loaded. Which version of the CUDA module is required you can find- in this table or
- run the following code after installing TensorFlow. The last line prints the required CUDA version:
pip
: install TensorFlow with CUDA support and CUDA Toolkit.- Supported since TensorFlow version > 2.13.1.
- Does not require loading any environment modules.
Installing TensorFlow via conda
is not officially mentioned, but conda-forge provides GPU-enabled builds:
-
conda
over conda-forge channel: explicitly installing the GPU packageYou might need to prefix the command with
CONDA_OVERRIDE_CUDA="12.2"
and adjust the CUDA version accordingly. -
conda
over conda-forge channel: installing generic package, but forcing the GPU packageAdjust the CUDA version, here 12.2, as needed.
-
Do not install TensorFlow with
conda
over Anaconda channel as TensorFlow GPU is only supported up to version 2.4.1 (2021).
Test installation#
To check your TensorFlow installation is functional and detects the GPU(s) execute the following command on a compute node after loading the Python module and activating the virtual environment you installed it to:
python3 -c 'import tensorflow as tf; print(tf.config.list_physical_devices("GPU"))'
# output when GPUs are usable:
## [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
If no GPUs are found or the cuda
and/or cudnn
modules are not loaded a message like
Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
or like
will be printed.
Determine the needed CUDA version for TensorFlow#
Each TensorFlow version requires a corresponding CUDA version. The specific version you can lookup in this table or run one of the following codes, after you have TensorFlow installed:
-
Print needed CUDA version:
-
Print needed CUDA version and other information:
Possible output might look like:
Python script printing CUDA version and other information.
#!/usr/bin/env python3
import tensorflow as tf
print(f"tf_version: {tf.__version__}")
bi = tf.sysconfig.get_build_info()
for k in [ 'cuda_version', 'cudnn_version', 'is_cuda_build', 'is_rocm_build' ]:
print(f"{k}: {bi[k] if k in bi else 'not defined'}")
Troubleshooting#
No GPUs found despite CUDA module is loaded#
You are on a cluster node with GPUs, have a CUDA module loaded, and still TensorFlow does not find any GPUs, indicated by the output:
... I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
This can happen if the version of a loaded CUDA module is incompatible with the installed TensorFlow. The solution is loading a compatible CUDA module for your TensorFlow. See the following Determine the needed CUDA version for TensorFlow for the procedure.
Fix warning "Can't find libdevice
directory" or "libdevice
not found at ./libdevice.10.bc
"#
If your output contains warnings like the following:
... I tensorflow/compiler/xla/service/service.cc:169] XLA service 0x14eb1660f180 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
... I tensorflow/compiler/xla/service/service.cc:177] StreamExecutor device (0): NVIDIA A40, Compute Capability 8.6
... I tensorflow/compiler/xla/service/service.cc:177] StreamExecutor device (1): NVIDIA A40, Compute Capability 8.6
... I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
... W tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:530] Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice. This may result in compilation or runtime failures, if the program we try to run uses routines from libdevice.
...llowing directories:
...irectory by setting xla_gpu_cuda_data_dir in HloModule's DebugOptions. For most apps, setting the environment variable XLA_FLAGS=--xla_gpu_cuda_data_dir=/path/to/cuda will work.
... W tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:274] libdevice is required by this HLO module but was not found at ./libdevice.10.bc
... W tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:274] libdevice is required by this HLO module but was not found at ./libdevice.10.bc
... W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:362 : INTERNAL: libdevice not found at ./libdevice.10.bc
you can fix this issue, like indicated in the output, by setting the
environment XLA_FLAGS
and pointing --xla_gpu_cuda_data_dir=
to the
installation path of CUDA.
On our systems with a loaded CUDA module this can be done by:
For more information regarding XLA see https://www.tensorflow.org/xla.
Fix warning "Could not find TensorRT"#
If the output contains warnings like
then load the corresponding tensorrt
module that matches the version of the loaded CUDA module.
List tensorrt
modules:
mdoule avail tensorrt
# possible output:
# tensorrt/7.2.3.4-cuda11.0-cudnn8.1 tensorrt/8.5.3.1-cuda11.8-cudnn8.6
If you are using for example the CUDA 11.8 module then load the corresponding tensorrt
module:
Using Docker images#
It is possible to use pre-built Docker images of TensorFlow, e.g either from DockerHub or Nvidia NGC.
However, instead of Docker Apptainer is used on our systems.
Apptainer allows for downloading and converting a Docker container into its own sif
file format.
From DockerHub#
To download and convert the latest TensorFlow container from DockerHub run:
cd $WORK
export APPTAINER_CACHEDIR=$(mktemp -d)
apptainer pull tensorflow-latest.sif docker://tensorflow/tensorflow:latest-gpu
rm -r "$APPTAINER_CACHEDIR"
Valid tags can be found under https://www.tensorflow.org/install/docker if you want to install a different one than latest-gpu
.
From Nvidia NGC#
To download and convert the TensorFlow container from Nvidia NGC run:
cd $WORK
export APPTAINER_CACHEDIR=$(mktemp -d)
singularity pull tensorflow-ngc-23.11-tf2-py3.sif docker://nvcr.io/nvidia/tensorflow:23.11-tf2-py3
rm -r "$APPTAINER_CACHEDIR"
To get the latest container, replace tag 23.11-tf2-py3
with the newest found on the tag page.
Using the imported container#
Within your job script, you use the container as follows:
Replace tensorflow-latest.sif
with the filename you used when pulling the container.
In the container automatically bind-mounted are the file systems /home
and /apps
, i.e. $HOME
, $HPCVAULT
, and $WORK
.
For a simple test run:
apptainer exec tensorflow-latest.sif python3 -c 'import tensorflow as tf; print(tf.config.list_physical_devices("GPU"))'
# output when GPUs are usable:
# INFO: underlay of /usr/bin/nvidia-smi required more than 50 (452) bind mounts
# 2023-06-27 17:33:48.814135: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
# To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
## [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]