Skip to content

DLProf#

DLProf (dlprof) is the Deep Learning Profiler from Nvidia and can be used for profiling deep learning scripts using Python with PyTorch or TensorFlow. With DLProf Viewer (dlprofviewer) a browser-based dashboard for visually analyzing results from dlprof exists.

For details see documentation for DLProf and Release Notes at Nvidia.

DLProf's PyTorch support is limited to version < v2.0

Installation#

Prerequisites

A Python module must be loaded, e.g. by executing

module add python

Installation steps

  1. Optionally: load or create a conda environment or virtual environment where DLProf is installed into.

  2. DLProf requires to install nvidia-pyindex first. This adds an Nvidia repository to pip.

    pip install nvidia-pyindex
    

  3. Install DLProf. DLProf can be installed for the framework you are targeting, e.g. PyTorch or TensorFlow. Choose one:

    # for Pytorch:
    pip install nvidia-dlprof[pytorch]
    # for TensorFlow:
    pip install nvidia-dlprof[tensorflow]
    
  4. Install DLProf Viewer.

    pip install nvidia-dlprofviewer
    

The installation steps as one code snippet:

pip install nvidia-pyindex
# for Pytorch:
pip install nvidia-dlprof[pytorch]
# for TensorFlow:
pip install nvidia-dlprof[tensorflow]
pip install nvidia-dlprofviewer

Usage#

Only profile for a short amount of time.

Profiling can create a lot of data and slow down training or inference. Hence, limit the profiling to a short amount of time. Run the profiler only for one epoch or a limited number of batches inside an epoch.

Depending on if you want to profile PyTorch or TensorFlow the required steps differ.

Profiling PyTorch scripts#

For Pytorch add the following code snippet to the script that should be profiled:

import nvidia_dlprof_pytorch_nvtx
nvidia_dlprof_pytorch_nvtx.init()
...
# put around your training/inference loop 
# or the part you want to profile
with torch.autograd.profiler.emit_nvtx():
   <training/inference loop>

This causes the corresponding modules to be imported and will profile anything under with torch.autograd.profiler.emit_nvtx(). It is also possible to have the with torch.autograd.profiler.emit_nvtx(): statement inside the training/inference loop and only use it for a limited number of batches.

Run your script by prepending dlprof --mode=pytorch:

dlprof --mode=pytorch [more dlprof args] <python pytorch script> [args to pytorch script]

Several *.sqlite and *.qdrep files will be created.

To show a textual report use:

dlprof --report

Profiling TensorFlow (TF) scripts#

For TensorFlow scripts just prepend dlprof when invoking your script:

# for TensorFlow 1.x
dlprof --mode=tensorflow1 <python tf script> [args to tf script]

# for TensorFlow 2.x
dlprof --mode=tensorflow2 <python tf script> [args to tf script]

Several *.sqlite and *.qdrep files will be created.

To show a textual report use:

dlprof --report

Passing options to Nvidia Nsight Systems (nsys)#

DLProf uses Nvidia Nsight Systems (nsys) for profiling.

You can add options to nsys via --nsys_opts flags:

dlprof --nsys_opts=<nsys options> ...

Troubleshooting#

Overwrite existing files#

DLProf does not override existing tracing files by default. Specify the flag --force=true to do so:

dlprof --mode=... --force=true ...

Nsight Systems (nsys) errors after profiling#

Errors of the following form from Nsight Systems can indicate that Nsight Systems is incompatible with the used CUDA version. DLProf installs its own version of Nsight Systems via the nvidia-nsys-cli package which is unrelated to Nsight Systems which comes from a possible different CUDA Toolkit version your application uses.

As a workaround a newer Nsight Systems version is required. Two options are available:

  • Load a newer CUDA module.
    • Even if your application does not use this CUDA module DLProf will use Nsight Systems this module provides.
  • Manually install a newer nvidia-nsys-cli package, i.e. nvidia_nsys_cli ... .whl from https://developer.download.nvidia.com/devtools/nsight-systems/ and install it via pip.

Errors from Nsight Systems, indicating an incompatibility with the used CUDA version:

Error {
  Type: RuntimeError
  SubError {
    Type: ProcessEventsError
    Props {
      Items {
        Type: ErrorText
        Value: "/build/agent/work/20a3cfcd1c25021d/QuadD/Host/Analysis/Modules/StringStorage.cpp(149): Throw in function QuadDCommon::StringId QuadDAnalysis::StringStorage::GetKeyForExterior
Id(QuadDAnalysis::GlobalProcess, QuadDAnalysis::StringStorage::ExteriorId) const\nDynamic exception type: boost::exception_detail::clone_impl<QuadDCommon::LogicException>\nstd::exception::wh
at: LogicException\n[QuadDCommon::tag_message*] = Cannot find bucket for a bucket index\n"
      }
    }
  }
}
Status: TargetProfilingFailed
Props {
  Items {
    Type: DeviceId
    Value: "Local (CLI)"
  }
}
Error {
  Type: RuntimeError
  SubError {
    Type: ProcessEventsError
    Props {
      Items {
        Type: ErrorText
        Value: "/build/agent/work/20a3cfcd1c25021d/QuadD/Host/Analysis/Modules/TraceProcessEvent.cpp(45): Throw in function const string& {anonymous}::GetCudaCallbackName(bool, uint32_t, con
st QuadDAnalysis::MoreInjection&)\nDynamic exception type: boost::exception_detail::clone_impl<QuadDCommon::InvalidArgumentException>\nstd::exception::what: InvalidArgumentException\n[QuadDC
ommon::tag_message*] = Unknown runtime API function index: 430\n"
      }
    }
  }
}

Viewing profile results DLProf Viewer#

Warning

Running dlprofviewer opens a dashboard under http://127.0.0.1:8000. This is accessible to all users who currently have jobs scheduled on the same compute node.

Run dlprofviewer:

dlprofviewer <path>/dlprof_dldb.sqlite

This starts a dashboard as web-application that you can access with your browser under http://127.0.0.1:8000.

For more details see the Nvidia's official documentation.

Viewing remote dashboards#

In case you are running dlprofviewer on a front end or cluster node and want to view the dashboard with your local browser, you have to create a port forwarding via ssh. The port forwarding will tunnel a connection to the dashboard from your local machine to the corresponding cluster node.

Prerequisites

If you are running dlbprofview on a cluster node, make sure you have configured the Template for connecting to cluster nodes in your local .ssh_config and are able to connect to a cluster node from your local machine.

Setup

For setting up the port forwarding the steps are:

  1. On the front end or cluster node obtain the fully qualified domain name (FQDN), from further on called remote-fqdn:

    hostname
    # output can look like one of the following lines:
    # <name>.nhr.fau.de  
    # <name>.rrze.fau.de
    # <name>.rrze.uni-erlangen.de
    
  2. On the front end or cluster node, if not already started, start dlprofviewer:

    dlprofviewer <path>/dlprof_dldb.sqlite
    
  3. On your local machine create a port forwarding to the remote machine:

    ssh -L 8000:localhost:8000 <remote-fqdn>
    

    This opens a new SSH session to the remote machine and also forwards the local port 8000 to the remote machine. If you exit the terminal, the port forwarding will also be stopped, when the last connection terminates.

  4. Open a browser on your local machine and navigate to http://localhost:8000. You should now see the DLProf Viewer dashboard.