Benchmark_nnU-Net_for_PyTorch
Benchmark of nnU-Net for PyTorch on Berzelius
It is based on the Nvidia NGC recipe of nnU-net for Pytorch.
Latest Version 21.11.0
Modified February 3, 2022
See NVIDIA Deep Learning Examples.
On local computer
- Clone the repository
git clone https://github.com/NVIDIA/DeepLearningExamples
cd ngc/DeepLearningExamples/PyTorch/Segmentation/nnUNet
- Build the nnU-Net PyTorch NGC container
docker build -t nnunet .
- Push the container to Docker Hub
docker tag nnunet:latest xuagu37/nvidia_nnu-net_for_pytorch:21.11.0
docker push xuagu37/nvidia_nnu-net_for_pytorch:21.11.0
On Berzelius
- Clone the repository
cd /proj/nsc/xuan/ngc
git clone https://github.com/NVIDIA/DeepLearningExamples
cd /proj/nsc/xuan/ngc/DeepLearningExamples/PyTorch/Segmentation/nnUNet
- Pull from xuagu37 and run the image
singularity pull nvidia_nnu-net_for_pytorch.sif docker://xuagu37/nvidia_nnu-net_for_pytorch:21.11.0
singularity shell -B ${PWD}/data:/data -B ${PWD}/results:/results --nv nvidia_nnu-net_for_pytorch.sif
- Prepare BraTS dataset
python download.py --task 01
python preprocess.py --task 01 --dim 2
- Run the script.
You need to modify the script for e.g. the name of your reservation, number of nodes, batch_size, etc.
cd /proj/nsc/xuan/ngc/DeepLearningExamples/PyTorch/Segmentation/nnUNet
sbash benchmark_nnunet_pytorch_berzelius.sh
sbash benchmark_nnunet_pytorch_berzelius_multi_node.sh
Results
We collect benchmark results of throughput (images/sec) for
- Precisions = TF32, AMP
- Dimention = 2
- Nodes = 1, 2
- GPUs = 1 - 8 (for 1 node), 16 (for 2 nodes)
- Batch size = 1, 2, 4, 8, 16, 32, 64, 128
TF32 (TensorFloat32) mode is for accelerating FP32 convolutions and matrix multiplications. TF32 mode is the default option for AI training with 32-bit variables on Ampere GPU architecture.
AMP (Automatic Mixed Precision) offers significant computational speedup by performing operations in half-precision (FP16) format, while storing minimal information in single-precision (TF32) to retain as much information as possible in critical parts of the network.
We run 100 iterations for each set of parameters.
Observation 1: when batch_size is small (1, 2, 4, 8), throughput_amp ≈ throughput_tf32;
when batch_size is large (16, 32, 64, 128), throughput_amp > throughput_tf32.

Observation 2: The coefficient of variation of throughput for 100 iterations is smallest when batch_size = 128.

Benchmarking with dim = 2, nodes = 1, 2, gpus = 8, batch_size = 128 can be used for node health check.
- The expected throughput for dim = 2, nodes = 1, gpus = 8, batch_size = 128 would be 4700 ± 500 (TF32).
- The expected throughput for dim = 2, nodes = 2, gpus = 16, batch_size = 128 would be 9250 ± 150 (TF32).
Observation 3: Ideally, the improvement of throughput would be linear when batch_size increases. In practice, throughtput stays below the ideal curve when batch_size > 16.

Observation 4: Ideally, the improvement of throughput would be linear when the number of GPUs increases. In practice, throughtput stays below the ideal curve when the number of gpus increases.

Notes
- It seems running directly via singularity shell will give worse performance (when I WFH). We should run it via sbatch script instead.
- It took around a week to finish 100 iterations of benchmarking for all sets of parameters.
- For multi-node benchmarking, we need to use "srun" command; also, the line "#SBATCH --ntasks-per-node=8" has to been added. Otherwise the process will hang.
- Benchmarking with dim = 2, nodes = 1, gpus = 8, batch_size = 128 takes ~2mins.
If we want to finish it within a minute, we can change the number of batches from 150 (the default value) to a smaller number. Or we can try some smaller datasets.