Berzelius nnU-Net Benchmark
The benchmarking is based on Nvidia NGC nnU-net for Pytorch v21.11.0.
VERSION=21.11.0
On local computer (optional)
- Download the code
wget --content-disposition https://api.ngc.nvidia.com/v2/resources/nvidia/nnunet_for_pytorch/versions/$VERSION/zip -O /tmp/nnunet_for_pytorch_$VERSION.zip
mkdir ~/DeepLearningExamples/nnunet_for_pytorch_$VERSION
unzip /tmp/nnunet_for_pytorch_$VERSION.zip -d ~/DeepLearningExamples/nnunet_for_pytorch_$VERSION/
cd ~/DeepLearningExamples/nnunet_for_pytorch_$VERSION/
- Build the nnU-Net PyTorch Docker container
Change the pytorch-lightning version to 1.5.10 to avoid the from torchmetrics.utilities.data import get_num_classes as _get_num_classes
error.
Add ENV PYTHONNOUSERSITE=True
to the Dockerfile
to disable the user packages.
docker build -t nnunet .
- Push the container to Docker Hub
docker tag nnunet:latest berzeliushub/nvidia_nnu-net_for_pytorch:$VERSION
docker push berzeliushub/nvidia_nnu-net_for_pytorch:$VERSION
On Berzelius
- Create directories
cd /proj/nsc_testing/xuan/DeepLearningExamples/
git clone https://gitlab.liu.se/xuagu37/Berzelius-nnU-Net-Benchmark.git
cd Berzelius-nnU-Net-Benchmark
mkdir data results
Docker is not available on Berzelius. We use Apptainer or Enroot.
- Prepare the dataset
With Apptainer
apptainer pull nvidia_nnu-net_for_pytorch_$VERSION.sif docker://berzeliushub/nvidia_nnu-net_for_pytorch:$VERSION
apptainer exec --nv -B ${PWD}/data:/data -B ${PWD}/results:/results nvidia_nnu-net_for_pytorch_$VERSION.sif bash -c "cd /workspace/nnunet_pyt && python download.py --task 01"
apptainer exec --nv -B ${PWD}/data:/data -B ${PWD}/results:/results nvidia_nnu-net_for_pytorch_$VERSION.sif bash -c "cd /workspace/nnunet_pyt && python preprocess.py --task 01 --dim 2 && python preprocess.py --task 01 --dim 3"
- Submit the job to Berzelius
You can choose either singularity or enroot in the scripts benchmark_single_node.sbatch
and benchmark_multi_node.sbatch
.
Chnage the following settings in benchmark_sbatch_submit.sh
:
- Data dimention,
- Number of nodes,
- Number of gpus used per node,
- Number of iterations for each parameter setting,
- Batch size.
We will average the benchmark performance over the iterations. The maximum usable (without a OOM error) batch size is 256 and 128 for single and multi-node, respectively.
mkdir -p sbatch_out
bash scripts/run_benchmark.sh
Results
We collect benchmark results of throughput (images/sec) for
- Precisions = TF32, AMP
- Dimention = 2
- Nodes = 1, 2, 3, 4, 5, 6, 7, 8
- GPUs = 1 - 8 (for 1 node), all gpus (for multi-node)
- Batch size = 1, 2, 4, 8, 16, 32, 64, 128, 256
TF32 (TensorFloat32) mode is for accelerating FP32 convolutions and matrix multiplications. TF32 mode is the default option for AI training with 32-bit variables on Ampere GPU architecture.
AMP (Automatic Mixed Precision) offers significant computational speedup by performing operations in half-precision (FP16) format, while storing minimal information in single-precision (TF32) to retain as much information as possible in critical parts of the network.
We run 100 iterations for each set of parameters. Please see the results in benchmar_table.xlsx.
Observation 1: Ideally, the improvement of throughput would be linear when the number of GPUs increases.
In practice, throughtput stays below the ideal curve when the number of gpus increases.

Observation 2: when batch_size is small (1, 2, 4, 8), throughput_amp ≈ throughput_tf32;
when batch_size is large (16, 32, 64, 128), throughput_amp > throughput_tf32.

Observation 3: Benchmark results are more stable when larger batch_size.

Coefficient of variation is calculated as the ratio of the standard deviation to the mean. It shows the extent of variability in relation to the mean of the population.
Observation 4: Ideally, the improvement of throughput would be linear when batch_size increases. In practice, throughtput stays below the ideal curve when batch_size > 16.

Notes
- Line 116 of DeepLearningExamples/PyTorch/Segmentation/nnUNet/main.py should be changed:
trainer.test(model, test_dataloaders=data_module.test_dataloader(), ckpt_path=ckpt_path)
to
trainer.test(model, dataloaders=data_module.test_dataloader(), ckpt_path=ckpt_path)
Ref: https://pytorch-lightning.readthedocs.io/en/stable/common/trainer.html - It seems running directly via singularity shell will give worse performance (when I WFH). We should run it via sbatch script instead.
- It took around a week to finish 100 iterations of benchmarking for all sets of parameters.
- For multi-node benchmarking, we need to use "srun" command; also, the line "#SBATCH --ntasks-per-node=8" has to be added. Otherwise the process will hang.
- Use as large batch_size as possible for a more stable benchmark result. For single node, use 256; for multi-node, use 128.
- Benchmarking with dim = 2, nodes = 1, gpus = 8, batch_size = 128, 256 takes ~2mins.
- Specify the paths for enroot cache and data, see this page.
- (20220222)
srun enroot ...
stopped working for multi-node case. Use pyxis instead. See the scriptbenchmark_multi_node.sbatch
.