Skip to content
Snippets Groups Projects

Berzelius nnU-Net Benchmark

The benchmarking is based on Nvidia NGC nnU-net for Pytorch v21.11.0.


On local computer (optional)

  • Download the code
wget --content-disposition$VERSION/zip -O /tmp/nnunet_for_pytorch_$
mkdir ~/DeepLearningExamples/nnunet_for_pytorch_$VERSION
unzip /tmp/nnunet_for_pytorch_$ -d ~/DeepLearningExamples/nnunet_for_pytorch_$VERSION/ 
cd ~/DeepLearningExamples/nnunet_for_pytorch_$VERSION/ 
  • Build the nnU-Net PyTorch Docker container

Change the pytorch-lightning version to 1.5.10 to avoid the from import get_num_classes as _get_num_classes error.

Add ENV PYTHONNOUSERSITE=True to the Dockerfile to disable the user packages.

docker build -t nnunet .
  • Push the container to Docker Hub
docker tag nnunet:latest berzeliushub/nvidia_nnu-net_for_pytorch:$VERSION
docker push berzeliushub/nvidia_nnu-net_for_pytorch:$VERSION

On Berzelius

  • Create directories
cd /proj/nsc_testing/xuan/DeepLearningExamples/
git clone
cd Berzelius-nnU-Net-Benchmark
mkdir data results

Docker is not available on Berzelius. We use Apptainer or Enroot.

  • Prepare the dataset

With Apptainer

apptainer pull nvidia_nnu-net_for_pytorch_$VERSION.sif docker://berzeliushub/nvidia_nnu-net_for_pytorch:$VERSION
apptainer exec --nv -B ${PWD}/data:/data -B ${PWD}/results:/results nvidia_nnu-net_for_pytorch_$VERSION.sif bash -c "cd /workspace/nnunet_pyt && python --task 01"
apptainer exec --nv -B ${PWD}/data:/data -B ${PWD}/results:/results nvidia_nnu-net_for_pytorch_$VERSION.sif bash -c "cd /workspace/nnunet_pyt && python --task 01 --dim 2 && python --task 01 --dim 3"
  • Submit the job to Berzelius

You can choose either singularity or enroot in the scripts benchmark_single_node.sbatch and benchmark_multi_node.sbatch.

Chnage the following settings in

  1. Data dimention,
  2. Number of nodes,
  3. Number of gpus used per node,
  4. Number of iterations for each parameter setting,
  5. Batch size.

We will average the benchmark performance over the iterations. The maximum usable (without a OOM error) batch size is 256 and 128 for single and multi-node, respectively.

mkdir -p sbatch_out
bash scripts/


We collect benchmark results of throughput (images/sec) for

  • Precisions = TF32, AMP
  • Dimention = 2
  • Nodes = 1, 2, 3, 4, 5, 6, 7, 8
  • GPUs = 1 - 8 (for 1 node), all gpus (for multi-node)
  • Batch size = 1, 2, 4, 8, 16, 32, 64, 128, 256

TF32 (TensorFloat32) mode is for accelerating FP32 convolutions and matrix multiplications. TF32 mode is the default option for AI training with 32-bit variables on Ampere GPU architecture.

AMP (Automatic Mixed Precision) offers significant computational speedup by performing operations in half-precision (FP16) format, while storing minimal information in single-precision (TF32) to retain as much information as possible in critical parts of the network.

We run 100 iterations for each set of parameters. Please see the results in benchmar_table.xlsx.

Observation 1: Ideally, the improvement of throughput would be linear when the number of GPUs increases.
In practice, throughtput stays below the ideal curve when the number of gpus increases.

Observation 2: when batch_size is small (1, 2, 4, 8), throughput_amp ≈ throughput_tf32;
when batch_size is large (16, 32, 64, 128), throughput_amp > throughput_tf32.

Observation 3: Benchmark results are more stable when larger batch_size.

Coefficient of variation is calculated as the ratio of the standard deviation to the mean. It shows the extent of variability in relation to the mean of the population.

Observation 4: Ideally, the improvement of throughput would be linear when batch_size increases. In practice, throughtput stays below the ideal curve when batch_size > 16.


  • Line 116 of DeepLearningExamples/PyTorch/Segmentation/nnUNet/ should be changed:
    trainer.test(model, test_dataloaders=data_module.test_dataloader(), ckpt_path=ckpt_path)
    trainer.test(model, dataloaders=data_module.test_dataloader(), ckpt_path=ckpt_path)
  • It seems running directly via singularity shell will give worse performance (when I WFH). We should run it via sbatch script instead.
  • It took around a week to finish 100 iterations of benchmarking for all sets of parameters.
  • For multi-node benchmarking, we need to use "srun" command; also, the line "#SBATCH --ntasks-per-node=8" has to be added. Otherwise the process will hang.
  • Use as large batch_size as possible for a more stable benchmark result. For single node, use 256; for multi-node, use 128.
  • Benchmarking with dim = 2, nodes = 1, gpus = 8, batch_size = 128, 256 takes ~2mins.
  • Specify the paths for enroot cache and data, see this page.
  • (20220222) srun enroot ...stopped working for multi-node case. Use pyxis instead. See the script benchmark_multi_node.sbatch.