Skip to content
Snippets Groups Projects
Commit a39e745d authored by Xuan Gu's avatar Xuan Gu
Browse files

Update 2 files

- /README.md
- /scripts/benchmark_single_node.sbatch
parent 27745028
No related branches found
No related tags found
No related merge requests found
# Berzelius nnU-Net Benchmark
The benchmarking is based on [Nvidia NGC nnU-net for Pytorch](https://catalog.ngc.nvidia.com/orgs/nvidia/resources/nnunet_for_pytorch) v21.11.0.
The benchmarking is based on [Nvidia NGC nnU-net for Pytorch](https://catalog.ngc.nvidia.com/orgs/nvidia/resources/nnunet_for_pytorch) v$VERSION.
VERSION=21.11.0
### On local computer (optional)
- Download the code
```
wget --content-disposition https://api.ngc.nvidia.com/v2/resources/nvidia/nnunet_for_pytorch/versions/21.11.0/zip -O /tmp/nnunet_for_pytorch_21.11.0.zip
mkdir /samsung1t/ngc/nnunet_for_pytorch_21.11.0
unzip /tmp/nnunet_for_pytorch_21.11.0.zip -d /samsung1t/ngc/nnunet_for_pytorch_21.11.0/
cd /samsung1t/ngc/nnunet_for_pytorch_21.11.0/
wget --content-disposition https://api.ngc.nvidia.com/v2/resources/nvidia/nnunet_for_pytorch/versions/$VERSION/zip -O /tmp/nnunet_for_pytorch_$VERSION.zip
mkdir ~/DeepLearningExamples/nnunet_for_pytorch_$VERSION
unzip /tmp/nnunet_for_pytorch_$VERSION.zip -d ~/DeepLearningExamples/nnunet_for_pytorch_$VERSION/
cd ~/DeepLearningExamples/nnunet_for_pytorch_$VERSION/
```
- Build the nnU-Net PyTorch Docker container
......@@ -18,63 +19,37 @@ docker build -t nnunet .
- Push the container to Docker Hub
```
docker tag nnunet:latest xuagu37/nvidia_nnu-net_for_pytorch:21.11.0
docker push xuagu37/nvidia_nnu-net_for_pytorch:21.11.0
docker tag nnunet:latest berzeliushub/nvidia_nnu-net_for_pytorch:$VERSION
docker push berzeliushub/nvidia_nnu-net_for_pytorch:$VERSION
```
### On Berzelius
- Create directories
```
cd /proj/nsc_testing/xuan
cd /proj/nsc_testing/xuan/DeepLearningExamples/
git clone https://gitlab.liu.se/xuagu37/Berzelius-nnU-Net-Benchmark.git
cd Berzelius-nnU-Net-Benchmark
mkdir data results
```
<!-- - Clone the repository
```
cd /proj/nsc_testing/xuan/ngc
git clone https://github.com/NVIDIA/DeepLearningExamples
cd DeepLearningExamples/PyTorch/Segmentation/nnUNet
mkdir data results
``` -->
Docker is not available on Berzelius. We us Apptainer or Enroot.
Docker is not available on Berzelius. We use Apptainer or Enroot.
- Prepare the dataset
With Apptainer
```
apptainer pull nvidia_nnu-net_for_pytorch.sif docker://xuagu37/nvidia_nnu-net_for_pytorch:21.11.0
apptainer pull nvidia_nnu-net_for_pytorch.sif docker://berzeliushub/nvidia_nnu-net_for_pytorch:$VERSION
apptainer exec --nv -B ${PWD}/data:/data -B ${PWD}/results:/results --nv nvidia_nnu-net_for_pytorch.sif bash -c "cd /workspace/nnunet_pyt && python download.py --task 01 && python preprocess.py --task 01 --dim 2"
```
With Enroot
```
enroot import 'docker://xuagu37/nvidia_nnu-net_for_pytorch:21.11.0'
enroot create --name nnunet xuagu37+nvidia_nnu-net_for_pytorch+21.11.0.sqsh
enroot import 'docker://berzeliushub/nvidia_nnu-net_for_pytorch:$VERSION'
enroot create --name nnunet berzeliushub+nvidia_nnu-net_for_pytorch+$VERSION.sqsh
enroot start --rw --mount ${PWD}/data:/data --mount ${PWD}/results:/results nnunet bash -c "cd /workspace/nnunet_pyt && python download.py --task 01 && python preprocess.py --task 01 --dim 2"
```
<!-- Using singularity
```
singularity pull nvidia_nnu-net_for_pytorch.sif docker://xuagu37/nvidia_nnu-net_for_pytorch:21.11.0
singularity shell -B ${PWD}/data:/data -B ${PWD}/results:/results --nv nvidia_nnu-net_for_pytorch.sif
```
Or using enroot
```
enroot import 'docker://xuagu37/nvidia_nnu-net_for_pytorch:21.11.0'
enroot create --name nnunet xuagu37+nvidia_nnu-net_for_pytorch+21.11.0.sqsh
enroot start --rw --mount ${PWD}/data:/data --mount ${PWD}/results:/results nnunet
```
- Prepare BraTS dataset (within the image)
```
python download.py --task 01
python preprocess.py --task 01 --dim 2
```
Exit the image.
-->
- For benchmarking purpose, we use copied of a single image
```
bash scripts/copy_data_for_benchmark.sh
......@@ -90,6 +65,7 @@ The input arguments are:
We will average the benchmark performance over the iterations. The maximum usable (without a OOM error) batch size is 256 and 128 for single and multi-node, respectively.
```
cd cd Berzelius-nnU-Net-Benchmark && mkdir -p sbatch_out
bash scripts/benchmark_sbatch_submit.sh 1 8 100 128
```
......@@ -110,22 +86,22 @@ We run 100 iterations for each set of parameters. Please see the results in benc
**Observation 1**: Ideally, the improvement of throughput would be linear when the number of GPUs increases.
In practice, throughtput stays below the ideal curve when the number of gpus increases.
<img src="https://gitlab.liu.se/xuagu37/Benchmark_nnU-Net_for_PyTorch/-/raw/74d9160cec1caaf2c4531db3ae6096b518229b32/figures/benchmark_throughput_gpus_ideal.png" width="800">
<img src="https://gitlab.liu.se/berzeliushub/Benchmark_nnU-Net_for_PyTorch/-/raw/74d9160cec1caaf2c4531db3ae6096b518229b32/figures/benchmark_throughput_gpus_ideal.png" width="800">
**Observation 2**: when batch_size is small (1, 2, 4, 8), throughput_amp ≈ throughput_tf32;
when batch_size is large (16, 32, 64, 128), throughput_amp > throughput_tf32.
<img src="https://gitlab.liu.se/xuagu37/Benchmark_nnU-Net_for_PyTorch/-/raw/3a4941c09c5280ef3749d44b3af14dcccacc38f7/figures/benchmark_throughput_batch_size.png" width="800">
<img src="https://gitlab.liu.se/berzeliushub/Benchmark_nnU-Net_for_PyTorch/-/raw/3a4941c09c5280ef3749d44b3af14dcccacc38f7/figures/benchmark_throughput_batch_size.png" width="800">
**Observation 3**: Benchmark results are more stable when larger batch_size.
<img src="https://gitlab.liu.se/xuagu37/Benchmark_nnU-Net_for_PyTorch/-/raw/e62617c63bfb4d167a78faf84156956bbc8f52bb/figures/benchmark_throughput_cv.png" width="800">
<img src="https://gitlab.liu.se/berzeliushub/Benchmark_nnU-Net_for_PyTorch/-/raw/e62617c63bfb4d167a78faf84156956bbc8f52bb/figures/benchmark_throughput_cv.png" width="800">
Coefficient of variation is calculated as the ratio of the standard deviation to the mean. It shows the extent of variability in relation to the mean of the population.
**Observation 4**: Ideally, the improvement of throughput would be linear when batch_size increases. In practice, throughtput stays below the ideal curve when batch_size > 16.
<img src="https://gitlab.liu.se/xuagu37/Benchmark_nnU-Net_for_PyTorch/-/raw/ec0f070f718c05d46c6090cc3f8d6ebb29f93725/figures/benchmark_throughput_batch_size_ideal.png" width="800">
<img src="https://gitlab.liu.se/berzeliushub/Benchmark_nnU-Net_for_PyTorch/-/raw/ec0f070f718c05d46c6090cc3f8d6ebb29f93725/figures/benchmark_throughput_batch_size_ideal.png" width="800">
......@@ -140,5 +116,5 @@ Ref: https://pytorch-lightning.readthedocs.io/en/stable/common/trainer.html
- For multi-node benchmarking, we need to use "srun" command; also, the line "#SBATCH --ntasks-per-node=8" has to be added. Otherwise the process will hang.
- Use as large batch_size as possible for a more stable benchmark result. For single node, use 256; for multi-node, use 128.
- Benchmarking with dim = 2, nodes = 1, gpus = 8, batch_size = 128, 256 takes ~2mins.
- Specify the paths for enroot cache and data, see this [page](https://gitlab.liu.se/xuagu37/run-pytorch-and-tensorflow-containers-with-nvidia-enroot#set-path-to-user-container-storage).
- Specify the paths for enroot cache and data, see this [page](https://gitlab.liu.se/berzeliushub/run-pytorch-and-tensorflow-containers-with-nvidia-enroot#set-path-to-user-container-storage).
- (20220222) ```srun enroot ...```stopped working for multi-node case. Use pyxis instead. See the script ```benchmark_multi_node.sbatch```.
......@@ -4,20 +4,20 @@
#SBATCH --nodes=1
#SBATCH --gres=gpu:8
#SBATCH --time=0-0:10:00
#####SBATCH --reservation=bt-xuan_1node_20221020_0900
#SBATCH --reservation=devel
# For apptainer
#rm -f results/benchmark_dim${1}_nodes${2}_gpus${3}_batchsize${4}_tf32_iteration${5}.json
#apptainer exec --nv -B ${PWD}/data:/data -B ${PWD}/results:/results nvidia_nnu-net_for_pytorch.sif bash -c "cd /workspace/nnunet_pyt && python scripts/benchmark.py --mode train --gpus ${3} --dim ${1} --batch_size ${4} --nodes ${2} --logname="benchmark_dim${1}_nodes${2}_gpus${3}_batchsize${4}_tf32_iteration${5}.json""
rm -f results/benchmark_dim${1}_nodes${2}_gpus${3}_batchsize${4}_tf32_iteration${5}.json
apptainer exec --nv -B ${PWD}/data:/data -B ${PWD}/results:/results nvidia_nnu-net_for_pytorch.sif bash -c "cd /workspace/nnunet_pyt && python scripts/benchmark.py --mode train --gpus ${3} --dim ${1} --batch_size ${4} --nodes ${2} --logname='benchmark_dim${1}_nodes${2}_gpus${3}_batchsize${4}_tf32_iteration${5}.json'"
#rm -f results/benchmark_dim${1}_nodes${2}_gpus${3}_batchsize${4}_amp_iteration${5}.json
#apptainer exec --nv -B ${PWD}/data:/data -B ${PWD}/results:/results nvidia_nnu-net_for_pytorch.sif bash -c "cd /workspace/nnunet_pyt && python scripts/benchmark.py --mode train --gpus ${3} --dim ${1} --batch_size ${4} --nodes ${2} --amp --logname="benchmark_dim${1}_nodes${2}_gpus${3}_batchsize${4}_amp_iteration${5}.json"
rm -f results/benchmark_dim${1}_nodes${2}_gpus${3}_batchsize${4}_amp_iteration${5}.json
apptainer exec --nv -B ${PWD}/data:/data -B ${PWD}/results:/results nvidia_nnu-net_for_pytorch.sif bash -c "cd /workspace/nnunet_pyt && python scripts/benchmark.py --mode train --gpus ${3} --dim ${1} --batch_size ${4} --nodes ${2} --amp --logname='benchmark_dim${1}_nodes${2}_gpus${3}_batchsize${4}_amp_iteration${5}.json'"
# For enroot
rm -f results/benchmark_dim${1}_nodes${2}_gpus${3}_batchsize${4}_tf32_iteration${5}.json
enroot start --rw --mount ${PWD}/data:/data --mount ${PWD}/results:/results nnunet bash -c "cd /workspace/nnunet_pyt && python scripts/benchmark.py --mode train --gpus ${3} --dim ${1} --batch_size ${4} --nodes ${2} --logname="benchmark_dim${1}_nodes${2}_gpus${3}_batchsize${4}_tf32_iteration${5}.json""
#rm -f results/benchmark_dim${1}_nodes${2}_gpus${3}_batchsize${4}_tf32_iteration${5}.json
#enroot start --rw --mount ${PWD}/data:/data --mount ${PWD}/results:/results nnunet bash -c "cd /workspace/nnunet_pyt && python scripts/benchmark.py --mode train --gpus ${3} --dim ${1} --batch_size ${4} --nodes ${2} --logname='benchmark_dim${1}_nodes${2}_gpus${3}_batchsize${4}_tf32_iteration${5}.json'"
rm -f results/benchmark_dim${1}_nodes${2}_gpus${3}_batchsize${4}_amp_iteration${5}.json
enroot start --rw --mount ${PWD}/data:/data --mount ${PWD}/results:/results nnunet bash -c "cd /workspace/nnunet_pyt && python scripts/benchmark.py --mode train --gpus ${3} --dim ${1} --batch_size ${4} --nodes ${2} --amp --logname="benchmark_dim${1}_nodes${2}_gpus${3}_batchsize${4}_amp_iteration${5}.json""
#rm -f results/benchmark_dim${1}_nodes${2}_gpus${3}_batchsize${4}_amp_iteration${5}.json
#enroot start --rw --mount ${PWD}/data:/data --mount ${PWD}/results:/results nnunet bash -c "cd /workspace/nnunet_pyt && python scripts/benchmark.py --mode train --gpus ${3} --dim ${1} --batch_size ${4} --nodes ${2} --amp --logname='benchmark_dim${1}_nodes${2}_gpus${3}_batchsize${4}_amp_iteration${5}.json'"
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment