diff --git a/README.md b/README.md index 8238bd2e77bfa62ad6ea5604ceda9fe11a68c6f5..2c3ec9558a32dfe8920769077c5af09b4bb8c3fc 100644 --- a/README.md +++ b/README.md @@ -1,14 +1,15 @@ # Berzelius nnU-Net Benchmark -The benchmarking is based on [Nvidia NGC nnU-net for Pytorch](https://catalog.ngc.nvidia.com/orgs/nvidia/resources/nnunet_for_pytorch) v21.11.0. +The benchmarking is based on [Nvidia NGC nnU-net for Pytorch](https://catalog.ngc.nvidia.com/orgs/nvidia/resources/nnunet_for_pytorch) v$VERSION. +VERSION=21.11.0 ### On local computer (optional) - Download the code ``` -wget --content-disposition https://api.ngc.nvidia.com/v2/resources/nvidia/nnunet_for_pytorch/versions/21.11.0/zip -O /tmp/nnunet_for_pytorch_21.11.0.zip -mkdir /samsung1t/ngc/nnunet_for_pytorch_21.11.0 -unzip /tmp/nnunet_for_pytorch_21.11.0.zip -d /samsung1t/ngc/nnunet_for_pytorch_21.11.0/ -cd /samsung1t/ngc/nnunet_for_pytorch_21.11.0/ +wget --content-disposition https://api.ngc.nvidia.com/v2/resources/nvidia/nnunet_for_pytorch/versions/$VERSION/zip -O /tmp/nnunet_for_pytorch_$VERSION.zip +mkdir ~/DeepLearningExamples/nnunet_for_pytorch_$VERSION +unzip /tmp/nnunet_for_pytorch_$VERSION.zip -d ~/DeepLearningExamples/nnunet_for_pytorch_$VERSION/ +cd ~/DeepLearningExamples/nnunet_for_pytorch_$VERSION/ ``` - Build the nnU-Net PyTorch Docker container @@ -18,63 +19,37 @@ docker build -t nnunet . - Push the container to Docker Hub ``` -docker tag nnunet:latest xuagu37/nvidia_nnu-net_for_pytorch:21.11.0 -docker push xuagu37/nvidia_nnu-net_for_pytorch:21.11.0 +docker tag nnunet:latest berzeliushub/nvidia_nnu-net_for_pytorch:$VERSION +docker push berzeliushub/nvidia_nnu-net_for_pytorch:$VERSION ``` ### On Berzelius - Create directories ``` -cd /proj/nsc_testing/xuan +cd /proj/nsc_testing/xuan/DeepLearningExamples/ git clone https://gitlab.liu.se/xuagu37/Berzelius-nnU-Net-Benchmark.git cd Berzelius-nnU-Net-Benchmark mkdir data results ``` -<!-- - Clone the repository -``` -cd /proj/nsc_testing/xuan/ngc -git clone https://github.com/NVIDIA/DeepLearningExamples -cd DeepLearningExamples/PyTorch/Segmentation/nnUNet -mkdir data results -``` --> -Docker is not available on Berzelius. We us Apptainer or Enroot. +Docker is not available on Berzelius. We use Apptainer or Enroot. - Prepare the dataset With Apptainer ``` -apptainer pull nvidia_nnu-net_for_pytorch.sif docker://xuagu37/nvidia_nnu-net_for_pytorch:21.11.0 +apptainer pull nvidia_nnu-net_for_pytorch.sif docker://berzeliushub/nvidia_nnu-net_for_pytorch:$VERSION apptainer exec --nv -B ${PWD}/data:/data -B ${PWD}/results:/results --nv nvidia_nnu-net_for_pytorch.sif bash -c "cd /workspace/nnunet_pyt && python download.py --task 01 && python preprocess.py --task 01 --dim 2" ``` With Enroot ``` -enroot import 'docker://xuagu37/nvidia_nnu-net_for_pytorch:21.11.0' -enroot create --name nnunet xuagu37+nvidia_nnu-net_for_pytorch+21.11.0.sqsh +enroot import 'docker://berzeliushub/nvidia_nnu-net_for_pytorch:$VERSION' +enroot create --name nnunet berzeliushub+nvidia_nnu-net_for_pytorch+$VERSION.sqsh enroot start --rw --mount ${PWD}/data:/data --mount ${PWD}/results:/results nnunet bash -c "cd /workspace/nnunet_pyt && python download.py --task 01 && python preprocess.py --task 01 --dim 2" ``` -<!-- Using singularity -``` -singularity pull nvidia_nnu-net_for_pytorch.sif docker://xuagu37/nvidia_nnu-net_for_pytorch:21.11.0 -singularity shell -B ${PWD}/data:/data -B ${PWD}/results:/results --nv nvidia_nnu-net_for_pytorch.sif -``` -Or using enroot -``` -enroot import 'docker://xuagu37/nvidia_nnu-net_for_pytorch:21.11.0' -enroot create --name nnunet xuagu37+nvidia_nnu-net_for_pytorch+21.11.0.sqsh -enroot start --rw --mount ${PWD}/data:/data --mount ${PWD}/results:/results nnunet -``` -- Prepare BraTS dataset (within the image) -``` -python download.py --task 01 -python preprocess.py --task 01 --dim 2 -``` -Exit the image. ---> - - For benchmarking purpose, we use copied of a single image ``` bash scripts/copy_data_for_benchmark.sh @@ -90,6 +65,7 @@ The input arguments are: We will average the benchmark performance over the iterations. The maximum usable (without a OOM error) batch size is 256 and 128 for single and multi-node, respectively. ``` +cd cd Berzelius-nnU-Net-Benchmark && mkdir -p sbatch_out bash scripts/benchmark_sbatch_submit.sh 1 8 100 128 ``` @@ -110,22 +86,22 @@ We run 100 iterations for each set of parameters. Please see the results in benc **Observation 1**: Ideally, the improvement of throughput would be linear when the number of GPUs increases. In practice, throughtput stays below the ideal curve when the number of gpus increases. -<img src="https://gitlab.liu.se/xuagu37/Benchmark_nnU-Net_for_PyTorch/-/raw/74d9160cec1caaf2c4531db3ae6096b518229b32/figures/benchmark_throughput_gpus_ideal.png" width="800"> +<img src="https://gitlab.liu.se/berzeliushub/Benchmark_nnU-Net_for_PyTorch/-/raw/74d9160cec1caaf2c4531db3ae6096b518229b32/figures/benchmark_throughput_gpus_ideal.png" width="800"> **Observation 2**: when batch_size is small (1, 2, 4, 8), throughput_amp ≈ throughput_tf32; when batch_size is large (16, 32, 64, 128), throughput_amp > throughput_tf32. -<img src="https://gitlab.liu.se/xuagu37/Benchmark_nnU-Net_for_PyTorch/-/raw/3a4941c09c5280ef3749d44b3af14dcccacc38f7/figures/benchmark_throughput_batch_size.png" width="800"> +<img src="https://gitlab.liu.se/berzeliushub/Benchmark_nnU-Net_for_PyTorch/-/raw/3a4941c09c5280ef3749d44b3af14dcccacc38f7/figures/benchmark_throughput_batch_size.png" width="800"> **Observation 3**: Benchmark results are more stable when larger batch_size. -<img src="https://gitlab.liu.se/xuagu37/Benchmark_nnU-Net_for_PyTorch/-/raw/e62617c63bfb4d167a78faf84156956bbc8f52bb/figures/benchmark_throughput_cv.png" width="800"> +<img src="https://gitlab.liu.se/berzeliushub/Benchmark_nnU-Net_for_PyTorch/-/raw/e62617c63bfb4d167a78faf84156956bbc8f52bb/figures/benchmark_throughput_cv.png" width="800"> Coefficient of variation is calculated as the ratio of the standard deviation to the mean. It shows the extent of variability in relation to the mean of the population. **Observation 4**: Ideally, the improvement of throughput would be linear when batch_size increases. In practice, throughtput stays below the ideal curve when batch_size > 16. -<img src="https://gitlab.liu.se/xuagu37/Benchmark_nnU-Net_for_PyTorch/-/raw/ec0f070f718c05d46c6090cc3f8d6ebb29f93725/figures/benchmark_throughput_batch_size_ideal.png" width="800"> +<img src="https://gitlab.liu.se/berzeliushub/Benchmark_nnU-Net_for_PyTorch/-/raw/ec0f070f718c05d46c6090cc3f8d6ebb29f93725/figures/benchmark_throughput_batch_size_ideal.png" width="800"> @@ -140,5 +116,5 @@ Ref: https://pytorch-lightning.readthedocs.io/en/stable/common/trainer.html - For multi-node benchmarking, we need to use "srun" command; also, the line "#SBATCH --ntasks-per-node=8" has to be added. Otherwise the process will hang. - Use as large batch_size as possible for a more stable benchmark result. For single node, use 256; for multi-node, use 128. - Benchmarking with dim = 2, nodes = 1, gpus = 8, batch_size = 128, 256 takes ~2mins. -- Specify the paths for enroot cache and data, see this [page](https://gitlab.liu.se/xuagu37/run-pytorch-and-tensorflow-containers-with-nvidia-enroot#set-path-to-user-container-storage). +- Specify the paths for enroot cache and data, see this [page](https://gitlab.liu.se/berzeliushub/run-pytorch-and-tensorflow-containers-with-nvidia-enroot#set-path-to-user-container-storage). - (20220222) ```srun enroot ...```stopped working for multi-node case. Use pyxis instead. See the script ```benchmark_multi_node.sbatch```. diff --git a/scripts/benchmark_single_node.sbatch b/scripts/benchmark_single_node.sbatch index 55c8d1ea8412a225b4fc93dce69d433bc4d20289..a52be2b5a74e9dfb05ea90a342fffdbd901d6e93 100644 --- a/scripts/benchmark_single_node.sbatch +++ b/scripts/benchmark_single_node.sbatch @@ -4,20 +4,20 @@ #SBATCH --nodes=1 #SBATCH --gres=gpu:8 #SBATCH --time=0-0:10:00 -#####SBATCH --reservation=bt-xuan_1node_20221020_0900 +#SBATCH --reservation=devel # For apptainer -#rm -f results/benchmark_dim${1}_nodes${2}_gpus${3}_batchsize${4}_tf32_iteration${5}.json -#apptainer exec --nv -B ${PWD}/data:/data -B ${PWD}/results:/results nvidia_nnu-net_for_pytorch.sif bash -c "cd /workspace/nnunet_pyt && python scripts/benchmark.py --mode train --gpus ${3} --dim ${1} --batch_size ${4} --nodes ${2} --logname="benchmark_dim${1}_nodes${2}_gpus${3}_batchsize${4}_tf32_iteration${5}.json"" +rm -f results/benchmark_dim${1}_nodes${2}_gpus${3}_batchsize${4}_tf32_iteration${5}.json +apptainer exec --nv -B ${PWD}/data:/data -B ${PWD}/results:/results nvidia_nnu-net_for_pytorch.sif bash -c "cd /workspace/nnunet_pyt && python scripts/benchmark.py --mode train --gpus ${3} --dim ${1} --batch_size ${4} --nodes ${2} --logname='benchmark_dim${1}_nodes${2}_gpus${3}_batchsize${4}_tf32_iteration${5}.json'" -#rm -f results/benchmark_dim${1}_nodes${2}_gpus${3}_batchsize${4}_amp_iteration${5}.json -#apptainer exec --nv -B ${PWD}/data:/data -B ${PWD}/results:/results nvidia_nnu-net_for_pytorch.sif bash -c "cd /workspace/nnunet_pyt && python scripts/benchmark.py --mode train --gpus ${3} --dim ${1} --batch_size ${4} --nodes ${2} --amp --logname="benchmark_dim${1}_nodes${2}_gpus${3}_batchsize${4}_amp_iteration${5}.json" +rm -f results/benchmark_dim${1}_nodes${2}_gpus${3}_batchsize${4}_amp_iteration${5}.json +apptainer exec --nv -B ${PWD}/data:/data -B ${PWD}/results:/results nvidia_nnu-net_for_pytorch.sif bash -c "cd /workspace/nnunet_pyt && python scripts/benchmark.py --mode train --gpus ${3} --dim ${1} --batch_size ${4} --nodes ${2} --amp --logname='benchmark_dim${1}_nodes${2}_gpus${3}_batchsize${4}_amp_iteration${5}.json'" # For enroot -rm -f results/benchmark_dim${1}_nodes${2}_gpus${3}_batchsize${4}_tf32_iteration${5}.json -enroot start --rw --mount ${PWD}/data:/data --mount ${PWD}/results:/results nnunet bash -c "cd /workspace/nnunet_pyt && python scripts/benchmark.py --mode train --gpus ${3} --dim ${1} --batch_size ${4} --nodes ${2} --logname="benchmark_dim${1}_nodes${2}_gpus${3}_batchsize${4}_tf32_iteration${5}.json"" +#rm -f results/benchmark_dim${1}_nodes${2}_gpus${3}_batchsize${4}_tf32_iteration${5}.json +#enroot start --rw --mount ${PWD}/data:/data --mount ${PWD}/results:/results nnunet bash -c "cd /workspace/nnunet_pyt && python scripts/benchmark.py --mode train --gpus ${3} --dim ${1} --batch_size ${4} --nodes ${2} --logname='benchmark_dim${1}_nodes${2}_gpus${3}_batchsize${4}_tf32_iteration${5}.json'" -rm -f results/benchmark_dim${1}_nodes${2}_gpus${3}_batchsize${4}_amp_iteration${5}.json -enroot start --rw --mount ${PWD}/data:/data --mount ${PWD}/results:/results nnunet bash -c "cd /workspace/nnunet_pyt && python scripts/benchmark.py --mode train --gpus ${3} --dim ${1} --batch_size ${4} --nodes ${2} --amp --logname="benchmark_dim${1}_nodes${2}_gpus${3}_batchsize${4}_amp_iteration${5}.json"" +#rm -f results/benchmark_dim${1}_nodes${2}_gpus${3}_batchsize${4}_amp_iteration${5}.json +#enroot start --rw --mount ${PWD}/data:/data --mount ${PWD}/results:/results nnunet bash -c "cd /workspace/nnunet_pyt && python scripts/benchmark.py --mode train --gpus ${3} --dim ${1} --batch_size ${4} --nodes ${2} --amp --logname='benchmark_dim${1}_nodes${2}_gpus${3}_batchsize${4}_amp_iteration${5}.json'"