update

68fc0d8b · Xuan Gu · d15947fa · 68fc0d8b · 68fc0d8b · 68fc0d8b
Commit 68fc0d8b authored 1 year ago by Xuan Gu
--- a/NVIDIA/DeepLearningExamples/PyTorch/Segmentation/MaskRCNN/README.md
+++ b/NVIDIA/DeepLearningExamples/PyTorch/Segmentation/MaskRCNN/README.md
@@ -7,7 +7,7 @@ MODEL_VERSION=latest
 MODEL_BASE=/proj/nsc_testing/xuan/containers/nvidia_pytorch_21.12-py3.sif
 CONTAINER_DIR=/proj/nsc_testing/xuan/containers/${MODEL_NAME}_${MODEL_VERSION}.sif
 DEF_DIR=/proj/nsc_testing/xuan/berzelius-benchmarks/NVIDIA/DeepLearningExamples/PyTorch/Segmentation/MaskRCNN/${MODEL_NAME}_${MODEL_VERSION}.def
-WORK_DIR=/proj/nsc_testing/xuan/berzelius-benchmarks/NVIDIA/DeepLearningExamples/PyTorch/Segmentation/MaskRCNN
+WORK_DIR=/proj/nsc_testing/xuan/berzelius-benchmarks/NVIDIA/DeepLearningExamples/PyTorch/Segmentation/MaskRCNN/object_detection

 mkdir -p $WORK_DIR/data $WORK_DIR/results
 ```
@@ -18,11 +18,16 @@ apptainer build $MODEL_BASE  docker://nvcr.io/nvidia/pytorch:21.12-py3
 apptainer build $CONTAINER_DIR $DEF_DIR
 ```

+### Make a copy of the code
+
+```
+apptainer exec $CONTAINER_DIR bash -c "cp -a /workspace/object_detection/* ${WORK_DIR}/object_detection"
+```

 ### Downloading and preprocessing the data

 ```
-apptainer exec --nv -B ${WORK_DIR}/data:/data --pwd /data $CONTAINER_DIR bash -c "cp /workspace/object_detection/hashes.md5 /data/ && bash /workspace/object_detection/download_dataset.sh /data"  
+apptainer exec --nv -B ${WORK_DIR}/object_detection/data:/data --pwd /data $CONTAINER_DIR bash -c "cp /workspace/object_detection/hashes.md5 /data/ && bash /workspace/object_detection/download_dataset.sh /data"  
 ```


@@ -32,9 +37,9 @@ apptainer exec --nv -B ${WORK_DIR}/data:/data --pwd /data $CONTAINER_DIR bash -c
 AMP can be enabled by setting `DTYPE` to `float16`.

 ```
-apptainer exec --nv $CONTAINER_DIR bash -c "cp -a /workspace/object_detection/* ${WORK_DIR}/"
-apptainer exec --nv -B ${WORK_DIR}/data:/datasets/data -B ${WORK_DIR}/results:/results --pwd ${WORK_DIR} $CONTAINER_DIR bash scripts/train_benchmark.sh float16 1 True True
-apptainer exec --nv -B ${WORK_DIR}/data:/datasets/data -B ${WORK_DIR}/results:/results --pwd ${WORK_DIR} $CONTAINER_DIR bash scripts/inference_benchmark.sh float16 1
+apptainer exec --nv $CONTAINER_DIR bash -c "cp -a /workspace/object_detection/* ${WORK_DIR}/object_detection/"
+apptainer exec --nv -B ${WORK_DIR}/object_detection/data:/datasets/data -B ${WORK_DIR}/object_detection/results:/results --pwd ${WORK_DIR} $CONTAINER_DIR bash scripts/train_benchmark.sh float16 1 True True
+apptainer exec --nv -B ${WORK_DIR}/object_detection/data:/datasets/data -B ${WORK_DIR}/object_detection/results:/results --pwd ${WORK_DIR} $CONTAINER_DIR bash scripts/inference_benchmark.sh float16 1

 ```


--- a/NVIDIA/DeepLearningExamples/PyTorch/Segmentation/nnUNet/README.md
+++ b/NVIDIA/DeepLearningExamples/PyTorch/Segmentation/nnUNet/README.md
@@ -3,11 +3,11 @@

 ```
 MODEL_NAME=nnunet_for_pytorch
-MODEL_VERSION=21.11.0
+MODEL_VERSION=latest
 MODEL_BASE=/proj/nsc_testing/xuan/containers/nvidia_pytorch_21.11-py3.sif
 CONTAINER_DIR=/proj/nsc_testing/xuan/containers/${MODEL_NAME}_${MODEL_VERSION}.sif
 DEF_DIR=/proj/nsc_testing/xuan/berzelius-benchmarks/NVIDIA/DeepLearningExamples/PyTorch/Segmentation/nnUNet/${MODEL_NAME}_${MODEL_VERSION}.def
-WORK_DIR=/proj/nsc_testing/xuan/berzelius-benchmarks/NVIDIA/DeepLearningExamples/PyTorch/Segmentation/nnUNet
+WORK_DIR=/proj/nsc_testing/xuan/berzelius-benchmarks/NVIDIA/DeepLearningExamples/PyTorch/Segmentation/nnUNet/nnunet_pyt

 mkdir -p $WORK_DIR/data $WORK_DIR/results
 ```
@@ -18,6 +18,11 @@ apptainer build $MODEL_BASE  docker://nvcr.io/nvidia/pytorch:21.11-py3
 apptainer build $CONTAINER_DIR $DEF_DIR
 ```

+### Make a copy of the code
+
+```
+apptainer exec $CONTAINER_DIR bash -c "cp -a /workspace/nnunet_pyt/* ${WORK_DIR}"
+```

 ### Downloading and preprocessing the data


--- a/README.md
+++ b/README.md
-# Berzelius nnU-Net Benchmark   
+# Berzelius Benchmarks   

-The benchmarking is based on [Nvidia NGC nnU-net for Pytorch](https://catalog.ngc.nvidia.com/orgs/nvidia/resources/nnunet_for_pytorch) v21.11.0.  

-VERSION=21.11.0
-### On local computer (optional)
- Download the code  
-```
-wget --content-disposition https://api.ngc.nvidia.com/v2/resources/nvidia/nnunet_for_pytorch/versions/$VERSION/zip -O /tmp/nnunet_for_pytorch_$VERSION.zip
-mkdir ~/DeepLearningExamples/nnunet_for_pytorch_$VERSION
-unzip /tmp/nnunet_for_pytorch_$VERSION.zip -d ~/DeepLearningExamples/nnunet_for_pytorch_$VERSION/ 
-cd ~/DeepLearningExamples/nnunet_for_pytorch_$VERSION/ 
-```

- Build the nnU-Net PyTorch Docker container

-Change the pytorch-lightning version to 1.5.10 to avoid the `from torchmetrics.utilities.data import get_num_classes as _get_num_classes` error.
-
-Add `ENV PYTHONNOUSERSITE=True` to the `Dockerfile` to disable the user packages.
-```
-docker build -t nnunet .
-```
-
- Push the container to Docker Hub
-```
-docker tag nnunet:latest berzeliushub/nvidia_nnu-net_for_pytorch:$VERSION
-docker push berzeliushub/nvidia_nnu-net_for_pytorch:$VERSION
-```
-
-### On Berzelius
-
- Create directories
-```
-cd /proj/nsc_testing/xuan/DeepLearningExamples/
-git clone https://gitlab.liu.se/xuagu37/Berzelius-nnU-Net-Benchmark.git
-cd Berzelius-nnU-Net-Benchmark
-mkdir data results
-```
-
-Docker is not available on Berzelius. We use Apptainer or Enroot.
-
- Prepare the dataset
-
-With Apptainer
-```
-apptainer pull nvidia_nnu-net_for_pytorch_$VERSION.sif docker://berzeliushub/nvidia_nnu-net_for_pytorch:$VERSION
-apptainer exec --nv -B ${PWD}/data:/data -B ${PWD}/results:/results nvidia_nnu-net_for_pytorch_$VERSION.sif bash -c "cd /workspace/nnunet_pyt && python download.py --task 01"
-apptainer exec --nv -B ${PWD}/data:/data -B ${PWD}/results:/results nvidia_nnu-net_for_pytorch_$VERSION.sif bash -c "cd /workspace/nnunet_pyt && python preprocess.py --task 01 --dim 2 && python preprocess.py --task 01 --dim 3"
-```  
-
-<!-- With Enroot
-```
-enroot import 'docker://berzeliushub/nvidia_nnu-net_for_pytorch:$VERSION'
-enroot create --name nnunet berzeliushub+nvidia_nnu-net_for_pytorch+$VERSION.sqsh
-enroot start --rw --mount ${PWD}/data:/data --mount ${PWD}/results:/results nnunet cd /workspace/nnunet_pyt && python download.py --task 01 && python preprocess.py --task 01 --dim 2" -->
-
-
- Submit the job to Berzelius  
- 
-You can choose either singularity or enroot in the scripts ```benchmark_single_node.sbatch``` and ```benchmark_multi_node.sbatch```.
-
-Chnage the following settings in `benchmark_sbatch_submit.sh`: 
-1. Data dimention,
-2. Number of nodes,
-3. Number of gpus used per node,
-4. Number of iterations for each parameter setting,
-5. Batch size.
-
-We will average the benchmark performance over the iterations. The maximum usable (without a OOM error) batch size is 256 and 128 for single and multi-node, respectively.
-```
-mkdir -p sbatch_out
-bash scripts/run_benchmark.sh
-```
-
-### Results  
-We collect benchmark results of throughput (images/sec) for  
- Precisions = TF32, AMP
- Dimention = 2
- Nodes = 1, 2, 3, 4, 5, 6, 7, 8
- GPUs = 1 - 8 (for 1 node), all gpus (for multi-node)
- Batch size = 1, 2, 4, 8, 16, 32, 64, 128, 256
-
-TF32 (TensorFloat32) mode is for accelerating FP32 convolutions and matrix multiplications. TF32 mode is the default option for AI training with 32-bit variables on Ampere GPU architecture.   
-
-AMP (Automatic Mixed Precision) offers significant computational speedup by performing operations in half-precision (FP16) format, while storing minimal information in single-precision (TF32) to retain as much information as possible in critical parts of the network.   
-
-We run 100 iterations for each set of parameters. Please see the results in benchmar_table.xlsx.  
-
-**Observation 1**: Ideally, the improvement of throughput would be linear when the number of GPUs increases.  
-In practice, throughtput stays below the ideal curve when the number of gpus increases.
-
-<img src="https://gitlab.liu.se/berzeliushub/Benchmark_nnU-Net_for_PyTorch/-/raw/74d9160cec1caaf2c4531db3ae6096b518229b32/figures/benchmark_throughput_gpus_ideal.png" width="800">
-
-**Observation 2**: when batch_size is small (1, 2, 4, 8), throughput_amp ≈ throughput_tf32;  
-when batch_size is large (16, 32, 64, 128), throughput_amp > throughput_tf32.  
-
-<img src="https://gitlab.liu.se/berzeliushub/Benchmark_nnU-Net_for_PyTorch/-/raw/3a4941c09c5280ef3749d44b3af14dcccacc38f7/figures/benchmark_throughput_batch_size.png" width="800">
-
-**Observation 3**: Benchmark results are more stable when larger batch_size.  
-
-<img src="https://gitlab.liu.se/berzeliushub/Benchmark_nnU-Net_for_PyTorch/-/raw/e62617c63bfb4d167a78faf84156956bbc8f52bb/figures/benchmark_throughput_cv.png" width="800">
-
-Coefficient of variation is calculated as the ratio of the standard deviation to the mean. It shows the extent of variability in relation to the mean of the population. 
-
-**Observation 4**: Ideally, the improvement of throughput would be linear when batch_size increases. In practice, throughtput stays below the ideal curve when batch_size > 16.
-
-<img src="https://gitlab.liu.se/berzeliushub/Benchmark_nnU-Net_for_PyTorch/-/raw/ec0f070f718c05d46c6090cc3f8d6ebb29f93725/figures/benchmark_throughput_batch_size_ideal.png" width="800">
-
-
-
-#### Notes
- Line 116 of DeepLearningExamples/PyTorch/Segmentation/nnUNet/main.py should be changed:  
-```trainer.test(model, test_dataloaders=data_module.test_dataloader(), ckpt_path=ckpt_path)```  
-to  
-```trainer.test(model, dataloaders=data_module.test_dataloader(), ckpt_path=ckpt_path)```  
-Ref: https://pytorch-lightning.readthedocs.io/en/stable/common/trainer.html
- It seems running directly via singularity shell will give worse performance (when I WFH). We should run it via sbatch script instead.
- It took around a week to finish 100 iterations of benchmarking for all sets of parameters.
- For multi-node benchmarking, we need to use "srun" command; also, the line "#SBATCH --ntasks-per-node=8" has to be added. Otherwise the process will hang.
- Use as large batch_size as possible for a more stable benchmark result. For single node, use 256; for multi-node, use 128.  
- Benchmarking with dim = 2, nodes = 1, gpus = 8, batch_size = 128, 256 takes ~2mins.  
- Specify the paths for enroot cache and data, see this [page](https://gitlab.liu.se/berzeliushub/run-pytorch-and-tensorflow-containers-with-nvidia-enroot#set-path-to-user-container-storage).
- (20220222) ```srun enroot ...```stopped working for multi-node case. Use pyxis instead. See the script ```benchmark_multi_node.sbatch```.
+| Category                      | Model    | Framework | Base Image        | Multi-GPU | Multi-Node |
+|-------------------------------|----------|-----------|-------------------|-----------|------------|
+| NVIDIA Deep Learning Examples | nnU-net  | PyTorch   | pytorch:21.11-py3 | Yes       | Yes        |
+|                               | MaskRCNN | PyTorch   | pytorch:21.12-py3 | Yes       | No         |
--- a/benchmark_table.xlsx
+++ b/benchmark_table.xlsx
--- a/figures/benchmark_throughput_batch_size.png
+++ b/figures/benchmark_throughput_batch_size.png
--- a/figures/benchmark_throughput_batch_size_ideal.png
+++ b/figures/benchmark_throughput_batch_size_ideal.png
--- a/figures/benchmark_throughput_cv.png
+++ b/figures/benchmark_throughput_cv.png
--- a/figures/benchmark_throughput_gpus_ideal.png
+++ b/figures/benchmark_throughput_gpus_ideal.png
--- a/figures/test.txt
+++ b/figures/test.txt
-t