Skip to content
Snippets Groups Projects
Commit 68fc0d8b authored by Xuan Gu's avatar Xuan Gu
Browse files

update

parent d15947fa
No related branches found
No related tags found
No related merge requests found
......@@ -7,7 +7,7 @@ MODEL_VERSION=latest
MODEL_BASE=/proj/nsc_testing/xuan/containers/nvidia_pytorch_21.12-py3.sif
CONTAINER_DIR=/proj/nsc_testing/xuan/containers/${MODEL_NAME}_${MODEL_VERSION}.sif
DEF_DIR=/proj/nsc_testing/xuan/berzelius-benchmarks/NVIDIA/DeepLearningExamples/PyTorch/Segmentation/MaskRCNN/${MODEL_NAME}_${MODEL_VERSION}.def
WORK_DIR=/proj/nsc_testing/xuan/berzelius-benchmarks/NVIDIA/DeepLearningExamples/PyTorch/Segmentation/MaskRCNN
WORK_DIR=/proj/nsc_testing/xuan/berzelius-benchmarks/NVIDIA/DeepLearningExamples/PyTorch/Segmentation/MaskRCNN/object_detection
mkdir -p $WORK_DIR/data $WORK_DIR/results
```
......@@ -18,11 +18,16 @@ apptainer build $MODEL_BASE docker://nvcr.io/nvidia/pytorch:21.12-py3
apptainer build $CONTAINER_DIR $DEF_DIR
```
### Make a copy of the code
```
apptainer exec $CONTAINER_DIR bash -c "cp -a /workspace/object_detection/* ${WORK_DIR}/object_detection"
```
### Downloading and preprocessing the data
```
apptainer exec --nv -B ${WORK_DIR}/data:/data --pwd /data $CONTAINER_DIR bash -c "cp /workspace/object_detection/hashes.md5 /data/ && bash /workspace/object_detection/download_dataset.sh /data"
apptainer exec --nv -B ${WORK_DIR}/object_detection/data:/data --pwd /data $CONTAINER_DIR bash -c "cp /workspace/object_detection/hashes.md5 /data/ && bash /workspace/object_detection/download_dataset.sh /data"
```
......@@ -32,9 +37,9 @@ apptainer exec --nv -B ${WORK_DIR}/data:/data --pwd /data $CONTAINER_DIR bash -c
AMP can be enabled by setting `DTYPE` to `float16`.
```
apptainer exec --nv $CONTAINER_DIR bash -c "cp -a /workspace/object_detection/* ${WORK_DIR}/"
apptainer exec --nv -B ${WORK_DIR}/data:/datasets/data -B ${WORK_DIR}/results:/results --pwd ${WORK_DIR} $CONTAINER_DIR bash scripts/train_benchmark.sh float16 1 True True
apptainer exec --nv -B ${WORK_DIR}/data:/datasets/data -B ${WORK_DIR}/results:/results --pwd ${WORK_DIR} $CONTAINER_DIR bash scripts/inference_benchmark.sh float16 1
apptainer exec --nv $CONTAINER_DIR bash -c "cp -a /workspace/object_detection/* ${WORK_DIR}/object_detection/"
apptainer exec --nv -B ${WORK_DIR}/object_detection/data:/datasets/data -B ${WORK_DIR}/object_detection/results:/results --pwd ${WORK_DIR} $CONTAINER_DIR bash scripts/train_benchmark.sh float16 1 True True
apptainer exec --nv -B ${WORK_DIR}/object_detection/data:/datasets/data -B ${WORK_DIR}/object_detection/results:/results --pwd ${WORK_DIR} $CONTAINER_DIR bash scripts/inference_benchmark.sh float16 1
```
......
......@@ -3,11 +3,11 @@
```
MODEL_NAME=nnunet_for_pytorch
MODEL_VERSION=21.11.0
MODEL_VERSION=latest
MODEL_BASE=/proj/nsc_testing/xuan/containers/nvidia_pytorch_21.11-py3.sif
CONTAINER_DIR=/proj/nsc_testing/xuan/containers/${MODEL_NAME}_${MODEL_VERSION}.sif
DEF_DIR=/proj/nsc_testing/xuan/berzelius-benchmarks/NVIDIA/DeepLearningExamples/PyTorch/Segmentation/nnUNet/${MODEL_NAME}_${MODEL_VERSION}.def
WORK_DIR=/proj/nsc_testing/xuan/berzelius-benchmarks/NVIDIA/DeepLearningExamples/PyTorch/Segmentation/nnUNet
WORK_DIR=/proj/nsc_testing/xuan/berzelius-benchmarks/NVIDIA/DeepLearningExamples/PyTorch/Segmentation/nnUNet/nnunet_pyt
mkdir -p $WORK_DIR/data $WORK_DIR/results
```
......@@ -18,6 +18,11 @@ apptainer build $MODEL_BASE docker://nvcr.io/nvidia/pytorch:21.11-py3
apptainer build $CONTAINER_DIR $DEF_DIR
```
### Make a copy of the code
```
apptainer exec $CONTAINER_DIR bash -c "cp -a /workspace/nnunet_pyt/* ${WORK_DIR}"
```
### Downloading and preprocessing the data
......
# Berzelius nnU-Net Benchmark
# Berzelius Benchmarks
The benchmarking is based on [Nvidia NGC nnU-net for Pytorch](https://catalog.ngc.nvidia.com/orgs/nvidia/resources/nnunet_for_pytorch) v21.11.0.
VERSION=21.11.0
### On local computer (optional)
- Download the code
```
wget --content-disposition https://api.ngc.nvidia.com/v2/resources/nvidia/nnunet_for_pytorch/versions/$VERSION/zip -O /tmp/nnunet_for_pytorch_$VERSION.zip
mkdir ~/DeepLearningExamples/nnunet_for_pytorch_$VERSION
unzip /tmp/nnunet_for_pytorch_$VERSION.zip -d ~/DeepLearningExamples/nnunet_for_pytorch_$VERSION/
cd ~/DeepLearningExamples/nnunet_for_pytorch_$VERSION/
```
- Build the nnU-Net PyTorch Docker container
Change the pytorch-lightning version to 1.5.10 to avoid the `from torchmetrics.utilities.data import get_num_classes as _get_num_classes` error.
Add `ENV PYTHONNOUSERSITE=True` to the `Dockerfile` to disable the user packages.
```
docker build -t nnunet .
```
- Push the container to Docker Hub
```
docker tag nnunet:latest berzeliushub/nvidia_nnu-net_for_pytorch:$VERSION
docker push berzeliushub/nvidia_nnu-net_for_pytorch:$VERSION
```
### On Berzelius
- Create directories
```
cd /proj/nsc_testing/xuan/DeepLearningExamples/
git clone https://gitlab.liu.se/xuagu37/Berzelius-nnU-Net-Benchmark.git
cd Berzelius-nnU-Net-Benchmark
mkdir data results
```
Docker is not available on Berzelius. We use Apptainer or Enroot.
- Prepare the dataset
With Apptainer
```
apptainer pull nvidia_nnu-net_for_pytorch_$VERSION.sif docker://berzeliushub/nvidia_nnu-net_for_pytorch:$VERSION
apptainer exec --nv -B ${PWD}/data:/data -B ${PWD}/results:/results nvidia_nnu-net_for_pytorch_$VERSION.sif bash -c "cd /workspace/nnunet_pyt && python download.py --task 01"
apptainer exec --nv -B ${PWD}/data:/data -B ${PWD}/results:/results nvidia_nnu-net_for_pytorch_$VERSION.sif bash -c "cd /workspace/nnunet_pyt && python preprocess.py --task 01 --dim 2 && python preprocess.py --task 01 --dim 3"
```
<!-- With Enroot
```
enroot import 'docker://berzeliushub/nvidia_nnu-net_for_pytorch:$VERSION'
enroot create --name nnunet berzeliushub+nvidia_nnu-net_for_pytorch+$VERSION.sqsh
enroot start --rw --mount ${PWD}/data:/data --mount ${PWD}/results:/results nnunet cd /workspace/nnunet_pyt && python download.py --task 01 && python preprocess.py --task 01 --dim 2" -->
- Submit the job to Berzelius
You can choose either singularity or enroot in the scripts ```benchmark_single_node.sbatch``` and ```benchmark_multi_node.sbatch```.
Chnage the following settings in `benchmark_sbatch_submit.sh`:
1. Data dimention,
2. Number of nodes,
3. Number of gpus used per node,
4. Number of iterations for each parameter setting,
5. Batch size.
We will average the benchmark performance over the iterations. The maximum usable (without a OOM error) batch size is 256 and 128 for single and multi-node, respectively.
```
mkdir -p sbatch_out
bash scripts/run_benchmark.sh
```
### Results
We collect benchmark results of throughput (images/sec) for
- Precisions = TF32, AMP
- Dimention = 2
- Nodes = 1, 2, 3, 4, 5, 6, 7, 8
- GPUs = 1 - 8 (for 1 node), all gpus (for multi-node)
- Batch size = 1, 2, 4, 8, 16, 32, 64, 128, 256
TF32 (TensorFloat32) mode is for accelerating FP32 convolutions and matrix multiplications. TF32 mode is the default option for AI training with 32-bit variables on Ampere GPU architecture.
AMP (Automatic Mixed Precision) offers significant computational speedup by performing operations in half-precision (FP16) format, while storing minimal information in single-precision (TF32) to retain as much information as possible in critical parts of the network.
We run 100 iterations for each set of parameters. Please see the results in benchmar_table.xlsx.
**Observation 1**: Ideally, the improvement of throughput would be linear when the number of GPUs increases.
In practice, throughtput stays below the ideal curve when the number of gpus increases.
<img src="https://gitlab.liu.se/berzeliushub/Benchmark_nnU-Net_for_PyTorch/-/raw/74d9160cec1caaf2c4531db3ae6096b518229b32/figures/benchmark_throughput_gpus_ideal.png" width="800">
**Observation 2**: when batch_size is small (1, 2, 4, 8), throughput_amp ≈ throughput_tf32;
when batch_size is large (16, 32, 64, 128), throughput_amp > throughput_tf32.
<img src="https://gitlab.liu.se/berzeliushub/Benchmark_nnU-Net_for_PyTorch/-/raw/3a4941c09c5280ef3749d44b3af14dcccacc38f7/figures/benchmark_throughput_batch_size.png" width="800">
**Observation 3**: Benchmark results are more stable when larger batch_size.
<img src="https://gitlab.liu.se/berzeliushub/Benchmark_nnU-Net_for_PyTorch/-/raw/e62617c63bfb4d167a78faf84156956bbc8f52bb/figures/benchmark_throughput_cv.png" width="800">
Coefficient of variation is calculated as the ratio of the standard deviation to the mean. It shows the extent of variability in relation to the mean of the population.
**Observation 4**: Ideally, the improvement of throughput would be linear when batch_size increases. In practice, throughtput stays below the ideal curve when batch_size > 16.
<img src="https://gitlab.liu.se/berzeliushub/Benchmark_nnU-Net_for_PyTorch/-/raw/ec0f070f718c05d46c6090cc3f8d6ebb29f93725/figures/benchmark_throughput_batch_size_ideal.png" width="800">
#### Notes
- Line 116 of DeepLearningExamples/PyTorch/Segmentation/nnUNet/main.py should be changed:
```trainer.test(model, test_dataloaders=data_module.test_dataloader(), ckpt_path=ckpt_path)```
to
```trainer.test(model, dataloaders=data_module.test_dataloader(), ckpt_path=ckpt_path)```
Ref: https://pytorch-lightning.readthedocs.io/en/stable/common/trainer.html
- It seems running directly via singularity shell will give worse performance (when I WFH). We should run it via sbatch script instead.
- It took around a week to finish 100 iterations of benchmarking for all sets of parameters.
- For multi-node benchmarking, we need to use "srun" command; also, the line "#SBATCH --ntasks-per-node=8" has to be added. Otherwise the process will hang.
- Use as large batch_size as possible for a more stable benchmark result. For single node, use 256; for multi-node, use 128.
- Benchmarking with dim = 2, nodes = 1, gpus = 8, batch_size = 128, 256 takes ~2mins.
- Specify the paths for enroot cache and data, see this [page](https://gitlab.liu.se/berzeliushub/run-pytorch-and-tensorflow-containers-with-nvidia-enroot#set-path-to-user-container-storage).
- (20220222) ```srun enroot ...```stopped working for multi-node case. Use pyxis instead. See the script ```benchmark_multi_node.sbatch```.
| Category | Model | Framework | Base Image | Multi-GPU | Multi-Node |
|-------------------------------|----------|-----------|-------------------|-----------|------------|
| NVIDIA Deep Learning Examples | nnU-net | PyTorch | pytorch:21.11-py3 | Yes | Yes |
| | MaskRCNN | PyTorch | pytorch:21.12-py3 | Yes | No |
File deleted
figures/benchmark_throughput_batch_size.png

114 KiB

figures/benchmark_throughput_batch_size_ideal.png

107 KiB

figures/benchmark_throughput_cv.png

108 KiB

figures/benchmark_throughput_gpus_ideal.png

127 KiB

t
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment