Skip to content
Snippets Groups Projects
Commit 56a041e3 authored by Xuan Gu's avatar Xuan Gu
Browse files

Update 4 files

- /NVIDIA/DeepLearningExamples/PyTorch/README.md
- /MLPerf/training/image_segmentation/pytorch/README.md
- /MLPerf/training/image_segmentation/pytorch/generate_benchmark_jobs.sh
- /MLPerf/training/image_segmentation/pytorch/submit_benchmark_jobs.sh
parent 8e340b74
No related branches found
No related tags found
No related merge requests found
......@@ -5,7 +5,7 @@ The U-Net3D from MLPerf has no version control.
```
MODEL_NAME=nnunet_for_pytorch
MODEL_NAME=U-Net3D
MODEL_BASE=/proj/nsc_testing/xuan/containers/pytorch_1.7.1-cuda11.0-cudnn8-runtime.sif
CONTAINER_DIR=/proj/nsc_testing/xuan/containers/${MODEL_NAME}.sif
DEF_DIR=/proj/nsc_testing/xuan/berzelius-benchmarks/MLPerf/training/image_segmentation/pytorch//${MODEL_NAME}.def
......@@ -44,3 +44,9 @@ apptainer exec --nv -B ${WORK_DIR}/raw-data:/raw-data -B ${WORK_DIR}/data:/data
```
bash submit_benchmark_jobs.sh
```
### Known issues
#### Issue 1
The line 23 in `main.py` will try to create a file in the container `/workspace/unet3d`, which will cause a write permission issue. We comment out this line.
\ No newline at end of file
#!/bin/bash
SBATCH_DIR=$WORK_DIR/sbatch_scripts/benchmark_${6}_${5}_dim${1}_nodes${2}_gpus${3}_batchsize_${4}.sbatch
SBATCH_OUT_DIR=$WORK_DIR/sbatch_out/benchmark_${6}_${5}_dim${1}_nodes${2}_gpus${3}_batchsize_${4}.out
LOG_DIR=benchmark_${6}_${5}_dim${1}_nodes${2}_gpus${3}_batchsize_${4}_amp.log
cat <<EOT > $SBATCH_DIR
#!/bin/bash
#SBATCH -A nsc
#SBATCH --nodes=${2}
#SBATCH --gpus=${3}
#SBATCH --time=0-0:20:00
#SBATCH --output=$SBATCH_OUT_DIR
EOT
if [ "${6}" = "thin" ]; then
cat <<EOT >> $SBATCH_DIR
#SBATCH -C "thin"
#SBATCH --reservation=$GPU_RESERVATION
EOT
else
cat <<EOT >> $SBATCH_DIR
#SBATCH -C "fat"
EOT
fi
cat <<EOT >> $SBATCH_DIR
rm -f $WORK_DIR/results/$LOG_DIR
apptainer exec --nv -B ${WORK_DIR}/raw-data:/raw-data -B ${WORK_DIR}/data:/data -B ${WORK_DIR}/results:/results $CONTAINER_DIR bash -c "cd /workspace/unet3d && bash run_and_time.sh 1"
mv ${WORK_DIR}/results/unet3d.log ${WORK_DIR}/results/$LOG_DIR
EOT
#!/bin/bash
set -e
export MODEL_NAME=nnunet_for_pytorch
export WORK_DIR=/proj/nsc_testing/xuan/berzelius-benchmarks/MLPerf/training/image_segmentation/pytorch
export CONTAINER_DIR=/proj/nsc_testing/xuan/containers/${MODEL_NAME}.sif
export GPU_RESERVATION=nodeimage
mkdir -p $WORK_DIR/sbatch_out $WORK_DIR/sbatch_scripts $WORK_DIR/results
benchmark_modes=("train" "predict")
# node_types=("thin" "fat")
node_types=("thin")
dim=2
for nodes in {1..1}; do
for gpus in {1..8}; do
for benchmark_mode in "${benchmark_modes[@]}"; do
for node_type in "${node_types[@]}"; do
if [ "${node_type}" = "thin" ]; then
batch_size=512
else
batch_size=1024
fi
echo dim ${dim}, nodes ${nodes}, gpus ${gpus}, batch_size ${batch_size}, benchmark_mode ${benchmark_mode}, node_type ${node_type}
# For single node
bash $WORK_DIR/generate_benchmark_jobs.sh ${dim} ${nodes} ${gpus} ${batch_size} ${benchmark_mode} ${node_type}
SBATCH_DIR=$WORK_DIR/sbatch_scripts/benchmark_${node_type}_${benchmark_mode}_dim${dim}_nodes${nodes}_gpus${gpus}_batchsize_${batch_size}.sbatch
# sbatch $SBATCH_DIR
# sleep 1
done
done
done
done
......@@ -39,3 +39,41 @@ apptainer exec --nv -B ${WORK_DIR}/data:/data -B ${WORK_DIR}/results:/results --
bash submit_benchmark_jobs.sh
```
### Known issues
#### Isssue 1
https://github.com/NVIDIA/DeepLearningExamples/issues/1113
When running the container, an error occurred:
```
ImportError: cannot import name 'get_num_classes' from 'torchmetrics.utilities.data' (/opt/conda/lib/python3.8/site-packages/torchmetrics/utilities/data.py)
```
Solution 1 (not working): `pip install pytorch-lightning==1.5.10`.
Another error raised when benchmarking predict:
```
Traceback (most recent call last):
File "main.py", line 110, in <module>
trainer.current_epoch = 1
AttributeError: can't set attribute
```
Solution 2: `pip install torchmetrics==0.6.0`.
Another error raised:
File "main.py", line 34, in <module>
set_affinity(int(os.getenv("LOCAL_RANK", "0")), args.gpus, mode=args.affinity)
File "/workspace/nnunet_pyt/utils/gpu_affinity.py", line 376, in set_affinity
set_socket_unique_affinity(gpu_id, nproc_per_node, cores, "contiguous", balanced)
File "/workspace/nnunet_pyt/utils/gpu_affinity.py", line 263, in set_socket_unique_affinity
os.sched_setaffinity(0, ungrouped_affinities[gpu_id])
OSError: [Errno 22] Invalid argument
We need to comment out the L32-33 in the `main.py` to fix it.
#### Issue 2
Muiti-node jobs is not supported yet in 21.11.0 but only in the most recent code on GitHub.
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment