Newer
Older
MODEL_NAME=maskrcnn_for_pytorch
MODEL_VERSION=latest
MODEL_BASE=/proj/nsc_testing/xuan/containers/nvidia_pytorch_21.12-py3.sif
CONTAINER_DIR=/proj/nsc_testing/xuan/containers/${MODEL_NAME}_${MODEL_VERSION}.sif
DEF_DIR=/proj/nsc_testing/xuan/berzelius-benchmarks/NVIDIA/DeepLearningExamples/PyTorch/Segmentation/MaskRCNN/${MODEL_NAME}_${MODEL_VERSION}.def
WORK_DIR=/proj/nsc_testing/xuan/berzelius-benchmarks/NVIDIA/DeepLearningExamples/PyTorch/Segmentation/MaskRCNN
mkdir -p $WORK_DIR/data $WORK_DIR/results
apptainer build $CONTAINER_DIR $DEF_DIR
```
### Downloading and preprocessing the data
```
apptainer exec --nv -B ${WORK_DIR}/data:/data --pwd /data $CONTAINER_DIR bash -c "cp /workspace/object_detection/hashes.md5 /data/ && bash /workspace/object_detection/download_dataset.sh /data"
apptainer exec --nv $CONTAINER_DIR bash -c "cp -a /workspace/object_detection/* ${WORK_DIR}/"
apptainer exec --nv -B ${WORK_DIR}/data:/datasets/data -B ${WORK_DIR}/results:/results --pwd ${WORK_DIR} $CONTAINER_DIR bash scripts/train_benchmark.sh fp16 1 True True
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
```
### Running benchmarking
```
apptainer exec --nv -B ${WORK_DIR}/data:/data -B ${WORK_DIR}/results:/results --pwd /workspace/nnunet_pyt $CONTAINER_DIR python scripts/benchmark.py --mode train --gpus 1 --dim 2 --batch_size 256 --amp
apptainer exec --nv -B ${WORK_DIR}/data:/data -B ${WORK_DIR}/results:/results --pwd /workspace/nnunet_pyt $CONTAINER_DIR python scripts/benchmark.py --mode predict --gpus 1 --dim 2 --batch_size 256 --amp
```
### Running benchmarking using batch jobs
```
bash submit_benchmark_jobs.sh
```
### Known issues
#### Isssue 1 (21.11.0)
https://github.com/NVIDIA/DeepLearningExamples/issues/1113
When running the container, an error occurred:
```
ImportError: cannot import name 'get_num_classes' from 'torchmetrics.utilities.data' (/opt/conda/lib/python3.8/site-packages/torchmetrics/utilities/data.py)
```
Solution 1 (not working): `pip install pytorch-lightning==1.5.10`.
Another error raised when benchmarking predict:
```
Traceback (most recent call last):
File "main.py", line 110, in <module>
trainer.current_epoch = 1
AttributeError: can't set attribute
```
Solution 2: `pip install torchmetrics==0.6.0`.
Another error raised:
File "main.py", line 34, in <module>
set_affinity(int(os.getenv("LOCAL_RANK", "0")), args.gpus, mode=args.affinity)
File "/workspace/nnunet_pyt/utils/gpu_affinity.py", line 376, in set_affinity
set_socket_unique_affinity(gpu_id, nproc_per_node, cores, "contiguous", balanced)
File "/workspace/nnunet_pyt/utils/gpu_affinity.py", line 263, in set_socket_unique_affinity
os.sched_setaffinity(0, ungrouped_affinities[gpu_id])
OSError: [Errno 22] Invalid argument
We need to comment out the L32-33 in the `main.py` to fix it.
#### Issue 2 (21.11.0)
Muiti-node jobs is not supported yet in 21.11.0 but only in the latest (nightly) version.
#### Issue 3 (latest)
```
ImportError: cannot import name '_compare_version' from 'torchmetrics.utilities.imports
```
Solution: `pip install torchmetrics==0.11.4`.