Skip to content
Snippets Groups Projects
README.md 3.27 KiB
Newer Older
Xuan Gu's avatar
Xuan Gu committed

### Setting paths

```
MODEL_NAME=nnunet_for_pytorch
Xuan Gu's avatar
Xuan Gu committed
MODEL_VERSION=latest
Xuan Gu's avatar
Xuan Gu committed
MODEL_BASE=/proj/nsc_testing/xuan/containers/nvidia_pytorch_21.11-py3.sif
CONTAINER_DIR=/proj/nsc_testing/xuan/containers/${MODEL_NAME}_${MODEL_VERSION}.sif
DEF_DIR=/proj/nsc_testing/xuan/berzelius-benchmarks/NVIDIA/DeepLearningExamples/PyTorch/Segmentation/nnUNet/${MODEL_NAME}_${MODEL_VERSION}.def
Xuan Gu's avatar
Xuan Gu committed
WORK_DIR=/proj/nsc_testing/xuan/berzelius-benchmarks/NVIDIA/DeepLearningExamples/PyTorch/Segmentation/nnUNet/nnunet_pyt
Xuan Gu's avatar
Xuan Gu committed

mkdir -p $WORK_DIR/data $WORK_DIR/results
Xuan Gu's avatar
Xuan Gu committed
```
### Building the container

```
apptainer build $MODEL_BASE  docker://nvcr.io/nvidia/pytorch:21.11-py3
apptainer build $CONTAINER_DIR $DEF_DIR
```

Xuan Gu's avatar
Xuan Gu committed
### Make a copy of the code

```
apptainer exec $CONTAINER_DIR bash -c "cp -a /workspace/nnunet_pyt/* ${WORK_DIR}"
```
Xuan Gu's avatar
Xuan Gu committed

### Downloading and preprocessing the data

```
apptainer exec --nv -B ${WORK_DIR}/data:/data -B ${WORK_DIR}/results:/results --pwd /workspace/nnunet_pyt $CONTAINER_DIR python download.py --task 01  
apptainer exec --nv -B ${WORK_DIR}/data:/data -B ${WORK_DIR}/results:/results --pwd /workspace/nnunet_pyt $CONTAINER_DIR  python /workspace/nnunet_pyt/preprocess.py --task 01 --dim 2
```



### Running benchmarking 

```
apptainer exec --nv -B ${WORK_DIR}/data:/data -B ${WORK_DIR}/results:/results --pwd /workspace/nnunet_pyt $CONTAINER_DIR python scripts/benchmark.py --mode train --gpus 1 --dim 2 --batch_size 256 --amp
apptainer exec --nv -B ${WORK_DIR}/data:/data -B ${WORK_DIR}/results:/results --pwd /workspace/nnunet_pyt $CONTAINER_DIR python scripts/benchmark.py --mode predict --gpus 1 --dim 2 --batch_size 256 --amp
```

### Running benchmarking using batch jobs

```
Xuan Gu's avatar
Xuan Gu committed
bash $WORK_DIR/submit_benchmark_jobs.sh
Xuan Gu's avatar
Xuan Gu committed
```


### Known issues

#### Isssue 1 (21.11.0)
https://github.com/NVIDIA/DeepLearningExamples/issues/1113

When running the container, an error occurred:
```
ImportError: cannot import name 'get_num_classes' from 'torchmetrics.utilities.data' (/opt/conda/lib/python3.8/site-packages/torchmetrics/utilities/data.py)
```


Solution 1 (not working): `pip install pytorch-lightning==1.5.10`.

Another error raised when benchmarking predict:
```
Traceback (most recent call last):
  File "main.py", line 110, in <module>
    trainer.current_epoch = 1
AttributeError: can't set attribute
```

Solution 2: `pip install torchmetrics==0.6.0`. 

Another error raised:
  File "main.py", line 34, in <module>
    set_affinity(int(os.getenv("LOCAL_RANK", "0")), args.gpus, mode=args.affinity)
  File "/workspace/nnunet_pyt/utils/gpu_affinity.py", line 376, in set_affinity
    set_socket_unique_affinity(gpu_id, nproc_per_node, cores, "contiguous", balanced)
  File "/workspace/nnunet_pyt/utils/gpu_affinity.py", line 263, in set_socket_unique_affinity
    os.sched_setaffinity(0, ungrouped_affinities[gpu_id])
OSError: [Errno 22] Invalid argument

We need to comment out the L32-33 in the `main.py` to fix it.

#### Issue 2 (21.11.0)

Muiti-node jobs is not supported yet in 21.11.0 but only in the latest (nightly) version.


#### Issue 3 (latest)

```
ImportError: cannot import name '_compare_version' from 'torchmetrics.utilities.imports
```

Xuan Gu's avatar
Xuan Gu committed
Solution: `pip install torchmetrics==0.11.4`.

#### Issue 4 

Use `#SBATCH --gres=gpu:8` for multi-node jobs. `#SBATCH --gpus=8` will not work.

Add `#SBATCH --ntasks-per-node=${gpus}`