Skip to content
Snippets Groups Projects
README.md 1.95 KiB
Newer Older
Xuan Gu's avatar
Xuan Gu committed

### Setting paths

```
Xuan Gu's avatar
Xuan Gu committed
MODEL_NAME=maskrcnn_for_pytorch
MODEL_VERSION=latest
MODEL_BASE=/proj/nsc_testing/xuan/containers/nvidia_pytorch_21.12-py3.sif
Xuan Gu's avatar
Xuan Gu committed
CONTAINER_DIR=/proj/nsc_testing/xuan/containers/${MODEL_NAME}_${MODEL_VERSION}.sif
Xuan Gu's avatar
Xuan Gu committed
DEF_DIR=/proj/nsc_testing/xuan/berzelius-benchmarks/NVIDIA/DeepLearningExamples/PyTorch/Segmentation/MaskRCNN/${MODEL_NAME}_${MODEL_VERSION}.def
Xuan Gu's avatar
Xuan Gu committed
WORK_DIR=/proj/nsc_testing/xuan/berzelius-benchmarks/NVIDIA/DeepLearningExamples/PyTorch/Segmentation/MaskRCNN/object_detection
Xuan Gu's avatar
Xuan Gu committed

mkdir -p $WORK_DIR/data $WORK_DIR/results
Xuan Gu's avatar
Xuan Gu committed
```
### Building the container

```
Xuan Gu's avatar
Xuan Gu committed
apptainer build $MODEL_BASE  docker://nvcr.io/nvidia/pytorch:21.12-py3
Xuan Gu's avatar
Xuan Gu committed
apptainer build $CONTAINER_DIR $DEF_DIR
```

Xuan Gu's avatar
Xuan Gu committed
### Make a copy of the code

```
apptainer exec $CONTAINER_DIR bash -c "cp -a /workspace/object_detection/* ${WORK_DIR}/object_detection"
```
Xuan Gu's avatar
Xuan Gu committed

### Downloading and preprocessing the data

```
Xuan Gu's avatar
Xuan Gu committed
apptainer exec --nv -B ${WORK_DIR}/object_detection/data:/data --pwd /data $CONTAINER_DIR bash -c "cp /workspace/object_detection/hashes.md5 /data/ && bash /workspace/object_detection/download_dataset.sh /data"  
Xuan Gu's avatar
Xuan Gu committed
```



### Running benchmarking 

Xuan Gu's avatar
Xuan Gu committed
AMP can be enabled by setting `DTYPE` to `float16`.
Xuan Gu's avatar
Xuan Gu committed

```
Xuan Gu's avatar
Xuan Gu committed
apptainer exec --nv $CONTAINER_DIR bash -c "cp -a /workspace/object_detection/* ${WORK_DIR}/object_detection/"
apptainer exec --nv -B ${WORK_DIR}/object_detection/data:/datasets/data -B ${WORK_DIR}/object_detection/results:/results --pwd ${WORK_DIR} $CONTAINER_DIR bash scripts/train_benchmark.sh float16 1 True True
apptainer exec --nv -B ${WORK_DIR}/object_detection/data:/datasets/data -B ${WORK_DIR}/object_detection/results:/results --pwd ${WORK_DIR} $CONTAINER_DIR bash scripts/inference_benchmark.sh float16 1
Xuan Gu's avatar
Xuan Gu committed

Xuan Gu's avatar
Xuan Gu committed
```
Xuan Gu's avatar
Xuan Gu committed

Xuan Gu's avatar
Xuan Gu committed
### Running benchmarking using batch jobs
Xuan Gu's avatar
Xuan Gu committed

```
Xuan Gu's avatar
Xuan Gu committed
bash $WORK_DIR/submit_benchmark_jobs.sh
Xuan Gu's avatar
Xuan Gu committed
```


Xuan Gu's avatar
Xuan Gu committed
### Known issues
Xuan Gu's avatar
Xuan Gu committed

Xuan Gu's avatar
Xuan Gu committed
#### Issue 1
Xuan Gu's avatar
Xuan Gu committed

Xuan Gu's avatar
Xuan Gu committed
The checkpoint file `results/last_checkpoint` has to be removed for a new benchmark train run.
Xuan Gu's avatar
Xuan Gu committed

Xuan Gu's avatar
Xuan Gu committed
#### Issue 2
Xuan Gu's avatar
Xuan Gu committed

Xuan Gu's avatar
Xuan Gu committed
Don't use `srun` to run `apptainer exec`.