Skip to content
Snippets Groups Projects
README.md 3.13 KiB
Newer Older
  • Learn to ignore specific revisions
  • Xuan Gu's avatar
    Xuan Gu committed
    
    ### Setting paths
    
    ```
    
    Xuan Gu's avatar
    Xuan Gu committed
    MODEL_NAME=maskrcnn_for_pytorch
    MODEL_VERSION=latest
    MODEL_BASE=/proj/nsc_testing/xuan/containers/nvidia_pytorch_21.12-py3.sif
    
    Xuan Gu's avatar
    Xuan Gu committed
    CONTAINER_DIR=/proj/nsc_testing/xuan/containers/${MODEL_NAME}_${MODEL_VERSION}.sif
    
    Xuan Gu's avatar
    Xuan Gu committed
    DEF_DIR=/proj/nsc_testing/xuan/berzelius-benchmarks/NVIDIA/DeepLearningExamples/PyTorch/Segmentation/MaskRCNN/${MODEL_NAME}_${MODEL_VERSION}.def
    WORK_DIR=/proj/nsc_testing/xuan/berzelius-benchmarks/NVIDIA/DeepLearningExamples/PyTorch/Segmentation/MaskRCNN
    
    mkdir -p $WORK_DIR/data $WORK_DIR/results
    
    Xuan Gu's avatar
    Xuan Gu committed
    ```
    ### Building the container
    
    ```
    
    Xuan Gu's avatar
    Xuan Gu committed
    apptainer build $MODEL_BASE  docker://nvcr.io/nvidia/pytorch:21.12-py3
    
    Xuan Gu's avatar
    Xuan Gu committed
    apptainer build $CONTAINER_DIR $DEF_DIR
    ```
    
    
    ### Downloading and preprocessing the data
    
    ```
    
    Xuan Gu's avatar
    Xuan Gu committed
    apptainer exec --nv -B ${WORK_DIR}/data:/data --pwd /data $CONTAINER_DIR bash -c "cp /workspace/object_detection/hashes.md5 /data/ && bash /workspace/object_detection/download_dataset.sh /data"  
    
    apptainer exec --nv $CONTAINER_DIR bash -c "cp -a /workspace/object_detection/* ${WORK_DIR}/"
    apptainer exec --nv -B ${WORK_DIR}/data:/datasets/data -B ${WORK_DIR}/results:/results --pwd ${WORK_DIR} $CONTAINER_DIR bash scripts/train_benchmark.sh fp16 1 True True
    
    Xuan Gu's avatar
    Xuan Gu committed
    ```
    
    
    
    ### Running benchmarking 
    
    ```
    apptainer exec --nv -B ${WORK_DIR}/data:/data -B ${WORK_DIR}/results:/results --pwd /workspace/nnunet_pyt $CONTAINER_DIR python scripts/benchmark.py --mode train --gpus 1 --dim 2 --batch_size 256 --amp
    apptainer exec --nv -B ${WORK_DIR}/data:/data -B ${WORK_DIR}/results:/results --pwd /workspace/nnunet_pyt $CONTAINER_DIR python scripts/benchmark.py --mode predict --gpus 1 --dim 2 --batch_size 256 --amp
    ```
    
    ### Running benchmarking using batch jobs
    
    ```
    bash submit_benchmark_jobs.sh
    ```
    
    
    ### Known issues
    
    #### Isssue 1 (21.11.0)
    https://github.com/NVIDIA/DeepLearningExamples/issues/1113
    
    When running the container, an error occurred:
    ```
    ImportError: cannot import name 'get_num_classes' from 'torchmetrics.utilities.data' (/opt/conda/lib/python3.8/site-packages/torchmetrics/utilities/data.py)
    ```
    
    
    Solution 1 (not working): `pip install pytorch-lightning==1.5.10`.
    
    Another error raised when benchmarking predict:
    ```
    Traceback (most recent call last):
      File "main.py", line 110, in <module>
        trainer.current_epoch = 1
    AttributeError: can't set attribute
    ```
    
    Solution 2: `pip install torchmetrics==0.6.0`. 
    
    Another error raised:
      File "main.py", line 34, in <module>
        set_affinity(int(os.getenv("LOCAL_RANK", "0")), args.gpus, mode=args.affinity)
      File "/workspace/nnunet_pyt/utils/gpu_affinity.py", line 376, in set_affinity
        set_socket_unique_affinity(gpu_id, nproc_per_node, cores, "contiguous", balanced)
      File "/workspace/nnunet_pyt/utils/gpu_affinity.py", line 263, in set_socket_unique_affinity
        os.sched_setaffinity(0, ungrouped_affinities[gpu_id])
    OSError: [Errno 22] Invalid argument
    
    We need to comment out the L32-33 in the `main.py` to fix it.
    
    #### Issue 2 (21.11.0)
    
    Muiti-node jobs is not supported yet in 21.11.0 but only in the latest (nightly) version.
    
    
    #### Issue 3 (latest)
    
    ```
    ImportError: cannot import name '_compare_version' from 'torchmetrics.utilities.imports
    ```
    
    Solution: `pip install torchmetrics==0.11.4`.