- Use as large batch_size as possible for a more stable benchmark result. For single node, use 256; for multi-node, use 128.
- Benchmarking with dim = 2, nodes = 1, gpus = 8, batch_size = 128, 256 takes ~2mins.
- Specify the paths for enroot cache and data, see this [page](https://gitlab.liu.se/xuagu37/run-pytorch-and-tensorflow-containers-with-nvidia-enroot#set-path-to-user-container-storage).
- (20220222) For single node case, command ```srun singularity/enroot``` gives slightly worse throughput performance compared with running ```singularity/enroot``` directly. Command ```srun enroot``` stopped working for multi-node case.
- (20220222) ```srun enroot ...```stopped working for multi-node case. Use pyxis instead. See the script ```benchmark_multi_node.sbatch```.