Skip to content
Snippets Groups Projects
Unverified Commit d1e33495 authored by Xuan Gu's avatar Xuan Gu Committed by GitHub
Browse files

Update README.md

parent 928785af
No related branches found
No related tags found
No related merge requests found
......@@ -102,6 +102,6 @@ when batch_size is large (16, 32, 64, 128), throughput_amp > throughput_tf32.
#### Notes
- It seems running directly via singularity shell will give worse performance (when I WFH). We should run it via sbatch script instead.
- It took around a week to finish all iterations of benchmarking.
- It took around a week to finish 100 iterations of benchmarking for all sets of parameters.
- For multi-node benchmarking, we need to use "srun" command; also, the line "#SBATCH --ntasks-per-node=8" has to been added. Otherwise the process will hang.
- Benchmarking with dim = 2, nodes = 1, gpus = 8, batch_size = 128 takes ~2mins. If we want to finish it within a minute, we can change the number of batches from 150 (the default value) to a smaller number.
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment