Skip to content
Snippets Groups Projects
Unverified Commit 3a90bed7 authored by Xuan Gu's avatar Xuan Gu Committed by GitHub
Browse files

Update README.md

parent b25bc9d9
No related branches found
No related tags found
No related merge requests found
......@@ -118,3 +118,4 @@ when batch_size is large (16, 32, 64, 128), throughput_amp > throughput_tf32.
- For multi-node benchmarking, we need to use "srun" command; also, the line "#SBATCH --ntasks-per-node=8" has to been added. Otherwise the process will hang.
- Benchmarking with dim = 2, nodes = 1, gpus = 8, batch_size = 128 takes ~2mins.
If we want to finish it within a minute, we can change the number of batches from 150 (the default value) to a smaller number. Or we can try some smaller datasets.
- On single node, max batch_size is 256; on multi-node, max batch_size is 128.
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment