diff --git a/README.md b/README.md index 7bae338c726a3157877c9c95a2fc4bfe7ca583ff..4f274f935b8e29794e10c8ea3a6fd244b74be468 100644 --- a/README.md +++ b/README.md @@ -118,3 +118,4 @@ when batch_size is large (16, 32, 64, 128), throughput_amp > throughput_tf32. - For multi-node benchmarking, we need to use "srun" command; also, the line "#SBATCH --ntasks-per-node=8" has to been added. Otherwise the process will hang. - Benchmarking with dim = 2, nodes = 1, gpus = 8, batch_size = 128 takes ~2mins. If we want to finish it within a minute, we can change the number of batches from 150 (the default value) to a smaller number. Or we can try some smaller datasets. +- On single node, max batch_size is 256; on multi-node, max batch_size is 128.