- The expected throughput for dim = 2, nodes = 1, gpus = 1, batch_size = 256 would be 670 ± 10 (TF32).
- The expected throughput for dim = 2, nodes = 1, gpus = 8, batch_size = 256 would be 5130 ± 180 (TF32).
- The expected throughput for dim = 2, nodes = 1, gpus = 4, batch_size = 256 would be 2600 ± 100 (TF32).
- The expected throughput for dim = 2, nodes = 2, gpus = 16, batch_size = 128 would be 9300 ± 70 (TF32).
- The expected throughput for dim = 2, nodes = 1, gpus = 8, batch_size = 256 would be 5150 ± 150 (TF32).
- The expected throughput for dim = 2, nodes = 3, gpus = 24, batch_size = 128 would be 13880 ± 85 (TF32).
- The expected throughput for dim = 2, nodes = 2, gpus = 16, batch_size = 128 would be 9250 ± 150 (TF32).
- The expected throughput for dim = 2, nodes = 4, gpus = 24, batch_size = 128 would be 18500 ± 90 (TF32).
**Observation 3**: Ideally, the improvement of throughput would be linear when batch_size increases. In practice, throughtput stays below the ideal curve when batch_size > 16.
**Observation 3**: Ideally, the improvement of throughput would be linear when batch_size increases. In practice, throughtput stays below the ideal curve when batch_size > 16.