@@ -71,8 +71,12 @@ when batch_size is large (16, 32, 64, 128), throughput_amp > throughput_tf32.
- Observation 2: The coefficient of variation of throughput for the 100 iterations is smallest when batch_size = 128.
Benchmarking with batch_size = 128 can be used for node health check. For example, the expected throughput for dim = 2, nodes = 1, gpus = 4, batch_size = 128, tf32 would be 2500 ± 200.
Benchmarking with batch_size = 128 can be used for node health check. For example, the expected throughput for dim = 2, nodes = 1, gpus = 4, batch_size = 128 would be 2500 ± 200 (TF32) and 4100 ± 400 (AMP).