Benchmarks

Performance benchmarks for parallel scaling on Gadi — NCI’s supercomputer.


System

Property Value
System Gadi (NCI)
CPU Intel Xeon Platinum 8274 (Cascade Lake) 3.2 GHz
Cores per node 48 (2 × 24-core sockets)
Compiler gcc 14.1.0

Strong Scaling

A fixed problem size of $256^3$ is distributed across an increasing number of threads. Runtime is wall-clock time in seconds.

Results

Threads Runtime (s) Speedup Efficiency (%)
1 253.6 1.00 100
2 123.9 2.00 100
4 64.3 3.94 98.6
8 32.8 7.73 96.7
16 16.8 15.11 94.4
32 8.7 29.12 91.1
48 6.5 38.99 81.2

Speedup = T₁ / Tₙ  ·  Efficiency = Speedup / N × 100

Strong scaling speedup

The dashed line shows ideal linear speedup. Deviation from ideal is expected due to communication overhead and non-parallelisable portions of the code (Amdahl’s Law).


CPU vs GPU Runtime

The following benchmark measures the wall-clock runtime of a fixed test run across three platforms:

  • Grid size: $256^3$
  • Steps: 1000

All runs use the same initial conditions, output settings, and build flags, differing only in the platform and parallelisation backend.

Results

Platform Compiler / Runtime Threads Runtime (s)
Apple M4 (2024) clang 17 10 1050
Intel Xeon 8268 (2019) gcc 14 12 1183
Nvidia V100-SXM2 (2018) CUDA 12.9 27.3

CPU vs GPU runtime

GPU Speedup

The V100 GPU completes the same run in about 27 s, compared to more than 1000 s on the Apple M4 and on the Xeon 8268. This corresponds to a speedup of roughly 39× over the M4 and 43× over the Xeon.

The speedup is driven primarily by the GPU-accelerated FFT (cuFFT), which dominates the cost of each time step at this grid size. The benefit is expected to grow further at larger $N$, where the FFT cost scales as $N^3 \log N$ and the GPU’s parallelism is more fully utilised.


Conclusions

  • Strong scaling is efficient up to $\mathcal{O}(20)$ threads, beyond which communication overhead begins to dominate.

This site uses Just the Docs, a documentation theme for Jekyll.