ECCOMAS 2024

Performance optimality in GPU-based, mesh-refined LBM codes

  • Latt, Jonas (Université de Genève)
  • Coreixas, Christophe (Université de Genève)

Please login to view abstract download link

Lattice Boltzmann applications are well known to be memory bound on GPUs. The achieved performance is therefore very similar for a variety of collision models, as their computation is masked by memory access, but also for different streaming schemes, as long as they limit the volume of data access in memory to an equal extent. The performance metrics of an LBM code executed on a non-uniform mesh of octree type is however more intricate. In this case, a potentially severe overhead can be paid by the handling of communication across the interface of mesh levels, the manipulation of neighbor lists for unstructured mesh communication, and for complex memory access patterns that may not be suited to the memory access mechanisms of GPUs. In this work, we carefully analyze each of these elements and show how their impact can be substantially limited in most use scenaries, leading to close-to-ideal performance under the bandwidth-limited conditions of the GPU. To establish a baseline for the analysis of the performance to be expected from a non-uniform GPU LB code, the straightforward grid-refinement approach by Rohde et al. is used which, at realistic Reynolds and Mach numbers, can lead to results of comparatively good quality. This is illustrated in the case of 2D aerodynamic and aeroacoustic studies. To address the most severe limitation to ideal performance in a mesh-refined simulation, the management of neighbor-link lists, we consider two solutions, which are (1) recursive parsing of an octree structure of GPU, avoiding the storage of neighbor-links altogether, and (2) compact storage scheme for neighbor links which limits their memory footprint substantially. In the second approach for example, we show that in a weakly compressible flow simulations, a non-uniform mesh performs only slightly less well than a uniform one in terms of raw cell-processing performance, with a performance loss (in lattice-site updates per second) of 25% at single precision and 10% at double precision. Our performance analysis is carried out on multiple 2D and 3D use scenarios. It shows that the overhead of using a non-uniform mesh can be to a large extent eliminated altogether in a careful implementation of the LBM method. This observation is further validated through the use of a compressed-memory streaming scheme, the AA-pattern, thus achieving optimality in terms of memory usage as well as processing performance.