ECCOMAS 2024

GPU Optimisation of a Finite Element Code for Incompressible Flow

  • Owen, Herbert (Barcelona Supercomputing Centre (BSC))
  • Ernst, Dominik (NHR@FAU)
  • Lehmkuhl, Oriol (Barcelona Supercomputing Centre (BSC))
  • Hager, Georg (NHR@FAU)
  • Wellein, Gerhard (NHR@FAU)

Please login to view abstract download link

We present a detailed description of the optimisation of the momentum assembly for the incompressible flow module of the Alya low-order finite element code. Alya is a high-performance computational mechanics code to solve complex coupled multi-physics problems developed by engineers, physicists and computational experts at the Barcelona Supercomputing Center. It is one of the two CFD codes of the Unified European Applications Benchmark Suite (UEBAS) and the Accelerator benchmark suite of PRACE. In this work, we focus on scale-resolving simulations solved using a fractional step scheme to uncouple momentum and continuity equations and an explicit treatment of the momentum equation [1]. For such problems, the two main computational kernels are the momentum assembly, analysed in this work, and the solution of the Poisson system for the pressure, for which we found that the optimal approach is to rely on external Algebraic Multigrid libraries such as PSCToolkit. The optimisation targets GPU architectures using Ope- nACC, but we have found that most of the work also benefits CPUs. The analysis shows that the large number of intermediate values combined with the semantics of globally allocated temporary arrays are the root of all performance problems on the GPU. The enhancements can be categorised as follows. Restructure to determine which values are computed at what time and in which order. Specialise, giving up some generality that is rarely used or can be recovered at compile time. Privatise the intermediate result arrays instead of allocating large global vectors. A roofline model is used to show how the different modifications enhance the performance of the GPU. The combination of previously mentioned improvements leads to a speedup of more than 50x on an NVIDIA A100 GPU and a 5x speedup on the CPU. The final version is much more energy efficient on the GPU than on the CPU, as one would expect. We believe the observed anti-patterns and solutions can be transferable to other code bases with a similar development history.