ECCOMAS 2024

A Distributed Memory Tri-diagonal Solver Optimised for CPU and GPU Architectures

Akkurt, Semih (Imperial College London)
Laizet, Sylvain (Imperial College London)

In session: MS146B - Advanced Parallel Algorithms for Extreme-Scale Simulations II

Please login to view abstract download link

A number of discretisations that are often used for solving PDE’s on structured grids result in tridiagonal system of equations. A typical example is Alternating Direction Im- plicit (ADI) based methods where a batch of tridiagonal systems are solved per direction. However, our primary focus in this work is the compact finite difference schemes, where obtaining derivatives or interpolations requires solution of a tridiagonal system due to the space-implicit coupling in the scheme [1]. Compact schemes are used in a number of frameworks including Xcompact3D where most operations require the solution of a batch of tridiagonal systems [2]. Xcompact3D uses a 2D-pencil domain decomposition strategy for parallelisation. Oper- ations are carried out along a given direction using a sequential Thomas algorithm, and an all-to-all type communication is used to transpose fields and the pencil decomposition so that the operations in a different direction can be carried out via a Thomas algorithm. Our recent work on porting Xcompact3D to GPUs necessitated a change in this strategy due to the relative node to node communication bottlenecks on modern GPU clusters. We developed a customised distributed memory tridiagonal solver based on existing strategies [3, 4] to eliminate the all-to-all type communications, and proposed a data structure to enable a high performance implementation of the proposed algorithm on CPU and GPU architectures. Performance analyses we have carried out demonstrates that our distributed algorithm can utilise up to 75% of the available bandwidth on CPUs and GPUs. Moreover, efficiency of strong scaling is 83% on ARCHER2 (2x AMD EPYC 7742 CPU per node) from 1 to 64 nodes, and 68% on CIRRUS (4x NVIDIA V100 per node) from 1 to 16 nodes.