ECCOMAS 2024

Lineal: An Efficient, Hybrid-Parallel Linear Algebra Library

  • Böhm, Kurt (TU Clausthal, Institute of Mathematics)
  • Ippisch, Olaf (TU Clausthal, Institute of Mathematics)

Please login to view abstract download link

Efficiently solving large sparse systems of linear equations arising from the discretization of PDEs is still a challenging problem. To solve large problems on attainable hardware, the new linear algebra library Lineal has been developed. Lineal uses the preconditioned CG method as its main solver, with an Algebraic Multigrid solver (an optimized version of the AMG from DUNE ISTL) as its main preconditioner. However, Lineal uses a number of techniques to achieve very low memory requirements while providing low runtimes as well as generic and extensible interfaces. For stencil-based problems, Lineal can compute the matrix elements on the finest grid on the fly and only needs to store the coarse grid hierarchy explicitly. In this case, only a single value (a single byte for some problems) per cell is needed on the finest grid, which drastically reduces memory consumption compared to explicit matrices. Additionally, matrix-vector products are computed using tiling to improve cache utilization. Alternatively, Compressed Row Storage (CRS) matrices can be used, which support indices consisting of an arbitrary number of bytes to reduce memory consumption. Furthermore, floating point types can be mixed (almost) arbitrarily to save memory. Elementary operations are represented as classes that perform element-wise computations, using inlining to combine operations efficiently. Almost all components are fully multithreaded and use explicit SIMD operations to improve performance. Additionally, recent work has added support for distributed memory parallelism using MPI, allowing for hybrid-parallel computations that utilize a compute cluster while minimizing communication costs. Lineal has been successfully used to simulate oxygen diffusion in X-ray scans of soil samples, solving instances with more than 10⁹ unknowns in 10 to 120 minutes on a single AMD EPYC system with 32 cores and 256 GB of RAM. Further tests using this problem show that Lineal performs well compared to existing libraries in terms of runtime and memory consumption. Tests demonstrating its scalability on larger compute clusters using hybrid parallelism will be shown as well.