ECCOMAS 2024

Hardware aware matrix-free approach for accelerating finite-element discretized eigenvalue problems: Application to large-scale Kohn-Sham density functional theory

Panigrahi, Gourab (Indian Institute of Science)
Motamarri, Phani (Indian Institute of Science)

In session: MS146B - Advanced Parallel Algorithms for Extreme-Scale Simulations II

Please login to view abstract download link

The finite-element (FE) discretization of a partial differential equation usually involves construction of a FE discretized operator, and computing its action on trial FE discretized fields for the solution of a linear system of equations or eigenvalue problems using iterative solvers. This is traditionally computed using global sparse-vector multiplication algorithms. However, recent hardware-aware algorithms for evaluating such higher-order FE discretized matrix-vector multiplications suggest that on-the-fly matrix-vector products without building and storing the cell-level dense matrices reduce both arithmetic complexity and memory footprint and are referred to as matrix-free approaches. These approaches exploit the tensor-structured nature of the FE polynomial basis for evaluating the underlying integrals, and the current state-of-the-art matrix-free implementations deal with the action of FE discretized matrix on a single vector. These are neither optimal nor readily applicable for matrix multi-vector products involving large number of vectors. We discuss a computationally efficient and scalable matrix-free algorithm and implementation strategies to compute the FE discretized matrix multi-vector products on multi-node GPU architectures. We use batched evaluation strategies, with the batchsize tailored to underlying hardware architectures, leading to better data locality and allowing for parallelization over multiple batches. We devise an algorithm to overlap compute and data movement in conjunction with GPU shared memory, constant memory, and kernel fusion to reduce data accesses to and from device memory and registers to reduce bank conflicts. Further, we propose a strategy where the memory of both the registers and shared memory is utilized to mitigate the memory constraints. We first benchmark the performance of our implementation using a representative FE discretized matrix acting on multivectors of various sizes on multi-node GPU architectures and observe significant gains over closest baseline implementation. Further, usefulness of the proposed approach is demonstrated in accelerating large-scale eigenvalue problem arising in FE discretized Kohn-Sham DFT, a challenging problem in quantum modeling of materials.