PPoPP 2021
Sat 27 February - Wed 3 March 2021
Tue 2 Mar 2021 13:06 - 13:12 - Session 6. Posters 1 Chair(s): Adam Morrison

Dense linear algebra kernels are fundamental components of many scientific computing applications. In this work, we present a novel method of deriving parallel I/O lower bounds for this broad family of programs. Based on the X-partitioning abstraction, our method explicitly captures inter-statement dependencies. Applying our analysis to LU factorization, we derive COnfLUX: an LU algorithm with the parallel I/O cost of $N^3 / (P \sqrt{S})$ communicated elements per processor — only $1/3\times$ over our established lower bound. We evaluate COnfLUX on various problem sizes, demonstrating empirical results that match our theoretical analysis, communicating less than Cray ScaLAPACK, SLATE, and the asymptotically-optimal CANDMC library. Running on $1$,$024$ nodes of Piz Daint, COnfLUX communicates 1.6$\times$ less than the second-best implementation and is expected to communicate 2.1$\times$ less on a full-scale run on Summit.