Dense linear algebra kernels are fundamental components of many scientific computing applications. In this work, we present a novel method of deriving parallel I/O lower bounds for this broad family of programs. Based on the X-partitioning abstraction, our method explicitly captures inter-statement dependencies. Applying our analysis to LU factorization, we derive COnfLUX: an LU algorithm with the parallel I/O cost of $N^3 / (P \sqrt{S})$ communicated elements per processor — only $1/3\times$ over our established lower bound. We evaluate COnfLUX on various problem sizes, demonstrating empirical results that match our theoretical analysis, communicating less than Cray ScaLAPACK, SLATE, and the asymptotically-optimal CANDMC library. Running on $1$,$024$ nodes of Piz Daint, COnfLUX communicates 1.6$\times$ less than the second-best implementation and is expected to communicate 2.1$\times$ less on a full-scale run on Summit.
Marquita Ellis University of California at Berkeley & Lawrence Berkeley National Lab, Aydın Buluç University of California at Berkeley & Lawrence Berkeley National Lab, Katherine Yelick University of California at Berkeley & Lawrence Berkeley National Lab