Synthesizing Optimal Collective Algorithms
Collective communication algorithms are an important component of distributed computation. Indeed, in the case of deep-learning, collective communication is the Amdahl’s bottleneck of data-parallel training.
This paper introduces SCCL (for Synthesized Collective Communication Library), a systematic approach to synthesizing collective communication algorithms that are explicitly tailored to a particular hardware topology. SCCL synthesizes algorithms along the Pareto-frontier spanning from latency-optimal to bandwidth-optimal implementations of a collective. The paper demonstrates how to encode the synthesis problem as a quantifier-free SMT formula which can be discharged to a theorem prover. We show how our carefully built encoding enables SCCL to scale.
We synthesize novel latency and bandwidth optimal algorithms not seen in the literature on two popular hardware topologies. We also show how SCCL efficiently lowers algorithms to implementations on two hardware architectures (NVIDIA and AMD) and demonstrate competitive performance with hand optimized collective communication libraries.
Mon 1 MarDisplayed time zone: Eastern Time (US & Canada) change
11:10 - 12:10
Session 2. Compilers, Analysis, SynthesisMain Conference
Chair(s): Milind Chabbi Uber Technologies
|Synthesizing Optimal Collective Algorithms|
Zixian Cai Australian National University, Zhengyang Liu University of Utah, Saeed Maleki Microsoft Research, Madan Musuvathi Microsoft Research, Todd Mytkowicz Microsoft Research, Jacob Nelson Microsoft Research, Olli Saarikivi Microsoft Research, RedmondLink to publication
|Parallel Binary Code Analysis|
Xiaozhu Meng Rice University, Jonathon Anderson Rice University, John Mellor-Crummey Rice University, Mark W. Krentel Rice University, Barton P. Miller University of Wisconsin - Madison, Srđan Milaković Rice UniversityLink to publication
|Compiler Support for Near Data Computing|
Mahmut Taylan Kandemir Penn State University, USA, Jihyun Ryoo Penn State University, USA, Xulong Tang University of Pittsburgh, USA, Mustafa Karakoy TUBITAK-BILGEM, TurkeyLink to publication
|Scaling Implicit Parallelism via Dynamic Control Replication|
Michael Bauer NVIDIA, Wonchan Lee NVIDIA, Elliott Slaughter SLAC National Accelerator Laboratory, Zhihao Jia Carnegie Mellon University, Mario Di Renzo Sapienza University of Rome, Manolis Papadakis NVIDIA, Galen Shipman Los Alamos National Laboratory, Patrick McCormick Los Alamos National Laboratory, Michael Garland NVIDIA, Alex Aiken Stanford UniveristyLink to publication