Written by Di Wu, Hongshi Tan, Hanzhang Yang, Bingsheng He, and Qizhen Zhang.
on
MGI: A Communication Framework for Data Processing in Massive GPU Infrastructures
Abstract
This paper presents MGI, a general communication framework for performing data processing tasks in massive GPU infrastructures. Inter-GPU data transfer performance is crucial to multi-GPU data processing, and existing solutions repeatedly implement the same set of communication optimizations. MGI identifies these techniques and applies them judiciously behind a simple interface. Enabling MGI are (1) a central controller that models relevant hardware resources as an annotated graph and automates infrastructure-level optimizations to construct transfer plans and (2) a scalable data plane where buffers and executors are carefully designed to incorporate device- and link-level optimizations to execute data transfers efficiently. Our experiments on a variety of GPU infrastructures and workloads show that MGI significantly improves multi-GPU data processing performance compared to existing frameworks. This work is open-sourced on Github.