GraphPipe: Improving the Performance and Scalability of DNN Training with Graph Pipeline Parallelism

Deep neural networks (DNNs) continue to grow rapidly in size, thus it is infeasible to train them on a single device. To address this challenge, current DNN training systems apply pipeline-parallel techniques. They split a DNN into multiple stages, construct a pipeline of them, and assign to each st...

Full description

Bibliographic Details
Main Author: Kim, Sunghyun
Other Authors: Alizadeh, Mohammad
Format: Thesis
Published: Massachusetts Institute of Technology 2024
Online Access:https://hdl.handle.net/1721.1/156292
Description
Summary:Deep neural networks (DNNs) continue to grow rapidly in size, thus it is infeasible to train them on a single device. To address this challenge, current DNN training systems apply pipeline-parallel techniques. They split a DNN into multiple stages, construct a pipeline of them, and assign to each stage a distinct device. Multiple devices, each storing a partial segment of the DNN, perform their respective operations in sequence to train the whole. Applying pipeline-parallel techniques makes it feasible to train large-scale DNNs, yet there is still room for improvement. Existing approaches only consider sequential pipeline stages and thus ignore the inherent topology of a DNN to train. For example, when the architecture of a DNN has computationally-independent parallel branches, serial execution of them mandated by sequential pipeline stages unnecessarily lengthens the processing time of training data. This shortcoming leaves model-parallel opportunities untapped, resulting in suboptimal training throughput. In this paper, we develop graph pipeline parallelism (GPP), a new pipeline-parallel scheme that partitions a DNN into pipeline stages whose dependencies are identified by a directed acyclic graph. GPP generalizes current sequential pipeline stages. By constructing the pipeline based on the DNN topology, GPP enables concurrent execution of computationally independent DNN segments. GPP then optimizes micro-batch schedules for these stages, and parallelizes large-scale DNN training across multiple devices. We show that GPP achieves reduced memory consumption and improved training throughput. We also develop GraphPipe, a distributed system that leverages GPP strategies to enable performant and scalable DNN training. Evaluation on a variety of DNNs demonstrates that GraphPipe outperforms existing pipeline-parallel systems such as PipeDream and Piper by up to 1.6×. Despite the fact that GPP involves a much larger search space of parallelization strategies, GraphPipe reduces the search time by 9–21× compared to PipeDream and Piper.