Distributed deep learning training using silicon photonic switched architectures

The scaling trends of deep learning models and distributed training workloads are challenging network capacities in today’s datacenters and high-performance computing (HPC) systems. We propose a system architecture that leverages silicon photonic (SiP) switch-enabled server regrouping using bandwidt...

Full description

Bibliographic Details
Main Authors:	Ziyi Zhu, Min Yee Teh, Zhenguo Wu, Madeleine Strom Glick, Shijia Yan, Maarten Hattink, Keren Bergman
Format:	Article
Language:	English
Published:	AIP Publishing LLC 2022-03-01
Series:	APL Photonics
Online Access:	http://dx.doi.org/10.1063/5.0070711

_version_	1828230242286698496
author	Ziyi Zhu Min Yee Teh Zhenguo Wu Madeleine Strom Glick Shijia Yan Maarten Hattink Keren Bergman
author_facet	Ziyi Zhu Min Yee Teh Zhenguo Wu Madeleine Strom Glick Shijia Yan Maarten Hattink Keren Bergman
author_sort	Ziyi Zhu
collection	DOAJ
description	The scaling trends of deep learning models and distributed training workloads are challenging network capacities in today’s datacenters and high-performance computing (HPC) systems. We propose a system architecture that leverages silicon photonic (SiP) switch-enabled server regrouping using bandwidth steering to tackle the challenges and accelerate distributed deep learning training. In addition, our proposed system architecture utilizes a highly integrated operating system-based SiP switch control scheme to reduce implementation complexity. To demonstrate the feasibility of our proposal, we built an experimental testbed with a SiP switch-enabled reconfigurable fat tree topology and evaluated the network performance of distributed ring all-reduce and parameter server workloads. The experimental results show up to 3.6× improvements over the static non-reconfigurable fat tree. Our large-scale simulation results show that server regrouping can deliver up to 2.3× flow throughput improvement for a 2× tapered fat tree and a further 11% improvement when higher-layer bandwidth steering is employed. The collective results show the potential of integrating SiP switches into datacenters and HPC systems to accelerate distributed deep learning training.
first_indexed	2024-04-12T18:45:33Z
format	Article
id	doaj.art-b8502b22779e440b851746d4484e75c6
institution	Directory Open Access Journal
issn	2378-0967
language	English
last_indexed	2024-04-12T18:45:33Z
publishDate	2022-03-01
publisher	AIP Publishing LLC
record_format	Article
series	APL Photonics
spelling	doaj.art-b8502b22779e440b851746d4484e75c62022-12-22T03:20:37ZengAIP Publishing LLCAPL Photonics2378-09672022-03-0173030901030901-1110.1063/5.0070711Distributed deep learning training using silicon photonic switched architecturesZiyi Zhu0Min Yee Teh1Zhenguo Wu2Madeleine Strom Glick3Shijia Yan4Maarten Hattink5Keren Bergman6Department of Electrical Engineering, Columbia University, New York, New York 10027, USADepartment of Electrical Engineering, Columbia University, New York, New York 10027, USADepartment of Electrical Engineering, Columbia University, New York, New York 10027, USADepartment of Electrical Engineering, Columbia University, New York, New York 10027, USADepartment of Electrical Engineering, Columbia University, New York, New York 10027, USADepartment of Electrical Engineering, Columbia University, New York, New York 10027, USADepartment of Electrical Engineering, Columbia University, New York, New York 10027, USAThe scaling trends of deep learning models and distributed training workloads are challenging network capacities in today’s datacenters and high-performance computing (HPC) systems. We propose a system architecture that leverages silicon photonic (SiP) switch-enabled server regrouping using bandwidth steering to tackle the challenges and accelerate distributed deep learning training. In addition, our proposed system architecture utilizes a highly integrated operating system-based SiP switch control scheme to reduce implementation complexity. To demonstrate the feasibility of our proposal, we built an experimental testbed with a SiP switch-enabled reconfigurable fat tree topology and evaluated the network performance of distributed ring all-reduce and parameter server workloads. The experimental results show up to 3.6× improvements over the static non-reconfigurable fat tree. Our large-scale simulation results show that server regrouping can deliver up to 2.3× flow throughput improvement for a 2× tapered fat tree and a further 11% improvement when higher-layer bandwidth steering is employed. The collective results show the potential of integrating SiP switches into datacenters and HPC systems to accelerate distributed deep learning training.http://dx.doi.org/10.1063/5.0070711
spellingShingle	Ziyi Zhu Min Yee Teh Zhenguo Wu Madeleine Strom Glick Shijia Yan Maarten Hattink Keren Bergman Distributed deep learning training using silicon photonic switched architectures APL Photonics
title	Distributed deep learning training using silicon photonic switched architectures
title_full	Distributed deep learning training using silicon photonic switched architectures
title_fullStr	Distributed deep learning training using silicon photonic switched architectures
title_full_unstemmed	Distributed deep learning training using silicon photonic switched architectures
title_short	Distributed deep learning training using silicon photonic switched architectures
title_sort	distributed deep learning training using silicon photonic switched architectures
url	http://dx.doi.org/10.1063/5.0070711
work_keys_str_mv	AT ziyizhu distributeddeeplearningtrainingusingsiliconphotonicswitchedarchitectures AT minyeeteh distributeddeeplearningtrainingusingsiliconphotonicswitchedarchitectures AT zhenguowu distributeddeeplearningtrainingusingsiliconphotonicswitchedarchitectures AT madeleinestromglick distributeddeeplearningtrainingusingsiliconphotonicswitchedarchitectures AT shijiayan distributeddeeplearningtrainingusingsiliconphotonicswitchedarchitectures AT maartenhattink distributeddeeplearningtrainingusingsiliconphotonicswitchedarchitectures AT kerenbergman distributeddeeplearningtrainingusingsiliconphotonicswitchedarchitectures

Distributed deep learning training using silicon photonic switched architectures

Similar Items