Distributed deep learning training using silicon photonic switched architectures

The scaling trends of deep learning models and distributed training workloads are challenging network capacities in today’s datacenters and high-performance computing (HPC) systems. We propose a system architecture that leverages silicon photonic (SiP) switch-enabled server regrouping using bandwidt...

Full description

Bibliographic Details
Main Authors: Ziyi Zhu, Min Yee Teh, Zhenguo Wu, Madeleine Strom Glick, Shijia Yan, Maarten Hattink, Keren Bergman
Format: Article
Language:English
Published: AIP Publishing LLC 2022-03-01
Series:APL Photonics
Online Access:http://dx.doi.org/10.1063/5.0070711
_version_ 1828230242286698496
author Ziyi Zhu
Min Yee Teh
Zhenguo Wu
Madeleine Strom Glick
Shijia Yan
Maarten Hattink
Keren Bergman
author_facet Ziyi Zhu
Min Yee Teh
Zhenguo Wu
Madeleine Strom Glick
Shijia Yan
Maarten Hattink
Keren Bergman
author_sort Ziyi Zhu
collection DOAJ
description The scaling trends of deep learning models and distributed training workloads are challenging network capacities in today’s datacenters and high-performance computing (HPC) systems. We propose a system architecture that leverages silicon photonic (SiP) switch-enabled server regrouping using bandwidth steering to tackle the challenges and accelerate distributed deep learning training. In addition, our proposed system architecture utilizes a highly integrated operating system-based SiP switch control scheme to reduce implementation complexity. To demonstrate the feasibility of our proposal, we built an experimental testbed with a SiP switch-enabled reconfigurable fat tree topology and evaluated the network performance of distributed ring all-reduce and parameter server workloads. The experimental results show up to 3.6× improvements over the static non-reconfigurable fat tree. Our large-scale simulation results show that server regrouping can deliver up to 2.3× flow throughput improvement for a 2× tapered fat tree and a further 11% improvement when higher-layer bandwidth steering is employed. The collective results show the potential of integrating SiP switches into datacenters and HPC systems to accelerate distributed deep learning training.
first_indexed 2024-04-12T18:45:33Z
format Article
id doaj.art-b8502b22779e440b851746d4484e75c6
institution Directory Open Access Journal
issn 2378-0967
language English
last_indexed 2024-04-12T18:45:33Z
publishDate 2022-03-01
publisher AIP Publishing LLC
record_format Article
series APL Photonics
spelling doaj.art-b8502b22779e440b851746d4484e75c62022-12-22T03:20:37ZengAIP Publishing LLCAPL Photonics2378-09672022-03-0173030901030901-1110.1063/5.0070711Distributed deep learning training using silicon photonic switched architecturesZiyi Zhu0Min Yee Teh1Zhenguo Wu2Madeleine Strom Glick3Shijia Yan4Maarten Hattink5Keren Bergman6Department of Electrical Engineering, Columbia University, New York, New York 10027, USADepartment of Electrical Engineering, Columbia University, New York, New York 10027, USADepartment of Electrical Engineering, Columbia University, New York, New York 10027, USADepartment of Electrical Engineering, Columbia University, New York, New York 10027, USADepartment of Electrical Engineering, Columbia University, New York, New York 10027, USADepartment of Electrical Engineering, Columbia University, New York, New York 10027, USADepartment of Electrical Engineering, Columbia University, New York, New York 10027, USAThe scaling trends of deep learning models and distributed training workloads are challenging network capacities in today’s datacenters and high-performance computing (HPC) systems. We propose a system architecture that leverages silicon photonic (SiP) switch-enabled server regrouping using bandwidth steering to tackle the challenges and accelerate distributed deep learning training. In addition, our proposed system architecture utilizes a highly integrated operating system-based SiP switch control scheme to reduce implementation complexity. To demonstrate the feasibility of our proposal, we built an experimental testbed with a SiP switch-enabled reconfigurable fat tree topology and evaluated the network performance of distributed ring all-reduce and parameter server workloads. The experimental results show up to 3.6× improvements over the static non-reconfigurable fat tree. Our large-scale simulation results show that server regrouping can deliver up to 2.3× flow throughput improvement for a 2× tapered fat tree and a further 11% improvement when higher-layer bandwidth steering is employed. The collective results show the potential of integrating SiP switches into datacenters and HPC systems to accelerate distributed deep learning training.http://dx.doi.org/10.1063/5.0070711
spellingShingle Ziyi Zhu
Min Yee Teh
Zhenguo Wu
Madeleine Strom Glick
Shijia Yan
Maarten Hattink
Keren Bergman
Distributed deep learning training using silicon photonic switched architectures
APL Photonics
title Distributed deep learning training using silicon photonic switched architectures
title_full Distributed deep learning training using silicon photonic switched architectures
title_fullStr Distributed deep learning training using silicon photonic switched architectures
title_full_unstemmed Distributed deep learning training using silicon photonic switched architectures
title_short Distributed deep learning training using silicon photonic switched architectures
title_sort distributed deep learning training using silicon photonic switched architectures
url http://dx.doi.org/10.1063/5.0070711
work_keys_str_mv AT ziyizhu distributeddeeplearningtrainingusingsiliconphotonicswitchedarchitectures
AT minyeeteh distributeddeeplearningtrainingusingsiliconphotonicswitchedarchitectures
AT zhenguowu distributeddeeplearningtrainingusingsiliconphotonicswitchedarchitectures
AT madeleinestromglick distributeddeeplearningtrainingusingsiliconphotonicswitchedarchitectures
AT shijiayan distributeddeeplearningtrainingusingsiliconphotonicswitchedarchitectures
AT maartenhattink distributeddeeplearningtrainingusingsiliconphotonicswitchedarchitectures
AT kerenbergman distributeddeeplearningtrainingusingsiliconphotonicswitchedarchitectures