DPro-SM – A distributed framework for proactive straggler mitigation using LSTM

The recent advancement in deep learning with growth in big data and high-performance computing is Distributed Deep Learning. The rapid rise in the volume of data and network complexity has led to significant growth in DDL. Distribution of the network leads to high communication and computation among...

Full description

Bibliographic Details
Main Authors: Aswathy Ravikumar, Harini Sriraman
Format: Article
Language:English
Published: Elsevier 2024-01-01
Series:Heliyon
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S2405844023107754
_version_ 1797336992937148416
author Aswathy Ravikumar
Harini Sriraman
author_facet Aswathy Ravikumar
Harini Sriraman
author_sort Aswathy Ravikumar
collection DOAJ
description The recent advancement in deep learning with growth in big data and high-performance computing is Distributed Deep Learning. The rapid rise in the volume of data and network complexity has led to significant growth in DDL. Distribution of the network leads to high communication and computation among the nodes, which leads to high training time and lower accuracy. The primary reason for the delay in communication is the presence of straggler nodes which causes the bottleneck in communication. Due to the enormous volume of parameter transfer, Distributed Deep Learning's data parallelism incurs substantial communication costs. The newly developed model-parallel methods may minimize the communication effort; however, this results in load imbalance and severe straggler issues: the proposed model DPro-SM, a distributed framework for proactive straggler mitigation using LSTM in distributed deep learning. DPro-SM uses LSTM to predict the completion time of each worker and proactively allocates resources to reduce the overall training time. The results show that DPro-SM can significantly reduce the training time and improve the scalability and efficiency of large-scale machine learning tasks.
first_indexed 2024-03-08T09:02:57Z
format Article
id doaj.art-dafa7d39619d4f1ba641eff444891b1d
institution Directory Open Access Journal
issn 2405-8440
language English
last_indexed 2024-03-08T09:02:57Z
publishDate 2024-01-01
publisher Elsevier
record_format Article
series Heliyon
spelling doaj.art-dafa7d39619d4f1ba641eff444891b1d2024-02-01T06:32:28ZengElsevierHeliyon2405-84402024-01-01101e23567DPro-SM – A distributed framework for proactive straggler mitigation using LSTMAswathy Ravikumar0Harini Sriraman1School of Computer Science and Engineering, VIT, Chennai, 600127, IndiaSchool of Computer Science and Engineering, VIT, Chennai, 600127, India; Corresponding author.The recent advancement in deep learning with growth in big data and high-performance computing is Distributed Deep Learning. The rapid rise in the volume of data and network complexity has led to significant growth in DDL. Distribution of the network leads to high communication and computation among the nodes, which leads to high training time and lower accuracy. The primary reason for the delay in communication is the presence of straggler nodes which causes the bottleneck in communication. Due to the enormous volume of parameter transfer, Distributed Deep Learning's data parallelism incurs substantial communication costs. The newly developed model-parallel methods may minimize the communication effort; however, this results in load imbalance and severe straggler issues: the proposed model DPro-SM, a distributed framework for proactive straggler mitigation using LSTM in distributed deep learning. DPro-SM uses LSTM to predict the completion time of each worker and proactively allocates resources to reduce the overall training time. The results show that DPro-SM can significantly reduce the training time and improve the scalability and efficiency of large-scale machine learning tasks.http://www.sciencedirect.com/science/article/pii/S2405844023107754Distributed deep learningData parallelStragglersMPILSTMProactive mitigation
spellingShingle Aswathy Ravikumar
Harini Sriraman
DPro-SM – A distributed framework for proactive straggler mitigation using LSTM
Heliyon
Distributed deep learning
Data parallel
Stragglers
MPI
LSTM
Proactive mitigation
title DPro-SM – A distributed framework for proactive straggler mitigation using LSTM
title_full DPro-SM – A distributed framework for proactive straggler mitigation using LSTM
title_fullStr DPro-SM – A distributed framework for proactive straggler mitigation using LSTM
title_full_unstemmed DPro-SM – A distributed framework for proactive straggler mitigation using LSTM
title_short DPro-SM – A distributed framework for proactive straggler mitigation using LSTM
title_sort dpro sm a distributed framework for proactive straggler mitigation using lstm
topic Distributed deep learning
Data parallel
Stragglers
MPI
LSTM
Proactive mitigation
url http://www.sciencedirect.com/science/article/pii/S2405844023107754
work_keys_str_mv AT aswathyravikumar dprosmadistributedframeworkforproactivestragglermitigationusinglstm
AT harinisriraman dprosmadistributedframeworkforproactivestragglermitigationusinglstm