DPro-SM – A distributed framework for proactive straggler mitigation using LSTM

The recent advancement in deep learning with growth in big data and high-performance computing is Distributed Deep Learning. The rapid rise in the volume of data and network complexity has led to significant growth in DDL. Distribution of the network leads to high communication and computation among...

Full description

Bibliographic Details
Main Authors:	Aswathy Ravikumar, Harini Sriraman
Format:	Article
Language:	English
Published:	Elsevier 2024-01-01
Series:	Heliyon
Subjects:	Distributed deep learning Data parallel Stragglers MPI LSTM Proactive mitigation
Online Access:	http://www.sciencedirect.com/science/article/pii/S2405844023107754

_version_	1797336992937148416
author	Aswathy Ravikumar Harini Sriraman
author_facet	Aswathy Ravikumar Harini Sriraman
author_sort	Aswathy Ravikumar
collection	DOAJ
description	The recent advancement in deep learning with growth in big data and high-performance computing is Distributed Deep Learning. The rapid rise in the volume of data and network complexity has led to significant growth in DDL. Distribution of the network leads to high communication and computation among the nodes, which leads to high training time and lower accuracy. The primary reason for the delay in communication is the presence of straggler nodes which causes the bottleneck in communication. Due to the enormous volume of parameter transfer, Distributed Deep Learning's data parallelism incurs substantial communication costs. The newly developed model-parallel methods may minimize the communication effort; however, this results in load imbalance and severe straggler issues: the proposed model DPro-SM, a distributed framework for proactive straggler mitigation using LSTM in distributed deep learning. DPro-SM uses LSTM to predict the completion time of each worker and proactively allocates resources to reduce the overall training time. The results show that DPro-SM can significantly reduce the training time and improve the scalability and efficiency of large-scale machine learning tasks.
first_indexed	2024-03-08T09:02:57Z
format	Article
id	doaj.art-dafa7d39619d4f1ba641eff444891b1d
institution	Directory Open Access Journal
issn	2405-8440
language	English
last_indexed	2024-03-08T09:02:57Z
publishDate	2024-01-01
publisher	Elsevier
record_format	Article
series	Heliyon
spelling	doaj.art-dafa7d39619d4f1ba641eff444891b1d2024-02-01T06:32:28ZengElsevierHeliyon2405-84402024-01-01101e23567DPro-SM – A distributed framework for proactive straggler mitigation using LSTMAswathy Ravikumar0Harini Sriraman1School of Computer Science and Engineering, VIT, Chennai, 600127, IndiaSchool of Computer Science and Engineering, VIT, Chennai, 600127, India; Corresponding author.The recent advancement in deep learning with growth in big data and high-performance computing is Distributed Deep Learning. The rapid rise in the volume of data and network complexity has led to significant growth in DDL. Distribution of the network leads to high communication and computation among the nodes, which leads to high training time and lower accuracy. The primary reason for the delay in communication is the presence of straggler nodes which causes the bottleneck in communication. Due to the enormous volume of parameter transfer, Distributed Deep Learning's data parallelism incurs substantial communication costs. The newly developed model-parallel methods may minimize the communication effort; however, this results in load imbalance and severe straggler issues: the proposed model DPro-SM, a distributed framework for proactive straggler mitigation using LSTM in distributed deep learning. DPro-SM uses LSTM to predict the completion time of each worker and proactively allocates resources to reduce the overall training time. The results show that DPro-SM can significantly reduce the training time and improve the scalability and efficiency of large-scale machine learning tasks.http://www.sciencedirect.com/science/article/pii/S2405844023107754Distributed deep learningData parallelStragglersMPILSTMProactive mitigation
spellingShingle	Aswathy Ravikumar Harini Sriraman DPro-SM – A distributed framework for proactive straggler mitigation using LSTM Heliyon Distributed deep learning Data parallel Stragglers MPI LSTM Proactive mitigation
title	DPro-SM – A distributed framework for proactive straggler mitigation using LSTM
title_full	DPro-SM – A distributed framework for proactive straggler mitigation using LSTM
title_fullStr	DPro-SM – A distributed framework for proactive straggler mitigation using LSTM
title_full_unstemmed	DPro-SM – A distributed framework for proactive straggler mitigation using LSTM
title_short	DPro-SM – A distributed framework for proactive straggler mitigation using LSTM
title_sort	dpro sm a distributed framework for proactive straggler mitigation using lstm
topic	Distributed deep learning Data parallel Stragglers MPI LSTM Proactive mitigation
url	http://www.sciencedirect.com/science/article/pii/S2405844023107754
work_keys_str_mv	AT aswathyravikumar dprosmadistributedframeworkforproactivestragglermitigationusinglstm AT harinisriraman dprosmadistributedframeworkforproactivestragglermitigationusinglstm

DPro-SM – A distributed framework for proactive straggler mitigation using LSTM

Similar Items