DPro-SM – A distributed framework for proactive straggler mitigation using LSTM
The recent advancement in deep learning with growth in big data and high-performance computing is Distributed Deep Learning. The rapid rise in the volume of data and network complexity has led to significant growth in DDL. Distribution of the network leads to high communication and computation among...
Main Authors: | , |
---|---|
Format: | Article |
Language: | English |
Published: |
Elsevier
2024-01-01
|
Series: | Heliyon |
Subjects: | |
Online Access: | http://www.sciencedirect.com/science/article/pii/S2405844023107754 |
_version_ | 1797336992937148416 |
---|---|
author | Aswathy Ravikumar Harini Sriraman |
author_facet | Aswathy Ravikumar Harini Sriraman |
author_sort | Aswathy Ravikumar |
collection | DOAJ |
description | The recent advancement in deep learning with growth in big data and high-performance computing is Distributed Deep Learning. The rapid rise in the volume of data and network complexity has led to significant growth in DDL. Distribution of the network leads to high communication and computation among the nodes, which leads to high training time and lower accuracy. The primary reason for the delay in communication is the presence of straggler nodes which causes the bottleneck in communication. Due to the enormous volume of parameter transfer, Distributed Deep Learning's data parallelism incurs substantial communication costs. The newly developed model-parallel methods may minimize the communication effort; however, this results in load imbalance and severe straggler issues: the proposed model DPro-SM, a distributed framework for proactive straggler mitigation using LSTM in distributed deep learning. DPro-SM uses LSTM to predict the completion time of each worker and proactively allocates resources to reduce the overall training time. The results show that DPro-SM can significantly reduce the training time and improve the scalability and efficiency of large-scale machine learning tasks. |
first_indexed | 2024-03-08T09:02:57Z |
format | Article |
id | doaj.art-dafa7d39619d4f1ba641eff444891b1d |
institution | Directory Open Access Journal |
issn | 2405-8440 |
language | English |
last_indexed | 2024-03-08T09:02:57Z |
publishDate | 2024-01-01 |
publisher | Elsevier |
record_format | Article |
series | Heliyon |
spelling | doaj.art-dafa7d39619d4f1ba641eff444891b1d2024-02-01T06:32:28ZengElsevierHeliyon2405-84402024-01-01101e23567DPro-SM – A distributed framework for proactive straggler mitigation using LSTMAswathy Ravikumar0Harini Sriraman1School of Computer Science and Engineering, VIT, Chennai, 600127, IndiaSchool of Computer Science and Engineering, VIT, Chennai, 600127, India; Corresponding author.The recent advancement in deep learning with growth in big data and high-performance computing is Distributed Deep Learning. The rapid rise in the volume of data and network complexity has led to significant growth in DDL. Distribution of the network leads to high communication and computation among the nodes, which leads to high training time and lower accuracy. The primary reason for the delay in communication is the presence of straggler nodes which causes the bottleneck in communication. Due to the enormous volume of parameter transfer, Distributed Deep Learning's data parallelism incurs substantial communication costs. The newly developed model-parallel methods may minimize the communication effort; however, this results in load imbalance and severe straggler issues: the proposed model DPro-SM, a distributed framework for proactive straggler mitigation using LSTM in distributed deep learning. DPro-SM uses LSTM to predict the completion time of each worker and proactively allocates resources to reduce the overall training time. The results show that DPro-SM can significantly reduce the training time and improve the scalability and efficiency of large-scale machine learning tasks.http://www.sciencedirect.com/science/article/pii/S2405844023107754Distributed deep learningData parallelStragglersMPILSTMProactive mitigation |
spellingShingle | Aswathy Ravikumar Harini Sriraman DPro-SM – A distributed framework for proactive straggler mitigation using LSTM Heliyon Distributed deep learning Data parallel Stragglers MPI LSTM Proactive mitigation |
title | DPro-SM – A distributed framework for proactive straggler mitigation using LSTM |
title_full | DPro-SM – A distributed framework for proactive straggler mitigation using LSTM |
title_fullStr | DPro-SM – A distributed framework for proactive straggler mitigation using LSTM |
title_full_unstemmed | DPro-SM – A distributed framework for proactive straggler mitigation using LSTM |
title_short | DPro-SM – A distributed framework for proactive straggler mitigation using LSTM |
title_sort | dpro sm a distributed framework for proactive straggler mitigation using lstm |
topic | Distributed deep learning Data parallel Stragglers MPI LSTM Proactive mitigation |
url | http://www.sciencedirect.com/science/article/pii/S2405844023107754 |
work_keys_str_mv | AT aswathyravikumar dprosmadistributedframeworkforproactivestragglermitigationusinglstm AT harinisriraman dprosmadistributedframeworkforproactivestragglermitigationusinglstm |