A scheduling algorithm to maximize storm throughput in heterogeneous cluster

Abstract In the most popular distributed stream processing frameworks (DSPFs), programs are modeled as a directed acyclic graph. Using this model, a DSPF can benefit from the parallelism capabilities of distributed clusters. Choosing a reasonable number of vertices for each operator and mapping the...

Full description

Bibliographic Details
Main Authors:	Hamid Nasiri, Saeed Nasehi, Arman Divband, Maziar Goudarzi
Format:	Article
Language:	English
Published:	SpringerOpen 2023-06-01
Series:	Journal of Big Data
Subjects:	Stream processing Scheduling Heterogeneous Throughput Parallelism
Online Access:	https://doi.org/10.1186/s40537-023-00771-y

_version_	1797801370392199168
author	Hamid Nasiri Saeed Nasehi Arman Divband Maziar Goudarzi
author_facet	Hamid Nasiri Saeed Nasehi Arman Divband Maziar Goudarzi
author_sort	Hamid Nasiri
collection	DOAJ
description	Abstract In the most popular distributed stream processing frameworks (DSPFs), programs are modeled as a directed acyclic graph. Using this model, a DSPF can benefit from the parallelism capabilities of distributed clusters. Choosing a reasonable number of vertices for each operator and mapping the vertices to the appropriate processing resources significantly affect the overall system performance. Due to the simplicity of the current DSPF schedulers, these frameworks perform poorly on large-scale clusters. In this paper, we present a heterogeneity-aware scheduling algorithm that finds the proper number of the vertices of an application graph and maps them to the most suitable cluster node. We begin with a pre-processing step which allocates the vertices to the given cluster nodes using profiling data. Then, we gradually increase the topology input rate in order to scale up the application graph. Finally, using a CPU utilization model which predicts the CPU workload based on the input rate to vertices and the processing node’s CPU characteristics, we identify the bottlenecked vertices and allocate new instances derived from them to the least utilized processing resource. Our experimental results on Storm Micro-Benchmark show that (1) the prediction model estimate CPU utilization with 92% accuracy. (2) Compared to the default scheduler of Storm, our scheduler provides 7 to 44% throughput enhancement. (3) The proposed method can find the solution within 4% (worst case) of the optimal scheduler, which obtains the best scheduling scenario using an exhaustive search over problem design space.
first_indexed	2024-03-13T04:49:26Z
format	Article
id	doaj.art-60ed228f2a284a0c8c34ebcd558d58e8
institution	Directory Open Access Journal
issn	2196-1115
language	English
last_indexed	2024-03-13T04:49:26Z
publishDate	2023-06-01
publisher	SpringerOpen
record_format	Article
series	Journal of Big Data
spelling	doaj.art-60ed228f2a284a0c8c34ebcd558d58e82023-06-18T11:16:36ZengSpringerOpenJournal of Big Data2196-11152023-06-0110112710.1186/s40537-023-00771-yA scheduling algorithm to maximize storm throughput in heterogeneous clusterHamid Nasiri0Saeed Nasehi1Arman Divband2Maziar Goudarzi3Department of Science, Sharif University of TechnologyDepartment of Science, Sharif University of TechnologyDepartment of Science, Sharif University of TechnologyDepartment of Science, Sharif University of TechnologyAbstract In the most popular distributed stream processing frameworks (DSPFs), programs are modeled as a directed acyclic graph. Using this model, a DSPF can benefit from the parallelism capabilities of distributed clusters. Choosing a reasonable number of vertices for each operator and mapping the vertices to the appropriate processing resources significantly affect the overall system performance. Due to the simplicity of the current DSPF schedulers, these frameworks perform poorly on large-scale clusters. In this paper, we present a heterogeneity-aware scheduling algorithm that finds the proper number of the vertices of an application graph and maps them to the most suitable cluster node. We begin with a pre-processing step which allocates the vertices to the given cluster nodes using profiling data. Then, we gradually increase the topology input rate in order to scale up the application graph. Finally, using a CPU utilization model which predicts the CPU workload based on the input rate to vertices and the processing node’s CPU characteristics, we identify the bottlenecked vertices and allocate new instances derived from them to the least utilized processing resource. Our experimental results on Storm Micro-Benchmark show that (1) the prediction model estimate CPU utilization with 92% accuracy. (2) Compared to the default scheduler of Storm, our scheduler provides 7 to 44% throughput enhancement. (3) The proposed method can find the solution within 4% (worst case) of the optimal scheduler, which obtains the best scheduling scenario using an exhaustive search over problem design space.https://doi.org/10.1186/s40537-023-00771-yStream processingSchedulingHeterogeneousThroughputParallelism
spellingShingle	Hamid Nasiri Saeed Nasehi Arman Divband Maziar Goudarzi A scheduling algorithm to maximize storm throughput in heterogeneous cluster Journal of Big Data Stream processing Scheduling Heterogeneous Throughput Parallelism
title	A scheduling algorithm to maximize storm throughput in heterogeneous cluster
title_full	A scheduling algorithm to maximize storm throughput in heterogeneous cluster
title_fullStr	A scheduling algorithm to maximize storm throughput in heterogeneous cluster
title_full_unstemmed	A scheduling algorithm to maximize storm throughput in heterogeneous cluster
title_short	A scheduling algorithm to maximize storm throughput in heterogeneous cluster
title_sort	scheduling algorithm to maximize storm throughput in heterogeneous cluster
topic	Stream processing Scheduling Heterogeneous Throughput Parallelism
url	https://doi.org/10.1186/s40537-023-00771-y
work_keys_str_mv	AT hamidnasiri aschedulingalgorithmtomaximizestormthroughputinheterogeneouscluster AT saeednasehi aschedulingalgorithmtomaximizestormthroughputinheterogeneouscluster AT armandivband aschedulingalgorithmtomaximizestormthroughputinheterogeneouscluster AT maziargoudarzi aschedulingalgorithmtomaximizestormthroughputinheterogeneouscluster AT hamidnasiri schedulingalgorithmtomaximizestormthroughputinheterogeneouscluster AT saeednasehi schedulingalgorithmtomaximizestormthroughputinheterogeneouscluster AT armandivband schedulingalgorithmtomaximizestormthroughputinheterogeneouscluster AT maziargoudarzi schedulingalgorithmtomaximizestormthroughputinheterogeneouscluster

A scheduling algorithm to maximize storm throughput in heterogeneous cluster

Similar Items