Cost-Minimizing Online Algorithms for Geo-Distributed Data Analytics

Modern enterprises often manage geographically distributed datacenters around the globe. In such environment, datasets are naturally collected and stored in different data centers and were later queried for complex analytics. In this paper, we study the Wide-Area Data Analytics problem, which aims t...

Full description

Bibliographic Details
Main Authors: Jiao Huang, Jing Huang, Shang Gao, Bo Yang
Format: Article
Language:English
Published: IEEE 2019-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/8891688/
_version_ 1819169988993351680
author Jiao Huang
Jing Huang
Shang Gao
Bo Yang
author_facet Jiao Huang
Jing Huang
Shang Gao
Bo Yang
author_sort Jiao Huang
collection DOAJ
description Modern enterprises often manage geographically distributed datacenters around the globe. In such environment, datasets are naturally collected and stored in different data centers and were later queried for complex analytics. In this paper, we study the Wide-Area Data Analytics problem, which aims to efficiently control data movements and achieve low latency for overall queries processing, both constrained by limited and expensive network resources across datacenters. Previous papers focus on offline settings of single analytical queries and do not consider time in optimizing system performance, and therefore ignores the dynamics of data and task placement in terms of inter-DC bandwidth utilization. In this paper, we consider the online setting and formulate a cost-minimizing optimization problem over time for arbitrary Directed Acyclic Graph query processing. Considering dynamics of network resource usage, we developed two online algorithms, Online Switch Resist (OSR) and Most Fixed Horizon Control (MFHC) with good competitive ratios. We performed extensive simulations and comparative studies using the TPC-CH benchmark and verified the efficacy of proposed algorithms. The algorithm we proposed is better than the existing algorithm, and its performance approximates the theoretical optimal value.
first_indexed 2024-12-22T19:28:15Z
format Article
id doaj.art-a24a146f17a544fd9fa62373d8c791e6
institution Directory Open Access Journal
issn 2169-3536
language English
last_indexed 2024-12-22T19:28:15Z
publishDate 2019-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj.art-a24a146f17a544fd9fa62373d8c791e62022-12-21T18:15:10ZengIEEEIEEE Access2169-35362019-01-01716351516352510.1109/ACCESS.2019.29516828891688Cost-Minimizing Online Algorithms for Geo-Distributed Data AnalyticsJiao Huang0https://orcid.org/0000-0002-9356-722XJing Huang1https://orcid.org/0000-0003-2077-556XShang Gao2https://orcid.org/0000-0002-1595-3176Bo Yang3https://orcid.org/0000-0003-1927-8419Department of the College of Computer Science and Technology, Jilin University, Changchun, ChinaDepartment of the College of Computer Science and Technology, Jilin University, Changchun, ChinaDepartment of the College of Computer Science and Technology, Jilin University, Changchun, ChinaDepartment of the College of Computer Science and Technology, Jilin University, Changchun, ChinaModern enterprises often manage geographically distributed datacenters around the globe. In such environment, datasets are naturally collected and stored in different data centers and were later queried for complex analytics. In this paper, we study the Wide-Area Data Analytics problem, which aims to efficiently control data movements and achieve low latency for overall queries processing, both constrained by limited and expensive network resources across datacenters. Previous papers focus on offline settings of single analytical queries and do not consider time in optimizing system performance, and therefore ignores the dynamics of data and task placement in terms of inter-DC bandwidth utilization. In this paper, we consider the online setting and formulate a cost-minimizing optimization problem over time for arbitrary Directed Acyclic Graph query processing. Considering dynamics of network resource usage, we developed two online algorithms, Online Switch Resist (OSR) and Most Fixed Horizon Control (MFHC) with good competitive ratios. We performed extensive simulations and comparative studies using the TPC-CH benchmark and verified the efficacy of proposed algorithms. The algorithm we proposed is better than the existing algorithm, and its performance approximates the theoretical optimal value.https://ieeexplore.ieee.org/document/8891688/Approximate nested querydistributed stream processingresource allocationerror guarantee
spellingShingle Jiao Huang
Jing Huang
Shang Gao
Bo Yang
Cost-Minimizing Online Algorithms for Geo-Distributed Data Analytics
IEEE Access
Approximate nested query
distributed stream processing
resource allocation
error guarantee
title Cost-Minimizing Online Algorithms for Geo-Distributed Data Analytics
title_full Cost-Minimizing Online Algorithms for Geo-Distributed Data Analytics
title_fullStr Cost-Minimizing Online Algorithms for Geo-Distributed Data Analytics
title_full_unstemmed Cost-Minimizing Online Algorithms for Geo-Distributed Data Analytics
title_short Cost-Minimizing Online Algorithms for Geo-Distributed Data Analytics
title_sort cost minimizing online algorithms for geo distributed data analytics
topic Approximate nested query
distributed stream processing
resource allocation
error guarantee
url https://ieeexplore.ieee.org/document/8891688/
work_keys_str_mv AT jiaohuang costminimizingonlinealgorithmsforgeodistributeddataanalytics
AT jinghuang costminimizingonlinealgorithmsforgeodistributeddataanalytics
AT shanggao costminimizingonlinealgorithmsforgeodistributeddataanalytics
AT boyang costminimizingonlinealgorithmsforgeodistributeddataanalytics