Cost-Minimizing Online Algorithms for Geo-Distributed Data Analytics
Modern enterprises often manage geographically distributed datacenters around the globe. In such environment, datasets are naturally collected and stored in different data centers and were later queried for complex analytics. In this paper, we study the Wide-Area Data Analytics problem, which aims t...
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
IEEE
2019-01-01
|
Series: | IEEE Access |
Subjects: | |
Online Access: | https://ieeexplore.ieee.org/document/8891688/ |
_version_ | 1819169988993351680 |
---|---|
author | Jiao Huang Jing Huang Shang Gao Bo Yang |
author_facet | Jiao Huang Jing Huang Shang Gao Bo Yang |
author_sort | Jiao Huang |
collection | DOAJ |
description | Modern enterprises often manage geographically distributed datacenters around the globe. In such environment, datasets are naturally collected and stored in different data centers and were later queried for complex analytics. In this paper, we study the Wide-Area Data Analytics problem, which aims to efficiently control data movements and achieve low latency for overall queries processing, both constrained by limited and expensive network resources across datacenters. Previous papers focus on offline settings of single analytical queries and do not consider time in optimizing system performance, and therefore ignores the dynamics of data and task placement in terms of inter-DC bandwidth utilization. In this paper, we consider the online setting and formulate a cost-minimizing optimization problem over time for arbitrary Directed Acyclic Graph query processing. Considering dynamics of network resource usage, we developed two online algorithms, Online Switch Resist (OSR) and Most Fixed Horizon Control (MFHC) with good competitive ratios. We performed extensive simulations and comparative studies using the TPC-CH benchmark and verified the efficacy of proposed algorithms. The algorithm we proposed is better than the existing algorithm, and its performance approximates the theoretical optimal value. |
first_indexed | 2024-12-22T19:28:15Z |
format | Article |
id | doaj.art-a24a146f17a544fd9fa62373d8c791e6 |
institution | Directory Open Access Journal |
issn | 2169-3536 |
language | English |
last_indexed | 2024-12-22T19:28:15Z |
publishDate | 2019-01-01 |
publisher | IEEE |
record_format | Article |
series | IEEE Access |
spelling | doaj.art-a24a146f17a544fd9fa62373d8c791e62022-12-21T18:15:10ZengIEEEIEEE Access2169-35362019-01-01716351516352510.1109/ACCESS.2019.29516828891688Cost-Minimizing Online Algorithms for Geo-Distributed Data AnalyticsJiao Huang0https://orcid.org/0000-0002-9356-722XJing Huang1https://orcid.org/0000-0003-2077-556XShang Gao2https://orcid.org/0000-0002-1595-3176Bo Yang3https://orcid.org/0000-0003-1927-8419Department of the College of Computer Science and Technology, Jilin University, Changchun, ChinaDepartment of the College of Computer Science and Technology, Jilin University, Changchun, ChinaDepartment of the College of Computer Science and Technology, Jilin University, Changchun, ChinaDepartment of the College of Computer Science and Technology, Jilin University, Changchun, ChinaModern enterprises often manage geographically distributed datacenters around the globe. In such environment, datasets are naturally collected and stored in different data centers and were later queried for complex analytics. In this paper, we study the Wide-Area Data Analytics problem, which aims to efficiently control data movements and achieve low latency for overall queries processing, both constrained by limited and expensive network resources across datacenters. Previous papers focus on offline settings of single analytical queries and do not consider time in optimizing system performance, and therefore ignores the dynamics of data and task placement in terms of inter-DC bandwidth utilization. In this paper, we consider the online setting and formulate a cost-minimizing optimization problem over time for arbitrary Directed Acyclic Graph query processing. Considering dynamics of network resource usage, we developed two online algorithms, Online Switch Resist (OSR) and Most Fixed Horizon Control (MFHC) with good competitive ratios. We performed extensive simulations and comparative studies using the TPC-CH benchmark and verified the efficacy of proposed algorithms. The algorithm we proposed is better than the existing algorithm, and its performance approximates the theoretical optimal value.https://ieeexplore.ieee.org/document/8891688/Approximate nested querydistributed stream processingresource allocationerror guarantee |
spellingShingle | Jiao Huang Jing Huang Shang Gao Bo Yang Cost-Minimizing Online Algorithms for Geo-Distributed Data Analytics IEEE Access Approximate nested query distributed stream processing resource allocation error guarantee |
title | Cost-Minimizing Online Algorithms for Geo-Distributed Data Analytics |
title_full | Cost-Minimizing Online Algorithms for Geo-Distributed Data Analytics |
title_fullStr | Cost-Minimizing Online Algorithms for Geo-Distributed Data Analytics |
title_full_unstemmed | Cost-Minimizing Online Algorithms for Geo-Distributed Data Analytics |
title_short | Cost-Minimizing Online Algorithms for Geo-Distributed Data Analytics |
title_sort | cost minimizing online algorithms for geo distributed data analytics |
topic | Approximate nested query distributed stream processing resource allocation error guarantee |
url | https://ieeexplore.ieee.org/document/8891688/ |
work_keys_str_mv | AT jiaohuang costminimizingonlinealgorithmsforgeodistributeddataanalytics AT jinghuang costminimizingonlinealgorithmsforgeodistributeddataanalytics AT shanggao costminimizingonlinealgorithmsforgeodistributeddataanalytics AT boyang costminimizingonlinealgorithmsforgeodistributeddataanalytics |