Distributed systems for spatio-textual data streams

Due to the prosperity of social networks and smart phones, huge amounts of data with both spatial and textual information, e.g., geo-tagged tweets, is generated continuously, which can be modelled as data streams. Such spatio-textual data stream contains valuable information for millions of users wi...

Full description

Bibliographic Details
Main Author: Chen, Zhida
Other Authors: Cong Gao
Format: Thesis
Language:English
Published: 2019
Subjects:
Online Access:https://hdl.handle.net/10356/106448
http://hdl.handle.net/10220/47970
_version_ 1811676356963991552
author Chen, Zhida
author2 Cong Gao
author_facet Cong Gao
Chen, Zhida
author_sort Chen, Zhida
collection NTU
description Due to the prosperity of social networks and smart phones, huge amounts of data with both spatial and textual information, e.g., geo-tagged tweets, is generated continuously, which can be modelled as data streams. Such spatio-textual data stream contains valuable information for millions of users with various interests on different keywords and locations. There has been increasing demand for efficiently exploring and processing spatio-textual data streams, which calls for systems that can provide real-time analytical results over the spatio-textual data. Publish/subscribe systems enable efficient and effective information distribution by allowing users to register continuous queries with both spatial and textual constraints. However, most existing publish/subscribe systems are centralized systems, which run on a single machine to process all the incoming data. The explosive growth of data scale and user base has posed challenges to the existing centralized publish/subscribe systems for spatio-textual data streams. To overcome these, we propose a distributed publish/subscribe system, called PS2Stream, which digests a massive spatio-textual data stream and directs the stream to target users with registered interests. Compared with existing systems, PS2Stream achieves a better workload distribution in terms of both minimizing the total amount of workload and balancing the load of workers. To achieve this, we propose a new workload distribution algorithm considering both space and text properties of the data. Additionally, PS2Stream supports dynamic load adjustments to adapt to the change of the workload, which makes PS2Stream adaptive. Extensive empirical evaluation, on commercial cloud computing platform with real data, validates the superiority of our system design and advantages of our techniques on system performance improvement. Publish/subscribe systems provide efficient ways to analyze the spatio-textual data at the tuple level, which return a set of spatio-textual objects satisfying the continuous queries in real time. However, in some scenarios, users are more interested in the higher level knowledge that can be extracted from the data. For instance, a marketing manager wants to know the popularity of some product in different regions, so that he or she can decide whether need to adjust the advertising strategy. A data stream warehouse system (DSWS) has the features of e cient data ingestion and enabling online analytical processing (OLAP) over streaming data. Unfortunately, existing DSWSs are not tailored for spatio-textual data and it requires a significant amount of efforts to address this. We develop a DSWS called STAR (Spatio-Textual Data Stream Warehouse). STAR is a distributed in-memory stream warehouse system, which can provide low-latency and up-to-date analytical results over a fast arriving spatio-textual data stream. STAR facilitates processing of ad-hoc aggregation queries with spatial or textual constraints by implementing a distributed view materialization algorithm. STAR adopts an effective workload partitioning strategy, which well partitions the workload composed of object processing, query processing and view maintaining. Additionally, STAR supports dynamic load adjustments, which make STAR scalable and adaptive. Extensive experiments over real data sets demonstrate the superior performance of STAR over existing systems.
first_indexed 2024-10-01T02:20:11Z
format Thesis
id ntu-10356/106448
institution Nanyang Technological University
language English
last_indexed 2024-10-01T02:20:11Z
publishDate 2019
record_format dspace
spelling ntu-10356/1064482020-11-01T04:46:05Z Distributed systems for spatio-textual data streams Chen, Zhida Cong Gao Interdisciplinary Graduate School (IGS) DRNTU::Engineering::Computer science and engineering Due to the prosperity of social networks and smart phones, huge amounts of data with both spatial and textual information, e.g., geo-tagged tweets, is generated continuously, which can be modelled as data streams. Such spatio-textual data stream contains valuable information for millions of users with various interests on different keywords and locations. There has been increasing demand for efficiently exploring and processing spatio-textual data streams, which calls for systems that can provide real-time analytical results over the spatio-textual data. Publish/subscribe systems enable efficient and effective information distribution by allowing users to register continuous queries with both spatial and textual constraints. However, most existing publish/subscribe systems are centralized systems, which run on a single machine to process all the incoming data. The explosive growth of data scale and user base has posed challenges to the existing centralized publish/subscribe systems for spatio-textual data streams. To overcome these, we propose a distributed publish/subscribe system, called PS2Stream, which digests a massive spatio-textual data stream and directs the stream to target users with registered interests. Compared with existing systems, PS2Stream achieves a better workload distribution in terms of both minimizing the total amount of workload and balancing the load of workers. To achieve this, we propose a new workload distribution algorithm considering both space and text properties of the data. Additionally, PS2Stream supports dynamic load adjustments to adapt to the change of the workload, which makes PS2Stream adaptive. Extensive empirical evaluation, on commercial cloud computing platform with real data, validates the superiority of our system design and advantages of our techniques on system performance improvement. Publish/subscribe systems provide efficient ways to analyze the spatio-textual data at the tuple level, which return a set of spatio-textual objects satisfying the continuous queries in real time. However, in some scenarios, users are more interested in the higher level knowledge that can be extracted from the data. For instance, a marketing manager wants to know the popularity of some product in different regions, so that he or she can decide whether need to adjust the advertising strategy. A data stream warehouse system (DSWS) has the features of e cient data ingestion and enabling online analytical processing (OLAP) over streaming data. Unfortunately, existing DSWSs are not tailored for spatio-textual data and it requires a significant amount of efforts to address this. We develop a DSWS called STAR (Spatio-Textual Data Stream Warehouse). STAR is a distributed in-memory stream warehouse system, which can provide low-latency and up-to-date analytical results over a fast arriving spatio-textual data stream. STAR facilitates processing of ad-hoc aggregation queries with spatial or textual constraints by implementing a distributed view materialization algorithm. STAR adopts an effective workload partitioning strategy, which well partitions the workload composed of object processing, query processing and view maintaining. Additionally, STAR supports dynamic load adjustments, which make STAR scalable and adaptive. Extensive experiments over real data sets demonstrate the superior performance of STAR over existing systems. Doctor of Philosophy 2019-04-03T08:51:14Z 2019-12-06T22:12:01Z 2019-04-03T08:51:14Z 2019-12-06T22:12:01Z 2018 Thesis Chen, Z. (2018). Distributed systems for spatio-textual data streams. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/106448 http://hdl.handle.net/10220/47970 10.32657/10220/47970 en 117 p. application/pdf
spellingShingle DRNTU::Engineering::Computer science and engineering
Chen, Zhida
Distributed systems for spatio-textual data streams
title Distributed systems for spatio-textual data streams
title_full Distributed systems for spatio-textual data streams
title_fullStr Distributed systems for spatio-textual data streams
title_full_unstemmed Distributed systems for spatio-textual data streams
title_short Distributed systems for spatio-textual data streams
title_sort distributed systems for spatio textual data streams
topic DRNTU::Engineering::Computer science and engineering
url https://hdl.handle.net/10356/106448
http://hdl.handle.net/10220/47970
work_keys_str_mv AT chenzhida distributedsystemsforspatiotextualdatastreams