Efficient spatial data partitioning for distributed $$k$$ k NN joins

Abstract Parallel processing of large spatial datasets over distributed systems has become a core part of modern data analytic systems like Apache Hadoop and Apache Spark. The general-purpose design of these systems does not natively account for the data’s spatial attributes and results in poor scal...

Full description

Bibliographic Details
Main Authors:	Ayman Zeidan, Huy T. Vo
Format:	Article
Language:	English
Published:	SpringerOpen 2022-06-01
Series:	Journal of Big Data
Subjects:	Big data Spatial data Spatial query Indexing Partitioning Technique
Online Access:	https://doi.org/10.1186/s40537-022-00587-2

_version_	1818210184974041088
author	Ayman Zeidan Huy T. Vo
author_facet	Ayman Zeidan Huy T. Vo
author_sort	Ayman Zeidan
collection	DOAJ
description	Abstract Parallel processing of large spatial datasets over distributed systems has become a core part of modern data analytic systems like Apache Hadoop and Apache Spark. The general-purpose design of these systems does not natively account for the data’s spatial attributes and results in poor scalability, accuracy, or prolonged runtimes. Spatial extensions remedy the problem and introduce spatial data recognition and operations. At the core of a spatial extension, a locality-preserving spatial partitioner determines how to spatially group the dataset’s objects into smaller chunks using the distributed system’s available resources. Existing spatial extensions rely on data sampling and often mismanage non-spatial data by either overlooking their memory requirements or excluding them entirely. This work discusses the various challenges that face spatial data partitioning and proposes a novel spatial partitioner for effectively processing spatial queries over large spatial datasets. For evaluation, the proposed partitioner is integrated with the well-known k-Nearest Neighbor ( $$k$$ k NN) spatial join query. Several experiments evaluate the proposal using real-world datasets. Our approach differs from existing proposals by (1) accounting for the dataset’s unique spatial traits without sampling, (2) considering the computational overhead required to handle non-spatial data, (3) minimizing partition shuffles, (4) computing the optimal utilization of the available resources, and (5) achieving accurate results. This contributes to the problem of spatial data partitioning through (1) providing a comprehensive discussion of the problems facing spatial data partitioning and processing, (2) the development of a novel spatial partitioning technique for in-memory distributed processing, (3) an effective, built-in, load-balancing methodology that reduces spatial query skews, and (4) a Spark-based implementation of the proposed work with an accurate $$k$$ k NN spatial join query. Experimental tests show up to $$1.48$$ 1.48 times improvement in runtime as well as the accuracy of results.
first_indexed	2024-12-12T05:12:35Z
format	Article
id	doaj.art-32c3a19672eb45bcba56451f6d8282f0
institution	Directory Open Access Journal
issn	2196-1115
language	English
last_indexed	2024-12-12T05:12:35Z
publishDate	2022-06-01
publisher	SpringerOpen
record_format	Article
series	Journal of Big Data
spelling	doaj.art-32c3a19672eb45bcba56451f6d8282f02022-12-22T00:36:51ZengSpringerOpenJournal of Big Data2196-11152022-06-019114210.1186/s40537-022-00587-2Efficient spatial data partitioning for distributed $$k$$ k NN joinsAyman Zeidan0Huy T. Vo1Department of Computer Science, CUNY Graduate CenterDepartment of Computer Science, CUNY City CollegeAbstract Parallel processing of large spatial datasets over distributed systems has become a core part of modern data analytic systems like Apache Hadoop and Apache Spark. The general-purpose design of these systems does not natively account for the data’s spatial attributes and results in poor scalability, accuracy, or prolonged runtimes. Spatial extensions remedy the problem and introduce spatial data recognition and operations. At the core of a spatial extension, a locality-preserving spatial partitioner determines how to spatially group the dataset’s objects into smaller chunks using the distributed system’s available resources. Existing spatial extensions rely on data sampling and often mismanage non-spatial data by either overlooking their memory requirements or excluding them entirely. This work discusses the various challenges that face spatial data partitioning and proposes a novel spatial partitioner for effectively processing spatial queries over large spatial datasets. For evaluation, the proposed partitioner is integrated with the well-known k-Nearest Neighbor ( $$k$$ k NN) spatial join query. Several experiments evaluate the proposal using real-world datasets. Our approach differs from existing proposals by (1) accounting for the dataset’s unique spatial traits without sampling, (2) considering the computational overhead required to handle non-spatial data, (3) minimizing partition shuffles, (4) computing the optimal utilization of the available resources, and (5) achieving accurate results. This contributes to the problem of spatial data partitioning through (1) providing a comprehensive discussion of the problems facing spatial data partitioning and processing, (2) the development of a novel spatial partitioning technique for in-memory distributed processing, (3) an effective, built-in, load-balancing methodology that reduces spatial query skews, and (4) a Spark-based implementation of the proposed work with an accurate $$k$$ k NN spatial join query. Experimental tests show up to $$1.48$$ 1.48 times improvement in runtime as well as the accuracy of results.https://doi.org/10.1186/s40537-022-00587-2Big dataSpatial dataSpatial queryIndexingPartitioningTechnique
spellingShingle	Ayman Zeidan Huy T. Vo Efficient spatial data partitioning for distributed $$k$$ k NN joins Journal of Big Data Big data Spatial data Spatial query Indexing Partitioning Technique
title	Efficient spatial data partitioning for distributed $$k$$ k NN joins
title_full	Efficient spatial data partitioning for distributed $$k$$ k NN joins
title_fullStr	Efficient spatial data partitioning for distributed $$k$$ k NN joins
title_full_unstemmed	Efficient spatial data partitioning for distributed $$k$$ k NN joins
title_short	Efficient spatial data partitioning for distributed $$k$$ k NN joins
title_sort	efficient spatial data partitioning for distributed k k nn joins
topic	Big data Spatial data Spatial query Indexing Partitioning Technique
url	https://doi.org/10.1186/s40537-022-00587-2
work_keys_str_mv	AT aymanzeidan efficientspatialdatapartitioningfordistributedkknnjoins AT huytvo efficientspatialdatapartitioningfordistributedkknnjoins

Efficient spatial data partitioning for distributed $$k$$ k NN joins

Similar Items