Efficient Group <i>K</i> Nearest-Neighbor Spatial Query Processing in Apache Spark

Aiming at the problem of spatial query processing in distributed computing systems, the design and implementation of new distributed spatial query algorithms is a current challenge. Apache Spark is a memory-based framework suitable for real-time and batch processing. Spark-based systems allow users...

Full description

Bibliographic Details
Main Authors:	Panagiotis Moutafis, George Mavrommatis, Michael Vassilakopoulos, Antonio Corral
Format:	Article
Language:	English
Published:	MDPI AG 2021-11-01
Series:	ISPRS International Journal of Geo-Information
Subjects:	big spatial data spatial query processing group nearest-neighbor query Apache Spark spatial query evaluation
Online Access:	https://www.mdpi.com/2220-9964/10/11/763

_version_	1797510080016416768
author	Panagiotis Moutafis George Mavrommatis Michael Vassilakopoulos Antonio Corral
author_facet	Panagiotis Moutafis George Mavrommatis Michael Vassilakopoulos Antonio Corral
author_sort	Panagiotis Moutafis
collection	DOAJ
description	Aiming at the problem of spatial query processing in distributed computing systems, the design and implementation of new distributed spatial query algorithms is a current challenge. Apache Spark is a memory-based framework suitable for real-time and batch processing. Spark-based systems allow users to work on distributed in-memory data, without worrying about the data distribution mechanism and fault-tolerance. Given two datasets of points (called Query and Training), the group <i>K</i> nearest-neighbor (G<i>K</i>NN) query retrieves (<i>K</i>) points of the Training with the smallest sum of distances to every point of the Query. This spatial query has been actively studied in centralized environments and several performance improving techniques and pruning heuristics have been also proposed, while, a distributed algorithm in Apache Hadoop was recently proposed by our team. Since, in general, Apache Hadoop exhibits lower performance than Spark, in this paper, we present the first distributed G<i>K</i>NN query algorithm in Apache Spark and compare it against the one in Apache Hadoop. This algorithm incorporates programming features and facilities that are specific to Apache Spark. Moreover, techniques that improve performance and are applicable in Apache Spark are also incorporated. The results of an extensive set of experiments with real-world spatial datasets are presented, demonstrating that our Apache Spark G<i>K</i>NN solution, with its improvements, is efficient and a clear winner in comparison to processing this query in Apache Hadoop.
first_indexed	2024-03-10T05:26:40Z
format	Article
id	doaj.art-828e8861446c43acbd59639ca6eeb6b4
institution	Directory Open Access Journal
issn	2220-9964
language	English
last_indexed	2024-03-10T05:26:40Z
publishDate	2021-11-01
publisher	MDPI AG
record_format	Article
series	ISPRS International Journal of Geo-Information
spelling	doaj.art-828e8861446c43acbd59639ca6eeb6b42023-11-22T23:36:29ZengMDPI AGISPRS International Journal of Geo-Information2220-99642021-11-01101176310.3390/ijgi10110763Efficient Group <i>K</i> Nearest-Neighbor Spatial Query Processing in Apache SparkPanagiotis Moutafis0George Mavrommatis1Michael Vassilakopoulos2Antonio Corral3Data Structuring & Engineering Lab, Department of Electrical and Computer Engineering, University of Thessaly, 38221 Volos, GreeceData Structuring & Engineering Lab, Department of Electrical and Computer Engineering, University of Thessaly, 38221 Volos, GreeceData Structuring & Engineering Lab, Department of Electrical and Computer Engineering, University of Thessaly, 38221 Volos, GreeceDepartment of Informatics, University of Almeria, 04120 Almeria, SpainAiming at the problem of spatial query processing in distributed computing systems, the design and implementation of new distributed spatial query algorithms is a current challenge. Apache Spark is a memory-based framework suitable for real-time and batch processing. Spark-based systems allow users to work on distributed in-memory data, without worrying about the data distribution mechanism and fault-tolerance. Given two datasets of points (called Query and Training), the group <i>K</i> nearest-neighbor (G<i>K</i>NN) query retrieves (<i>K</i>) points of the Training with the smallest sum of distances to every point of the Query. This spatial query has been actively studied in centralized environments and several performance improving techniques and pruning heuristics have been also proposed, while, a distributed algorithm in Apache Hadoop was recently proposed by our team. Since, in general, Apache Hadoop exhibits lower performance than Spark, in this paper, we present the first distributed G<i>K</i>NN query algorithm in Apache Spark and compare it against the one in Apache Hadoop. This algorithm incorporates programming features and facilities that are specific to Apache Spark. Moreover, techniques that improve performance and are applicable in Apache Spark are also incorporated. The results of an extensive set of experiments with real-world spatial datasets are presented, demonstrating that our Apache Spark G<i>K</i>NN solution, with its improvements, is efficient and a clear winner in comparison to processing this query in Apache Hadoop.https://www.mdpi.com/2220-9964/10/11/763big spatial dataspatial query processinggroup nearest-neighbor queryApache Sparkspatial query evaluation
spellingShingle	Panagiotis Moutafis George Mavrommatis Michael Vassilakopoulos Antonio Corral Efficient Group <i>K</i> Nearest-Neighbor Spatial Query Processing in Apache Spark ISPRS International Journal of Geo-Information big spatial data spatial query processing group nearest-neighbor query Apache Spark spatial query evaluation
title	Efficient Group <i>K</i> Nearest-Neighbor Spatial Query Processing in Apache Spark
title_full	Efficient Group <i>K</i> Nearest-Neighbor Spatial Query Processing in Apache Spark
title_fullStr	Efficient Group <i>K</i> Nearest-Neighbor Spatial Query Processing in Apache Spark
title_full_unstemmed	Efficient Group <i>K</i> Nearest-Neighbor Spatial Query Processing in Apache Spark
title_short	Efficient Group <i>K</i> Nearest-Neighbor Spatial Query Processing in Apache Spark
title_sort	efficient group i k i nearest neighbor spatial query processing in apache spark
topic	big spatial data spatial query processing group nearest-neighbor query Apache Spark spatial query evaluation
url	https://www.mdpi.com/2220-9964/10/11/763
work_keys_str_mv	AT panagiotismoutafis efficientgroupikinearestneighborspatialqueryprocessinginapachespark AT georgemavrommatis efficientgroupikinearestneighborspatialqueryprocessinginapachespark AT michaelvassilakopoulos efficientgroupikinearestneighborspatialqueryprocessinginapachespark AT antoniocorral efficientgroupikinearestneighborspatialqueryprocessinginapachespark

Efficient Group <i>K</i> Nearest-Neighbor Spatial Query Processing in Apache Spark

Similar Items