Towards advanced distributed data processing: framework, optimization, and application

The surge in available big data has drawn significant interest in distributed processing methods capable of handling the ever-expanding data volume and increasing computational complexities efficiently and at scale. While existing distributed data processing frameworks, such as Apache Spark, have pr...

Full description

Bibliographic Details
Main Author: Liu, Kaiqi
Other Authors: Mo Li
Format: Thesis-Doctor of Philosophy
Language:English
Published: Nanyang Technological University 2024
Subjects:
Online Access:https://hdl.handle.net/10356/177576
_version_ 1826128090461372416
author Liu, Kaiqi
author2 Mo Li
author_facet Mo Li
Liu, Kaiqi
author_sort Liu, Kaiqi
collection NTU
description The surge in available big data has drawn significant interest in distributed processing methods capable of handling the ever-expanding data volume and increasing computational complexities efficiently and at scale. While existing distributed data processing frameworks, such as Apache Spark, have proven effective in various applications, there is still considerable room for improvement and exploration in this field. This thesis focuses on three key aspects of advancing distributed data processing using Apache Spark. First, a novel framework is introduced to extend Spark’s capabilities, enabling the efficient processing of large-scale spatio-temporal data to better serve machine-learning applications. This framework not only achieves high efficiency but also provides a user-friendly interface. Second, a deep-learning-based optimization approach tailored to enhance the efficiency of Spark SQL execution is proposed. The end-to-end system integration of this approach leads to practical performance gains. Last, a distributed solution for the computational-intensive large-scale microscopic crowd simulation is designed and implemented aiming to improve the scalability and efficiency of such applications. These three works collectively expand the application of distributed computing and enhance efficiency through the implementation of state-of-the-art techniques.
first_indexed 2024-10-01T07:19:14Z
format Thesis-Doctor of Philosophy
id ntu-10356/177576
institution Nanyang Technological University
language English
last_indexed 2024-10-01T07:19:14Z
publishDate 2024
publisher Nanyang Technological University
record_format dspace
spelling ntu-10356/1775762024-06-03T06:51:20Z Towards advanced distributed data processing: framework, optimization, and application Liu, Kaiqi Mo Li School of Computer Science and Engineering Alibaba-NTU Singapore Joint Research Institute limo@ntu.edu.sg Computer and Information Science The surge in available big data has drawn significant interest in distributed processing methods capable of handling the ever-expanding data volume and increasing computational complexities efficiently and at scale. While existing distributed data processing frameworks, such as Apache Spark, have proven effective in various applications, there is still considerable room for improvement and exploration in this field. This thesis focuses on three key aspects of advancing distributed data processing using Apache Spark. First, a novel framework is introduced to extend Spark’s capabilities, enabling the efficient processing of large-scale spatio-temporal data to better serve machine-learning applications. This framework not only achieves high efficiency but also provides a user-friendly interface. Second, a deep-learning-based optimization approach tailored to enhance the efficiency of Spark SQL execution is proposed. The end-to-end system integration of this approach leads to practical performance gains. Last, a distributed solution for the computational-intensive large-scale microscopic crowd simulation is designed and implemented aiming to improve the scalability and efficiency of such applications. These three works collectively expand the application of distributed computing and enhance efficiency through the implementation of state-of-the-art techniques. Doctor of Philosophy 2024-05-29T04:45:33Z 2024-05-29T04:45:33Z 2024 Thesis-Doctor of Philosophy Liu, K. (2024). Towards advanced distributed data processing: framework, optimization, and application. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/177576 https://hdl.handle.net/10356/177576 10.32657/10356/177576 en This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University
spellingShingle Computer and Information Science
Liu, Kaiqi
Towards advanced distributed data processing: framework, optimization, and application
title Towards advanced distributed data processing: framework, optimization, and application
title_full Towards advanced distributed data processing: framework, optimization, and application
title_fullStr Towards advanced distributed data processing: framework, optimization, and application
title_full_unstemmed Towards advanced distributed data processing: framework, optimization, and application
title_short Towards advanced distributed data processing: framework, optimization, and application
title_sort towards advanced distributed data processing framework optimization and application
topic Computer and Information Science
url https://hdl.handle.net/10356/177576
work_keys_str_mv AT liukaiqi towardsadvanceddistributeddataprocessingframeworkoptimizationandapplication