Performance optimization for distributed machine learning and graph processing at scale over virtualized infrastructure

Nowadays, many real-world applications can be represented as machine learning and graph processing (MLGP) problems, and require sophisticated analysis on massive datasets. Various distributed computing systems have been proposed to run MLGP applications in a cluster. These systems usually manage the...

Full description

Bibliographic Details
Main Author: Sun, Peng
Other Authors: Wen Yonggang
Format: Thesis
Language:English
Published: 2018
Subjects:
Online Access:http://hdl.handle.net/10356/73229
_version_ 1811697093979406336
author Sun, Peng
author2 Wen Yonggang
author_facet Wen Yonggang
Sun, Peng
author_sort Sun, Peng
collection NTU
description Nowadays, many real-world applications can be represented as machine learning and graph processing (MLGP) problems, and require sophisticated analysis on massive datasets. Various distributed computing systems have been proposed to run MLGP applications in a cluster. These systems usually manage the input data in a distributed file system (DFS), perform data-parallel computation on multiple machines, and exchange intermediate data via network. In this thesis, we focus on performance optimization of distributed MLGP over virtualized infrastructure. First, we focus on improving the resource utilization of a cluster shared with multiple distributed MLGP workloads. Organizations are trending to use a cluster management system (CMS) to run multiple distributed MLGP applications in a single cluster. Existing CMSs can only allocate a static partition of the cluster to each application, leading to poor cluster utilization. To address this problem, we propose a new CMS named Dorm, which leverages virtualization techniques to partition a cluster, runs one application per partition, and can dynamically resize each partition at application runtime to achieve high cluster utilization and meet other performance constraints. Extensive performance evaluations have shown that Dorm could increase the cluster utilization by a factor of up to 2.32. Second, we improve the metadata lookup performance for DFSs. Existing DFSs usually use distributed hash table (DHT) to manage their metadata servers. When performing a metadata operation, users should first use a lookup service to locate the desired metadata object. The lookup operation could lead to reduced metadata operation throughput and high latency. To address this problem, we design a new metadata lookup service called MetaFlow. MetaFlow leverages software-defined networking (SDN) techniques to transfer metadata lookup to the network layer, and generates appropriate flow tables for SDN-enabled switches by mapping the physical network topology to a logical B-tree. Extensive performance evaluations have shown that MetaFlow could increase the system throughput by a factor of up to 6.5, and reduce the system latency by a factor of up to 5 for the metadata management, compared to DHT-based approaches. Third, we reduce the communication overhead of distributed machine learning (ML) based on the Parameter Server (PS) framework. The PS framework has a group of worker nodes performing data-parallel computation, and has a group of server nodes maintaining globally shared parameters. Each worker node would continually pull parameters from server nodes and push updates to server nodes, resulting in high communication overhead. To address this problem, we design ParameterFlow, a communication layer for the PS framework with an updatecentric communication (UCC) model and a dynamic value-bounded filter (DVF). UCC introduces a broadcast/push model to exchange data between worker nodes and server nodes. DVF could directly reduce network traffic and communication time by selectively dropping updates for network transmission. Experiments have shown that that PF could speed up popular distributed ML applications by a factor of up to 4.3, compared to the conventional PS framework. Last, we enable high-performance large-scale graph processing in small clusters with limited memory. When processing big graphs, existing in-memory graph processing systems can easily exceed the cluster memory capacity. While out-ofcore approaches could handle big graphs, they have poor performance due to high disk I/O overhead. We design a new distributed graph processing system named GraphH with three techniques: a gather-apply-broadcast computation model, an edge cache system and a hybrid communication mode. Experiments have shown that GraphH outperforms existing out-of-core systems by more than 100x, when processing big graphs in small clusters with limited memory.The proposed approaches and obtained results can provide guidelines to improve large-scale distributed MLGP applications over virtualized infrastructure.
first_indexed 2024-10-01T07:49:47Z
format Thesis
id ntu-10356/73229
institution Nanyang Technological University
language English
last_indexed 2024-10-01T07:49:47Z
publishDate 2018
record_format dspace
spelling ntu-10356/732292021-03-20T14:04:10Z Performance optimization for distributed machine learning and graph processing at scale over virtualized infrastructure Sun, Peng Wen Yonggang Interdisciplinary Graduate School (IGS) DRNTU::Engineering::Computer science and engineering::Computer systems organization Nowadays, many real-world applications can be represented as machine learning and graph processing (MLGP) problems, and require sophisticated analysis on massive datasets. Various distributed computing systems have been proposed to run MLGP applications in a cluster. These systems usually manage the input data in a distributed file system (DFS), perform data-parallel computation on multiple machines, and exchange intermediate data via network. In this thesis, we focus on performance optimization of distributed MLGP over virtualized infrastructure. First, we focus on improving the resource utilization of a cluster shared with multiple distributed MLGP workloads. Organizations are trending to use a cluster management system (CMS) to run multiple distributed MLGP applications in a single cluster. Existing CMSs can only allocate a static partition of the cluster to each application, leading to poor cluster utilization. To address this problem, we propose a new CMS named Dorm, which leverages virtualization techniques to partition a cluster, runs one application per partition, and can dynamically resize each partition at application runtime to achieve high cluster utilization and meet other performance constraints. Extensive performance evaluations have shown that Dorm could increase the cluster utilization by a factor of up to 2.32. Second, we improve the metadata lookup performance for DFSs. Existing DFSs usually use distributed hash table (DHT) to manage their metadata servers. When performing a metadata operation, users should first use a lookup service to locate the desired metadata object. The lookup operation could lead to reduced metadata operation throughput and high latency. To address this problem, we design a new metadata lookup service called MetaFlow. MetaFlow leverages software-defined networking (SDN) techniques to transfer metadata lookup to the network layer, and generates appropriate flow tables for SDN-enabled switches by mapping the physical network topology to a logical B-tree. Extensive performance evaluations have shown that MetaFlow could increase the system throughput by a factor of up to 6.5, and reduce the system latency by a factor of up to 5 for the metadata management, compared to DHT-based approaches. Third, we reduce the communication overhead of distributed machine learning (ML) based on the Parameter Server (PS) framework. The PS framework has a group of worker nodes performing data-parallel computation, and has a group of server nodes maintaining globally shared parameters. Each worker node would continually pull parameters from server nodes and push updates to server nodes, resulting in high communication overhead. To address this problem, we design ParameterFlow, a communication layer for the PS framework with an updatecentric communication (UCC) model and a dynamic value-bounded filter (DVF). UCC introduces a broadcast/push model to exchange data between worker nodes and server nodes. DVF could directly reduce network traffic and communication time by selectively dropping updates for network transmission. Experiments have shown that that PF could speed up popular distributed ML applications by a factor of up to 4.3, compared to the conventional PS framework. Last, we enable high-performance large-scale graph processing in small clusters with limited memory. When processing big graphs, existing in-memory graph processing systems can easily exceed the cluster memory capacity. While out-ofcore approaches could handle big graphs, they have poor performance due to high disk I/O overhead. We design a new distributed graph processing system named GraphH with three techniques: a gather-apply-broadcast computation model, an edge cache system and a hybrid communication mode. Experiments have shown that GraphH outperforms existing out-of-core systems by more than 100x, when processing big graphs in small clusters with limited memory.The proposed approaches and obtained results can provide guidelines to improve large-scale distributed MLGP applications over virtualized infrastructure. Doctor of Philosophy (IGS) 2018-01-26T08:19:25Z 2018-01-26T08:19:25Z 2018 Thesis Sun, P. (2018). Performance optimization for distributed machine learning and graph processing at scale over virtualized infrastructure. Doctoral thesis, Nanyang Technological University, Singapore. http://hdl.handle.net/10356/73229 10.32657/10356/73229 en 199 p. application/pdf
spellingShingle DRNTU::Engineering::Computer science and engineering::Computer systems organization
Sun, Peng
Performance optimization for distributed machine learning and graph processing at scale over virtualized infrastructure
title Performance optimization for distributed machine learning and graph processing at scale over virtualized infrastructure
title_full Performance optimization for distributed machine learning and graph processing at scale over virtualized infrastructure
title_fullStr Performance optimization for distributed machine learning and graph processing at scale over virtualized infrastructure
title_full_unstemmed Performance optimization for distributed machine learning and graph processing at scale over virtualized infrastructure
title_short Performance optimization for distributed machine learning and graph processing at scale over virtualized infrastructure
title_sort performance optimization for distributed machine learning and graph processing at scale over virtualized infrastructure
topic DRNTU::Engineering::Computer science and engineering::Computer systems organization
url http://hdl.handle.net/10356/73229
work_keys_str_mv AT sunpeng performanceoptimizationfordistributedmachinelearningandgraphprocessingatscaleovervirtualizedinfrastructure