Distributed large-scale graph processing on FPGAs

Abstract Processing large-scale graphs is challenging due to the nature of the computation that causes irregular memory access patterns. Managing such irregular accesses may cause significant performance degradation on both CPUs and GPUs. Thus, recent research trends propose graph processing acceler...

Full description

Bibliographic Details
Main Authors:	Amin Sahebi, Marco Barbone, Marco Procaccini, Wayne Luk, Georgi Gaydadjiev, Roberto Giorgi
Format:	Article
Language:	English
Published:	SpringerOpen 2023-06-01
Series:	Journal of Big Data
Subjects:	Graph processing Distributed computing Grid partitioning FPGA Accelerators
Online Access:	https://doi.org/10.1186/s40537-023-00756-x

_version_	1797806627669147648
author	Amin Sahebi Marco Barbone Marco Procaccini Wayne Luk Georgi Gaydadjiev Roberto Giorgi
author_facet	Amin Sahebi Marco Barbone Marco Procaccini Wayne Luk Georgi Gaydadjiev Roberto Giorgi
author_sort	Amin Sahebi
collection	DOAJ
description	Abstract Processing large-scale graphs is challenging due to the nature of the computation that causes irregular memory access patterns. Managing such irregular accesses may cause significant performance degradation on both CPUs and GPUs. Thus, recent research trends propose graph processing acceleration with Field-Programmable Gate Arrays (FPGA). FPGAs are programmable hardware devices that can be fully customised to perform specific tasks in a highly parallel and efficient manner. However, FPGAs have a limited amount of on-chip memory that cannot fit the entire graph. Due to the limited device memory size, data needs to be repeatedly transferred to and from the FPGA on-chip memory, which makes data transfer time dominate over the computation time. A possible way to overcome the FPGA accelerators’ resource limitation is to engage a multi-FPGA distributed architecture and use an efficient partitioning scheme. Such a scheme aims to increase data locality and minimise communication between different partitions. This work proposes an FPGA processing engine that overlaps, hides and customises all data transfers so that the FPGA accelerator is fully utilised. This engine is integrated into a framework for using FPGA clusters and is able to use an offline partitioning method to facilitate the distribution of large-scale graphs. The proposed framework uses Hadoop at a higher level to map a graph to the underlying hardware platform. The higher layer of computation is responsible for gathering the blocks of data that have been pre-processed and stored on the host’s file system and distribute to a lower layer of computation made of FPGAs. We show how graph partitioning combined with an FPGA architecture will lead to high performance, even when the graph has Millions of vertices and Billions of edges. In the case of the PageRank algorithm, widely used for ranking the importance of nodes in a graph, compared to state-of-the-art CPU and GPU solutions, our implementation is the fastest, achieving a speedup of 13 compared to 8 and 3 respectively. Moreover, in the case of the large-scale graphs, the GPU solution fails due to memory limitations while the CPU solution achieves a speedup of 12 compared to the 26x achieved by our FPGA solution. Other state-of-the-art FPGA solutions are 28 times slower than our proposed solution. When the size of a graph limits the performance of a single FPGA device, our performance model shows that using multi-FPGAs in a distributed system can further improve the performance by about 12x. This highlights our implementation efficiency for large datasets not fitting in the on-chip memory of a hardware device.
first_indexed	2024-03-13T06:10:09Z
format	Article
id	doaj.art-23c388678710405cacce6a11848c3aa6
institution	Directory Open Access Journal
issn	2196-1115
language	English
last_indexed	2024-03-13T06:10:09Z
publishDate	2023-06-01
publisher	SpringerOpen
record_format	Article
series	Journal of Big Data
spelling	doaj.art-23c388678710405cacce6a11848c3aa62023-06-11T11:16:29ZengSpringerOpenJournal of Big Data2196-11152023-06-0110112810.1186/s40537-023-00756-xDistributed large-scale graph processing on FPGAsAmin Sahebi0Marco Barbone1Marco Procaccini2Wayne Luk3Georgi Gaydadjiev4Roberto Giorgi5Department of Information Engineering and Mathematics, University of SienaDepartment of Computing, Imperial College LondonDepartment of Information Engineering and Mathematics, University of SienaDepartment of Computing, Imperial College LondonDepartment of Computing, Imperial College LondonDepartment of Information Engineering and Mathematics, University of SienaAbstract Processing large-scale graphs is challenging due to the nature of the computation that causes irregular memory access patterns. Managing such irregular accesses may cause significant performance degradation on both CPUs and GPUs. Thus, recent research trends propose graph processing acceleration with Field-Programmable Gate Arrays (FPGA). FPGAs are programmable hardware devices that can be fully customised to perform specific tasks in a highly parallel and efficient manner. However, FPGAs have a limited amount of on-chip memory that cannot fit the entire graph. Due to the limited device memory size, data needs to be repeatedly transferred to and from the FPGA on-chip memory, which makes data transfer time dominate over the computation time. A possible way to overcome the FPGA accelerators’ resource limitation is to engage a multi-FPGA distributed architecture and use an efficient partitioning scheme. Such a scheme aims to increase data locality and minimise communication between different partitions. This work proposes an FPGA processing engine that overlaps, hides and customises all data transfers so that the FPGA accelerator is fully utilised. This engine is integrated into a framework for using FPGA clusters and is able to use an offline partitioning method to facilitate the distribution of large-scale graphs. The proposed framework uses Hadoop at a higher level to map a graph to the underlying hardware platform. The higher layer of computation is responsible for gathering the blocks of data that have been pre-processed and stored on the host’s file system and distribute to a lower layer of computation made of FPGAs. We show how graph partitioning combined with an FPGA architecture will lead to high performance, even when the graph has Millions of vertices and Billions of edges. In the case of the PageRank algorithm, widely used for ranking the importance of nodes in a graph, compared to state-of-the-art CPU and GPU solutions, our implementation is the fastest, achieving a speedup of 13 compared to 8 and 3 respectively. Moreover, in the case of the large-scale graphs, the GPU solution fails due to memory limitations while the CPU solution achieves a speedup of 12 compared to the 26x achieved by our FPGA solution. Other state-of-the-art FPGA solutions are 28 times slower than our proposed solution. When the size of a graph limits the performance of a single FPGA device, our performance model shows that using multi-FPGAs in a distributed system can further improve the performance by about 12x. This highlights our implementation efficiency for large datasets not fitting in the on-chip memory of a hardware device.https://doi.org/10.1186/s40537-023-00756-xGraph processingDistributed computingGrid partitioningFPGAAccelerators
spellingShingle	Amin Sahebi Marco Barbone Marco Procaccini Wayne Luk Georgi Gaydadjiev Roberto Giorgi Distributed large-scale graph processing on FPGAs Journal of Big Data Graph processing Distributed computing Grid partitioning FPGA Accelerators
title	Distributed large-scale graph processing on FPGAs
title_full	Distributed large-scale graph processing on FPGAs
title_fullStr	Distributed large-scale graph processing on FPGAs
title_full_unstemmed	Distributed large-scale graph processing on FPGAs
title_short	Distributed large-scale graph processing on FPGAs
title_sort	distributed large scale graph processing on fpgas
topic	Graph processing Distributed computing Grid partitioning FPGA Accelerators
url	https://doi.org/10.1186/s40537-023-00756-x
work_keys_str_mv	AT aminsahebi distributedlargescalegraphprocessingonfpgas AT marcobarbone distributedlargescalegraphprocessingonfpgas AT marcoprocaccini distributedlargescalegraphprocessingonfpgas AT wayneluk distributedlargescalegraphprocessingonfpgas AT georgigaydadjiev distributedlargescalegraphprocessingonfpgas AT robertogiorgi distributedlargescalegraphprocessingonfpgas

Distributed large-scale graph processing on FPGAs

Similar Items