An NoC Traffic Compiler for Efficient FPGA Implementation of Sparse Graph-Oriented Workloads

Parallel graph-oriented applications expressed in the Bulk-Synchronous Parallel (BSP) and Token Dataflow compute models generate highly-structured communication workloads from messages propagating along graph edges. We can statially expose this structure to traffic compilers and optimization tools t...

Full description

Bibliographic Details
Main Authors:	Kapre, Nachiket, Dehon, André
Other Authors:	School of Computer Engineering
Format:	Journal Article
Language:	English
Published:	2015
Subjects:	Computer Science and Engineering
Online Access:	https://hdl.handle.net/10356/81118 http://hdl.handle.net/10220/39124

_version_	1826115893455749120
author	Kapre, Nachiket Dehon, André
author2	School of Computer Engineering
author_facet	School of Computer Engineering Kapre, Nachiket Dehon, André
author_sort	Kapre, Nachiket
collection	NTU
description	Parallel graph-oriented applications expressed in the Bulk-Synchronous Parallel (BSP) and Token Dataflow compute models generate highly-structured communication workloads from messages propagating along graph edges. We can statially expose this structure to traffic compilers and optimization tools to reshape and reduce traffic for higher performance (or lower area, lower energy, lower cost). Such offline traffic optimization eliminates the need for complex, runtime NoC hardware and enables lightweight, scalable NoCs. We perform load balancing, placement, fanout routing, and fine-grained synchronization to optimize our workloads for large networks up to 2025 parallel elements for BSP model and 25 parallel elements for Token Dataflow. This allows us to demonstrate speedups between 1.2× and 22× (3.5× mean), area reductions (number of Processing Elements) between 3× and 15× (9× mean) and dynamic energy savings between 2× and 3.5× (2.7× mean) over a range of real-world graph applications in the BSP compute model. We deliver speedups of 0.5–13× (geomean 3.6×) for Sparse Direct Matrix Solve (Token Dataflow compute model) applied to a range of sparse matrices when using a high-quality placement algorithm. We expect such traffic optimization tools and techniques to become an essential part of the NoC application-mapping flow.
first_indexed	2024-10-01T04:02:33Z
format	Journal Article
id	ntu-10356/81118
institution	Nanyang Technological University
language	English
last_indexed	2024-10-01T04:02:33Z
publishDate	2015
record_format	dspace
spelling	ntu-10356/811182020-05-28T07:18:18Z An NoC Traffic Compiler for Efficient FPGA Implementation of Sparse Graph-Oriented Workloads Kapre, Nachiket Dehon, André School of Computer Engineering Computer Science and Engineering Parallel graph-oriented applications expressed in the Bulk-Synchronous Parallel (BSP) and Token Dataflow compute models generate highly-structured communication workloads from messages propagating along graph edges. We can statially expose this structure to traffic compilers and optimization tools to reshape and reduce traffic for higher performance (or lower area, lower energy, lower cost). Such offline traffic optimization eliminates the need for complex, runtime NoC hardware and enables lightweight, scalable NoCs. We perform load balancing, placement, fanout routing, and fine-grained synchronization to optimize our workloads for large networks up to 2025 parallel elements for BSP model and 25 parallel elements for Token Dataflow. This allows us to demonstrate speedups between 1.2× and 22× (3.5× mean), area reductions (number of Processing Elements) between 3× and 15× (9× mean) and dynamic energy savings between 2× and 3.5× (2.7× mean) over a range of real-world graph applications in the BSP compute model. We deliver speedups of 0.5–13× (geomean 3.6×) for Sparse Direct Matrix Solve (Token Dataflow compute model) applied to a range of sparse matrices when using a high-quality placement algorithm. We expect such traffic optimization tools and techniques to become an essential part of the NoC application-mapping flow. Published version 2015-12-17T04:05:09Z 2019-12-06T14:21:48Z 2015-12-17T04:05:09Z 2019-12-06T14:21:48Z 2011 Journal Article Kapre, N., & Dehon, A. (2011). An NoC Traffic Compiler for Efficient FPGA Implementation of Sparse Graph-Oriented Workloads. International Journal of Reconfigurable Computing, 2011, 745147-. 1687-7195 https://hdl.handle.net/10356/81118 http://hdl.handle.net/10220/39124 10.1155/2011/745147 en International Journal of Reconfigurable Computing © 2011 Nachiket Kapre and André Dehon. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. 14 p. application/pdf
spellingShingle	Computer Science and Engineering Kapre, Nachiket Dehon, André An NoC Traffic Compiler for Efficient FPGA Implementation of Sparse Graph-Oriented Workloads
title	An NoC Traffic Compiler for Efficient FPGA Implementation of Sparse Graph-Oriented Workloads
title_full	An NoC Traffic Compiler for Efficient FPGA Implementation of Sparse Graph-Oriented Workloads
title_fullStr	An NoC Traffic Compiler for Efficient FPGA Implementation of Sparse Graph-Oriented Workloads
title_full_unstemmed	An NoC Traffic Compiler for Efficient FPGA Implementation of Sparse Graph-Oriented Workloads
title_short	An NoC Traffic Compiler for Efficient FPGA Implementation of Sparse Graph-Oriented Workloads
title_sort	noc traffic compiler for efficient fpga implementation of sparse graph oriented workloads
topic	Computer Science and Engineering
url	https://hdl.handle.net/10356/81118 http://hdl.handle.net/10220/39124
work_keys_str_mv	AT kaprenachiket annoctrafficcompilerforefficientfpgaimplementationofsparsegraphorientedworkloads AT dehonandre annoctrafficcompilerforefficientfpgaimplementationofsparsegraphorientedworkloads AT kaprenachiket noctrafficcompilerforefficientfpgaimplementationofsparsegraphorientedworkloads AT dehonandre noctrafficcompilerforefficientfpgaimplementationofsparsegraphorientedworkloads

An NoC Traffic Compiler for Efficient FPGA Implementation of Sparse Graph-Oriented Workloads

Similar Items