BOOM: Broadcast Optimizations for On-chip Meshes

Future many-core chips will require an on-chip network that can support broadcasts and multicasts at good power-performance. A vanilla on-chip network would send multiple unicast packets for each broadcast packet, resulting in latency, throughput and power overheads. Recent research in on-chip multi...

Full description

Bibliographic Details
Main Authors: Krishna, Tushar, Beckmann, Bradford M., Peh, Li-Shiuan, Reinhardt, Steven K.
Other Authors: Li-Shiuan Peh
Published: 2011
Subjects:
Online Access:http://hdl.handle.net/1721.1/61695
_version_ 1811077454151811072
author Krishna, Tushar
Beckmann, Bradford M.
Peh, Li-Shiuan
Reinhardt, Steven K.
author2 Li-Shiuan Peh
author_facet Li-Shiuan Peh
Krishna, Tushar
Beckmann, Bradford M.
Peh, Li-Shiuan
Reinhardt, Steven K.
author_sort Krishna, Tushar
collection MIT
description Future many-core chips will require an on-chip network that can support broadcasts and multicasts at good power-performance. A vanilla on-chip network would send multiple unicast packets for each broadcast packet, resulting in latency, throughput and power overheads. Recent research in on-chip multicast support has proposed forking of broadcast/multicast packets within the network at the router buffers, but these techniques are far from ideal, since they increase buffer occupancy which lowers throughput, and packets incur delay and power penalties at each router. In this work, we analyze an ideal broadcast mesh; show the substantial gaps between state-of-the-art multicast NoCs and the ideal; then propose BOOM, which comprises a WHIRL routing protocol that ideally load balances broadcast traffic, a mXbar multicast crossbar circuit that enables multicast traversal at similar energy-delay as unicasts, and speculative bypassing of buffering for multicast flits. Together, they enable broadcast packets to approach the delay, energy, and throughput of the ideal fabric. Our simulations show BOOM realizing an average network latency that is 5% off ideal, attaining 96% of ideal throughput, with energy consumption that is 9% above ideal. Evaluations using synthetic traffic show BOOM achieving a latency reduction of 61%, throughput improvement of 63%, and buffer power reduction of 80% as compared to a baseline broadcast. Simulations with PARSEC benchmarks show BOOM reducing average request and network latency by 40% and 15% respectively.
first_indexed 2024-09-23T10:43:17Z
id mit-1721.1/61695
institution Massachusetts Institute of Technology
last_indexed 2024-09-23T10:43:17Z
publishDate 2011
record_format dspace
spelling mit-1721.1/616952019-04-11T10:31:06Z BOOM: Broadcast Optimizations for On-chip Meshes Krishna, Tushar Beckmann, Bradford M. Peh, Li-Shiuan Reinhardt, Steven K. Li-Shiuan Peh Computer Architecture multicore Future many-core chips will require an on-chip network that can support broadcasts and multicasts at good power-performance. A vanilla on-chip network would send multiple unicast packets for each broadcast packet, resulting in latency, throughput and power overheads. Recent research in on-chip multicast support has proposed forking of broadcast/multicast packets within the network at the router buffers, but these techniques are far from ideal, since they increase buffer occupancy which lowers throughput, and packets incur delay and power penalties at each router. In this work, we analyze an ideal broadcast mesh; show the substantial gaps between state-of-the-art multicast NoCs and the ideal; then propose BOOM, which comprises a WHIRL routing protocol that ideally load balances broadcast traffic, a mXbar multicast crossbar circuit that enables multicast traversal at similar energy-delay as unicasts, and speculative bypassing of buffering for multicast flits. Together, they enable broadcast packets to approach the delay, energy, and throughput of the ideal fabric. Our simulations show BOOM realizing an average network latency that is 5% off ideal, attaining 96% of ideal throughput, with energy consumption that is 9% above ideal. Evaluations using synthetic traffic show BOOM achieving a latency reduction of 61%, throughput improvement of 63%, and buffer power reduction of 80% as compared to a baseline broadcast. Simulations with PARSEC benchmarks show BOOM reducing average request and network latency by 40% and 15% respectively. 2011-03-14T19:45:24Z 2011-03-14T19:45:24Z 2011-03-14 http://hdl.handle.net/1721.1/61695 MIT-CSAIL-TR-2011-013 Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported http://creativecommons.org/licenses/by-nc-nd/3.0/ 12 p. application/pdf
spellingShingle multicore
Krishna, Tushar
Beckmann, Bradford M.
Peh, Li-Shiuan
Reinhardt, Steven K.
BOOM: Broadcast Optimizations for On-chip Meshes
title BOOM: Broadcast Optimizations for On-chip Meshes
title_full BOOM: Broadcast Optimizations for On-chip Meshes
title_fullStr BOOM: Broadcast Optimizations for On-chip Meshes
title_full_unstemmed BOOM: Broadcast Optimizations for On-chip Meshes
title_short BOOM: Broadcast Optimizations for On-chip Meshes
title_sort boom broadcast optimizations for on chip meshes
topic multicore
url http://hdl.handle.net/1721.1/61695
work_keys_str_mv AT krishnatushar boombroadcastoptimizationsforonchipmeshes
AT beckmannbradfordm boombroadcastoptimizationsforonchipmeshes
AT pehlishiuan boombroadcastoptimizationsforonchipmeshes
AT reinhardtstevenk boombroadcastoptimizationsforonchipmeshes