BOOM: Broadcast Optimizations for On-chip Meshes
Future many-core chips will require an on-chip network that can support broadcasts and multicasts at good power-performance. A vanilla on-chip network would send multiple unicast packets for each broadcast packet, resulting in latency, throughput and power overheads. Recent research in on-chip multi...
Main Authors: | , , , |
---|---|
Other Authors: | |
Published: |
2011
|
Subjects: | |
Online Access: | http://hdl.handle.net/1721.1/61695 |
_version_ | 1811077454151811072 |
---|---|
author | Krishna, Tushar Beckmann, Bradford M. Peh, Li-Shiuan Reinhardt, Steven K. |
author2 | Li-Shiuan Peh |
author_facet | Li-Shiuan Peh Krishna, Tushar Beckmann, Bradford M. Peh, Li-Shiuan Reinhardt, Steven K. |
author_sort | Krishna, Tushar |
collection | MIT |
description | Future many-core chips will require an on-chip network that can support broadcasts and multicasts at good power-performance. A vanilla on-chip network would send multiple unicast packets for each broadcast packet, resulting in latency, throughput and power overheads. Recent research in on-chip multicast support has proposed forking of broadcast/multicast packets within the network at the router buffers, but these techniques are far from ideal, since they increase buffer occupancy which lowers throughput, and packets incur delay and power penalties at each router. In this work, we analyze an ideal broadcast mesh; show the substantial gaps between state-of-the-art multicast NoCs and the ideal; then propose BOOM, which comprises a WHIRL routing protocol that ideally load balances broadcast traffic, a mXbar multicast crossbar circuit that enables multicast traversal at similar energy-delay as unicasts, and speculative bypassing of buffering for multicast flits. Together, they enable broadcast packets to approach the delay, energy, and throughput of the ideal fabric. Our simulations show BOOM realizing an average network latency that is 5% off ideal, attaining 96% of ideal throughput, with energy consumption that is 9% above ideal. Evaluations using synthetic traffic show BOOM achieving a latency reduction of 61%, throughput improvement of 63%, and buffer power reduction of 80% as compared to a baseline broadcast. Simulations with PARSEC benchmarks show BOOM reducing average request and network latency by 40% and 15% respectively. |
first_indexed | 2024-09-23T10:43:17Z |
id | mit-1721.1/61695 |
institution | Massachusetts Institute of Technology |
last_indexed | 2024-09-23T10:43:17Z |
publishDate | 2011 |
record_format | dspace |
spelling | mit-1721.1/616952019-04-11T10:31:06Z BOOM: Broadcast Optimizations for On-chip Meshes Krishna, Tushar Beckmann, Bradford M. Peh, Li-Shiuan Reinhardt, Steven K. Li-Shiuan Peh Computer Architecture multicore Future many-core chips will require an on-chip network that can support broadcasts and multicasts at good power-performance. A vanilla on-chip network would send multiple unicast packets for each broadcast packet, resulting in latency, throughput and power overheads. Recent research in on-chip multicast support has proposed forking of broadcast/multicast packets within the network at the router buffers, but these techniques are far from ideal, since they increase buffer occupancy which lowers throughput, and packets incur delay and power penalties at each router. In this work, we analyze an ideal broadcast mesh; show the substantial gaps between state-of-the-art multicast NoCs and the ideal; then propose BOOM, which comprises a WHIRL routing protocol that ideally load balances broadcast traffic, a mXbar multicast crossbar circuit that enables multicast traversal at similar energy-delay as unicasts, and speculative bypassing of buffering for multicast flits. Together, they enable broadcast packets to approach the delay, energy, and throughput of the ideal fabric. Our simulations show BOOM realizing an average network latency that is 5% off ideal, attaining 96% of ideal throughput, with energy consumption that is 9% above ideal. Evaluations using synthetic traffic show BOOM achieving a latency reduction of 61%, throughput improvement of 63%, and buffer power reduction of 80% as compared to a baseline broadcast. Simulations with PARSEC benchmarks show BOOM reducing average request and network latency by 40% and 15% respectively. 2011-03-14T19:45:24Z 2011-03-14T19:45:24Z 2011-03-14 http://hdl.handle.net/1721.1/61695 MIT-CSAIL-TR-2011-013 Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported http://creativecommons.org/licenses/by-nc-nd/3.0/ 12 p. application/pdf |
spellingShingle | multicore Krishna, Tushar Beckmann, Bradford M. Peh, Li-Shiuan Reinhardt, Steven K. BOOM: Broadcast Optimizations for On-chip Meshes |
title | BOOM: Broadcast Optimizations for On-chip Meshes |
title_full | BOOM: Broadcast Optimizations for On-chip Meshes |
title_fullStr | BOOM: Broadcast Optimizations for On-chip Meshes |
title_full_unstemmed | BOOM: Broadcast Optimizations for On-chip Meshes |
title_short | BOOM: Broadcast Optimizations for On-chip Meshes |
title_sort | boom broadcast optimizations for on chip meshes |
topic | multicore |
url | http://hdl.handle.net/1721.1/61695 |
work_keys_str_mv | AT krishnatushar boombroadcastoptimizationsforonchipmeshes AT beckmannbradfordm boombroadcastoptimizationsforonchipmeshes AT pehlishiuan boombroadcastoptimizationsforonchipmeshes AT reinhardtstevenk boombroadcastoptimizationsforonchipmeshes |