A 64-TOPS Energy-Efficient Tensor Accelerator in 14nm With Reconfigurable Fetch Network and Processing Fusion for Maximal Data Reuse

For energy-efficient accelerators in data centers that leverage advances in the performance and energy efficiency of recent algorithms, flexible architectures are critical to support state-of-the-art algorithms for various deep learning tasks. Due to the matrix multiplication units at the core of te...

Full description

Bibliographic Details
Main Authors:	Sang Min Lee, Hanjoon Kim, Jeseung Yeon, Juyun Lee, Younggeun Choi, Minho Kim, Changjae Park, Kiseok Jang, Youngsik Kim, Yongseung Kim, Changman Lee, Hyuck Han, Won Eung Kim, Rui Tang, Joon Ho Baek
Format:	Article
Language:	English
Published:	IEEE 2022-01-01
Series:	IEEE Open Journal of the Solid-State Circuits Society
Subjects:	AI accelerators convolutional neural networks data reuse depth-wise and group convolution inference ML accelerator
Online Access:	https://ieeexplore.ieee.org/document/9927346/

_version_	1797197424606838784
author	Sang Min Lee Hanjoon Kim Jeseung Yeon Juyun Lee Younggeun Choi Minho Kim Changjae Park Kiseok Jang Youngsik Kim Yongseung Kim Changman Lee Hyuck Han Won Eung Kim Rui Tang Joon Ho Baek
author_facet	Sang Min Lee Hanjoon Kim Jeseung Yeon Juyun Lee Younggeun Choi Minho Kim Changjae Park Kiseok Jang Youngsik Kim Yongseung Kim Changman Lee Hyuck Han Won Eung Kim Rui Tang Joon Ho Baek
author_sort	Sang Min Lee
collection	DOAJ
description	For energy-efficient accelerators in data centers that leverage advances in the performance and energy efficiency of recent algorithms, flexible architectures are critical to support state-of-the-art algorithms for various deep learning tasks. Due to the matrix multiplication units at the core of tensor operations, most recent programmable architectures lack flexibility for layers with diminished dimensions, especially for inferences where a large batch axis is rarely allowed. In addition, exploiting the data reuse inherent within tensor operations for computing a single matrix multiplication is challenging. In this work, an extension of a vector processor in 14 nm is proposed, which is customized to tensor operations. The flexible architecture enables a tensorized loop to support various data layouts and different shapes and sizes of tensor operations. It also exploits all possible data reuse, including input, weight, and output. Based on the tensorized loop, fetch and reduction networks, which unicast or multicast with the ordering of both input data and processing data, can be simplified using a circuit-switching-like network with configured topology and flow control for each tensor operation. Two processing elements can be fused to optimize latency for a large model or can operate individually for throughput. As a result, various state-of-the-art models can be processed efficiently with straightforward compiler optimization, and the highest energy efficiency of 13.4Inferences/s/W on EfficientNetV2-S is demonstrated.
first_indexed	2024-04-24T06:43:45Z
format	Article
id	doaj.art-de8bda9e1c81437f81d2dbb7cc1dc338
institution	Directory Open Access Journal
issn	2644-1349
language	English
last_indexed	2024-04-24T06:43:45Z
publishDate	2022-01-01
publisher	IEEE
record_format	Article
series	IEEE Open Journal of the Solid-State Circuits Society
spelling	doaj.art-de8bda9e1c81437f81d2dbb7cc1dc3382024-04-22T20:40:15ZengIEEEIEEE Open Journal of the Solid-State Circuits Society2644-13492022-01-01221923010.1109/OJSSCS.2022.32167989927346A 64-TOPS Energy-Efficient Tensor Accelerator in 14nm With Reconfigurable Fetch Network and Processing Fusion for Maximal Data ReuseSang Min Lee0https://orcid.org/0000-0002-8234-520XHanjoon Kim1Jeseung Yeon2https://orcid.org/0000-0003-3529-710XJuyun Lee3https://orcid.org/0000-0002-4270-2508Younggeun Choi4https://orcid.org/0000-0002-0887-7494Minho Kim5https://orcid.org/0000-0002-8803-0348Changjae Park6Kiseok Jang7Youngsik Kim8Yongseung Kim9https://orcid.org/0000-0001-7154-2358Changman Lee10Hyuck Han11Won Eung Kim12Rui Tang13Joon Ho Baek14Hardware Department, FuriosaAI, Inc., Seoul, Republic of KoreaHardware Department, FuriosaAI, Inc., Seoul, Republic of KoreaHardware Department, FuriosaAI, Inc., Seoul, Republic of KoreaHardware Department, FuriosaAI, Inc., Seoul, Republic of KoreaHardware Department, FuriosaAI, Inc., Seoul, Republic of KoreaHardware Department, FuriosaAI, Inc., Seoul, Republic of KoreaHardware Department, FuriosaAI, Inc., Seoul, Republic of KoreaHardware Department, FuriosaAI, Inc., Seoul, Republic of KoreaHardware Department, FuriosaAI, Inc., Seoul, Republic of KoreaHardware Department, FuriosaAI, Inc., Seoul, Republic of KoreaHardware Department, FuriosaAI, Inc., Seoul, Republic of KoreaHardware Department, FuriosaAI, Inc., Seoul, Republic of KoreaHardware Department, FuriosaAI, Inc., Seoul, Republic of KoreaMarketing, Strategy and Operations Department, MSQUARE Ltd., Shanghai, ChinaHardware Department, FuriosaAI, Inc., Seoul, Republic of KoreaFor energy-efficient accelerators in data centers that leverage advances in the performance and energy efficiency of recent algorithms, flexible architectures are critical to support state-of-the-art algorithms for various deep learning tasks. Due to the matrix multiplication units at the core of tensor operations, most recent programmable architectures lack flexibility for layers with diminished dimensions, especially for inferences where a large batch axis is rarely allowed. In addition, exploiting the data reuse inherent within tensor operations for computing a single matrix multiplication is challenging. In this work, an extension of a vector processor in 14 nm is proposed, which is customized to tensor operations. The flexible architecture enables a tensorized loop to support various data layouts and different shapes and sizes of tensor operations. It also exploits all possible data reuse, including input, weight, and output. Based on the tensorized loop, fetch and reduction networks, which unicast or multicast with the ordering of both input data and processing data, can be simplified using a circuit-switching-like network with configured topology and flow control for each tensor operation. Two processing elements can be fused to optimize latency for a large model or can operate individually for throughput. As a result, various state-of-the-art models can be processed efficiently with straightforward compiler optimization, and the highest energy efficiency of 13.4Inferences/s/W on EfficientNetV2-S is demonstrated.https://ieeexplore.ieee.org/document/9927346/AI acceleratorsconvolutional neural networksdata reusedepth-wise and group convolutioninferenceML accelerator
spellingShingle	Sang Min Lee Hanjoon Kim Jeseung Yeon Juyun Lee Younggeun Choi Minho Kim Changjae Park Kiseok Jang Youngsik Kim Yongseung Kim Changman Lee Hyuck Han Won Eung Kim Rui Tang Joon Ho Baek A 64-TOPS Energy-Efficient Tensor Accelerator in 14nm With Reconfigurable Fetch Network and Processing Fusion for Maximal Data Reuse IEEE Open Journal of the Solid-State Circuits Society AI accelerators convolutional neural networks data reuse depth-wise and group convolution inference ML accelerator
title	A 64-TOPS Energy-Efficient Tensor Accelerator in 14nm With Reconfigurable Fetch Network and Processing Fusion for Maximal Data Reuse
title_full	A 64-TOPS Energy-Efficient Tensor Accelerator in 14nm With Reconfigurable Fetch Network and Processing Fusion for Maximal Data Reuse
title_fullStr	A 64-TOPS Energy-Efficient Tensor Accelerator in 14nm With Reconfigurable Fetch Network and Processing Fusion for Maximal Data Reuse
title_full_unstemmed	A 64-TOPS Energy-Efficient Tensor Accelerator in 14nm With Reconfigurable Fetch Network and Processing Fusion for Maximal Data Reuse
title_short	A 64-TOPS Energy-Efficient Tensor Accelerator in 14nm With Reconfigurable Fetch Network and Processing Fusion for Maximal Data Reuse
title_sort	64 tops energy efficient tensor accelerator in 14nm with reconfigurable fetch network and processing fusion for maximal data reuse
topic	AI accelerators convolutional neural networks data reuse depth-wise and group convolution inference ML accelerator
url	https://ieeexplore.ieee.org/document/9927346/
work_keys_str_mv	AT sangminlee a64topsenergyefficienttensoracceleratorin14nmwithreconfigurablefetchnetworkandprocessingfusionformaximaldatareuse AT hanjoonkim a64topsenergyefficienttensoracceleratorin14nmwithreconfigurablefetchnetworkandprocessingfusionformaximaldatareuse AT jeseungyeon a64topsenergyefficienttensoracceleratorin14nmwithreconfigurablefetchnetworkandprocessingfusionformaximaldatareuse AT juyunlee a64topsenergyefficienttensoracceleratorin14nmwithreconfigurablefetchnetworkandprocessingfusionformaximaldatareuse AT younggeunchoi a64topsenergyefficienttensoracceleratorin14nmwithreconfigurablefetchnetworkandprocessingfusionformaximaldatareuse AT minhokim a64topsenergyefficienttensoracceleratorin14nmwithreconfigurablefetchnetworkandprocessingfusionformaximaldatareuse AT changjaepark a64topsenergyefficienttensoracceleratorin14nmwithreconfigurablefetchnetworkandprocessingfusionformaximaldatareuse AT kiseokjang a64topsenergyefficienttensoracceleratorin14nmwithreconfigurablefetchnetworkandprocessingfusionformaximaldatareuse AT youngsikkim a64topsenergyefficienttensoracceleratorin14nmwithreconfigurablefetchnetworkandprocessingfusionformaximaldatareuse AT yongseungkim a64topsenergyefficienttensoracceleratorin14nmwithreconfigurablefetchnetworkandprocessingfusionformaximaldatareuse AT changmanlee a64topsenergyefficienttensoracceleratorin14nmwithreconfigurablefetchnetworkandprocessingfusionformaximaldatareuse AT hyuckhan a64topsenergyefficienttensoracceleratorin14nmwithreconfigurablefetchnetworkandprocessingfusionformaximaldatareuse AT woneungkim a64topsenergyefficienttensoracceleratorin14nmwithreconfigurablefetchnetworkandprocessingfusionformaximaldatareuse AT ruitang a64topsenergyefficienttensoracceleratorin14nmwithreconfigurablefetchnetworkandprocessingfusionformaximaldatareuse AT joonhobaek a64topsenergyefficienttensoracceleratorin14nmwithreconfigurablefetchnetworkandprocessingfusionformaximaldatareuse AT sangminlee 64topsenergyefficienttensoracceleratorin14nmwithreconfigurablefetchnetworkandprocessingfusionformaximaldatareuse AT hanjoonkim 64topsenergyefficienttensoracceleratorin14nmwithreconfigurablefetchnetworkandprocessingfusionformaximaldatareuse AT jeseungyeon 64topsenergyefficienttensoracceleratorin14nmwithreconfigurablefetchnetworkandprocessingfusionformaximaldatareuse AT juyunlee 64topsenergyefficienttensoracceleratorin14nmwithreconfigurablefetchnetworkandprocessingfusionformaximaldatareuse AT younggeunchoi 64topsenergyefficienttensoracceleratorin14nmwithreconfigurablefetchnetworkandprocessingfusionformaximaldatareuse AT minhokim 64topsenergyefficienttensoracceleratorin14nmwithreconfigurablefetchnetworkandprocessingfusionformaximaldatareuse AT changjaepark 64topsenergyefficienttensoracceleratorin14nmwithreconfigurablefetchnetworkandprocessingfusionformaximaldatareuse AT kiseokjang 64topsenergyefficienttensoracceleratorin14nmwithreconfigurablefetchnetworkandprocessingfusionformaximaldatareuse AT youngsikkim 64topsenergyefficienttensoracceleratorin14nmwithreconfigurablefetchnetworkandprocessingfusionformaximaldatareuse AT yongseungkim 64topsenergyefficienttensoracceleratorin14nmwithreconfigurablefetchnetworkandprocessingfusionformaximaldatareuse AT changmanlee 64topsenergyefficienttensoracceleratorin14nmwithreconfigurablefetchnetworkandprocessingfusionformaximaldatareuse AT hyuckhan 64topsenergyefficienttensoracceleratorin14nmwithreconfigurablefetchnetworkandprocessingfusionformaximaldatareuse AT woneungkim 64topsenergyefficienttensoracceleratorin14nmwithreconfigurablefetchnetworkandprocessingfusionformaximaldatareuse AT ruitang 64topsenergyefficienttensoracceleratorin14nmwithreconfigurablefetchnetworkandprocessingfusionformaximaldatareuse AT joonhobaek 64topsenergyefficienttensoracceleratorin14nmwithreconfigurablefetchnetworkandprocessingfusionformaximaldatareuse

A 64-TOPS Energy-Efficient Tensor Accelerator in 14nm With Reconfigurable Fetch Network and Processing Fusion for Maximal Data Reuse

Similar Items