Energy efficient circuits and architectural design for machine learning on edge

The number of Internet of Things (IoT) devices around the world is forecasted to be 50 billion by the year 2025. IoT devices are commonly referred to as edge devices, as they are able to connect to the Internet and operate at the edge of the network. IoT devices are also equipped with sensors to col...

Full description

Bibliographic Details
Main Author: Chong, Yi Sheng
Other Authors: Goh Wang Ling
Format: Thesis-Doctor of Philosophy
Language:English
Published: Nanyang Technological University 2023
Subjects:
Online Access:https://hdl.handle.net/10356/168616
Description
Summary:The number of Internet of Things (IoT) devices around the world is forecasted to be 50 billion by the year 2025. IoT devices are commonly referred to as edge devices, as they are able to connect to the Internet and operate at the edge of the network. IoT devices are also equipped with sensors to collect data and user input from the working environment. To intelligently process the data, neural network algorithm, a well-known machine learning technique, is employed for high accuracy performance. Since the neural network algorithm is compute and memory intensive, IoT devices face high latency due to their limited computation power on board. Thus, this thesis explores custom circuits and architecture design to accelerate neural network computation on edge devices. The first focus of this thesis is to enable convolutional neural network (CNN) computation on edge devices for image processing. Recent CNN models exploit a new convolution layer, i.e., depthwise separable convolution, to reduce model size and computation. In the new layer, pointwise convolution becomes the major CNN workload, which is not well supported by the existing CNN accelerators. Thus, a convolution (CONV) unit is proposed to handle the CNN computation but with dedicated support to pointwise convolution. To enable high energy efficiency, the proposed CONV unit employs weight stationary dataflow with input data reuse and computation parallelism. Implemented using a 40-nm technology node, the CONV unit attains an energy efficiency of 3.13 TOPS/W, which is the third best among the state-of-the-art for recent CNNs such as MobileNet, when working at nominal voltage of 0.85 V and frequency of 100 MHz. Besides, this thesis explores speech processing on IoT devices. In particular, keyword spotting (KWS) is to detect keywords in the sound recorded by a microphone before activating the power-consuming speech recognition system. KWS needs to be always-on to constantly detect keywords from the user voice input. Thus, a low power neural network based KWS hardware has been proposed not only to maximize the battery life of IoT devices, but also to achieve high KWS accuracy. The proposed KWS engine is composed of a Mel frequency cepstral coefficients (MFCC) module and a long short term memory (LSTM) accelerator. The MFCC module is optimized for low power by using hardware algorithm co-optimization. While, the LSTM accelerator is designed to run a compact yet accurate KWS LSTM model. The LSTM model is optimized for small model size using the novel enhanced top-k row pruning, compression as well as quantization, which in turn reduce the on-chip memory and area of the LSTM accelerator. The proposed KWS engine is implemented using a 40-nm technology node. It reports a power consumption of 2.5 uW, which is 2.2 times smaller as compared to the state-of-the-art LSTM-based KWS, when operating at voltage of 0.6 V and frequency of 400 kHz. Furthermore, this thesis explores the emerging compute-in-memory (CIM), which is used to overcome the memory bottleneck of the traditional Von-Neumann architecture by bringing the computation near to the memory, thus increasing the energy efficiency. CIM is an attractive candidate for accelerating the neural network computation, because it is naturally good at performing the matrix vector multiplications, which are the fundamental operation of neural networks. However, when mapping a neural network to a CIM hardware, computation errors exist, leading to accuracy drop. The errors are due to the non-idealities and stochastic programming response of the CIM memory cell. Therefore, this thesis proposes a chip-in-the-loop training scheme, which helps the network to adapt to the non-idealities and regain accuracy. The proposed scheme considers only two-state resistive random access memory (RRAM) and binarized neural network (BNN). The BNN attains high accuracy despite that the network weight is only 1-bit, while the weights can be easily mapped to the RRAM-based CIM for computation. The proposed training scheme successfully adjusts the weights of a four-layer fully-connected layer to regain the accuracy. In conclusion, the thesis investigates the energy efficient and low power neural network hardware that work in the resource-constrained edge environment. The two proposed accelerators, the CONV unit and KWS engine, have high potential to be integrated into edge devices as co-processors. Both can cater the edge devices' need for long battery life and real-time response, given their high energy efficiency and low power consumption. On the other hand, this thesis tackles the accuracy drop issue due to the non-idealities in CIM using the proposed network training scheme. Overcoming this challenge not only helps to harvest the high energy efficiency brought by the CIM, but also allows the CIM to deliver accurate response when the CIM is deployed for neural network based applications.