Knowledge Distillation With Feature Self Attention

With the rapid development of deep learning technology, the size and performance of the network continuously grow, making network compression essential for commercial applications. In this paper, we propose a Feature Self Attention (FSA) module that extracts correlation information between the hidde...

Full description

Bibliographic Details
Main Authors: Sin-Gu Park, Dong-Joong Kang
Format: Article
Language:English
Published: IEEE 2023-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10093872/
Description
Summary:With the rapid development of deep learning technology, the size and performance of the network continuously grow, making network compression essential for commercial applications. In this paper, we propose a Feature Self Attention (FSA) module that extracts correlation information between the hidden features of a network and a new method for distilling the correlation features to compress the model. FSA does not require a special module or network to match features between the teacher model and the student model. By removing the multi-head structure and the repeated self-attention blocks in the existing self-attention mechanism, it minimizes the addition of parameters. Based on ResNet-18, 34, the added parameters are only 2.00M and the training speed is also the fastest in comparison to benchmark models. It was demonstrated through experiments that the use of interrelationship loss between features can be beneficial for training student models, indicating the importance of considering correlation information in deep neural network compression. And it was verified through training from scratch on the vanilla without the pre-trained weight of the student model.
ISSN:2169-3536