Searching for Efficient Multi-Stage Vision Transformers

Vision Transformer (ViT) demonstrates that Transformer for natural language processing can be applied to image classification tasks and result in comparable performance to convolutional neural networks (CNN), which have been studied in computer vision for years. This naturally raises the question of...

Full description

Bibliographic Details
Main Author:	Liao, Yi-Lun
Other Authors:	Sze, Vivienne
Format:	Thesis
Published:	Massachusetts Institute of Technology 2022
Online Access:	https://hdl.handle.net/1721.1/140187

_version_	1826212497752850432
author	Liao, Yi-Lun
author2	Sze, Vivienne
author_facet	Sze, Vivienne Liao, Yi-Lun
author_sort	Liao, Yi-Lun
collection	MIT
description	Vision Transformer (ViT) demonstrates that Transformer for natural language processing can be applied to image classification tasks and result in comparable performance to convolutional neural networks (CNN), which have been studied in computer vision for years. This naturally raises the question of how the performance of ViT can be advanced with design techniques of CNN. To this end, we propose to incorporate two techniques and present ViT-ResNAS, an efficient multi-stage ViT architecture designed with neural architecture search (NAS). First, we propose residual spatial reduction to decrease sequence lengths for deeper layers and utilize a multi-stage architecture. When reducing lengths, we add skip connections to improve performance and stabilize training deeper networks. Second, we propose weight-sharing NAS with multi-architectural sampling. We enlarge a network and utilize its sub-networks to define a search space. A super-network covering all sub-networks is then trained for fast evaluation of their performance. To efficiently train the super-network, we propose to sample and train multiple subnetworks with one forward-backward pass given a batch of examples. After training the super-network, evolutionary search is performed to discover high-performance network architectures. Experiments on ImageNet demonstrate the effectiveness of ViT-ResNAS. Compared to the original DeiT, ViT-ResNAS-Tiny achieves 8.6% higher accuracy than DeiT-Ti with slightly higher multiply-accumulate operations (MACs), and ViTResNAS-Small achieves similar accuracy to DeiT-B while having 6.3× less MACs and 3.7× higher throughput. Additionally, ViT-ResNAS achieves better accuracyMACs and accuracy-throughput trade-offs than other strong baselines of ViT such as PVT and PiT and high-erpformance CNNs like RegNet and ResNet-RS.
first_indexed	2024-09-23T15:23:22Z
format	Thesis
id	mit-1721.1/140187
institution	Massachusetts Institute of Technology
last_indexed	2024-09-23T15:23:22Z
publishDate	2022
publisher	Massachusetts Institute of Technology
record_format	dspace
spelling	mit-1721.1/1401872022-02-08T03:10:17Z Searching for Efficient Multi-Stage Vision Transformers Liao, Yi-Lun Sze, Vivienne Karaman, Sertac Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science Vision Transformer (ViT) demonstrates that Transformer for natural language processing can be applied to image classification tasks and result in comparable performance to convolutional neural networks (CNN), which have been studied in computer vision for years. This naturally raises the question of how the performance of ViT can be advanced with design techniques of CNN. To this end, we propose to incorporate two techniques and present ViT-ResNAS, an efficient multi-stage ViT architecture designed with neural architecture search (NAS). First, we propose residual spatial reduction to decrease sequence lengths for deeper layers and utilize a multi-stage architecture. When reducing lengths, we add skip connections to improve performance and stabilize training deeper networks. Second, we propose weight-sharing NAS with multi-architectural sampling. We enlarge a network and utilize its sub-networks to define a search space. A super-network covering all sub-networks is then trained for fast evaluation of their performance. To efficiently train the super-network, we propose to sample and train multiple subnetworks with one forward-backward pass given a batch of examples. After training the super-network, evolutionary search is performed to discover high-performance network architectures. Experiments on ImageNet demonstrate the effectiveness of ViT-ResNAS. Compared to the original DeiT, ViT-ResNAS-Tiny achieves 8.6% higher accuracy than DeiT-Ti with slightly higher multiply-accumulate operations (MACs), and ViTResNAS-Small achieves similar accuracy to DeiT-B while having 6.3× less MACs and 3.7× higher throughput. Additionally, ViT-ResNAS achieves better accuracyMACs and accuracy-throughput trade-offs than other strong baselines of ViT such as PVT and PiT and high-erpformance CNNs like RegNet and ResNet-RS. S.M. 2022-02-07T15:29:15Z 2022-02-07T15:29:15Z 2021-09 2021-09-21T19:54:11.179Z Thesis https://hdl.handle.net/1721.1/140187 In Copyright - Educational Use Permitted Copyright MIT http://rightsstatements.org/page/InC-EDU/1.0/ application/pdf Massachusetts Institute of Technology
spellingShingle	Liao, Yi-Lun Searching for Efficient Multi-Stage Vision Transformers
title	Searching for Efficient Multi-Stage Vision Transformers
title_full	Searching for Efficient Multi-Stage Vision Transformers
title_fullStr	Searching for Efficient Multi-Stage Vision Transformers
title_full_unstemmed	Searching for Efficient Multi-Stage Vision Transformers
title_short	Searching for Efficient Multi-Stage Vision Transformers
title_sort	searching for efficient multi stage vision transformers
url	https://hdl.handle.net/1721.1/140187
work_keys_str_mv	AT liaoyilun searchingforefficientmultistagevisiontransformers

Searching for Efficient Multi-Stage Vision Transformers

Similar Items