Searching for Efficient Multi-Stage Vision Transformers

Vision Transformer (ViT) demonstrates that Transformer for natural language processing can be applied to image classification tasks and result in comparable performance to convolutional neural networks (CNN), which have been studied in computer vision for years. This naturally raises the question of...

Full description

Bibliographic Details
Main Author: Liao, Yi-Lun
Other Authors: Sze, Vivienne
Format: Thesis
Published: Massachusetts Institute of Technology 2022
Online Access:https://hdl.handle.net/1721.1/140187
_version_ 1826212497752850432
author Liao, Yi-Lun
author2 Sze, Vivienne
author_facet Sze, Vivienne
Liao, Yi-Lun
author_sort Liao, Yi-Lun
collection MIT
description Vision Transformer (ViT) demonstrates that Transformer for natural language processing can be applied to image classification tasks and result in comparable performance to convolutional neural networks (CNN), which have been studied in computer vision for years. This naturally raises the question of how the performance of ViT can be advanced with design techniques of CNN. To this end, we propose to incorporate two techniques and present ViT-ResNAS, an efficient multi-stage ViT architecture designed with neural architecture search (NAS). First, we propose residual spatial reduction to decrease sequence lengths for deeper layers and utilize a multi-stage architecture. When reducing lengths, we add skip connections to improve performance and stabilize training deeper networks. Second, we propose weight-sharing NAS with multi-architectural sampling. We enlarge a network and utilize its sub-networks to define a search space. A super-network covering all sub-networks is then trained for fast evaluation of their performance. To efficiently train the super-network, we propose to sample and train multiple subnetworks with one forward-backward pass given a batch of examples. After training the super-network, evolutionary search is performed to discover high-performance network architectures. Experiments on ImageNet demonstrate the effectiveness of ViT-ResNAS. Compared to the original DeiT, ViT-ResNAS-Tiny achieves 8.6% higher accuracy than DeiT-Ti with slightly higher multiply-accumulate operations (MACs), and ViTResNAS-Small achieves similar accuracy to DeiT-B while having 6.3× less MACs and 3.7× higher throughput. Additionally, ViT-ResNAS achieves better accuracyMACs and accuracy-throughput trade-offs than other strong baselines of ViT such as PVT and PiT and high-erpformance CNNs like RegNet and ResNet-RS.
first_indexed 2024-09-23T15:23:22Z
format Thesis
id mit-1721.1/140187
institution Massachusetts Institute of Technology
last_indexed 2024-09-23T15:23:22Z
publishDate 2022
publisher Massachusetts Institute of Technology
record_format dspace
spelling mit-1721.1/1401872022-02-08T03:10:17Z Searching for Efficient Multi-Stage Vision Transformers Liao, Yi-Lun Sze, Vivienne Karaman, Sertac Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science Vision Transformer (ViT) demonstrates that Transformer for natural language processing can be applied to image classification tasks and result in comparable performance to convolutional neural networks (CNN), which have been studied in computer vision for years. This naturally raises the question of how the performance of ViT can be advanced with design techniques of CNN. To this end, we propose to incorporate two techniques and present ViT-ResNAS, an efficient multi-stage ViT architecture designed with neural architecture search (NAS). First, we propose residual spatial reduction to decrease sequence lengths for deeper layers and utilize a multi-stage architecture. When reducing lengths, we add skip connections to improve performance and stabilize training deeper networks. Second, we propose weight-sharing NAS with multi-architectural sampling. We enlarge a network and utilize its sub-networks to define a search space. A super-network covering all sub-networks is then trained for fast evaluation of their performance. To efficiently train the super-network, we propose to sample and train multiple subnetworks with one forward-backward pass given a batch of examples. After training the super-network, evolutionary search is performed to discover high-performance network architectures. Experiments on ImageNet demonstrate the effectiveness of ViT-ResNAS. Compared to the original DeiT, ViT-ResNAS-Tiny achieves 8.6% higher accuracy than DeiT-Ti with slightly higher multiply-accumulate operations (MACs), and ViTResNAS-Small achieves similar accuracy to DeiT-B while having 6.3× less MACs and 3.7× higher throughput. Additionally, ViT-ResNAS achieves better accuracyMACs and accuracy-throughput trade-offs than other strong baselines of ViT such as PVT and PiT and high-erpformance CNNs like RegNet and ResNet-RS. S.M. 2022-02-07T15:29:15Z 2022-02-07T15:29:15Z 2021-09 2021-09-21T19:54:11.179Z Thesis https://hdl.handle.net/1721.1/140187 In Copyright - Educational Use Permitted Copyright MIT http://rightsstatements.org/page/InC-EDU/1.0/ application/pdf Massachusetts Institute of Technology
spellingShingle Liao, Yi-Lun
Searching for Efficient Multi-Stage Vision Transformers
title Searching for Efficient Multi-Stage Vision Transformers
title_full Searching for Efficient Multi-Stage Vision Transformers
title_fullStr Searching for Efficient Multi-Stage Vision Transformers
title_full_unstemmed Searching for Efficient Multi-Stage Vision Transformers
title_short Searching for Efficient Multi-Stage Vision Transformers
title_sort searching for efficient multi stage vision transformers
url https://hdl.handle.net/1721.1/140187
work_keys_str_mv AT liaoyilun searchingforefficientmultistagevisiontransformers