Summary: | Vision Transformer (ViT) demonstrates that Transformer for natural language processing can be applied to image classification tasks and result in comparable performance to convolutional neural networks (CNN), which have been studied in computer vision for years. This naturally raises the question of how the performance of ViT can be advanced with design techniques of CNN. To this end, we propose to incorporate two techniques and present ViT-ResNAS, an efficient multi-stage ViT architecture designed with neural architecture search (NAS).
First, we propose residual spatial reduction to decrease sequence lengths for deeper layers and utilize a multi-stage architecture. When reducing lengths, we add skip connections to improve performance and stabilize training deeper networks. Second, we propose weight-sharing NAS with multi-architectural sampling. We enlarge a network and utilize its sub-networks to define a search space. A super-network covering all sub-networks is then trained for fast evaluation of their performance. To efficiently train the super-network, we propose to sample and train multiple subnetworks with one forward-backward pass given a batch of examples. After training the super-network, evolutionary search is performed to discover high-performance network architectures.
Experiments on ImageNet demonstrate the effectiveness of ViT-ResNAS. Compared to the original DeiT, ViT-ResNAS-Tiny achieves 8.6% higher accuracy than DeiT-Ti with slightly higher multiply-accumulate operations (MACs), and ViTResNAS-Small achieves similar accuracy to DeiT-B while having 6.3× less MACs and 3.7× higher throughput. Additionally, ViT-ResNAS achieves better accuracyMACs and accuracy-throughput trade-offs than other strong baselines of ViT such as PVT and PiT and high-erpformance CNNs like RegNet and ResNet-RS.
|