A Fast Inference Vision Transformer for Automatic Pavement Image Classification and Its Visual Interpretation Method

Traditional automatic pavement distress detection methods using convolutional neural networks (CNNs) require a great deal of time and resources for computing and are poor in terms of interpretability. Therefore, inspired by the successful application of Transformer architecture in natural language p...

Full description

Bibliographic Details
Main Authors:	Yihan Chen, Xingyu Gu, Zhen Liu, Jia Liang
Format:	Article
Language:	English
Published:	MDPI AG 2022-04-01
Series:	Remote Sensing
Subjects:	pavement distress image classification deep learning vision transformer LeViT visual interpretation
Online Access:	https://www.mdpi.com/2072-4292/14/8/1877

_version_	1797434263893704704
author	Yihan Chen Xingyu Gu Zhen Liu Jia Liang
author_facet	Yihan Chen Xingyu Gu Zhen Liu Jia Liang
author_sort	Yihan Chen
collection	DOAJ
description	Traditional automatic pavement distress detection methods using convolutional neural networks (CNNs) require a great deal of time and resources for computing and are poor in terms of interpretability. Therefore, inspired by the successful application of Transformer architecture in natural language processing (NLP) tasks, a novel Transformer method called LeViT was introduced for automatic asphalt pavement image classification. LeViT consists of convolutional layers, transformer stages where Multi-layer Perception (MLP) and multi-head self-attention blocks alternate using the residual connection, and two classifier heads. To conduct the proposed methods, three different sources of pavement image datasets and pre-trained weights based on ImageNet were attained. The performance of the proposed model was compared with six state-of-the-art (SOTA) deep learning models. All of them were trained based on transfer learning strategy. Compared to the tested SOTA methods, LeViT has less than 1/8 of the parameters of the original Vision Transformer (ViT) and 1/2 of ResNet and InceptionNet. Experimental results show that after training for 100 epochs with a 16 batch-size, the proposed method acquired 91.56% accuracy, 91.72% precision, 91.56% recall, and 91.45% F1-score in the Chinese asphalt pavement dataset and 99.17% accuracy, 99.19% precision, 99.17% recall, and 99.17% F1-score in the German asphalt pavement dataset, which is the best performance among all the tested SOTA models. Moreover, it shows superiority in inference speed (86 ms/step), which is approximately 25% of the original ViT method and 80% of some prevailing CNN-based models, including DenseNet, VGG, and ResNet. Overall, the proposed method can achieve competitive performance with fewer computation costs. In addition, a visualization method combining Grad-CAM and Attention Rollout was proposed to analyze the classification results and explore what has been learned in every MLP and attention block of LeViT, which improved the interpretability of the proposed pavement image classification model.
first_indexed	2024-03-09T10:29:43Z
format	Article
id	doaj.art-00192c5e0e4a4a3db5661f0e397471ab
institution	Directory Open Access Journal
issn	2072-4292
language	English
last_indexed	2024-03-09T10:29:43Z
publishDate	2022-04-01
publisher	MDPI AG
record_format	Article
series	Remote Sensing
spelling	doaj.art-00192c5e0e4a4a3db5661f0e397471ab2023-12-01T21:22:05ZengMDPI AGRemote Sensing2072-42922022-04-01148187710.3390/rs14081877A Fast Inference Vision Transformer for Automatic Pavement Image Classification and Its Visual Interpretation MethodYihan Chen0Xingyu Gu1Zhen Liu2Jia Liang3Department of Roadway Engineering, School of Transportation, Southeast University, Nanjing 211189, ChinaDepartment of Roadway Engineering, School of Transportation, Southeast University, Nanjing 211189, ChinaDepartment of Roadway Engineering, School of Transportation, Southeast University, Nanjing 211189, ChinaDepartment of Roadway Engineering, School of Transportation, Southeast University, Nanjing 211189, ChinaTraditional automatic pavement distress detection methods using convolutional neural networks (CNNs) require a great deal of time and resources for computing and are poor in terms of interpretability. Therefore, inspired by the successful application of Transformer architecture in natural language processing (NLP) tasks, a novel Transformer method called LeViT was introduced for automatic asphalt pavement image classification. LeViT consists of convolutional layers, transformer stages where Multi-layer Perception (MLP) and multi-head self-attention blocks alternate using the residual connection, and two classifier heads. To conduct the proposed methods, three different sources of pavement image datasets and pre-trained weights based on ImageNet were attained. The performance of the proposed model was compared with six state-of-the-art (SOTA) deep learning models. All of them were trained based on transfer learning strategy. Compared to the tested SOTA methods, LeViT has less than 1/8 of the parameters of the original Vision Transformer (ViT) and 1/2 of ResNet and InceptionNet. Experimental results show that after training for 100 epochs with a 16 batch-size, the proposed method acquired 91.56% accuracy, 91.72% precision, 91.56% recall, and 91.45% F1-score in the Chinese asphalt pavement dataset and 99.17% accuracy, 99.19% precision, 99.17% recall, and 99.17% F1-score in the German asphalt pavement dataset, which is the best performance among all the tested SOTA models. Moreover, it shows superiority in inference speed (86 ms/step), which is approximately 25% of the original ViT method and 80% of some prevailing CNN-based models, including DenseNet, VGG, and ResNet. Overall, the proposed method can achieve competitive performance with fewer computation costs. In addition, a visualization method combining Grad-CAM and Attention Rollout was proposed to analyze the classification results and explore what has been learned in every MLP and attention block of LeViT, which improved the interpretability of the proposed pavement image classification model.https://www.mdpi.com/2072-4292/14/8/1877pavement distressimage classificationdeep learningvision transformerLeViTvisual interpretation
spellingShingle	Yihan Chen Xingyu Gu Zhen Liu Jia Liang A Fast Inference Vision Transformer for Automatic Pavement Image Classification and Its Visual Interpretation Method Remote Sensing pavement distress image classification deep learning vision transformer LeViT visual interpretation
title	A Fast Inference Vision Transformer for Automatic Pavement Image Classification and Its Visual Interpretation Method
title_full	A Fast Inference Vision Transformer for Automatic Pavement Image Classification and Its Visual Interpretation Method
title_fullStr	A Fast Inference Vision Transformer for Automatic Pavement Image Classification and Its Visual Interpretation Method
title_full_unstemmed	A Fast Inference Vision Transformer for Automatic Pavement Image Classification and Its Visual Interpretation Method
title_short	A Fast Inference Vision Transformer for Automatic Pavement Image Classification and Its Visual Interpretation Method
title_sort	fast inference vision transformer for automatic pavement image classification and its visual interpretation method
topic	pavement distress image classification deep learning vision transformer LeViT visual interpretation
url	https://www.mdpi.com/2072-4292/14/8/1877
work_keys_str_mv	AT yihanchen afastinferencevisiontransformerforautomaticpavementimageclassificationanditsvisualinterpretationmethod AT xingyugu afastinferencevisiontransformerforautomaticpavementimageclassificationanditsvisualinterpretationmethod AT zhenliu afastinferencevisiontransformerforautomaticpavementimageclassificationanditsvisualinterpretationmethod AT jialiang afastinferencevisiontransformerforautomaticpavementimageclassificationanditsvisualinterpretationmethod AT yihanchen fastinferencevisiontransformerforautomaticpavementimageclassificationanditsvisualinterpretationmethod AT xingyugu fastinferencevisiontransformerforautomaticpavementimageclassificationanditsvisualinterpretationmethod AT zhenliu fastinferencevisiontransformerforautomaticpavementimageclassificationanditsvisualinterpretationmethod AT jialiang fastinferencevisiontransformerforautomaticpavementimageclassificationanditsvisualinterpretationmethod

A Fast Inference Vision Transformer for Automatic Pavement Image Classification and Its Visual Interpretation Method

Similar Items