A Fast Inference Vision Transformer for Automatic Pavement Image Classification and Its Visual Interpretation Method

Traditional automatic pavement distress detection methods using convolutional neural networks (CNNs) require a great deal of time and resources for computing and are poor in terms of interpretability. Therefore, inspired by the successful application of Transformer architecture in natural language p...

Full description

Bibliographic Details
Main Authors: Yihan Chen, Xingyu Gu, Zhen Liu, Jia Liang
Format: Article
Language:English
Published: MDPI AG 2022-04-01
Series:Remote Sensing
Subjects:
Online Access:https://www.mdpi.com/2072-4292/14/8/1877
_version_ 1797434263893704704
author Yihan Chen
Xingyu Gu
Zhen Liu
Jia Liang
author_facet Yihan Chen
Xingyu Gu
Zhen Liu
Jia Liang
author_sort Yihan Chen
collection DOAJ
description Traditional automatic pavement distress detection methods using convolutional neural networks (CNNs) require a great deal of time and resources for computing and are poor in terms of interpretability. Therefore, inspired by the successful application of Transformer architecture in natural language processing (NLP) tasks, a novel Transformer method called LeViT was introduced for automatic asphalt pavement image classification. LeViT consists of convolutional layers, transformer stages where Multi-layer Perception (MLP) and multi-head self-attention blocks alternate using the residual connection, and two classifier heads. To conduct the proposed methods, three different sources of pavement image datasets and pre-trained weights based on ImageNet were attained. The performance of the proposed model was compared with six state-of-the-art (SOTA) deep learning models. All of them were trained based on transfer learning strategy. Compared to the tested SOTA methods, LeViT has less than 1/8 of the parameters of the original Vision Transformer (ViT) and 1/2 of ResNet and InceptionNet. Experimental results show that after training for 100 epochs with a 16 batch-size, the proposed method acquired 91.56% accuracy, 91.72% precision, 91.56% recall, and 91.45% F1-score in the Chinese asphalt pavement dataset and 99.17% accuracy, 99.19% precision, 99.17% recall, and 99.17% F1-score in the German asphalt pavement dataset, which is the best performance among all the tested SOTA models. Moreover, it shows superiority in inference speed (86 ms/step), which is approximately 25% of the original ViT method and 80% of some prevailing CNN-based models, including DenseNet, VGG, and ResNet. Overall, the proposed method can achieve competitive performance with fewer computation costs. In addition, a visualization method combining Grad-CAM and Attention Rollout was proposed to analyze the classification results and explore what has been learned in every MLP and attention block of LeViT, which improved the interpretability of the proposed pavement image classification model.
first_indexed 2024-03-09T10:29:43Z
format Article
id doaj.art-00192c5e0e4a4a3db5661f0e397471ab
institution Directory Open Access Journal
issn 2072-4292
language English
last_indexed 2024-03-09T10:29:43Z
publishDate 2022-04-01
publisher MDPI AG
record_format Article
series Remote Sensing
spelling doaj.art-00192c5e0e4a4a3db5661f0e397471ab2023-12-01T21:22:05ZengMDPI AGRemote Sensing2072-42922022-04-01148187710.3390/rs14081877A Fast Inference Vision Transformer for Automatic Pavement Image Classification and Its Visual Interpretation MethodYihan Chen0Xingyu Gu1Zhen Liu2Jia Liang3Department of Roadway Engineering, School of Transportation, Southeast University, Nanjing 211189, ChinaDepartment of Roadway Engineering, School of Transportation, Southeast University, Nanjing 211189, ChinaDepartment of Roadway Engineering, School of Transportation, Southeast University, Nanjing 211189, ChinaDepartment of Roadway Engineering, School of Transportation, Southeast University, Nanjing 211189, ChinaTraditional automatic pavement distress detection methods using convolutional neural networks (CNNs) require a great deal of time and resources for computing and are poor in terms of interpretability. Therefore, inspired by the successful application of Transformer architecture in natural language processing (NLP) tasks, a novel Transformer method called LeViT was introduced for automatic asphalt pavement image classification. LeViT consists of convolutional layers, transformer stages where Multi-layer Perception (MLP) and multi-head self-attention blocks alternate using the residual connection, and two classifier heads. To conduct the proposed methods, three different sources of pavement image datasets and pre-trained weights based on ImageNet were attained. The performance of the proposed model was compared with six state-of-the-art (SOTA) deep learning models. All of them were trained based on transfer learning strategy. Compared to the tested SOTA methods, LeViT has less than 1/8 of the parameters of the original Vision Transformer (ViT) and 1/2 of ResNet and InceptionNet. Experimental results show that after training for 100 epochs with a 16 batch-size, the proposed method acquired 91.56% accuracy, 91.72% precision, 91.56% recall, and 91.45% F1-score in the Chinese asphalt pavement dataset and 99.17% accuracy, 99.19% precision, 99.17% recall, and 99.17% F1-score in the German asphalt pavement dataset, which is the best performance among all the tested SOTA models. Moreover, it shows superiority in inference speed (86 ms/step), which is approximately 25% of the original ViT method and 80% of some prevailing CNN-based models, including DenseNet, VGG, and ResNet. Overall, the proposed method can achieve competitive performance with fewer computation costs. In addition, a visualization method combining Grad-CAM and Attention Rollout was proposed to analyze the classification results and explore what has been learned in every MLP and attention block of LeViT, which improved the interpretability of the proposed pavement image classification model.https://www.mdpi.com/2072-4292/14/8/1877pavement distressimage classificationdeep learningvision transformerLeViTvisual interpretation
spellingShingle Yihan Chen
Xingyu Gu
Zhen Liu
Jia Liang
A Fast Inference Vision Transformer for Automatic Pavement Image Classification and Its Visual Interpretation Method
Remote Sensing
pavement distress
image classification
deep learning
vision transformer
LeViT
visual interpretation
title A Fast Inference Vision Transformer for Automatic Pavement Image Classification and Its Visual Interpretation Method
title_full A Fast Inference Vision Transformer for Automatic Pavement Image Classification and Its Visual Interpretation Method
title_fullStr A Fast Inference Vision Transformer for Automatic Pavement Image Classification and Its Visual Interpretation Method
title_full_unstemmed A Fast Inference Vision Transformer for Automatic Pavement Image Classification and Its Visual Interpretation Method
title_short A Fast Inference Vision Transformer for Automatic Pavement Image Classification and Its Visual Interpretation Method
title_sort fast inference vision transformer for automatic pavement image classification and its visual interpretation method
topic pavement distress
image classification
deep learning
vision transformer
LeViT
visual interpretation
url https://www.mdpi.com/2072-4292/14/8/1877
work_keys_str_mv AT yihanchen afastinferencevisiontransformerforautomaticpavementimageclassificationanditsvisualinterpretationmethod
AT xingyugu afastinferencevisiontransformerforautomaticpavementimageclassificationanditsvisualinterpretationmethod
AT zhenliu afastinferencevisiontransformerforautomaticpavementimageclassificationanditsvisualinterpretationmethod
AT jialiang afastinferencevisiontransformerforautomaticpavementimageclassificationanditsvisualinterpretationmethod
AT yihanchen fastinferencevisiontransformerforautomaticpavementimageclassificationanditsvisualinterpretationmethod
AT xingyugu fastinferencevisiontransformerforautomaticpavementimageclassificationanditsvisualinterpretationmethod
AT zhenliu fastinferencevisiontransformerforautomaticpavementimageclassificationanditsvisualinterpretationmethod
AT jialiang fastinferencevisiontransformerforautomaticpavementimageclassificationanditsvisualinterpretationmethod