A Fast Inference Vision Transformer for Automatic Pavement Image Classification and Its Visual Interpretation Method
Traditional automatic pavement distress detection methods using convolutional neural networks (CNNs) require a great deal of time and resources for computing and are poor in terms of interpretability. Therefore, inspired by the successful application of Transformer architecture in natural language p...
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2022-04-01
|
Series: | Remote Sensing |
Subjects: | |
Online Access: | https://www.mdpi.com/2072-4292/14/8/1877 |
_version_ | 1797434263893704704 |
---|---|
author | Yihan Chen Xingyu Gu Zhen Liu Jia Liang |
author_facet | Yihan Chen Xingyu Gu Zhen Liu Jia Liang |
author_sort | Yihan Chen |
collection | DOAJ |
description | Traditional automatic pavement distress detection methods using convolutional neural networks (CNNs) require a great deal of time and resources for computing and are poor in terms of interpretability. Therefore, inspired by the successful application of Transformer architecture in natural language processing (NLP) tasks, a novel Transformer method called LeViT was introduced for automatic asphalt pavement image classification. LeViT consists of convolutional layers, transformer stages where Multi-layer Perception (MLP) and multi-head self-attention blocks alternate using the residual connection, and two classifier heads. To conduct the proposed methods, three different sources of pavement image datasets and pre-trained weights based on ImageNet were attained. The performance of the proposed model was compared with six state-of-the-art (SOTA) deep learning models. All of them were trained based on transfer learning strategy. Compared to the tested SOTA methods, LeViT has less than 1/8 of the parameters of the original Vision Transformer (ViT) and 1/2 of ResNet and InceptionNet. Experimental results show that after training for 100 epochs with a 16 batch-size, the proposed method acquired 91.56% accuracy, 91.72% precision, 91.56% recall, and 91.45% F1-score in the Chinese asphalt pavement dataset and 99.17% accuracy, 99.19% precision, 99.17% recall, and 99.17% F1-score in the German asphalt pavement dataset, which is the best performance among all the tested SOTA models. Moreover, it shows superiority in inference speed (86 ms/step), which is approximately 25% of the original ViT method and 80% of some prevailing CNN-based models, including DenseNet, VGG, and ResNet. Overall, the proposed method can achieve competitive performance with fewer computation costs. In addition, a visualization method combining Grad-CAM and Attention Rollout was proposed to analyze the classification results and explore what has been learned in every MLP and attention block of LeViT, which improved the interpretability of the proposed pavement image classification model. |
first_indexed | 2024-03-09T10:29:43Z |
format | Article |
id | doaj.art-00192c5e0e4a4a3db5661f0e397471ab |
institution | Directory Open Access Journal |
issn | 2072-4292 |
language | English |
last_indexed | 2024-03-09T10:29:43Z |
publishDate | 2022-04-01 |
publisher | MDPI AG |
record_format | Article |
series | Remote Sensing |
spelling | doaj.art-00192c5e0e4a4a3db5661f0e397471ab2023-12-01T21:22:05ZengMDPI AGRemote Sensing2072-42922022-04-01148187710.3390/rs14081877A Fast Inference Vision Transformer for Automatic Pavement Image Classification and Its Visual Interpretation MethodYihan Chen0Xingyu Gu1Zhen Liu2Jia Liang3Department of Roadway Engineering, School of Transportation, Southeast University, Nanjing 211189, ChinaDepartment of Roadway Engineering, School of Transportation, Southeast University, Nanjing 211189, ChinaDepartment of Roadway Engineering, School of Transportation, Southeast University, Nanjing 211189, ChinaDepartment of Roadway Engineering, School of Transportation, Southeast University, Nanjing 211189, ChinaTraditional automatic pavement distress detection methods using convolutional neural networks (CNNs) require a great deal of time and resources for computing and are poor in terms of interpretability. Therefore, inspired by the successful application of Transformer architecture in natural language processing (NLP) tasks, a novel Transformer method called LeViT was introduced for automatic asphalt pavement image classification. LeViT consists of convolutional layers, transformer stages where Multi-layer Perception (MLP) and multi-head self-attention blocks alternate using the residual connection, and two classifier heads. To conduct the proposed methods, three different sources of pavement image datasets and pre-trained weights based on ImageNet were attained. The performance of the proposed model was compared with six state-of-the-art (SOTA) deep learning models. All of them were trained based on transfer learning strategy. Compared to the tested SOTA methods, LeViT has less than 1/8 of the parameters of the original Vision Transformer (ViT) and 1/2 of ResNet and InceptionNet. Experimental results show that after training for 100 epochs with a 16 batch-size, the proposed method acquired 91.56% accuracy, 91.72% precision, 91.56% recall, and 91.45% F1-score in the Chinese asphalt pavement dataset and 99.17% accuracy, 99.19% precision, 99.17% recall, and 99.17% F1-score in the German asphalt pavement dataset, which is the best performance among all the tested SOTA models. Moreover, it shows superiority in inference speed (86 ms/step), which is approximately 25% of the original ViT method and 80% of some prevailing CNN-based models, including DenseNet, VGG, and ResNet. Overall, the proposed method can achieve competitive performance with fewer computation costs. In addition, a visualization method combining Grad-CAM and Attention Rollout was proposed to analyze the classification results and explore what has been learned in every MLP and attention block of LeViT, which improved the interpretability of the proposed pavement image classification model.https://www.mdpi.com/2072-4292/14/8/1877pavement distressimage classificationdeep learningvision transformerLeViTvisual interpretation |
spellingShingle | Yihan Chen Xingyu Gu Zhen Liu Jia Liang A Fast Inference Vision Transformer for Automatic Pavement Image Classification and Its Visual Interpretation Method Remote Sensing pavement distress image classification deep learning vision transformer LeViT visual interpretation |
title | A Fast Inference Vision Transformer for Automatic Pavement Image Classification and Its Visual Interpretation Method |
title_full | A Fast Inference Vision Transformer for Automatic Pavement Image Classification and Its Visual Interpretation Method |
title_fullStr | A Fast Inference Vision Transformer for Automatic Pavement Image Classification and Its Visual Interpretation Method |
title_full_unstemmed | A Fast Inference Vision Transformer for Automatic Pavement Image Classification and Its Visual Interpretation Method |
title_short | A Fast Inference Vision Transformer for Automatic Pavement Image Classification and Its Visual Interpretation Method |
title_sort | fast inference vision transformer for automatic pavement image classification and its visual interpretation method |
topic | pavement distress image classification deep learning vision transformer LeViT visual interpretation |
url | https://www.mdpi.com/2072-4292/14/8/1877 |
work_keys_str_mv | AT yihanchen afastinferencevisiontransformerforautomaticpavementimageclassificationanditsvisualinterpretationmethod AT xingyugu afastinferencevisiontransformerforautomaticpavementimageclassificationanditsvisualinterpretationmethod AT zhenliu afastinferencevisiontransformerforautomaticpavementimageclassificationanditsvisualinterpretationmethod AT jialiang afastinferencevisiontransformerforautomaticpavementimageclassificationanditsvisualinterpretationmethod AT yihanchen fastinferencevisiontransformerforautomaticpavementimageclassificationanditsvisualinterpretationmethod AT xingyugu fastinferencevisiontransformerforautomaticpavementimageclassificationanditsvisualinterpretationmethod AT zhenliu fastinferencevisiontransformerforautomaticpavementimageclassificationanditsvisualinterpretationmethod AT jialiang fastinferencevisiontransformerforautomaticpavementimageclassificationanditsvisualinterpretationmethod |