ViT-SAPS: Detail-Aware Transformer for Mechanical Assembly Semantic Segmentation

Semantic segmentation of mechanical assembly images provides an effective way to monitor the assembly process and improve the product quality. Compared with other deep learning models, Transformer has advantages in modeling global context, and it has been widely applied in various computer vision ta...

Full description

Bibliographic Details
Main Authors:	Haitao Dong, Chengjun Chen, Jinlei Wang, Feixiang Shen, Yong Pang
Format:	Article
Language:	English
Published:	IEEE 2023-01-01
Series:	IEEE Access
Subjects:	Deep learning vision Transformer mechanical assembly monitoring semantic segmentation
Online Access:	https://ieeexplore.ieee.org/document/10108991/

_version_	1797834825816604672
author	Haitao Dong Chengjun Chen Jinlei Wang Feixiang Shen Yong Pang
author_facet	Haitao Dong Chengjun Chen Jinlei Wang Feixiang Shen Yong Pang
author_sort	Haitao Dong
collection	DOAJ
description	Semantic segmentation of mechanical assembly images provides an effective way to monitor the assembly process and improve the product quality. Compared with other deep learning models, Transformer has advantages in modeling global context, and it has been widely applied in various computer vision tasks including semantic segmentation. However, Transformer pays the same granularity of attention on all the regions of an image, so it has some difficulty to be applied to the semantic segmentation of mechanical assembly images, in which mechanical parts have large size differences and the information quantity distribution is uneven. This paper proposes a novel Transformer-based model called Vision Transformer with Self-Adaptive Patch Size (ViT-SAPS). ViT-SAPS can perceive the detail information in an image and pays finer-grained attention on the regions where the detail information locates, thus meeting the requirements of mechanical assembly semantic segmentation. Specifically, a self-adaptive patch splitting algorithm is proposed to split an image into patches of various sizes. The more detail information an image region has, the smaller patches it is split into. Further, to handle these unfixed-size patches, a position encoding scheme and a non-uniform bilinear interpolation algorithm used after sequence decoding are proposed. Experimental results show that ViT-SAPS has stronger detail segmentation ability than the model with fixed patch size, and achieves an impressive locality-globality trade-off. This study not only provides a practical method for mechanical assembly semantic segmentation, but also has much value for the application of vision Transformers in other fields. The code is available at: <uri>https://github.com/QDLGARIM/ViT-SAPS</uri>.
first_indexed	2024-04-09T14:43:41Z
format	Article
id	doaj.art-9b08cbe781c94d909b2bfeea96ddcba8
institution	Directory Open Access Journal
issn	2169-3536
language	English
last_indexed	2024-04-09T14:43:41Z
publishDate	2023-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj.art-9b08cbe781c94d909b2bfeea96ddcba82023-05-02T23:00:53ZengIEEEIEEE Access2169-35362023-01-0111414674147910.1109/ACCESS.2023.327080710108991ViT-SAPS: Detail-Aware Transformer for Mechanical Assembly Semantic SegmentationHaitao Dong0https://orcid.org/0000-0002-8937-0184Chengjun Chen1https://orcid.org/0000-0003-3185-1062Jinlei Wang2https://orcid.org/0000-0002-0024-6583Feixiang Shen3https://orcid.org/0000-0002-1853-8488Yong Pang4https://orcid.org/0000-0002-3068-7312School of Information and Control Engineering, Qingdao University of Technology, Qingdao, ChinaSchool of Mechanical and Automotive Engineering, Qingdao University of Technology, Qingdao, ChinaSchool of Mechanical and Automotive Engineering, Qingdao University of Technology, Qingdao, ChinaSchool of Mechanical and Automotive Engineering, Qingdao University of Technology, Qingdao, ChinaSchool of Information and Control Engineering, Qingdao University of Technology, Qingdao, ChinaSemantic segmentation of mechanical assembly images provides an effective way to monitor the assembly process and improve the product quality. Compared with other deep learning models, Transformer has advantages in modeling global context, and it has been widely applied in various computer vision tasks including semantic segmentation. However, Transformer pays the same granularity of attention on all the regions of an image, so it has some difficulty to be applied to the semantic segmentation of mechanical assembly images, in which mechanical parts have large size differences and the information quantity distribution is uneven. This paper proposes a novel Transformer-based model called Vision Transformer with Self-Adaptive Patch Size (ViT-SAPS). ViT-SAPS can perceive the detail information in an image and pays finer-grained attention on the regions where the detail information locates, thus meeting the requirements of mechanical assembly semantic segmentation. Specifically, a self-adaptive patch splitting algorithm is proposed to split an image into patches of various sizes. The more detail information an image region has, the smaller patches it is split into. Further, to handle these unfixed-size patches, a position encoding scheme and a non-uniform bilinear interpolation algorithm used after sequence decoding are proposed. Experimental results show that ViT-SAPS has stronger detail segmentation ability than the model with fixed patch size, and achieves an impressive locality-globality trade-off. This study not only provides a practical method for mechanical assembly semantic segmentation, but also has much value for the application of vision Transformers in other fields. The code is available at: <uri>https://github.com/QDLGARIM/ViT-SAPS</uri>.https://ieeexplore.ieee.org/document/10108991/Deep learningvision Transformermechanical assembly monitoringsemantic segmentation
spellingShingle	Haitao Dong Chengjun Chen Jinlei Wang Feixiang Shen Yong Pang ViT-SAPS: Detail-Aware Transformer for Mechanical Assembly Semantic Segmentation IEEE Access Deep learning vision Transformer mechanical assembly monitoring semantic segmentation
title	ViT-SAPS: Detail-Aware Transformer for Mechanical Assembly Semantic Segmentation
title_full	ViT-SAPS: Detail-Aware Transformer for Mechanical Assembly Semantic Segmentation
title_fullStr	ViT-SAPS: Detail-Aware Transformer for Mechanical Assembly Semantic Segmentation
title_full_unstemmed	ViT-SAPS: Detail-Aware Transformer for Mechanical Assembly Semantic Segmentation
title_short	ViT-SAPS: Detail-Aware Transformer for Mechanical Assembly Semantic Segmentation
title_sort	vit saps detail aware transformer for mechanical assembly semantic segmentation
topic	Deep learning vision Transformer mechanical assembly monitoring semantic segmentation
url	https://ieeexplore.ieee.org/document/10108991/
work_keys_str_mv	AT haitaodong vitsapsdetailawaretransformerformechanicalassemblysemanticsegmentation AT chengjunchen vitsapsdetailawaretransformerformechanicalassemblysemanticsegmentation AT jinleiwang vitsapsdetailawaretransformerformechanicalassemblysemanticsegmentation AT feixiangshen vitsapsdetailawaretransformerformechanicalassemblysemanticsegmentation AT yongpang vitsapsdetailawaretransformerformechanicalassemblysemanticsegmentation

ViT-SAPS: Detail-Aware Transformer for Mechanical Assembly Semantic Segmentation

Similar Items