Scene Text Detection Based on Multi-Headed Self-Attention Using Shifted Windows

Scene text detection has become a popular topic in computer vision research. Most of the current research is based on deep learning, using Convolutional Neural Networks (CNNs) to extract the visual features of images. However, due to the limitations of convolution kernel size, CNNs can only extract...

Full description

Bibliographic Details
Main Authors:	Baohua Huang, Xiaoru Feng
Format:	Article
Language:	English
Published:	MDPI AG 2023-03-01
Series:	Applied Sciences
Subjects:	scene text detection multi-headed self-attention shifted window multi-oriented multi-language
Online Access:	https://www.mdpi.com/2076-3417/13/6/3928

_version_	1797613657509593088
author	Baohua Huang Xiaoru Feng
author_facet	Baohua Huang Xiaoru Feng
author_sort	Baohua Huang
collection	DOAJ
description	Scene text detection has become a popular topic in computer vision research. Most of the current research is based on deep learning, using Convolutional Neural Networks (CNNs) to extract the visual features of images. However, due to the limitations of convolution kernel size, CNNs can only extract local features of images with small perceptual fields, and they cannot obtain more global features. In this paper, to improve the accuracy of scene text detection, a feature enhancement module is added to the text detection model. This module acquires global features of an image by computing the multi-headed self-attention of the feature map. The improved model extracts local features using CNNs, while extracting global features through the feature enhancement module. The features extracted by both of these are then fused to ensure that visual features at different levels of the image are extracted. A shifted window is used in the calculation of the self-attention, which reduces the computational complexity from the second power of the input image width-height product to the first power. Experiments are conducted on the multi-oriented text dataset ICDAR2015 and the multi-language text dataset MSRA-TD500. Compared with the pre-improvement method DBNet, the F1-score improves by 0.5% and 3.5% on ICDAR2015 and MSRA-TD500, respectively, indicating the effectiveness of the model improvement.
first_indexed	2024-03-11T06:57:56Z
format	Article
id	doaj.art-bf165a7c63144d02980ddbb314b42ae4
institution	Directory Open Access Journal
issn	2076-3417
language	English
last_indexed	2024-03-11T06:57:56Z
publishDate	2023-03-01
publisher	MDPI AG
record_format	Article
series	Applied Sciences
spelling	doaj.art-bf165a7c63144d02980ddbb314b42ae42023-11-17T09:28:56ZengMDPI AGApplied Sciences2076-34172023-03-01136392810.3390/app13063928Scene Text Detection Based on Multi-Headed Self-Attention Using Shifted WindowsBaohua Huang0Xiaoru Feng1School of Computer and Electronic Information, Guangxi University, Nanning 530004, ChinaSchool of Computer and Electronic Information, Guangxi University, Nanning 530004, ChinaScene text detection has become a popular topic in computer vision research. Most of the current research is based on deep learning, using Convolutional Neural Networks (CNNs) to extract the visual features of images. However, due to the limitations of convolution kernel size, CNNs can only extract local features of images with small perceptual fields, and they cannot obtain more global features. In this paper, to improve the accuracy of scene text detection, a feature enhancement module is added to the text detection model. This module acquires global features of an image by computing the multi-headed self-attention of the feature map. The improved model extracts local features using CNNs, while extracting global features through the feature enhancement module. The features extracted by both of these are then fused to ensure that visual features at different levels of the image are extracted. A shifted window is used in the calculation of the self-attention, which reduces the computational complexity from the second power of the input image width-height product to the first power. Experiments are conducted on the multi-oriented text dataset ICDAR2015 and the multi-language text dataset MSRA-TD500. Compared with the pre-improvement method DBNet, the F1-score improves by 0.5% and 3.5% on ICDAR2015 and MSRA-TD500, respectively, indicating the effectiveness of the model improvement.https://www.mdpi.com/2076-3417/13/6/3928scene text detectionmulti-headed self-attentionshifted windowmulti-orientedmulti-language
spellingShingle	Baohua Huang Xiaoru Feng Scene Text Detection Based on Multi-Headed Self-Attention Using Shifted Windows Applied Sciences scene text detection multi-headed self-attention shifted window multi-oriented multi-language
title	Scene Text Detection Based on Multi-Headed Self-Attention Using Shifted Windows
title_full	Scene Text Detection Based on Multi-Headed Self-Attention Using Shifted Windows
title_fullStr	Scene Text Detection Based on Multi-Headed Self-Attention Using Shifted Windows
title_full_unstemmed	Scene Text Detection Based on Multi-Headed Self-Attention Using Shifted Windows
title_short	Scene Text Detection Based on Multi-Headed Self-Attention Using Shifted Windows
title_sort	scene text detection based on multi headed self attention using shifted windows
topic	scene text detection multi-headed self-attention shifted window multi-oriented multi-language
url	https://www.mdpi.com/2076-3417/13/6/3928
work_keys_str_mv	AT baohuahuang scenetextdetectionbasedonmultiheadedselfattentionusingshiftedwindows AT xiaorufeng scenetextdetectionbasedonmultiheadedselfattentionusingshiftedwindows

Scene Text Detection Based on Multi-Headed Self-Attention Using Shifted Windows

Similar Items