Scene Text Detection Based on Multi-Headed Self-Attention Using Shifted Windows

Scene text detection has become a popular topic in computer vision research. Most of the current research is based on deep learning, using Convolutional Neural Networks (CNNs) to extract the visual features of images. However, due to the limitations of convolution kernel size, CNNs can only extract...

Full description

Bibliographic Details
Main Authors: Baohua Huang, Xiaoru Feng
Format: Article
Language:English
Published: MDPI AG 2023-03-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/13/6/3928
_version_ 1797613657509593088
author Baohua Huang
Xiaoru Feng
author_facet Baohua Huang
Xiaoru Feng
author_sort Baohua Huang
collection DOAJ
description Scene text detection has become a popular topic in computer vision research. Most of the current research is based on deep learning, using Convolutional Neural Networks (CNNs) to extract the visual features of images. However, due to the limitations of convolution kernel size, CNNs can only extract local features of images with small perceptual fields, and they cannot obtain more global features. In this paper, to improve the accuracy of scene text detection, a feature enhancement module is added to the text detection model. This module acquires global features of an image by computing the multi-headed self-attention of the feature map. The improved model extracts local features using CNNs, while extracting global features through the feature enhancement module. The features extracted by both of these are then fused to ensure that visual features at different levels of the image are extracted. A shifted window is used in the calculation of the self-attention, which reduces the computational complexity from the second power of the input image width-height product to the first power. Experiments are conducted on the multi-oriented text dataset ICDAR2015 and the multi-language text dataset MSRA-TD500. Compared with the pre-improvement method DBNet, the F1-score improves by 0.5% and 3.5% on ICDAR2015 and MSRA-TD500, respectively, indicating the effectiveness of the model improvement.
first_indexed 2024-03-11T06:57:56Z
format Article
id doaj.art-bf165a7c63144d02980ddbb314b42ae4
institution Directory Open Access Journal
issn 2076-3417
language English
last_indexed 2024-03-11T06:57:56Z
publishDate 2023-03-01
publisher MDPI AG
record_format Article
series Applied Sciences
spelling doaj.art-bf165a7c63144d02980ddbb314b42ae42023-11-17T09:28:56ZengMDPI AGApplied Sciences2076-34172023-03-01136392810.3390/app13063928Scene Text Detection Based on Multi-Headed Self-Attention Using Shifted WindowsBaohua Huang0Xiaoru Feng1School of Computer and Electronic Information, Guangxi University, Nanning 530004, ChinaSchool of Computer and Electronic Information, Guangxi University, Nanning 530004, ChinaScene text detection has become a popular topic in computer vision research. Most of the current research is based on deep learning, using Convolutional Neural Networks (CNNs) to extract the visual features of images. However, due to the limitations of convolution kernel size, CNNs can only extract local features of images with small perceptual fields, and they cannot obtain more global features. In this paper, to improve the accuracy of scene text detection, a feature enhancement module is added to the text detection model. This module acquires global features of an image by computing the multi-headed self-attention of the feature map. The improved model extracts local features using CNNs, while extracting global features through the feature enhancement module. The features extracted by both of these are then fused to ensure that visual features at different levels of the image are extracted. A shifted window is used in the calculation of the self-attention, which reduces the computational complexity from the second power of the input image width-height product to the first power. Experiments are conducted on the multi-oriented text dataset ICDAR2015 and the multi-language text dataset MSRA-TD500. Compared with the pre-improvement method DBNet, the F1-score improves by 0.5% and 3.5% on ICDAR2015 and MSRA-TD500, respectively, indicating the effectiveness of the model improvement.https://www.mdpi.com/2076-3417/13/6/3928scene text detectionmulti-headed self-attentionshifted windowmulti-orientedmulti-language
spellingShingle Baohua Huang
Xiaoru Feng
Scene Text Detection Based on Multi-Headed Self-Attention Using Shifted Windows
Applied Sciences
scene text detection
multi-headed self-attention
shifted window
multi-oriented
multi-language
title Scene Text Detection Based on Multi-Headed Self-Attention Using Shifted Windows
title_full Scene Text Detection Based on Multi-Headed Self-Attention Using Shifted Windows
title_fullStr Scene Text Detection Based on Multi-Headed Self-Attention Using Shifted Windows
title_full_unstemmed Scene Text Detection Based on Multi-Headed Self-Attention Using Shifted Windows
title_short Scene Text Detection Based on Multi-Headed Self-Attention Using Shifted Windows
title_sort scene text detection based on multi headed self attention using shifted windows
topic scene text detection
multi-headed self-attention
shifted window
multi-oriented
multi-language
url https://www.mdpi.com/2076-3417/13/6/3928
work_keys_str_mv AT baohuahuang scenetextdetectionbasedonmultiheadedselfattentionusingshiftedwindows
AT xiaorufeng scenetextdetectionbasedonmultiheadedselfattentionusingshiftedwindows