Language matters: a weakly SupervisedVision-Language pre-training approach for scene text detection and spotting

Recently, Vision-Language Pre-training (VLP) techniques have greatly benefited various vision-language tasks by jointly learning visual and textual representations, which intuitively helps in Optical Character Recognition (OCR) tasks due to the rich visual and textual information in scene text image...

Full description

Bibliographic Details
Main Authors:	Xue, C, Hao, Y, Lu, S, Torr, P, Bai, S
Format:	Conference item
Language:	English
Published:	Springer 2022

_version_	1797108869018681344
author	Xue, C Hao, Y Lu, S Torr, P Bai, S
author_facet	Xue, C Hao, Y Lu, S Torr, P Bai, S
author_sort	Xue, C
collection	OXFORD
description	Recently, Vision-Language Pre-training (VLP) techniques have greatly benefited various vision-language tasks by jointly learning visual and textual representations, which intuitively helps in Optical Character Recognition (OCR) tasks due to the rich visual and textual information in scene text images. However, these methods cannot well cope with OCR tasks because of the difficulty in both instance-level text encoding and image-text pair acquisition (i.e. images and captured texts in them). This paper presents a weakly supervised pre-training method, oCLIP, which can acquire effective scene text representations by jointly learning and aligning visual and textual information. Our network consists of an image encoder and a character-aware text encoder that extract visual and textual features, respectively, as well as a visualtextual decoder that models the interaction among textual and visual features for learning effective scene text representations. With the learning of textual features, the pre-trained model can attend texts in images well with character awareness. Besides, these designs enable the learning from weakly annotated texts (i.e. partial texts in images without text bounding boxes) which mitigates the data annotation constraint greatly. Experiments over the weakly annotated images in ICDAR2019-LSVT show that our pre-trained model improves F-score by +2.5% and +4.8% while transferring its weights to other text detection and spotting networks, respectively. In addition, the proposed method outperforms existing pre-training techniques consistently across multiple public datasets (e.g., +3.2% and +1.3% for Total-Text and CTW1500).
first_indexed	2024-03-07T07:32:37Z
format	Conference item
id	oxford-uuid:e59da773-2019-4fc2-9877-45717d0ae984
institution	University of Oxford
language	English
last_indexed	2024-03-07T07:32:37Z
publishDate	2022
publisher	Springer
record_format	dspace
spelling	oxford-uuid:e59da773-2019-4fc2-9877-45717d0ae9842023-02-06T15:05:27ZLanguage matters: a weakly SupervisedVision-Language pre-training approach for scene text detection and spottingConference itemhttp://purl.org/coar/resource_type/c_5794uuid:e59da773-2019-4fc2-9877-45717d0ae984EnglishSymplectic ElementsSpringer2022Xue, CHao, YLu, STorr, PBai, SRecently, Vision-Language Pre-training (VLP) techniques have greatly benefited various vision-language tasks by jointly learning visual and textual representations, which intuitively helps in Optical Character Recognition (OCR) tasks due to the rich visual and textual information in scene text images. However, these methods cannot well cope with OCR tasks because of the difficulty in both instance-level text encoding and image-text pair acquisition (i.e. images and captured texts in them). This paper presents a weakly supervised pre-training method, oCLIP, which can acquire effective scene text representations by jointly learning and aligning visual and textual information. Our network consists of an image encoder and a character-aware text encoder that extract visual and textual features, respectively, as well as a visualtextual decoder that models the interaction among textual and visual features for learning effective scene text representations. With the learning of textual features, the pre-trained model can attend texts in images well with character awareness. Besides, these designs enable the learning from weakly annotated texts (i.e. partial texts in images without text bounding boxes) which mitigates the data annotation constraint greatly. Experiments over the weakly annotated images in ICDAR2019-LSVT show that our pre-trained model improves F-score by +2.5% and +4.8% while transferring its weights to other text detection and spotting networks, respectively. In addition, the proposed method outperforms existing pre-training techniques consistently across multiple public datasets (e.g., +3.2% and +1.3% for Total-Text and CTW1500).
spellingShingle	Xue, C Hao, Y Lu, S Torr, P Bai, S Language matters: a weakly SupervisedVision-Language pre-training approach for scene text detection and spotting
title	Language matters: a weakly SupervisedVision-Language pre-training approach for scene text detection and spotting
title_full	Language matters: a weakly SupervisedVision-Language pre-training approach for scene text detection and spotting
title_fullStr	Language matters: a weakly SupervisedVision-Language pre-training approach for scene text detection and spotting
title_full_unstemmed	Language matters: a weakly SupervisedVision-Language pre-training approach for scene text detection and spotting
title_short	Language matters: a weakly SupervisedVision-Language pre-training approach for scene text detection and spotting
title_sort	language matters a weakly supervisedvision language pre training approach for scene text detection and spotting
work_keys_str_mv	AT xuec languagemattersaweaklysupervisedvisionlanguagepretrainingapproachforscenetextdetectionandspotting AT haoy languagemattersaweaklysupervisedvisionlanguagepretrainingapproachforscenetextdetectionandspotting AT lus languagemattersaweaklysupervisedvisionlanguagepretrainingapproachforscenetextdetectionandspotting AT torrp languagemattersaweaklysupervisedvisionlanguagepretrainingapproachforscenetextdetectionandspotting AT bais languagemattersaweaklysupervisedvisionlanguagepretrainingapproachforscenetextdetectionandspotting

Language matters: a weakly SupervisedVision-Language pre-training approach for scene text detection and spotting

Similar Items