Visual Clue Guidance and Consistency Matching Framework for Multimodal Named Entity Recognition

The goal of multimodal named entity recognition (MNER) is to detect entity spans in given image–text pairs and classify them into corresponding entity types. Despite the success of existing works that leverage cross-modal attention mechanisms to integrate textual and visual representations, we obser...

Full description

Bibliographic Details
Main Authors:	Li He, Qingxiang Wang, Jie Liu, Jianyong Duan, Hao Wang
Format:	Article
Language:	English
Published:	MDPI AG 2024-03-01
Series:	Applied Sciences
Subjects:	multimodal named entity recognition contrastive learning feature pyramid
Online Access:	https://www.mdpi.com/2076-3417/14/6/2333

_version_	1797242215645315072
author	Li He Qingxiang Wang Jie Liu Jianyong Duan Hao Wang
author_facet	Li He Qingxiang Wang Jie Liu Jianyong Duan Hao Wang
author_sort	Li He
collection	DOAJ
description	The goal of multimodal named entity recognition (MNER) is to detect entity spans in given image–text pairs and classify them into corresponding entity types. Despite the success of existing works that leverage cross-modal attention mechanisms to integrate textual and visual representations, we observe three key issues. Firstly, models are prone to misguidance when fusing unrelated text and images. Secondly, most existing visual features are not enhanced or filtered. Finally, due to the independent encoding strategies employed for text and images, a noticeable semantic gap exists between them. To address these challenges, we propose a framework called visual clue guidance and consistency matching (GMF). To tackle the first issue, we introduce a visual clue guidance (VCG) module designed to hierarchically extract visual information from multiple scales. This information is utilized as an injectable visual clue guidance sequence to steer text representations for error-insensitive prediction decisions. Furthermore, by incorporating a cross-scale attention (CSA) module, we successfully mitigate interference across scales, enhancing the image’s capability to capture details. To address the third issue of semantic disparity between text and images, we employ a consistency matching (CM) module based on the idea of multimodal contrastive learning, facilitating the collaborative learning of multimodal data. To validate the effectiveness of our proposed framework, we conducted comprehensive experimental studies, including extensive comparative experiments, ablation studies, and case studies, on two widely used benchmark datasets, demonstrating the efficacy of the framework.
first_indexed	2024-04-24T18:35:41Z
format	Article
id	doaj.art-75ee36711e754d879842b33b7682131f
institution	Directory Open Access Journal
issn	2076-3417
language	English
last_indexed	2024-04-24T18:35:41Z
publishDate	2024-03-01
publisher	MDPI AG
record_format	Article
series	Applied Sciences
spelling	doaj.art-75ee36711e754d879842b33b7682131f2024-03-27T13:19:21ZengMDPI AGApplied Sciences2076-34172024-03-01146233310.3390/app14062333Visual Clue Guidance and Consistency Matching Framework for Multimodal Named Entity RecognitionLi He0Qingxiang Wang1Jie Liu2Jianyong Duan3Hao Wang4School of Information Science and Technology, North China University of Technology, Beijing 100144, ChinaSchool of Information Science and Technology, North China University of Technology, Beijing 100144, ChinaSchool of Information Science and Technology, North China University of Technology, Beijing 100144, ChinaSchool of Information Science and Technology, North China University of Technology, Beijing 100144, ChinaSchool of Information Science and Technology, North China University of Technology, Beijing 100144, ChinaThe goal of multimodal named entity recognition (MNER) is to detect entity spans in given image–text pairs and classify them into corresponding entity types. Despite the success of existing works that leverage cross-modal attention mechanisms to integrate textual and visual representations, we observe three key issues. Firstly, models are prone to misguidance when fusing unrelated text and images. Secondly, most existing visual features are not enhanced or filtered. Finally, due to the independent encoding strategies employed for text and images, a noticeable semantic gap exists between them. To address these challenges, we propose a framework called visual clue guidance and consistency matching (GMF). To tackle the first issue, we introduce a visual clue guidance (VCG) module designed to hierarchically extract visual information from multiple scales. This information is utilized as an injectable visual clue guidance sequence to steer text representations for error-insensitive prediction decisions. Furthermore, by incorporating a cross-scale attention (CSA) module, we successfully mitigate interference across scales, enhancing the image’s capability to capture details. To address the third issue of semantic disparity between text and images, we employ a consistency matching (CM) module based on the idea of multimodal contrastive learning, facilitating the collaborative learning of multimodal data. To validate the effectiveness of our proposed framework, we conducted comprehensive experimental studies, including extensive comparative experiments, ablation studies, and case studies, on two widely used benchmark datasets, demonstrating the efficacy of the framework.https://www.mdpi.com/2076-3417/14/6/2333multimodal named entity recognitioncontrastive learningfeature pyramid
spellingShingle	Li He Qingxiang Wang Jie Liu Jianyong Duan Hao Wang Visual Clue Guidance and Consistency Matching Framework for Multimodal Named Entity Recognition Applied Sciences multimodal named entity recognition contrastive learning feature pyramid
title	Visual Clue Guidance and Consistency Matching Framework for Multimodal Named Entity Recognition
title_full	Visual Clue Guidance and Consistency Matching Framework for Multimodal Named Entity Recognition
title_fullStr	Visual Clue Guidance and Consistency Matching Framework for Multimodal Named Entity Recognition
title_full_unstemmed	Visual Clue Guidance and Consistency Matching Framework for Multimodal Named Entity Recognition
title_short	Visual Clue Guidance and Consistency Matching Framework for Multimodal Named Entity Recognition
title_sort	visual clue guidance and consistency matching framework for multimodal named entity recognition
topic	multimodal named entity recognition contrastive learning feature pyramid
url	https://www.mdpi.com/2076-3417/14/6/2333
work_keys_str_mv	AT lihe visualclueguidanceandconsistencymatchingframeworkformultimodalnamedentityrecognition AT qingxiangwang visualclueguidanceandconsistencymatchingframeworkformultimodalnamedentityrecognition AT jieliu visualclueguidanceandconsistencymatchingframeworkformultimodalnamedentityrecognition AT jianyongduan visualclueguidanceandconsistencymatchingframeworkformultimodalnamedentityrecognition AT haowang visualclueguidanceandconsistencymatchingframeworkformultimodalnamedentityrecognition

Visual Clue Guidance and Consistency Matching Framework for Multimodal Named Entity Recognition

Similar Items