Dynamic Debiasing Network for Visual Commonsense Generation

The task of Visual Commonsense Generation (VCG) delves into the deeper narrative behind a static image, aiming to comprehend not just its immediate content but also the surrounding context. The VCG model generates three types of captions for each image: 1) the events preceding the image, 2) the char...

Full description

Bibliographic Details
Main Authors:	Jungeun Kim, Jinwoo Park, Jaekwang Seok, Junyeong Kim
Format:	Article
Language:	English
Published:	IEEE 2023-01-01
Series:	IEEE Access
Subjects:	Multimodal reasoning visual commonsense generation VisualCOMET dataset bias debiasing causal inference
Online Access:	https://ieeexplore.ieee.org/document/10348563/

_version_	1797376349366648832
author	Jungeun Kim Jinwoo Park Jaekwang Seok Junyeong Kim
author_facet	Jungeun Kim Jinwoo Park Jaekwang Seok Junyeong Kim
author_sort	Jungeun Kim
collection	DOAJ
description	The task of Visual Commonsense Generation (VCG) delves into the deeper narrative behind a static image, aiming to comprehend not just its immediate content but also the surrounding context. The VCG model generates three types of captions for each image: 1) the events preceding the image, 2) the characters’ current intents, and 3) the anticipated subsequent events. However, a significant challenge in VCG research is the prevalent yet under-addressed issue of dataset bias, which can result in spurious correlations during model training. This occurs when a model, influenced by biased data, infers associations that frequently appear in the dataset but may not provide accurate or contextually appropriate interpretations. The issue becomes even more complex in multimodal tasks, where different types of data, such as text and image, bring their unique biases. When these modalities are combined as inputs to a model, one modality might exhibit a stronger bias than others. To address this, we introduce the Dynamic Debiasing Network (DDNet) for Visual Commonsense Generation. DDNet is designed to identify the biased modality and dynamically counteract modality-specific biases using causal relationship. By considering biases from multiple modalities, DDNet avoids over-focusing on any single modality and effectively combines information from all modalities. The experimental results on the VisualCOMET dataset demonstrate that our proposed network fosters more accurate commonsense inferences. This emphasizes the critical need for debiasing in multimodal tasks and enhances the reliability of machine-generated commonsense narratives.
first_indexed	2024-03-08T19:37:16Z
format	Article
id	doaj.art-5b1a3e2a154149aea336a471112bcc05
institution	Directory Open Access Journal
issn	2169-3536
language	English
last_indexed	2024-03-08T19:37:16Z
publishDate	2023-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj.art-5b1a3e2a154149aea336a471112bcc052023-12-26T00:02:50ZengIEEEIEEE Access2169-35362023-01-011113970613971410.1109/ACCESS.2023.334070510348563Dynamic Debiasing Network for Visual Commonsense GenerationJungeun Kim0https://orcid.org/0009-0006-9757-7149Jinwoo Park1https://orcid.org/0000-0003-4927-1058Jaekwang Seok2Junyeong Kim3https://orcid.org/0000-0002-7871-9627Department of Artificial Intelligence, Chung-Ang University, Seoul, Republic of KoreaDepartment of Artificial Intelligence, Chung-Ang University, Seoul, Republic of KoreaDepartment of Artificial Intelligence, Chung-Ang University, Seoul, Republic of KoreaDepartment of Artificial Intelligence, Chung-Ang University, Seoul, Republic of KoreaThe task of Visual Commonsense Generation (VCG) delves into the deeper narrative behind a static image, aiming to comprehend not just its immediate content but also the surrounding context. The VCG model generates three types of captions for each image: 1) the events preceding the image, 2) the characters’ current intents, and 3) the anticipated subsequent events. However, a significant challenge in VCG research is the prevalent yet under-addressed issue of dataset bias, which can result in spurious correlations during model training. This occurs when a model, influenced by biased data, infers associations that frequently appear in the dataset but may not provide accurate or contextually appropriate interpretations. The issue becomes even more complex in multimodal tasks, where different types of data, such as text and image, bring their unique biases. When these modalities are combined as inputs to a model, one modality might exhibit a stronger bias than others. To address this, we introduce the Dynamic Debiasing Network (DDNet) for Visual Commonsense Generation. DDNet is designed to identify the biased modality and dynamically counteract modality-specific biases using causal relationship. By considering biases from multiple modalities, DDNet avoids over-focusing on any single modality and effectively combines information from all modalities. The experimental results on the VisualCOMET dataset demonstrate that our proposed network fosters more accurate commonsense inferences. This emphasizes the critical need for debiasing in multimodal tasks and enhances the reliability of machine-generated commonsense narratives.https://ieeexplore.ieee.org/document/10348563/Multimodal reasoningvisual commonsense generationVisualCOMETdataset biasdebiasingcausal inference
spellingShingle	Jungeun Kim Jinwoo Park Jaekwang Seok Junyeong Kim Dynamic Debiasing Network for Visual Commonsense Generation IEEE Access Multimodal reasoning visual commonsense generation VisualCOMET dataset bias debiasing causal inference
title	Dynamic Debiasing Network for Visual Commonsense Generation
title_full	Dynamic Debiasing Network for Visual Commonsense Generation
title_fullStr	Dynamic Debiasing Network for Visual Commonsense Generation
title_full_unstemmed	Dynamic Debiasing Network for Visual Commonsense Generation
title_short	Dynamic Debiasing Network for Visual Commonsense Generation
title_sort	dynamic debiasing network for visual commonsense generation
topic	Multimodal reasoning visual commonsense generation VisualCOMET dataset bias debiasing causal inference
url	https://ieeexplore.ieee.org/document/10348563/
work_keys_str_mv	AT jungeunkim dynamicdebiasingnetworkforvisualcommonsensegeneration AT jinwoopark dynamicdebiasingnetworkforvisualcommonsensegeneration AT jaekwangseok dynamicdebiasingnetworkforvisualcommonsensegeneration AT junyeongkim dynamicdebiasingnetworkforvisualcommonsensegeneration

Dynamic Debiasing Network for Visual Commonsense Generation

Similar Items