AI Ekphrasis: Multi-Modal Learning with Foundation Models for Fine-Grained Poetry Retrieval

Artificial intelligence research in natural language processing in the context of poetry struggles with the recognition of holistic content such as poetic symbolism, metaphor, and other fine-grained attributes. Given these challenges, multi-modal image–poetry reasoning and retrieval remain largely u...

Full description

Bibliographic Details
Main Authors: Muhammad Shahid Jabbar, Jitae Shin, Jun-Dong Cho
Format: Article
Language:English
Published: MDPI AG 2022-04-01
Series:Electronics
Subjects:
Online Access:https://www.mdpi.com/2079-9292/11/8/1275
_version_ 1797446731270455296
author Muhammad Shahid Jabbar
Jitae Shin
Jun-Dong Cho
author_facet Muhammad Shahid Jabbar
Jitae Shin
Jun-Dong Cho
author_sort Muhammad Shahid Jabbar
collection DOAJ
description Artificial intelligence research in natural language processing in the context of poetry struggles with the recognition of holistic content such as poetic symbolism, metaphor, and other fine-grained attributes. Given these challenges, multi-modal image–poetry reasoning and retrieval remain largely unexplored. Our recent accessibility study indicates that poetry is an effective medium to convey visual artwork attributes for improved artwork appreciation of people with visual impairments. We, therefore, introduce a deep learning approach for the automatic retrieval of poetry suitable to the input images. The recent state-of-the-art CLIP provides a way for multi-modal visual and text features matched using cosine similarity. However, it lacks shared cross-modality attention features to model fine-grained relationships. The proposed approach in this work takes advantage of strong pre-training of the CLIP model and overcomes its limitations by introducing shared attention parameters to better model the fine-grained relationship between both modalities. We test and compare our proposed approach using the expertly annotated MiltiM-Poem dataset, which is considered the largest public image–poetry pair dataset for English poetry. The proposed approach aims to solve the problems of image-based attribute recognition and automatic retrieval for fine-grained poetic verses. The test results reflect that the shared attention parameters alleviate fine-grained attribute recognition, and the proposed approach is a significant step towards automatic multi-modal retrieval for improved artwork appreciation of people with visual impairments.
first_indexed 2024-03-09T13:44:47Z
format Article
id doaj.art-f8d78dcd196f457ca8fc36c5b04a8f87
institution Directory Open Access Journal
issn 2079-9292
language English
last_indexed 2024-03-09T13:44:47Z
publishDate 2022-04-01
publisher MDPI AG
record_format Article
series Electronics
spelling doaj.art-f8d78dcd196f457ca8fc36c5b04a8f872023-11-30T21:02:31ZengMDPI AGElectronics2079-92922022-04-01118127510.3390/electronics11081275AI Ekphrasis: Multi-Modal Learning with Foundation Models for Fine-Grained Poetry RetrievalMuhammad Shahid Jabbar0Jitae Shin1Jun-Dong Cho2Department of Electrical and Computer Engineering, Sungkyunkwan University, Suwon 16419, KoreaDepartment of Electrical and Computer Engineering, Sungkyunkwan University, Suwon 16419, KoreaDepartment of Electrical and Computer Engineering, Sungkyunkwan University, Suwon 16419, KoreaArtificial intelligence research in natural language processing in the context of poetry struggles with the recognition of holistic content such as poetic symbolism, metaphor, and other fine-grained attributes. Given these challenges, multi-modal image–poetry reasoning and retrieval remain largely unexplored. Our recent accessibility study indicates that poetry is an effective medium to convey visual artwork attributes for improved artwork appreciation of people with visual impairments. We, therefore, introduce a deep learning approach for the automatic retrieval of poetry suitable to the input images. The recent state-of-the-art CLIP provides a way for multi-modal visual and text features matched using cosine similarity. However, it lacks shared cross-modality attention features to model fine-grained relationships. The proposed approach in this work takes advantage of strong pre-training of the CLIP model and overcomes its limitations by introducing shared attention parameters to better model the fine-grained relationship between both modalities. We test and compare our proposed approach using the expertly annotated MiltiM-Poem dataset, which is considered the largest public image–poetry pair dataset for English poetry. The proposed approach aims to solve the problems of image-based attribute recognition and automatic retrieval for fine-grained poetic verses. The test results reflect that the shared attention parameters alleviate fine-grained attribute recognition, and the proposed approach is a significant step towards automatic multi-modal retrieval for improved artwork appreciation of people with visual impairments.https://www.mdpi.com/2079-9292/11/8/1275image-based poetry retrievalfine-grained attribute recognitionaccessibilitymulti-modal attentioncross-encoder
spellingShingle Muhammad Shahid Jabbar
Jitae Shin
Jun-Dong Cho
AI Ekphrasis: Multi-Modal Learning with Foundation Models for Fine-Grained Poetry Retrieval
Electronics
image-based poetry retrieval
fine-grained attribute recognition
accessibility
multi-modal attention
cross-encoder
title AI Ekphrasis: Multi-Modal Learning with Foundation Models for Fine-Grained Poetry Retrieval
title_full AI Ekphrasis: Multi-Modal Learning with Foundation Models for Fine-Grained Poetry Retrieval
title_fullStr AI Ekphrasis: Multi-Modal Learning with Foundation Models for Fine-Grained Poetry Retrieval
title_full_unstemmed AI Ekphrasis: Multi-Modal Learning with Foundation Models for Fine-Grained Poetry Retrieval
title_short AI Ekphrasis: Multi-Modal Learning with Foundation Models for Fine-Grained Poetry Retrieval
title_sort ai ekphrasis multi modal learning with foundation models for fine grained poetry retrieval
topic image-based poetry retrieval
fine-grained attribute recognition
accessibility
multi-modal attention
cross-encoder
url https://www.mdpi.com/2079-9292/11/8/1275
work_keys_str_mv AT muhammadshahidjabbar aiekphrasismultimodallearningwithfoundationmodelsforfinegrainedpoetryretrieval
AT jitaeshin aiekphrasismultimodallearningwithfoundationmodelsforfinegrainedpoetryretrieval
AT jundongcho aiekphrasismultimodallearningwithfoundationmodelsforfinegrainedpoetryretrieval