AI Ekphrasis: Multi-Modal Learning with Foundation Models for Fine-Grained Poetry Retrieval
Artificial intelligence research in natural language processing in the context of poetry struggles with the recognition of holistic content such as poetic symbolism, metaphor, and other fine-grained attributes. Given these challenges, multi-modal image–poetry reasoning and retrieval remain largely u...
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2022-04-01
|
Series: | Electronics |
Subjects: | |
Online Access: | https://www.mdpi.com/2079-9292/11/8/1275 |
_version_ | 1797446731270455296 |
---|---|
author | Muhammad Shahid Jabbar Jitae Shin Jun-Dong Cho |
author_facet | Muhammad Shahid Jabbar Jitae Shin Jun-Dong Cho |
author_sort | Muhammad Shahid Jabbar |
collection | DOAJ |
description | Artificial intelligence research in natural language processing in the context of poetry struggles with the recognition of holistic content such as poetic symbolism, metaphor, and other fine-grained attributes. Given these challenges, multi-modal image–poetry reasoning and retrieval remain largely unexplored. Our recent accessibility study indicates that poetry is an effective medium to convey visual artwork attributes for improved artwork appreciation of people with visual impairments. We, therefore, introduce a deep learning approach for the automatic retrieval of poetry suitable to the input images. The recent state-of-the-art CLIP provides a way for multi-modal visual and text features matched using cosine similarity. However, it lacks shared cross-modality attention features to model fine-grained relationships. The proposed approach in this work takes advantage of strong pre-training of the CLIP model and overcomes its limitations by introducing shared attention parameters to better model the fine-grained relationship between both modalities. We test and compare our proposed approach using the expertly annotated MiltiM-Poem dataset, which is considered the largest public image–poetry pair dataset for English poetry. The proposed approach aims to solve the problems of image-based attribute recognition and automatic retrieval for fine-grained poetic verses. The test results reflect that the shared attention parameters alleviate fine-grained attribute recognition, and the proposed approach is a significant step towards automatic multi-modal retrieval for improved artwork appreciation of people with visual impairments. |
first_indexed | 2024-03-09T13:44:47Z |
format | Article |
id | doaj.art-f8d78dcd196f457ca8fc36c5b04a8f87 |
institution | Directory Open Access Journal |
issn | 2079-9292 |
language | English |
last_indexed | 2024-03-09T13:44:47Z |
publishDate | 2022-04-01 |
publisher | MDPI AG |
record_format | Article |
series | Electronics |
spelling | doaj.art-f8d78dcd196f457ca8fc36c5b04a8f872023-11-30T21:02:31ZengMDPI AGElectronics2079-92922022-04-01118127510.3390/electronics11081275AI Ekphrasis: Multi-Modal Learning with Foundation Models for Fine-Grained Poetry RetrievalMuhammad Shahid Jabbar0Jitae Shin1Jun-Dong Cho2Department of Electrical and Computer Engineering, Sungkyunkwan University, Suwon 16419, KoreaDepartment of Electrical and Computer Engineering, Sungkyunkwan University, Suwon 16419, KoreaDepartment of Electrical and Computer Engineering, Sungkyunkwan University, Suwon 16419, KoreaArtificial intelligence research in natural language processing in the context of poetry struggles with the recognition of holistic content such as poetic symbolism, metaphor, and other fine-grained attributes. Given these challenges, multi-modal image–poetry reasoning and retrieval remain largely unexplored. Our recent accessibility study indicates that poetry is an effective medium to convey visual artwork attributes for improved artwork appreciation of people with visual impairments. We, therefore, introduce a deep learning approach for the automatic retrieval of poetry suitable to the input images. The recent state-of-the-art CLIP provides a way for multi-modal visual and text features matched using cosine similarity. However, it lacks shared cross-modality attention features to model fine-grained relationships. The proposed approach in this work takes advantage of strong pre-training of the CLIP model and overcomes its limitations by introducing shared attention parameters to better model the fine-grained relationship between both modalities. We test and compare our proposed approach using the expertly annotated MiltiM-Poem dataset, which is considered the largest public image–poetry pair dataset for English poetry. The proposed approach aims to solve the problems of image-based attribute recognition and automatic retrieval for fine-grained poetic verses. The test results reflect that the shared attention parameters alleviate fine-grained attribute recognition, and the proposed approach is a significant step towards automatic multi-modal retrieval for improved artwork appreciation of people with visual impairments.https://www.mdpi.com/2079-9292/11/8/1275image-based poetry retrievalfine-grained attribute recognitionaccessibilitymulti-modal attentioncross-encoder |
spellingShingle | Muhammad Shahid Jabbar Jitae Shin Jun-Dong Cho AI Ekphrasis: Multi-Modal Learning with Foundation Models for Fine-Grained Poetry Retrieval Electronics image-based poetry retrieval fine-grained attribute recognition accessibility multi-modal attention cross-encoder |
title | AI Ekphrasis: Multi-Modal Learning with Foundation Models for Fine-Grained Poetry Retrieval |
title_full | AI Ekphrasis: Multi-Modal Learning with Foundation Models for Fine-Grained Poetry Retrieval |
title_fullStr | AI Ekphrasis: Multi-Modal Learning with Foundation Models for Fine-Grained Poetry Retrieval |
title_full_unstemmed | AI Ekphrasis: Multi-Modal Learning with Foundation Models for Fine-Grained Poetry Retrieval |
title_short | AI Ekphrasis: Multi-Modal Learning with Foundation Models for Fine-Grained Poetry Retrieval |
title_sort | ai ekphrasis multi modal learning with foundation models for fine grained poetry retrieval |
topic | image-based poetry retrieval fine-grained attribute recognition accessibility multi-modal attention cross-encoder |
url | https://www.mdpi.com/2079-9292/11/8/1275 |
work_keys_str_mv | AT muhammadshahidjabbar aiekphrasismultimodallearningwithfoundationmodelsforfinegrainedpoetryretrieval AT jitaeshin aiekphrasismultimodallearningwithfoundationmodelsforfinegrainedpoetryretrieval AT jundongcho aiekphrasismultimodallearningwithfoundationmodelsforfinegrainedpoetryretrieval |