Interpolating the Text-to-Image Correspondence Based on Phonetic and Phonological Similarities for Nonword-to-Image Generation

Text-to-Image (T2I) generation is the task of synthesizing images corresponding to a given text input. The recent innovations in artificial intelligence have enhanced the capacity of conventional T2I generation, yielding more and more powerful models day by day. However, their behavior is known to b...

Full description

Bibliographic Details
Main Authors:	Chihaya Matsuhira, Marc A. Kastner, Takahiro Komamizu, Takatsugu Hirayama, Keisuke Doman, Yasutomo Kawanishi, Ichiro Ide
Format:	Article
Language:	English
Published:	IEEE 2024-01-01
Series:	IEEE Access
Subjects:	Nonwords phonetics pronunciation psycholinguistics text-to-image generation vision and language
Online Access:	https://ieeexplore.ieee.org/document/10473073/

_version_	1797243416818483200
author	Chihaya Matsuhira Marc A. Kastner Takahiro Komamizu Takatsugu Hirayama Keisuke Doman Yasutomo Kawanishi Ichiro Ide
author_facet	Chihaya Matsuhira Marc A. Kastner Takahiro Komamizu Takatsugu Hirayama Keisuke Doman Yasutomo Kawanishi Ichiro Ide
author_sort	Chihaya Matsuhira
collection	DOAJ
description	Text-to-Image (T2I) generation is the task of synthesizing images corresponding to a given text input. The recent innovations in artificial intelligence have enhanced the capacity of conventional T2I generation, yielding more and more powerful models day by day. However, their behavior is known to become unstable in the face of text inputs containing nonwords that have no definition within a language. This behavior not only results in situations where image generation does not match human expectations but also hinders these models from being utilized in psycholinguistic applications and simulations. This paper exploits the human nature of associating nonwords with their phonetically and phonologically similar words and uses it to propose a T2I generation framework robust against nonword inputs. The framework comprises a phonetics-aware language model as well as an adjusted T2I generation model. Our evaluations confirm that the proposed nonword-to-image generation synthesizes images that depict visual concepts of phonetically similar words more stably than comparative methods. We also assess how the image generation results match human expectations, showing a better agreement than the phonetics-blind baseline.
first_indexed	2024-04-24T18:54:46Z
format	Article
id	doaj.art-1ae819dd8f17465891e7e43da4164d71
institution	Directory Open Access Journal
issn	2169-3536
language	English
last_indexed	2024-04-24T18:54:46Z
publishDate	2024-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj.art-1ae819dd8f17465891e7e43da4164d712024-03-26T17:44:21ZengIEEEIEEE Access2169-35362024-01-0112412994131610.1109/ACCESS.2024.337809510473073Interpolating the Text-to-Image Correspondence Based on Phonetic and Phonological Similarities for Nonword-to-Image GenerationChihaya Matsuhira0https://orcid.org/0000-0003-2453-4560Marc A. Kastner1https://orcid.org/0000-0002-9193-5973Takahiro Komamizu2https://orcid.org/0000-0002-3041-4330Takatsugu Hirayama3https://orcid.org/0000-0001-6290-9680Keisuke Doman4https://orcid.org/0000-0001-6040-4988Yasutomo Kawanishi5https://orcid.org/0000-0002-3799-4550Ichiro Ide6https://orcid.org/0000-0003-3942-9296Graduate School of Informatics, Nagoya University, Nagoya, Aichi, JapanGraduate School of Informatics, Kyoto University, Kyoto, JapanMathematical and Data Science Center, Nagoya University, Nagoya, Aichi, JapanGraduate School of Informatics, Nagoya University, Nagoya, Aichi, JapanGraduate School of Informatics, Nagoya University, Nagoya, Aichi, JapanGraduate School of Informatics, Nagoya University, Nagoya, Aichi, JapanGraduate School of Informatics, Nagoya University, Nagoya, Aichi, JapanText-to-Image (T2I) generation is the task of synthesizing images corresponding to a given text input. The recent innovations in artificial intelligence have enhanced the capacity of conventional T2I generation, yielding more and more powerful models day by day. However, their behavior is known to become unstable in the face of text inputs containing nonwords that have no definition within a language. This behavior not only results in situations where image generation does not match human expectations but also hinders these models from being utilized in psycholinguistic applications and simulations. This paper exploits the human nature of associating nonwords with their phonetically and phonologically similar words and uses it to propose a T2I generation framework robust against nonword inputs. The framework comprises a phonetics-aware language model as well as an adjusted T2I generation model. Our evaluations confirm that the proposed nonword-to-image generation synthesizes images that depict visual concepts of phonetically similar words more stably than comparative methods. We also assess how the image generation results match human expectations, showing a better agreement than the phonetics-blind baseline.https://ieeexplore.ieee.org/document/10473073/Nonwordsphoneticspronunciationpsycholinguisticstext-to-image generationvision and language
spellingShingle	Chihaya Matsuhira Marc A. Kastner Takahiro Komamizu Takatsugu Hirayama Keisuke Doman Yasutomo Kawanishi Ichiro Ide Interpolating the Text-to-Image Correspondence Based on Phonetic and Phonological Similarities for Nonword-to-Image Generation IEEE Access Nonwords phonetics pronunciation psycholinguistics text-to-image generation vision and language
title	Interpolating the Text-to-Image Correspondence Based on Phonetic and Phonological Similarities for Nonword-to-Image Generation
title_full	Interpolating the Text-to-Image Correspondence Based on Phonetic and Phonological Similarities for Nonword-to-Image Generation
title_fullStr	Interpolating the Text-to-Image Correspondence Based on Phonetic and Phonological Similarities for Nonword-to-Image Generation
title_full_unstemmed	Interpolating the Text-to-Image Correspondence Based on Phonetic and Phonological Similarities for Nonword-to-Image Generation
title_short	Interpolating the Text-to-Image Correspondence Based on Phonetic and Phonological Similarities for Nonword-to-Image Generation
title_sort	interpolating the text to image correspondence based on phonetic and phonological similarities for nonword to image generation
topic	Nonwords phonetics pronunciation psycholinguistics text-to-image generation vision and language
url	https://ieeexplore.ieee.org/document/10473073/
work_keys_str_mv	AT chihayamatsuhira interpolatingthetexttoimagecorrespondencebasedonphoneticandphonologicalsimilaritiesfornonwordtoimagegeneration AT marcakastner interpolatingthetexttoimagecorrespondencebasedonphoneticandphonologicalsimilaritiesfornonwordtoimagegeneration AT takahirokomamizu interpolatingthetexttoimagecorrespondencebasedonphoneticandphonologicalsimilaritiesfornonwordtoimagegeneration AT takatsuguhirayama interpolatingthetexttoimagecorrespondencebasedonphoneticandphonologicalsimilaritiesfornonwordtoimagegeneration AT keisukedoman interpolatingthetexttoimagecorrespondencebasedonphoneticandphonologicalsimilaritiesfornonwordtoimagegeneration AT yasutomokawanishi interpolatingthetexttoimagecorrespondencebasedonphoneticandphonologicalsimilaritiesfornonwordtoimagegeneration AT ichiroide interpolatingthetexttoimagecorrespondencebasedonphoneticandphonologicalsimilaritiesfornonwordtoimagegeneration

Interpolating the Text-to-Image Correspondence Based on Phonetic and Phonological Similarities for Nonword-to-Image Generation

Similar Items