Realistic Image Generation from Text by Using BERT-Based Embedding
Recently, in the field of artificial intelligence, multimodal learning has received a lot of attention due to expectations for the enhancement of AI performance and potential applications. Text-to-image generation, which is one of the multimodal tasks, is a challenging topic in computer vision and n...
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2022-03-01
|
Series: | Electronics |
Subjects: | |
Online Access: | https://www.mdpi.com/2079-9292/11/5/764 |
_version_ | 1797475343071707136 |
---|---|
author | Sanghyuck Na Mirae Do Kyeonah Yu Juntae Kim |
author_facet | Sanghyuck Na Mirae Do Kyeonah Yu Juntae Kim |
author_sort | Sanghyuck Na |
collection | DOAJ |
description | Recently, in the field of artificial intelligence, multimodal learning has received a lot of attention due to expectations for the enhancement of AI performance and potential applications. Text-to-image generation, which is one of the multimodal tasks, is a challenging topic in computer vision and natural language processing. The text-to-image generation model based on generative adversarial network (GAN) utilizes a text encoder pre-trained with image-text pairs. However, text encoders pre-trained with image-text pairs cannot obtain rich information about texts not seen during pre-training, thus it is hard to generate an image that semantically matches a given text description. In this paper, we propose a new text-to-image generation model using pre-trained BERT, which is widely used in the field of natural language processing. The pre-trained BERT is used as a text encoder by performing fine-tuning with a large amount of text, so that rich information about the text is obtained and thus suitable for the image generation task. Through experiments using a multimodal benchmark dataset, we show that the proposed method improves the performance over the baseline model both quantitatively and qualitatively. |
first_indexed | 2024-03-09T20:42:50Z |
format | Article |
id | doaj.art-7d7d595ed6844a4aa72aa96b9c1704c4 |
institution | Directory Open Access Journal |
issn | 2079-9292 |
language | English |
last_indexed | 2024-03-09T20:42:50Z |
publishDate | 2022-03-01 |
publisher | MDPI AG |
record_format | Article |
series | Electronics |
spelling | doaj.art-7d7d595ed6844a4aa72aa96b9c1704c42023-11-23T22:53:37ZengMDPI AGElectronics2079-92922022-03-0111576410.3390/electronics11050764Realistic Image Generation from Text by Using BERT-Based EmbeddingSanghyuck Na0Mirae Do1Kyeonah Yu2Juntae Kim3Department of Computer Science and Engineering, Dongguk University, Pildong-ro 1-gil, Jung-gu, Seoul 04620, KoreaDepartment of Computer Engineering, Duksung Women’s University, Samyang-ro 144-3gil, Dobong-gu, Seoul 01369, KoreaDepartment of Computer Engineering, Duksung Women’s University, Samyang-ro 144-3gil, Dobong-gu, Seoul 01369, KoreaDepartment of Computer Science and Engineering, Dongguk University, Pildong-ro 1-gil, Jung-gu, Seoul 04620, KoreaRecently, in the field of artificial intelligence, multimodal learning has received a lot of attention due to expectations for the enhancement of AI performance and potential applications. Text-to-image generation, which is one of the multimodal tasks, is a challenging topic in computer vision and natural language processing. The text-to-image generation model based on generative adversarial network (GAN) utilizes a text encoder pre-trained with image-text pairs. However, text encoders pre-trained with image-text pairs cannot obtain rich information about texts not seen during pre-training, thus it is hard to generate an image that semantically matches a given text description. In this paper, we propose a new text-to-image generation model using pre-trained BERT, which is widely used in the field of natural language processing. The pre-trained BERT is used as a text encoder by performing fine-tuning with a large amount of text, so that rich information about the text is obtained and thus suitable for the image generation task. Through experiments using a multimodal benchmark dataset, we show that the proposed method improves the performance over the baseline model both quantitatively and qualitatively.https://www.mdpi.com/2079-9292/11/5/764text to image generationmultimodal dataBERTGAN |
spellingShingle | Sanghyuck Na Mirae Do Kyeonah Yu Juntae Kim Realistic Image Generation from Text by Using BERT-Based Embedding Electronics text to image generation multimodal data BERT GAN |
title | Realistic Image Generation from Text by Using BERT-Based Embedding |
title_full | Realistic Image Generation from Text by Using BERT-Based Embedding |
title_fullStr | Realistic Image Generation from Text by Using BERT-Based Embedding |
title_full_unstemmed | Realistic Image Generation from Text by Using BERT-Based Embedding |
title_short | Realistic Image Generation from Text by Using BERT-Based Embedding |
title_sort | realistic image generation from text by using bert based embedding |
topic | text to image generation multimodal data BERT GAN |
url | https://www.mdpi.com/2079-9292/11/5/764 |
work_keys_str_mv | AT sanghyuckna realisticimagegenerationfromtextbyusingbertbasedembedding AT miraedo realisticimagegenerationfromtextbyusingbertbasedembedding AT kyeonahyu realisticimagegenerationfromtextbyusingbertbasedembedding AT juntaekim realisticimagegenerationfromtextbyusingbertbasedembedding |