Summary: | Generating images from text descriptions is an intersecting field between natural language processing and computer vision. This task generates an image that conforms to an input text description in semantic details. Generative Adversarial Networks (GANs), as one of the most popular generation methods, are an important solution for generating images from text.
Text-to-image methods based on GANs have developed rapidly in recent years. They generally use a Conditional GAN (CGAN) architecture, adding textual descriptions as extra features to the network to generate pictures under text constraints. For our study, we choose the widely used Stacked GAN (StackGAN) model as the baseline. We propose the following improvements on the model structure and training procedure, as our novel contribution to this research area.
On the model structure, the StackGAN consists of two generators. we improve the textual conditioning by adding the textual embeddings to each upsampling block as a multi-level input. In this manner, image generators will receive richer semantic information, improving image-text consistency. We also add the Non-local blocks within the two generators to make the network better integrate global information. On the training process, we propose to use the Wasserstein GAN method to train the network to alleviate the difficulty of training the original GAN. In particular, we use the Wasserstein distance as a guidance signal for the GAN training. It also alleviates the common issue of mode collapse.
To evaluate the impact of our modifications to the original StackGAN, we conduct ablation experiments on the CUB-200-2011 bird dataset. The experimental results show that the revised network has achieved better experimental results than the original network.
|