Zero-Shot Image Classification with Rectified Embedding Vectors Using a Caption Generator

Although image recognition technologies are developing rapidly with deep learning, conventional recognition models trained by supervised learning with class labels do not work well when test inputs from untrained classes are given. For example, a recognizer trained to classify Asian bird species can...

Full description

Bibliographic Details
Main Authors: Chan Hur, Hyeyoung Park
Format: Article
Language:English
Published: MDPI AG 2023-06-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/13/12/7071
_version_ 1797596236573835264
author Chan Hur
Hyeyoung Park
author_facet Chan Hur
Hyeyoung Park
author_sort Chan Hur
collection DOAJ
description Although image recognition technologies are developing rapidly with deep learning, conventional recognition models trained by supervised learning with class labels do not work well when test inputs from untrained classes are given. For example, a recognizer trained to classify Asian bird species cannot recognize the species of kiwi, because the class label “kiwi” and its image samples have not been seen during training. To overcome this limitation, zero-shot classification has been studied recently, and the joint-embedding-based approach has been suggested as one of the promised solutions. In this approach, image features and text descriptions belonging to the same class are trained to be closely located in a common joint-embedding space. Once we obtain the embedding function that can represent the semantic relationship of image–text pairs in training data, test images and text descriptions (prototypes) of unseen classes can also be mapped to the joint-embedding space for classification. The main challenge with this approach is mapping inputs of two different modalities into a common space, and previous works suffer from the inconsistency between the distribution of two feature sets on joint-embedding space extracted from the heterogeneous inputs. To treat this problem, we propose a novel method of employing additional textual information to rectify the visual representation of input images. Since the conceptual information of test classes is generally given as texts, we expect that the additional descriptions from a caption generator can adjust the visual feature for better matching with the representation of the test classes. We also propose to use the generated textual descriptions to augment training samples for learning joint-embedding space. In the experiments on two benchmark datasets, the proposed method shows significant performance improvements of 1.4% on the CUB dataset and 5.5% on the flower dataset, in comparison to existing models.
first_indexed 2024-03-11T02:48:45Z
format Article
id doaj.art-727eb73d833d4beca67f9f928307c939
institution Directory Open Access Journal
issn 2076-3417
language English
last_indexed 2024-03-11T02:48:45Z
publishDate 2023-06-01
publisher MDPI AG
record_format Article
series Applied Sciences
spelling doaj.art-727eb73d833d4beca67f9f928307c9392023-11-18T09:08:35ZengMDPI AGApplied Sciences2076-34172023-06-011312707110.3390/app13127071Zero-Shot Image Classification with Rectified Embedding Vectors Using a Caption GeneratorChan Hur0Hyeyoung Park1School of Computer Science and Engineering, Kyungpook National University, Daegu 41566, Republic of KoreaSchool of Computer Science and Engineering, Kyungpook National University, Daegu 41566, Republic of KoreaAlthough image recognition technologies are developing rapidly with deep learning, conventional recognition models trained by supervised learning with class labels do not work well when test inputs from untrained classes are given. For example, a recognizer trained to classify Asian bird species cannot recognize the species of kiwi, because the class label “kiwi” and its image samples have not been seen during training. To overcome this limitation, zero-shot classification has been studied recently, and the joint-embedding-based approach has been suggested as one of the promised solutions. In this approach, image features and text descriptions belonging to the same class are trained to be closely located in a common joint-embedding space. Once we obtain the embedding function that can represent the semantic relationship of image–text pairs in training data, test images and text descriptions (prototypes) of unseen classes can also be mapped to the joint-embedding space for classification. The main challenge with this approach is mapping inputs of two different modalities into a common space, and previous works suffer from the inconsistency between the distribution of two feature sets on joint-embedding space extracted from the heterogeneous inputs. To treat this problem, we propose a novel method of employing additional textual information to rectify the visual representation of input images. Since the conceptual information of test classes is generally given as texts, we expect that the additional descriptions from a caption generator can adjust the visual feature for better matching with the representation of the test classes. We also propose to use the generated textual descriptions to augment training samples for learning joint-embedding space. In the experiments on two benchmark datasets, the proposed method shows significant performance improvements of 1.4% on the CUB dataset and 5.5% on the flower dataset, in comparison to existing models.https://www.mdpi.com/2076-3417/13/12/7071zero-shot learningimage captioningjoint-embeddingvisual feature enhancementtextural feature generation
spellingShingle Chan Hur
Hyeyoung Park
Zero-Shot Image Classification with Rectified Embedding Vectors Using a Caption Generator
Applied Sciences
zero-shot learning
image captioning
joint-embedding
visual feature enhancement
textural feature generation
title Zero-Shot Image Classification with Rectified Embedding Vectors Using a Caption Generator
title_full Zero-Shot Image Classification with Rectified Embedding Vectors Using a Caption Generator
title_fullStr Zero-Shot Image Classification with Rectified Embedding Vectors Using a Caption Generator
title_full_unstemmed Zero-Shot Image Classification with Rectified Embedding Vectors Using a Caption Generator
title_short Zero-Shot Image Classification with Rectified Embedding Vectors Using a Caption Generator
title_sort zero shot image classification with rectified embedding vectors using a caption generator
topic zero-shot learning
image captioning
joint-embedding
visual feature enhancement
textural feature generation
url https://www.mdpi.com/2076-3417/13/12/7071
work_keys_str_mv AT chanhur zeroshotimageclassificationwithrectifiedembeddingvectorsusingacaptiongenerator
AT hyeyoungpark zeroshotimageclassificationwithrectifiedembeddingvectorsusingacaptiongenerator