Multi-granularity sequence generation for hierarchical image classification

Abstract Hierarchical multi-granularity image classification is a challenging task that aims to tag each given image with multiple granularity labels simultaneously. Existing methods tend to overlook that different image regions contribute differently to label prediction at different granularities,...

Full description

Bibliographic Details
Main Authors:	Xinda Liu, Lili Wang
Format:	Article
Language:	English
Published:	SpringerOpen 2024-01-01
Series:	Computational Visual Media
Subjects:	hierarchical multi-granularity classification vision and text transformer sequence generation fine-grained image recognition cross-modality attention
Online Access:	https://doi.org/10.1007/s41095-022-0332-2

_version_	1827388253154574336
author	Xinda Liu Lili Wang
author_facet	Xinda Liu Lili Wang
author_sort	Xinda Liu
collection	DOAJ
description	Abstract Hierarchical multi-granularity image classification is a challenging task that aims to tag each given image with multiple granularity labels simultaneously. Existing methods tend to overlook that different image regions contribute differently to label prediction at different granularities, and also insufficiently consider relationships between the hierarchical multi-granularity labels. We introduce a sequence-to-sequence mechanism to overcome these two problems and propose a multi-granularity sequence generation (MGSG) approach for the hierarchical multi-granularity image classification task. Specifically, we introduce a transformer architecture to encode the image into visual representation sequences. Next, we traverse the taxonomic tree and organize the multi-granularity labels into sequences, and vectorize them and add positional information. The proposed multi-granularity sequence generation method builds a decoder that takes visual representation sequences and semantic label embedding as inputs, and outputs the predicted multi-granularity label sequence. The decoder models dependencies and correlations between multi-granularity labels through a masked multi-head self-attention mechanism, and relates visual information to the semantic label information through a cross-modality attention mechanism. In this way, the proposed method preserves the relationships between labels at different granularity levels and takes into account the influence of different image regions on labels with different granularities. Evaluations on six public benchmarks qualitatively and quantitatively demonstrate the advantages of the proposed method. Our project is available at https://github.com/liuxindazz/mgsg .
first_indexed	2024-03-08T16:15:01Z
format	Article
id	doaj.art-1f9ddbe7a6d34778be7b51a074acbfc8
institution	Directory Open Access Journal
issn	2096-0433 2096-0662
language	English
last_indexed	2024-03-08T16:15:01Z
publishDate	2024-01-01
publisher	SpringerOpen
record_format	Article
series	Computational Visual Media
spelling	doaj.art-1f9ddbe7a6d34778be7b51a074acbfc82024-01-07T12:39:08ZengSpringerOpenComputational Visual Media2096-04332096-06622024-01-0110224326010.1007/s41095-022-0332-2Multi-granularity sequence generation for hierarchical image classificationXinda Liu0Lili Wang1State Key Laboratory of Virtual Reality Technology and Systems, Beihang UniversityState Key Laboratory of Virtual Reality Technology and Systems, Beihang UniversityAbstract Hierarchical multi-granularity image classification is a challenging task that aims to tag each given image with multiple granularity labels simultaneously. Existing methods tend to overlook that different image regions contribute differently to label prediction at different granularities, and also insufficiently consider relationships between the hierarchical multi-granularity labels. We introduce a sequence-to-sequence mechanism to overcome these two problems and propose a multi-granularity sequence generation (MGSG) approach for the hierarchical multi-granularity image classification task. Specifically, we introduce a transformer architecture to encode the image into visual representation sequences. Next, we traverse the taxonomic tree and organize the multi-granularity labels into sequences, and vectorize them and add positional information. The proposed multi-granularity sequence generation method builds a decoder that takes visual representation sequences and semantic label embedding as inputs, and outputs the predicted multi-granularity label sequence. The decoder models dependencies and correlations between multi-granularity labels through a masked multi-head self-attention mechanism, and relates visual information to the semantic label information through a cross-modality attention mechanism. In this way, the proposed method preserves the relationships between labels at different granularity levels and takes into account the influence of different image regions on labels with different granularities. Evaluations on six public benchmarks qualitatively and quantitatively demonstrate the advantages of the proposed method. Our project is available at https://github.com/liuxindazz/mgsg .https://doi.org/10.1007/s41095-022-0332-2hierarchical multi-granularity classificationvision and text transformersequence generationfine-grained image recognitioncross-modality attention
spellingShingle	Xinda Liu Lili Wang Multi-granularity sequence generation for hierarchical image classification Computational Visual Media hierarchical multi-granularity classification vision and text transformer sequence generation fine-grained image recognition cross-modality attention
title	Multi-granularity sequence generation for hierarchical image classification
title_full	Multi-granularity sequence generation for hierarchical image classification
title_fullStr	Multi-granularity sequence generation for hierarchical image classification
title_full_unstemmed	Multi-granularity sequence generation for hierarchical image classification
title_short	Multi-granularity sequence generation for hierarchical image classification
title_sort	multi granularity sequence generation for hierarchical image classification
topic	hierarchical multi-granularity classification vision and text transformer sequence generation fine-grained image recognition cross-modality attention
url	https://doi.org/10.1007/s41095-022-0332-2
work_keys_str_mv	AT xindaliu multigranularitysequencegenerationforhierarchicalimageclassification AT liliwang multigranularitysequencegenerationforhierarchicalimageclassification

Multi-granularity sequence generation for hierarchical image classification

Similar Items