Bi-Modal Learning With Channel-Wise Attention for Multi-Label Image Classification

Multi-label image classification is more in line with the real-world applications. This problem is difficult due to the the fact that complex label space makes it hard to get label-level attention regions and deal with semantic relationships among labels. Common deep network-based methods utilize CN...

Full description

Bibliographic Details
Main Authors: Peng Li, Peng Chen, Yonghong Xie, Dezheng Zhang
Format: Article
Language:English
Published: IEEE 2020-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/8951081/
Description
Summary:Multi-label image classification is more in line with the real-world applications. This problem is difficult due to the the fact that complex label space makes it hard to get label-level attention regions and deal with semantic relationships among labels. Common deep network-based methods utilize CNN to extract features and consider the labels as a sequence or a graph, thus handling the label correlations with RNN or graph-theoretical algorithms. In this paper, we propose a novel CNN-RNN-based model, bi-modal multi-label learning(BMML) framework. Firstly, an improved channel-wise attention mechanism is presented to propose regional attention maps and connect them to relative labels. After that, based on the assumption that objects in a semantic scene always have high-level relevance among visual and textual corpus, we further embed the labels through different pre-trained language models and determine the label sequence in a “semantic space” constructed on large-scale textual data, thereby handling the labels in their semantic context. In addition, a cross-modal feature aligning module is introduced in BMML framework. Experimental results show that BMML is able to achieve better accuracies then those mainstream multi-label classification methods on several benchmark data sets.
ISSN:2169-3536