Summary: | Multi-label image classification is more in line with the real-world applications. This problem is difficult due to the the fact that complex label space makes it hard to get label-level attention regions and deal with semantic relationships among labels. Common deep network-based methods utilize CNN to extract features and consider the labels as a sequence or a graph, thus handling the label correlations with RNN or graph-theoretical algorithms. In this paper, we propose a novel CNN-RNN-based model, bi-modal multi-label learning(BMML) framework. Firstly, an improved channel-wise attention mechanism is presented to propose regional attention maps and connect them to relative labels. After that, based on the assumption that objects in a semantic scene always have high-level relevance among visual and textual corpus, we further embed the labels through different pre-trained language models and determine the label sequence in a “semantic space” constructed on large-scale textual data, thereby handling the labels in their semantic context. In addition, a cross-modal feature aligning module is introduced in BMML framework. Experimental results show that BMML is able to achieve better accuracies then those mainstream multi-label classification methods on several benchmark data sets.
|