An Integrated Transformer with Collaborative Tokens Mining for Fine-Grained Recognition

Fine-grained recognition mainly classifies subclass images into hundreds of subcategorical labels by locating the discriminative regions (e.g., Cape May warbler or Magnolia warbler bird). Due to the high complexity and non-differentiation of region locations through the traditional backbone architec...

Full description

Bibliographic Details
Main Authors: Weiwei Yang, Jian Yin
Format: Article
Language:English
Published: MDPI AG 2023-06-01
Series:Electronics
Subjects:
Online Access:https://www.mdpi.com/2079-9292/12/12/2635
Description
Summary:Fine-grained recognition mainly classifies subclass images into hundreds of subcategorical labels by locating the discriminative regions (e.g., Cape May warbler or Magnolia warbler bird). Due to the high complexity and non-differentiation of region locations through the traditional backbone architecture, most existing approaches utilize multi-level reinforcement learning to distinguish the similar appearance among sub-categories. These methods explore incomplete information through only the intra-class informative regions in one image or the inter-class and intra-class relationship in pairwise images, leading to the tendency for overlapped region locations. Since the inter-class correlations and new backbone with complete contextual semantic information play important roles in distinguishing fine-grained classes, we propose a novel transformer with the collaborative token mining (TCTM) scheme by fully exploiting the relationships between inter-class and intra-class regions. The proposed TCTM scheme with a new transformer backbone consists of two modules that collaboratively explore the spatially aware tokens: the Pyramid Tokens Multiplication (PTM) module which exploits the integrated multi-stage inter-class and intra-class correlations from new transformer architecture and the Tokens Proposals Generation (TPG) module which captures two groups of top-four discriminative tokens. The two PTMs extract contrastive tokens for each image and learn to rank these tokens, assuming that those from the same class and the same module should have smaller distances. The TPGs further sort and update the candidate tokens from the extracted attention tokens by ranking their probabilities with ground truth subcategorical labels. Through the collaboration between the PTM and TPG, our TCTM scheme can take the integrated correlations into account and mine the discriminative tokens for final fine-grained classification. Extensive experiments on four popular benchmarks show that our proposed TCTM outperforms the state-of-the-art methods by a large margin.
ISSN:2079-9292