Summary: | The joint use of hyperspectral image (HSI) and light detection and ranging (LiDAR) data has gained significant performance on land-cover classification. Although spatial–spectral feature learning methods based on convolutional neural networks and transformer networks have achieved prominent advances, contextual information described by fixed convolutional kernels and all self-attention heads selected have limited ability to characterize the detailed information and nonredundant features of land-covers on multimodal data. In this article, a multiscale head selection transformer (MHST) network, is proposed to fully explore detailed and nonredundant features in spatial and spectral dimensions of HSI and LiDAR data. To better acquire detailed information of spatial and spectral features at different scales, a multiscale spectral–spatial feature extraction module, including cascaded multiscale 3-D and 2-D convolutional layers, is inserted into MHST. Simultaneously, an adaptive global feature extraction module based on head selection pooling transformer is given after transformer encoder module for alleviating token redundancy in an adaptive computation style. Finally, we develop a multimodal–multiscale feature fusion classification module with local features and global class token, to exploit a powerful global–local fuse style. The extensive experiments on three popular datasets demonstrate that MHST significantly outperforms other related networks.
|