Summary: | Few-shot human action recognition, a prominent area in computer vision, has garnered increasing attention and broader use in real-life scenarios. Extracting spatio-temporal skeletal information from human movement videos offers interpretable and data-efficient features. However, existing spatio-temporal feature encoders face challenges such as handling sequence boundaries and coping with noise. In order to solve the above problems, this paper proposes a temporal complement method to optimize the Dynamic Time Warping (DTW) algorithm based on the feature representation of the human skeleton sequence. DTW helps to find optimal alignment between sequences by warping them in the time domain. This is quite useful specially in scenarios where training data is limited. However, DTW has the drawback that the optimal alignment path is highly sensitive to errors in the time series distance matrix. Therefore, we apply a Virtual Adversarial Training method to improve the anti-noise capability of the algorithm. Here, We introduce adversarial perturbations in the training phase to the time series distance matrix, thus incentivizing the model to be resilient to such noise. Our method achieves highest accuracy among protonet, DTW and DASTM methods for the 5-way-1-shot setting for the NTU-S (77.7%), and Kinetics (41.2%) datasets. For the 5-way-5-shot setting, our method achieves highest accuracy of 51.8% for Kinetics dataset when compared with the other approaches.
|