A Hierarchical Spatial–Temporal Cross-Attention Scheme for Video Summarization Using Contrastive Learning
Video summarization (VS) is a widely used technique for facilitating the effective reading, fast comprehension, and effective retrieval of video content. Certain properties of the new video data, such as a lack of prominent emphasis and a fuzzy theme development border, disturb the original thinking...
Main Authors: | , , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2022-10-01
|
Series: | Sensors |
Subjects: | |
Online Access: | https://www.mdpi.com/1424-8220/22/21/8275 |
_version_ | 1797466553262800896 |
---|---|
author | Xiaoyu Teng Xiaolin Gui Pan Xu Jianglei Tong Jian An Yang Liu Huilan Jiang |
author_facet | Xiaoyu Teng Xiaolin Gui Pan Xu Jianglei Tong Jian An Yang Liu Huilan Jiang |
author_sort | Xiaoyu Teng |
collection | DOAJ |
description | Video summarization (VS) is a widely used technique for facilitating the effective reading, fast comprehension, and effective retrieval of video content. Certain properties of the new video data, such as a lack of prominent emphasis and a fuzzy theme development border, disturb the original thinking mode based on video feature information. Moreover, it introduces new challenges to the extraction of video depth and breadth features. In addition, the diversity of user requirements creates additional complications for more accurate keyframe screening issues. To overcome these challenges, this paper proposes a hierarchical spatial–temporal cross-attention scheme for video summarization based on comparative learning. Graph attention networks (GAT) and the multi-head convolutional attention cell are used to extract local and depth features, while the GAT-adjusted bidirection ConvLSTM (DB-ConvLSTM) is used to extract global and breadth features. Furthermore, a spatial–temporal cross-attention-based ConvLSTM is developed for merging hierarchical characteristics and achieving more accurate screening in similar keyframes clusters. Verification experiments and comparative analysis demonstrate that our method outperforms state-of-the-art methods. |
first_indexed | 2024-03-09T18:40:27Z |
format | Article |
id | doaj.art-c9b85f1343f84bd0bf7a8f2dcb80ea6c |
institution | Directory Open Access Journal |
issn | 1424-8220 |
language | English |
last_indexed | 2024-03-09T18:40:27Z |
publishDate | 2022-10-01 |
publisher | MDPI AG |
record_format | Article |
series | Sensors |
spelling | doaj.art-c9b85f1343f84bd0bf7a8f2dcb80ea6c2023-11-24T06:45:28ZengMDPI AGSensors1424-82202022-10-012221827510.3390/s22218275A Hierarchical Spatial–Temporal Cross-Attention Scheme for Video Summarization Using Contrastive LearningXiaoyu Teng0Xiaolin Gui1Pan Xu2Jianglei Tong3Jian An4Yang Liu5Huilan Jiang6Department of Faculty of Electronic and Information Engineering, Xi’an Jiaotong University, Xi’an 710049, ChinaDepartment of Faculty of Electronic and Information Engineering, Xi’an Jiaotong University, Xi’an 710049, ChinaDepartment of Faculty of Electronic and Information Engineering, Xi’an Jiaotong University, Xi’an 710049, ChinaDepartment of Faculty of Electronic and Information Engineering, Xi’an Jiaotong University, Xi’an 710049, ChinaDepartment of Faculty of Electronic and Information Engineering, Xi’an Jiaotong University, Xi’an 710049, ChinaMedical College, Northwest Minzu University, Lanzhou 730030, ChinaONYCOM Co., Ltd., Seoul 04519, KoreaVideo summarization (VS) is a widely used technique for facilitating the effective reading, fast comprehension, and effective retrieval of video content. Certain properties of the new video data, such as a lack of prominent emphasis and a fuzzy theme development border, disturb the original thinking mode based on video feature information. Moreover, it introduces new challenges to the extraction of video depth and breadth features. In addition, the diversity of user requirements creates additional complications for more accurate keyframe screening issues. To overcome these challenges, this paper proposes a hierarchical spatial–temporal cross-attention scheme for video summarization based on comparative learning. Graph attention networks (GAT) and the multi-head convolutional attention cell are used to extract local and depth features, while the GAT-adjusted bidirection ConvLSTM (DB-ConvLSTM) is used to extract global and breadth features. Furthermore, a spatial–temporal cross-attention-based ConvLSTM is developed for merging hierarchical characteristics and achieving more accurate screening in similar keyframes clusters. Verification experiments and comparative analysis demonstrate that our method outperforms state-of-the-art methods.https://www.mdpi.com/1424-8220/22/21/8275video summarizationspatial–temporal featurescross-attention |
spellingShingle | Xiaoyu Teng Xiaolin Gui Pan Xu Jianglei Tong Jian An Yang Liu Huilan Jiang A Hierarchical Spatial–Temporal Cross-Attention Scheme for Video Summarization Using Contrastive Learning Sensors video summarization spatial–temporal features cross-attention |
title | A Hierarchical Spatial–Temporal Cross-Attention Scheme for Video Summarization Using Contrastive Learning |
title_full | A Hierarchical Spatial–Temporal Cross-Attention Scheme for Video Summarization Using Contrastive Learning |
title_fullStr | A Hierarchical Spatial–Temporal Cross-Attention Scheme for Video Summarization Using Contrastive Learning |
title_full_unstemmed | A Hierarchical Spatial–Temporal Cross-Attention Scheme for Video Summarization Using Contrastive Learning |
title_short | A Hierarchical Spatial–Temporal Cross-Attention Scheme for Video Summarization Using Contrastive Learning |
title_sort | hierarchical spatial temporal cross attention scheme for video summarization using contrastive learning |
topic | video summarization spatial–temporal features cross-attention |
url | https://www.mdpi.com/1424-8220/22/21/8275 |
work_keys_str_mv | AT xiaoyuteng ahierarchicalspatialtemporalcrossattentionschemeforvideosummarizationusingcontrastivelearning AT xiaolingui ahierarchicalspatialtemporalcrossattentionschemeforvideosummarizationusingcontrastivelearning AT panxu ahierarchicalspatialtemporalcrossattentionschemeforvideosummarizationusingcontrastivelearning AT jiangleitong ahierarchicalspatialtemporalcrossattentionschemeforvideosummarizationusingcontrastivelearning AT jianan ahierarchicalspatialtemporalcrossattentionschemeforvideosummarizationusingcontrastivelearning AT yangliu ahierarchicalspatialtemporalcrossattentionschemeforvideosummarizationusingcontrastivelearning AT huilanjiang ahierarchicalspatialtemporalcrossattentionschemeforvideosummarizationusingcontrastivelearning AT xiaoyuteng hierarchicalspatialtemporalcrossattentionschemeforvideosummarizationusingcontrastivelearning AT xiaolingui hierarchicalspatialtemporalcrossattentionschemeforvideosummarizationusingcontrastivelearning AT panxu hierarchicalspatialtemporalcrossattentionschemeforvideosummarizationusingcontrastivelearning AT jiangleitong hierarchicalspatialtemporalcrossattentionschemeforvideosummarizationusingcontrastivelearning AT jianan hierarchicalspatialtemporalcrossattentionschemeforvideosummarizationusingcontrastivelearning AT yangliu hierarchicalspatialtemporalcrossattentionschemeforvideosummarizationusingcontrastivelearning AT huilanjiang hierarchicalspatialtemporalcrossattentionschemeforvideosummarizationusingcontrastivelearning |