A Hierarchical Spatial–Temporal Cross-Attention Scheme for Video Summarization Using Contrastive Learning

Video summarization (VS) is a widely used technique for facilitating the effective reading, fast comprehension, and effective retrieval of video content. Certain properties of the new video data, such as a lack of prominent emphasis and a fuzzy theme development border, disturb the original thinking...

Full description

Bibliographic Details
Main Authors: Xiaoyu Teng, Xiaolin Gui, Pan Xu, Jianglei Tong, Jian An, Yang Liu, Huilan Jiang
Format: Article
Language:English
Published: MDPI AG 2022-10-01
Series:Sensors
Subjects:
Online Access:https://www.mdpi.com/1424-8220/22/21/8275
_version_ 1797466553262800896
author Xiaoyu Teng
Xiaolin Gui
Pan Xu
Jianglei Tong
Jian An
Yang Liu
Huilan Jiang
author_facet Xiaoyu Teng
Xiaolin Gui
Pan Xu
Jianglei Tong
Jian An
Yang Liu
Huilan Jiang
author_sort Xiaoyu Teng
collection DOAJ
description Video summarization (VS) is a widely used technique for facilitating the effective reading, fast comprehension, and effective retrieval of video content. Certain properties of the new video data, such as a lack of prominent emphasis and a fuzzy theme development border, disturb the original thinking mode based on video feature information. Moreover, it introduces new challenges to the extraction of video depth and breadth features. In addition, the diversity of user requirements creates additional complications for more accurate keyframe screening issues. To overcome these challenges, this paper proposes a hierarchical spatial–temporal cross-attention scheme for video summarization based on comparative learning. Graph attention networks (GAT) and the multi-head convolutional attention cell are used to extract local and depth features, while the GAT-adjusted bidirection ConvLSTM (DB-ConvLSTM) is used to extract global and breadth features. Furthermore, a spatial–temporal cross-attention-based ConvLSTM is developed for merging hierarchical characteristics and achieving more accurate screening in similar keyframes clusters. Verification experiments and comparative analysis demonstrate that our method outperforms state-of-the-art methods.
first_indexed 2024-03-09T18:40:27Z
format Article
id doaj.art-c9b85f1343f84bd0bf7a8f2dcb80ea6c
institution Directory Open Access Journal
issn 1424-8220
language English
last_indexed 2024-03-09T18:40:27Z
publishDate 2022-10-01
publisher MDPI AG
record_format Article
series Sensors
spelling doaj.art-c9b85f1343f84bd0bf7a8f2dcb80ea6c2023-11-24T06:45:28ZengMDPI AGSensors1424-82202022-10-012221827510.3390/s22218275A Hierarchical Spatial–Temporal Cross-Attention Scheme for Video Summarization Using Contrastive LearningXiaoyu Teng0Xiaolin Gui1Pan Xu2Jianglei Tong3Jian An4Yang Liu5Huilan Jiang6Department of Faculty of Electronic and Information Engineering, Xi’an Jiaotong University, Xi’an 710049, ChinaDepartment of Faculty of Electronic and Information Engineering, Xi’an Jiaotong University, Xi’an 710049, ChinaDepartment of Faculty of Electronic and Information Engineering, Xi’an Jiaotong University, Xi’an 710049, ChinaDepartment of Faculty of Electronic and Information Engineering, Xi’an Jiaotong University, Xi’an 710049, ChinaDepartment of Faculty of Electronic and Information Engineering, Xi’an Jiaotong University, Xi’an 710049, ChinaMedical College, Northwest Minzu University, Lanzhou 730030, ChinaONYCOM Co., Ltd., Seoul 04519, KoreaVideo summarization (VS) is a widely used technique for facilitating the effective reading, fast comprehension, and effective retrieval of video content. Certain properties of the new video data, such as a lack of prominent emphasis and a fuzzy theme development border, disturb the original thinking mode based on video feature information. Moreover, it introduces new challenges to the extraction of video depth and breadth features. In addition, the diversity of user requirements creates additional complications for more accurate keyframe screening issues. To overcome these challenges, this paper proposes a hierarchical spatial–temporal cross-attention scheme for video summarization based on comparative learning. Graph attention networks (GAT) and the multi-head convolutional attention cell are used to extract local and depth features, while the GAT-adjusted bidirection ConvLSTM (DB-ConvLSTM) is used to extract global and breadth features. Furthermore, a spatial–temporal cross-attention-based ConvLSTM is developed for merging hierarchical characteristics and achieving more accurate screening in similar keyframes clusters. Verification experiments and comparative analysis demonstrate that our method outperforms state-of-the-art methods.https://www.mdpi.com/1424-8220/22/21/8275video summarizationspatial–temporal featurescross-attention
spellingShingle Xiaoyu Teng
Xiaolin Gui
Pan Xu
Jianglei Tong
Jian An
Yang Liu
Huilan Jiang
A Hierarchical Spatial–Temporal Cross-Attention Scheme for Video Summarization Using Contrastive Learning
Sensors
video summarization
spatial–temporal features
cross-attention
title A Hierarchical Spatial–Temporal Cross-Attention Scheme for Video Summarization Using Contrastive Learning
title_full A Hierarchical Spatial–Temporal Cross-Attention Scheme for Video Summarization Using Contrastive Learning
title_fullStr A Hierarchical Spatial–Temporal Cross-Attention Scheme for Video Summarization Using Contrastive Learning
title_full_unstemmed A Hierarchical Spatial–Temporal Cross-Attention Scheme for Video Summarization Using Contrastive Learning
title_short A Hierarchical Spatial–Temporal Cross-Attention Scheme for Video Summarization Using Contrastive Learning
title_sort hierarchical spatial temporal cross attention scheme for video summarization using contrastive learning
topic video summarization
spatial–temporal features
cross-attention
url https://www.mdpi.com/1424-8220/22/21/8275
work_keys_str_mv AT xiaoyuteng ahierarchicalspatialtemporalcrossattentionschemeforvideosummarizationusingcontrastivelearning
AT xiaolingui ahierarchicalspatialtemporalcrossattentionschemeforvideosummarizationusingcontrastivelearning
AT panxu ahierarchicalspatialtemporalcrossattentionschemeforvideosummarizationusingcontrastivelearning
AT jiangleitong ahierarchicalspatialtemporalcrossattentionschemeforvideosummarizationusingcontrastivelearning
AT jianan ahierarchicalspatialtemporalcrossattentionschemeforvideosummarizationusingcontrastivelearning
AT yangliu ahierarchicalspatialtemporalcrossattentionschemeforvideosummarizationusingcontrastivelearning
AT huilanjiang ahierarchicalspatialtemporalcrossattentionschemeforvideosummarizationusingcontrastivelearning
AT xiaoyuteng hierarchicalspatialtemporalcrossattentionschemeforvideosummarizationusingcontrastivelearning
AT xiaolingui hierarchicalspatialtemporalcrossattentionschemeforvideosummarizationusingcontrastivelearning
AT panxu hierarchicalspatialtemporalcrossattentionschemeforvideosummarizationusingcontrastivelearning
AT jiangleitong hierarchicalspatialtemporalcrossattentionschemeforvideosummarizationusingcontrastivelearning
AT jianan hierarchicalspatialtemporalcrossattentionschemeforvideosummarizationusingcontrastivelearning
AT yangliu hierarchicalspatialtemporalcrossattentionschemeforvideosummarizationusingcontrastivelearning
AT huilanjiang hierarchicalspatialtemporalcrossattentionschemeforvideosummarizationusingcontrastivelearning