Self-supervised video representation learning by uncovering spatio-temporal statistics

This paper proposes a novel pretext task to address the self-supervised video representation learning problem. Specifically, given an unlabeled video clip, we compute a series of spatio-temporal statistical summaries, such as the spatial location and dominant direction of the largest motion, the spa...

Descripción completa

Detalles Bibliográficos
Autores principales:	Wang, J, Jiao, J, Bao, L, He, S, Liu, W, Liu, YH
Formato:	Journal article
Lenguaje:	English
Publicado:	IEEE 2021

_version_	1826307981983088640
author	Wang, J Jiao, J Bao, L He, S Liu, W Liu, YH
author_facet	Wang, J Jiao, J Bao, L He, S Liu, W Liu, YH
author_sort	Wang, J
collection	OXFORD
description	This paper proposes a novel pretext task to address the self-supervised video representation learning problem. Specifically, given an unlabeled video clip, we compute a series of spatio-temporal statistical summaries, such as the spatial location and dominant direction of the largest motion, the spatial location and dominant color of the largest color diversity along the temporal axis, etc. Then a neural network is built and trained to yield the statistical summaries given the video frames as inputs. In order to alleviate the learning difficulty, we employ several spatial partitioning patterns to encode rough spatial locations instead of exact spatial Cartesian coordinates. Our approach is inspired by the observation that human visual system is sensitive to rapidly changing contents in the visual field, and only needs impressions about rough spatial locations to understand the visual contents. To validate the effectiveness of the proposed approach, we conduct extensive experiments with four 3D backbone networks, i.e., C3D, 3D-ResNet, R(2+1)D and S3D-G. The results show that our approach outperforms the existing approaches across these backbone networks on four downstream video analysis tasks including action recognition, video retrieval, dynamic scene recognition, and action similarity labeling. The source code is publicly available at: https://github.com/laura-wang/video_repres_sts.
first_indexed	2024-03-07T07:11:17Z
format	Journal article
id	oxford-uuid:bce63cd8-4d01-41b6-ae2c-2d3752501966
institution	University of Oxford
language	English
last_indexed	2024-03-07T07:11:17Z
publishDate	2021
publisher	IEEE
record_format	dspace
spelling	oxford-uuid:bce63cd8-4d01-41b6-ae2c-2d37525019662022-06-23T08:59:01ZSelf-supervised video representation learning by uncovering spatio-temporal statisticsJournal articlehttp://purl.org/coar/resource_type/c_dcae04bcuuid:bce63cd8-4d01-41b6-ae2c-2d3752501966EnglishSymplectic ElementsIEEE2021Wang, JJiao, JBao, LHe, SLiu, WLiu, YHThis paper proposes a novel pretext task to address the self-supervised video representation learning problem. Specifically, given an unlabeled video clip, we compute a series of spatio-temporal statistical summaries, such as the spatial location and dominant direction of the largest motion, the spatial location and dominant color of the largest color diversity along the temporal axis, etc. Then a neural network is built and trained to yield the statistical summaries given the video frames as inputs. In order to alleviate the learning difficulty, we employ several spatial partitioning patterns to encode rough spatial locations instead of exact spatial Cartesian coordinates. Our approach is inspired by the observation that human visual system is sensitive to rapidly changing contents in the visual field, and only needs impressions about rough spatial locations to understand the visual contents. To validate the effectiveness of the proposed approach, we conduct extensive experiments with four 3D backbone networks, i.e., C3D, 3D-ResNet, R(2+1)D and S3D-G. The results show that our approach outperforms the existing approaches across these backbone networks on four downstream video analysis tasks including action recognition, video retrieval, dynamic scene recognition, and action similarity labeling. The source code is publicly available at: https://github.com/laura-wang/video_repres_sts.
spellingShingle	Wang, J Jiao, J Bao, L He, S Liu, W Liu, YH Self-supervised video representation learning by uncovering spatio-temporal statistics
title	Self-supervised video representation learning by uncovering spatio-temporal statistics
title_full	Self-supervised video representation learning by uncovering spatio-temporal statistics
title_fullStr	Self-supervised video representation learning by uncovering spatio-temporal statistics
title_full_unstemmed	Self-supervised video representation learning by uncovering spatio-temporal statistics
title_short	Self-supervised video representation learning by uncovering spatio-temporal statistics
title_sort	self supervised video representation learning by uncovering spatio temporal statistics
work_keys_str_mv	AT wangj selfsupervisedvideorepresentationlearningbyuncoveringspatiotemporalstatistics AT jiaoj selfsupervisedvideorepresentationlearningbyuncoveringspatiotemporalstatistics AT baol selfsupervisedvideorepresentationlearningbyuncoveringspatiotemporalstatistics AT hes selfsupervisedvideorepresentationlearningbyuncoveringspatiotemporalstatistics AT liuw selfsupervisedvideorepresentationlearningbyuncoveringspatiotemporalstatistics AT liuyh selfsupervisedvideorepresentationlearningbyuncoveringspatiotemporalstatistics

Self-supervised video representation learning by uncovering spatio-temporal statistics

Ejemplares similares