A Hybrid Transformer-LSTM Model With 3D Separable Convolution for Video Prediction

Video prediction is an essential vision task due to its wide applications in real-world scenarios. However, it is indeed challenging due to the inherent uncertainty and complex spatiotemporal dynamics of video content. Several state-of-the-art deep learning methods have achieved superior video predi...

Full description

Bibliographic Details
Main Authors:	Mareeta Mathai, Ying Liu, Nam Ling
Format:	Article
Language:	English
Published:	IEEE 2024-01-01
Series:	IEEE Access
Subjects:	3D separable convolution deep learning depthwise convolution LSTM pointwise convolution self-attention
Online Access:	https://ieeexplore.ieee.org/document/10464302/

_version_	1797243290688421888
author	Mareeta Mathai Ying Liu Nam Ling
author_facet	Mareeta Mathai Ying Liu Nam Ling
author_sort	Mareeta Mathai
collection	DOAJ
description	Video prediction is an essential vision task due to its wide applications in real-world scenarios. However, it is indeed challenging due to the inherent uncertainty and complex spatiotemporal dynamics of video content. Several state-of-the-art deep learning methods have achieved superior video prediction accuracy at the expense of huge computational cost. Hence, they are not suitable for devices with limitations in memory and computational resource. In the light of Green Artificial Intelligence (AI), more environment friendly deep learning solutions are desired to tackle the problem of large models and computational cost. In this work, we propose a novel video prediction network 3DTransLSTM, which adopts a hybrid transformer-long short-term memory (LSTM) structure to inherit the merits of both self-attention and recurrence. Three-dimensional (3D) depthwise separable convolutions are used in this hybrid structure to extract spatiotemporal features, meanwhile enhancing model efficiency. We conducted experimental studies on four popular video prediction datasets. Compared to existing methods, our proposed 3DTransLSTM achieved competitive frame prediction accuracy with significantly reduced model size, trainable parameters, and computational complexity. Moreover, we demonstrate the generalization ability of the proposed model by testing the model on dataset completely unseen in the training data.
first_indexed	2024-04-24T18:52:46Z
format	Article
id	doaj.art-5b081dbb36034408bb4582646b8367f2
institution	Directory Open Access Journal
issn	2169-3536
language	English
last_indexed	2024-04-24T18:52:46Z
publishDate	2024-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj.art-5b081dbb36034408bb4582646b8367f22024-03-26T17:47:52ZengIEEEIEEE Access2169-35362024-01-0112395893960210.1109/ACCESS.2024.337536510464302A Hybrid Transformer-LSTM Model With 3D Separable Convolution for Video PredictionMareeta Mathai0https://orcid.org/0009-0002-4488-5464Ying Liu1https://orcid.org/0000-0003-3380-4243Nam Ling2https://orcid.org/0000-0002-5741-7937Department of Computer Science and Engineering, Santa Clara University, Santa Clara, CA, USADepartment of Computer Science and Engineering, Santa Clara University, Santa Clara, CA, USADepartment of Computer Science and Engineering, Santa Clara University, Santa Clara, CA, USAVideo prediction is an essential vision task due to its wide applications in real-world scenarios. However, it is indeed challenging due to the inherent uncertainty and complex spatiotemporal dynamics of video content. Several state-of-the-art deep learning methods have achieved superior video prediction accuracy at the expense of huge computational cost. Hence, they are not suitable for devices with limitations in memory and computational resource. In the light of Green Artificial Intelligence (AI), more environment friendly deep learning solutions are desired to tackle the problem of large models and computational cost. In this work, we propose a novel video prediction network 3DTransLSTM, which adopts a hybrid transformer-long short-term memory (LSTM) structure to inherit the merits of both self-attention and recurrence. Three-dimensional (3D) depthwise separable convolutions are used in this hybrid structure to extract spatiotemporal features, meanwhile enhancing model efficiency. We conducted experimental studies on four popular video prediction datasets. Compared to existing methods, our proposed 3DTransLSTM achieved competitive frame prediction accuracy with significantly reduced model size, trainable parameters, and computational complexity. Moreover, we demonstrate the generalization ability of the proposed model by testing the model on dataset completely unseen in the training data.https://ieeexplore.ieee.org/document/10464302/3D separable convolutiondeep learningdepthwise convolutionLSTMpointwise convolutionself-attention
spellingShingle	Mareeta Mathai Ying Liu Nam Ling A Hybrid Transformer-LSTM Model With 3D Separable Convolution for Video Prediction IEEE Access 3D separable convolution deep learning depthwise convolution LSTM pointwise convolution self-attention
title	A Hybrid Transformer-LSTM Model With 3D Separable Convolution for Video Prediction
title_full	A Hybrid Transformer-LSTM Model With 3D Separable Convolution for Video Prediction
title_fullStr	A Hybrid Transformer-LSTM Model With 3D Separable Convolution for Video Prediction
title_full_unstemmed	A Hybrid Transformer-LSTM Model With 3D Separable Convolution for Video Prediction
title_short	A Hybrid Transformer-LSTM Model With 3D Separable Convolution for Video Prediction
title_sort	hybrid transformer lstm model with 3d separable convolution for video prediction
topic	3D separable convolution deep learning depthwise convolution LSTM pointwise convolution self-attention
url	https://ieeexplore.ieee.org/document/10464302/
work_keys_str_mv	AT mareetamathai ahybridtransformerlstmmodelwith3dseparableconvolutionforvideoprediction AT yingliu ahybridtransformerlstmmodelwith3dseparableconvolutionforvideoprediction AT namling ahybridtransformerlstmmodelwith3dseparableconvolutionforvideoprediction AT mareetamathai hybridtransformerlstmmodelwith3dseparableconvolutionforvideoprediction AT yingliu hybridtransformerlstmmodelwith3dseparableconvolutionforvideoprediction AT namling hybridtransformerlstmmodelwith3dseparableconvolutionforvideoprediction

A Hybrid Transformer-LSTM Model With 3D Separable Convolution for Video Prediction

Similar Items