Deep Transformer Based Video Inpainting Using Fast Fourier Tokenization

Bridging distant space-time interactions is important for high-quality video inpainting with large moving masks. Most existing technologies exploit patch similarities within the frames, or leaverage large-scale training data to fill the hole along spatial and temporal dimensions. Recent works introd...

Full description

Bibliographic Details
Main Authors:	Taewan Kim, Jinwoo Kim, Heeseok Oh, Jiwoo Kang
Format:	Article
Language:	English
Published:	IEEE 2024-01-01
Series:	IEEE Access
Subjects:	Video inpainting video completion free-form inpainting object removal adversarial learning
Online Access:	https://ieeexplore.ieee.org/document/10418237/

_version_	1797311675166097408
author	Taewan Kim Jinwoo Kim Heeseok Oh Jiwoo Kang
author_facet	Taewan Kim Jinwoo Kim Heeseok Oh Jiwoo Kang
author_sort	Taewan Kim
collection	DOAJ
description	Bridging distant space-time interactions is important for high-quality video inpainting with large moving masks. Most existing technologies exploit patch similarities within the frames, or leaverage large-scale training data to fill the hole along spatial and temporal dimensions. Recent works introduce promissing Transformer architecture into deep video inpainting to escape from the dominanace of nearby interactions and achieve superior performance than their baselines. However, such methods still struggle to complete larger holes containing complicated scenes. To alleviate this issue, we first employ a fast Fourier convolutions, which cover the frame-wide receptive field, for token representation. Then, the token passes through the seperated spatio-temporal transformer to explicitly moel the long-range context relations and simultaneously complete the missing regions in all input frames. By formulating video inpainting as a directionless sequence-to-sequence prediction task, our model fills visually consistent content, even under conditions such as large missing areas or complex geometries. Furthermore, our spatio-temporal transformer iteratively fills the hole from the boundary enabling it to exploit rich contextual information. We validate the superiority of the proposed model by using standard stationary masks and more realistic moving object masks. Both qualitative and quantitative results show that our model compares favorably against the state-of-the-art algorithms.
first_indexed	2024-03-08T02:04:10Z
format	Article
id	doaj.art-c86117bafb7f4ad1919228ed6483d1a4
institution	Directory Open Access Journal
issn	2169-3536
language	English
last_indexed	2024-03-08T02:04:10Z
publishDate	2024-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj.art-c86117bafb7f4ad1919228ed6483d1a42024-02-14T00:01:23ZengIEEEIEEE Access2169-35362024-01-0112217232173610.1109/ACCESS.2024.336128310418237Deep Transformer Based Video Inpainting Using Fast Fourier TokenizationTaewan Kim0https://orcid.org/0000-0003-3319-7797Jinwoo Kim1Heeseok Oh2https://orcid.org/0000-0002-0920-7281Jiwoo Kang3https://orcid.org/0000-0002-0920-7281Data Science Major, Dongduk Women’s University, Seoul, South KoreaDepartment of Electrical and Electronic Engineering, Yonsei University, Seoul, South KoreaDepartment of Applied AI, Hansung University, Seoul, South KoreaDivision of Artificial Intelligence Engineering, Sookmyung Women’s University, Seoul, South KoreaBridging distant space-time interactions is important for high-quality video inpainting with large moving masks. Most existing technologies exploit patch similarities within the frames, or leaverage large-scale training data to fill the hole along spatial and temporal dimensions. Recent works introduce promissing Transformer architecture into deep video inpainting to escape from the dominanace of nearby interactions and achieve superior performance than their baselines. However, such methods still struggle to complete larger holes containing complicated scenes. To alleviate this issue, we first employ a fast Fourier convolutions, which cover the frame-wide receptive field, for token representation. Then, the token passes through the seperated spatio-temporal transformer to explicitly moel the long-range context relations and simultaneously complete the missing regions in all input frames. By formulating video inpainting as a directionless sequence-to-sequence prediction task, our model fills visually consistent content, even under conditions such as large missing areas or complex geometries. Furthermore, our spatio-temporal transformer iteratively fills the hole from the boundary enabling it to exploit rich contextual information. We validate the superiority of the proposed model by using standard stationary masks and more realistic moving object masks. Both qualitative and quantitative results show that our model compares favorably against the state-of-the-art algorithms.https://ieeexplore.ieee.org/document/10418237/Video inpaintingvideo completionfree-form inpaintingobject removaladversarial learning
spellingShingle	Taewan Kim Jinwoo Kim Heeseok Oh Jiwoo Kang Deep Transformer Based Video Inpainting Using Fast Fourier Tokenization IEEE Access Video inpainting video completion free-form inpainting object removal adversarial learning
title	Deep Transformer Based Video Inpainting Using Fast Fourier Tokenization
title_full	Deep Transformer Based Video Inpainting Using Fast Fourier Tokenization
title_fullStr	Deep Transformer Based Video Inpainting Using Fast Fourier Tokenization
title_full_unstemmed	Deep Transformer Based Video Inpainting Using Fast Fourier Tokenization
title_short	Deep Transformer Based Video Inpainting Using Fast Fourier Tokenization
title_sort	deep transformer based video inpainting using fast fourier tokenization
topic	Video inpainting video completion free-form inpainting object removal adversarial learning
url	https://ieeexplore.ieee.org/document/10418237/
work_keys_str_mv	AT taewankim deeptransformerbasedvideoinpaintingusingfastfouriertokenization AT jinwookim deeptransformerbasedvideoinpaintingusingfastfouriertokenization AT heeseokoh deeptransformerbasedvideoinpaintingusingfastfouriertokenization AT jiwookang deeptransformerbasedvideoinpaintingusingfastfouriertokenization

Deep Transformer Based Video Inpainting Using Fast Fourier Tokenization

Similar Items