Improving Video Vision Transformer for Deepfake Video Detection Using Facial Landmark, Depthwise Separable Convolution and Self Attention

In this paper, we present our result of research in video deepfake detection. We built a deepfake detection system to detect whether a video is a deepfake or real. The deepfake detection algorithm still struggle in providing a sufficient accuracy values, especially in challenging deepfake dataset. O...

Full description

Bibliographic Details
Main Authors:	Kurniawan Nur Ramadhani, Rinaldi Munir, Nugraha Priya Utama
Format:	Article
Language:	English
Published:	IEEE 2024-01-01
Series:	IEEE Access
Subjects:	Deepfake detection facial landmark depthwise separable convolution convolution block attention module video vision transformer
Online Access:	https://ieeexplore.ieee.org/document/10388363/

_version_	1797348686378827776
author	Kurniawan Nur Ramadhani Rinaldi Munir Nugraha Priya Utama
author_facet	Kurniawan Nur Ramadhani Rinaldi Munir Nugraha Priya Utama
author_sort	Kurniawan Nur Ramadhani
collection	DOAJ
description	In this paper, we present our result of research in video deepfake detection. We built a deepfake detection system to detect whether a video is a deepfake or real. The deepfake detection algorithm still struggle in providing a sufficient accuracy values, especially in challenging deepfake dataset. Our deepfake detection system utilized spatiotemporal feature that extracted using Video Vision Transformer (ViViT). The main contribution of our research is providing a deepfake detection system that based on ViViT architecture and using landmark area images for the input of the system. Our system extracted the feature from a number of spatial features. The spatial feature was extracted using Depthwise Separable Convolution (DSC) block combined with Convolution Block Attention Module (CBAM) from tubelet. The tubelet was a representation of facial landmark area that was extracted from the input video. In our system, we used 25 facial landmark area for an input video. In our experiment we used Celeb-DF version 2 dataset because it is considered to be a challenging deepfake dataset. We conducted augmentation to the dataset, so we obtained 8335 videos for training set, 390 videos for validation set, and 1123 videos for testing set. We trained our deepfake detection system using Adam optimizer, with learning rate of 10–4 and 100 epoch. From the experiment, we obtained the accuracy score of 87.18% and F1 score of 92.52%. We also conducted the ablation study to display the effect of each part of our model to the overall system performance. From this research, we obtained that by using landmark area images, our ViViT based deepfake detection system had a good performance in detecting deepfake videos.
first_indexed	2024-03-08T12:09:33Z
format	Article
id	doaj.art-4b28f1b0d3694d048f51c9723456b820
institution	Directory Open Access Journal
issn	2169-3536
language	English
last_indexed	2024-03-08T12:09:33Z
publishDate	2024-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj.art-4b28f1b0d3694d048f51c9723456b8202024-01-23T00:05:51ZengIEEEIEEE Access2169-35362024-01-01128932893910.1109/ACCESS.2024.335289010388363Improving Video Vision Transformer for Deepfake Video Detection Using Facial Landmark, Depthwise Separable Convolution and Self AttentionKurniawan Nur Ramadhani0https://orcid.org/0000-0002-5126-8213Rinaldi Munir1Nugraha Priya Utama2Bandung Institute of Technology, Bandung, IndonesiaBandung Institute of Technology, Bandung, IndonesiaBandung Institute of Technology, Bandung, IndonesiaIn this paper, we present our result of research in video deepfake detection. We built a deepfake detection system to detect whether a video is a deepfake or real. The deepfake detection algorithm still struggle in providing a sufficient accuracy values, especially in challenging deepfake dataset. Our deepfake detection system utilized spatiotemporal feature that extracted using Video Vision Transformer (ViViT). The main contribution of our research is providing a deepfake detection system that based on ViViT architecture and using landmark area images for the input of the system. Our system extracted the feature from a number of spatial features. The spatial feature was extracted using Depthwise Separable Convolution (DSC) block combined with Convolution Block Attention Module (CBAM) from tubelet. The tubelet was a representation of facial landmark area that was extracted from the input video. In our system, we used 25 facial landmark area for an input video. In our experiment we used Celeb-DF version 2 dataset because it is considered to be a challenging deepfake dataset. We conducted augmentation to the dataset, so we obtained 8335 videos for training set, 390 videos for validation set, and 1123 videos for testing set. We trained our deepfake detection system using Adam optimizer, with learning rate of 10–4 and 100 epoch. From the experiment, we obtained the accuracy score of 87.18% and F1 score of 92.52%. We also conducted the ablation study to display the effect of each part of our model to the overall system performance. From this research, we obtained that by using landmark area images, our ViViT based deepfake detection system had a good performance in detecting deepfake videos.https://ieeexplore.ieee.org/document/10388363/Deepfake detectionfacial landmarkdepthwise separable convolutionconvolution block attention modulevideo vision transformer
spellingShingle	Kurniawan Nur Ramadhani Rinaldi Munir Nugraha Priya Utama Improving Video Vision Transformer for Deepfake Video Detection Using Facial Landmark, Depthwise Separable Convolution and Self Attention IEEE Access Deepfake detection facial landmark depthwise separable convolution convolution block attention module video vision transformer
title	Improving Video Vision Transformer for Deepfake Video Detection Using Facial Landmark, Depthwise Separable Convolution and Self Attention
title_full	Improving Video Vision Transformer for Deepfake Video Detection Using Facial Landmark, Depthwise Separable Convolution and Self Attention
title_fullStr	Improving Video Vision Transformer for Deepfake Video Detection Using Facial Landmark, Depthwise Separable Convolution and Self Attention
title_full_unstemmed	Improving Video Vision Transformer for Deepfake Video Detection Using Facial Landmark, Depthwise Separable Convolution and Self Attention
title_short	Improving Video Vision Transformer for Deepfake Video Detection Using Facial Landmark, Depthwise Separable Convolution and Self Attention
title_sort	improving video vision transformer for deepfake video detection using facial landmark depthwise separable convolution and self attention
topic	Deepfake detection facial landmark depthwise separable convolution convolution block attention module video vision transformer
url	https://ieeexplore.ieee.org/document/10388363/
work_keys_str_mv	AT kurniawannurramadhani improvingvideovisiontransformerfordeepfakevideodetectionusingfaciallandmarkdepthwiseseparableconvolutionandselfattention AT rinaldimunir improvingvideovisiontransformerfordeepfakevideodetectionusingfaciallandmarkdepthwiseseparableconvolutionandselfattention AT nugrahapriyautama improvingvideovisiontransformerfordeepfakevideodetectionusingfaciallandmarkdepthwiseseparableconvolutionandselfattention

Improving Video Vision Transformer for Deepfake Video Detection Using Facial Landmark, Depthwise Separable Convolution and Self Attention

Similar Items