Transformer Models and Convolutional Networks with Different Activation Functions for Swallow Classification Using Depth Video Data

Dysphagia is a common geriatric syndrome that might induce serious complications and death. Standard diagnostics using the Videofluoroscopic Swallowing Study (VFSS) or Fiberoptic Evaluation of Swallowing (FEES) are expensive and expose patients to risks, while bedside screening is subjective and mig...

Full description

Bibliographic Details
Main Authors: Derek Ka-Hei Lai, Ethan Shiu-Wang Cheng, Bryan Pak-Hei So, Ye-Jiao Mao, Sophia Ming-Yan Cheung, Daphne Sze Ki Cheung, Duo Wai-Chi Wong, James Chung-Wai Cheung
Format: Article
Language:English
Published: MDPI AG 2023-07-01
Series:Mathematics
Subjects:
Online Access:https://www.mdpi.com/2227-7390/11/14/3081
_version_ 1797588411330068480
author Derek Ka-Hei Lai
Ethan Shiu-Wang Cheng
Bryan Pak-Hei So
Ye-Jiao Mao
Sophia Ming-Yan Cheung
Daphne Sze Ki Cheung
Duo Wai-Chi Wong
James Chung-Wai Cheung
author_facet Derek Ka-Hei Lai
Ethan Shiu-Wang Cheng
Bryan Pak-Hei So
Ye-Jiao Mao
Sophia Ming-Yan Cheung
Daphne Sze Ki Cheung
Duo Wai-Chi Wong
James Chung-Wai Cheung
author_sort Derek Ka-Hei Lai
collection DOAJ
description Dysphagia is a common geriatric syndrome that might induce serious complications and death. Standard diagnostics using the Videofluoroscopic Swallowing Study (VFSS) or Fiberoptic Evaluation of Swallowing (FEES) are expensive and expose patients to risks, while bedside screening is subjective and might lack reliability. An affordable and accessible instrumented screening is necessary. This study aimed to evaluate the classification performance of Transformer models and convolutional networks in identifying swallowing and non-swallowing tasks through depth video data. Different activation functions (ReLU, LeakyReLU, GELU, ELU, SiLU, and GLU) were then evaluated on the best-performing model. Sixty-five healthy participants (<i>n</i> = 65) were invited to perform swallowing (eating a cracker and drinking water) and non-swallowing tasks (a deep breath and pronouncing vowels: “/eɪ/”, “/iː/”, “/aɪ/”, “/oʊ/”, “/u:/”). Swallowing and non-swallowing were classified by Transformer models (TimeSFormer, Video Vision Transformer (ViViT)), and convolutional neural networks (SlowFast, X3D, and R(2+1)D), respectively. In general, convolutional neural networks outperformed the Transformer models. X3D was the best model with good-to-excellent performance (F1-score: 0.920; adjusted F1-score: 0.885) in classifying swallowing and non-swallowing conditions. Moreover, X3D with its default activation function (ReLU) produced the best results, although LeakyReLU performed better in deep breathing and pronouncing “/aɪ/” tasks. Future studies shall consider collecting more data for pretraining and developing a hyperparameter tuning strategy for activation functions and the high dimensionality video data for Transformer models.
first_indexed 2024-03-11T00:51:36Z
format Article
id doaj.art-395b611226bd4197a567c57c544fcd30
institution Directory Open Access Journal
issn 2227-7390
language English
last_indexed 2024-03-11T00:51:36Z
publishDate 2023-07-01
publisher MDPI AG
record_format Article
series Mathematics
spelling doaj.art-395b611226bd4197a567c57c544fcd302023-11-18T20:20:15ZengMDPI AGMathematics2227-73902023-07-011114308110.3390/math11143081Transformer Models and Convolutional Networks with Different Activation Functions for Swallow Classification Using Depth Video DataDerek Ka-Hei Lai0Ethan Shiu-Wang Cheng1Bryan Pak-Hei So2Ye-Jiao Mao3Sophia Ming-Yan Cheung4Daphne Sze Ki Cheung5Duo Wai-Chi Wong6James Chung-Wai Cheung7Department of Biomedical Engineering, Faculty of Engineering, The Hong Kong Polytechnic University, Hong Kong 999077, ChinaDepartment of Electronic and Information Engineering, Faculty of Engineering, The Hong Kong Polytechnic University, Hong Kong 999077, ChinaDepartment of Biomedical Engineering, Faculty of Engineering, The Hong Kong Polytechnic University, Hong Kong 999077, ChinaDepartment of Biomedical Engineering, Faculty of Engineering, The Hong Kong Polytechnic University, Hong Kong 999077, ChinaDepartment of Mathematics, School of Science, The Hong Kong University of Science and Technology, Hong Kong 999077, ChinaSchool of Nursing, The Hong Kong Polytechnic University, Hong Kong 999077, ChinaDepartment of Biomedical Engineering, Faculty of Engineering, The Hong Kong Polytechnic University, Hong Kong 999077, ChinaDepartment of Biomedical Engineering, Faculty of Engineering, The Hong Kong Polytechnic University, Hong Kong 999077, ChinaDysphagia is a common geriatric syndrome that might induce serious complications and death. Standard diagnostics using the Videofluoroscopic Swallowing Study (VFSS) or Fiberoptic Evaluation of Swallowing (FEES) are expensive and expose patients to risks, while bedside screening is subjective and might lack reliability. An affordable and accessible instrumented screening is necessary. This study aimed to evaluate the classification performance of Transformer models and convolutional networks in identifying swallowing and non-swallowing tasks through depth video data. Different activation functions (ReLU, LeakyReLU, GELU, ELU, SiLU, and GLU) were then evaluated on the best-performing model. Sixty-five healthy participants (<i>n</i> = 65) were invited to perform swallowing (eating a cracker and drinking water) and non-swallowing tasks (a deep breath and pronouncing vowels: “/eɪ/”, “/iː/”, “/aɪ/”, “/oʊ/”, “/u:/”). Swallowing and non-swallowing were classified by Transformer models (TimeSFormer, Video Vision Transformer (ViViT)), and convolutional neural networks (SlowFast, X3D, and R(2+1)D), respectively. In general, convolutional neural networks outperformed the Transformer models. X3D was the best model with good-to-excellent performance (F1-score: 0.920; adjusted F1-score: 0.885) in classifying swallowing and non-swallowing conditions. Moreover, X3D with its default activation function (ReLU) produced the best results, although LeakyReLU performed better in deep breathing and pronouncing “/aɪ/” tasks. Future studies shall consider collecting more data for pretraining and developing a hyperparameter tuning strategy for activation functions and the high dimensionality video data for Transformer models.https://www.mdpi.com/2227-7390/11/14/3081dysphagiaaspiration pneumoniacomputer-aided screeninggerontechnologydeep learning
spellingShingle Derek Ka-Hei Lai
Ethan Shiu-Wang Cheng
Bryan Pak-Hei So
Ye-Jiao Mao
Sophia Ming-Yan Cheung
Daphne Sze Ki Cheung
Duo Wai-Chi Wong
James Chung-Wai Cheung
Transformer Models and Convolutional Networks with Different Activation Functions for Swallow Classification Using Depth Video Data
Mathematics
dysphagia
aspiration pneumonia
computer-aided screening
gerontechnology
deep learning
title Transformer Models and Convolutional Networks with Different Activation Functions for Swallow Classification Using Depth Video Data
title_full Transformer Models and Convolutional Networks with Different Activation Functions for Swallow Classification Using Depth Video Data
title_fullStr Transformer Models and Convolutional Networks with Different Activation Functions for Swallow Classification Using Depth Video Data
title_full_unstemmed Transformer Models and Convolutional Networks with Different Activation Functions for Swallow Classification Using Depth Video Data
title_short Transformer Models and Convolutional Networks with Different Activation Functions for Swallow Classification Using Depth Video Data
title_sort transformer models and convolutional networks with different activation functions for swallow classification using depth video data
topic dysphagia
aspiration pneumonia
computer-aided screening
gerontechnology
deep learning
url https://www.mdpi.com/2227-7390/11/14/3081
work_keys_str_mv AT derekkaheilai transformermodelsandconvolutionalnetworkswithdifferentactivationfunctionsforswallowclassificationusingdepthvideodata
AT ethanshiuwangcheng transformermodelsandconvolutionalnetworkswithdifferentactivationfunctionsforswallowclassificationusingdepthvideodata
AT bryanpakheiso transformermodelsandconvolutionalnetworkswithdifferentactivationfunctionsforswallowclassificationusingdepthvideodata
AT yejiaomao transformermodelsandconvolutionalnetworkswithdifferentactivationfunctionsforswallowclassificationusingdepthvideodata
AT sophiamingyancheung transformermodelsandconvolutionalnetworkswithdifferentactivationfunctionsforswallowclassificationusingdepthvideodata
AT daphneszekicheung transformermodelsandconvolutionalnetworkswithdifferentactivationfunctionsforswallowclassificationusingdepthvideodata
AT duowaichiwong transformermodelsandconvolutionalnetworkswithdifferentactivationfunctionsforswallowclassificationusingdepthvideodata
AT jameschungwaicheung transformermodelsandconvolutionalnetworkswithdifferentactivationfunctionsforswallowclassificationusingdepthvideodata