Sound Can Help Us See More Clearly

In the field of video action classification, existing network frameworks often only use video frames as input. When the object involved in the action does not appear in a prominent position in the video frame, the network cannot accurately classify it. We introduce a new neural network structure tha...

Full description

Bibliographic Details
Main Authors:	Yongsheng Li, Tengfei Tu, Hua Zhang, Jishuai Li, Zhengping Jin, Qiaoyan Wen
Format:	Article
Language:	English
Published:	MDPI AG 2022-01-01
Series:	Sensors
Subjects:	sound texture two-stream network computer vision
Online Access:	https://www.mdpi.com/1424-8220/22/2/599

_version_	1797490451495780352
author	Yongsheng Li Tengfei Tu Hua Zhang Jishuai Li Zhengping Jin Qiaoyan Wen
author_facet	Yongsheng Li Tengfei Tu Hua Zhang Jishuai Li Zhengping Jin Qiaoyan Wen
author_sort	Yongsheng Li
collection	DOAJ
description	In the field of video action classification, existing network frameworks often only use video frames as input. When the object involved in the action does not appear in a prominent position in the video frame, the network cannot accurately classify it. We introduce a new neural network structure that uses sound to assist in processing such tasks. The original sound wave is converted into sound texture as the input of the network. Furthermore, in order to use the rich modal information (images and sound) in the video, we designed and used a two-stream frame. In this work, we assume that sound data can be used to solve motion recognition tasks. To demonstrate this, we designed a neural network based on sound texture to perform video action classification tasks. Then, we fuse this network with a deep neural network that uses continuous video frames to construct a two-stream network, which is called A-IN. Finally, in the kinetics dataset, we use our proposed A-IN to compare with the image-only network. The experimental results show that the recognition accuracy of the two-stream neural network model with uesed sound data features is increased by 7.6% compared with the network using video frames. This proves that the rational use of the rich information in the video can improve the classification effect.
first_indexed	2024-03-10T00:33:11Z
format	Article
id	doaj.art-236aefc40f4144a588b6c269fab8b435
institution	Directory Open Access Journal
issn	1424-8220
language	English
last_indexed	2024-03-10T00:33:11Z
publishDate	2022-01-01
publisher	MDPI AG
record_format	Article
series	Sensors
spelling	doaj.art-236aefc40f4144a588b6c269fab8b4352023-11-23T15:21:19ZengMDPI AGSensors1424-82202022-01-0122259910.3390/s22020599Sound Can Help Us See More ClearlyYongsheng Li0Tengfei Tu1Hua Zhang2Jishuai Li3Zhengping Jin4Qiaoyan Wen5State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing 100876, ChinaState Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing 100876, ChinaState Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing 100876, ChinaState Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing 100876, ChinaState Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing 100876, ChinaState Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing 100876, ChinaIn the field of video action classification, existing network frameworks often only use video frames as input. When the object involved in the action does not appear in a prominent position in the video frame, the network cannot accurately classify it. We introduce a new neural network structure that uses sound to assist in processing such tasks. The original sound wave is converted into sound texture as the input of the network. Furthermore, in order to use the rich modal information (images and sound) in the video, we designed and used a two-stream frame. In this work, we assume that sound data can be used to solve motion recognition tasks. To demonstrate this, we designed a neural network based on sound texture to perform video action classification tasks. Then, we fuse this network with a deep neural network that uses continuous video frames to construct a two-stream network, which is called A-IN. Finally, in the kinetics dataset, we use our proposed A-IN to compare with the image-only network. The experimental results show that the recognition accuracy of the two-stream neural network model with uesed sound data features is increased by 7.6% compared with the network using video frames. This proves that the rational use of the rich information in the video can improve the classification effect.https://www.mdpi.com/1424-8220/22/2/599sound texturetwo-stream networkcomputer vision
spellingShingle	Yongsheng Li Tengfei Tu Hua Zhang Jishuai Li Zhengping Jin Qiaoyan Wen Sound Can Help Us See More Clearly Sensors sound texture two-stream network computer vision
title	Sound Can Help Us See More Clearly
title_full	Sound Can Help Us See More Clearly
title_fullStr	Sound Can Help Us See More Clearly
title_full_unstemmed	Sound Can Help Us See More Clearly
title_short	Sound Can Help Us See More Clearly
title_sort	sound can help us see more clearly
topic	sound texture two-stream network computer vision
url	https://www.mdpi.com/1424-8220/22/2/599
work_keys_str_mv	AT yongshengli soundcanhelpusseemoreclearly AT tengfeitu soundcanhelpusseemoreclearly AT huazhang soundcanhelpusseemoreclearly AT jishuaili soundcanhelpusseemoreclearly AT zhengpingjin soundcanhelpusseemoreclearly AT qiaoyanwen soundcanhelpusseemoreclearly

Sound Can Help Us See More Clearly

Similar Items