Feature Fusion Based on Main-Auxiliary Network for Speech Emotion Recognition

Speech emotion recognition is an important research direction of human-computer interaction. Effective feature extraction and fusion are among the key factors to improve the rate of speech emotion recognition. In this paper, a speech emotion recognition algorithm using Main-auxiliary networks for de...

Volledige beschrijving

Bibliografische gegevens
Hoofdauteurs: Desheng HU, Xueying ZHANG, Jing ZHANG, Baoyun LI
Formaat: Artikel
Taal:English
Gepubliceerd in: Editorial Office of Journal of Taiyuan University of Technology 2021-09-01
Reeks:Taiyuan Ligong Daxue xuebao
Onderwerpen:
Online toegang:https://tyutjournal.tyut.edu.cn/englishpaper/show-331.html
Omschrijving
Samenvatting:Speech emotion recognition is an important research direction of human-computer interaction. Effective feature extraction and fusion are among the key factors to improve the rate of speech emotion recognition. In this paper, a speech emotion recognition algorithm using Main-auxiliary networks for deep feature fusion was proposed. First, segment features are input into BLSTM-attention network as the main network. The attention mechanism can pay attention to the emotion information in speech signals. Then, the Mel spectrum features are input into Convolutional Neural Networks-Global Average Pooling (GAP) as auxiliary network. GAP can reduce the overfitting brought by the fully connected layer. Finally, the two are combined in the form of Main-auxiliary networks to solve the problem of unsatisfactory recognition results caused by direct fusion of different types of features. The experimental results of comparing four models on IEMOCAP dataset show that WA and UA using the depth feature fusion of the Main-Auxiliary network are improved to different degrees.
ISSN:1007-9432