Summary: | Speech emotion recognition(SER) refers to the use of machines to recognize the emotions of a speaker from speech.SER is an important part of human-computer interaction(HCI).But there are still many problems in SER research,e.g.,the lack of high-quality data,insufficient model accuracy,little research under noisy environments.In this paper,we propose a method called Head Fusion based on the multi-head attention mechanism to improve the accuracy of SER.We implemente an attention-based convolutional neural network(ACNN) model and conduct experiments on the interactive emotional dyadic motion capture(IEMOCAP) data set.The accuracy is improved to 76.18% (weighted accuracy,WA) and 76.36% (unweighted accuracy,UA).To the best of our knowledge,compared with the state-of-the-art result on this dataset(76.4% of WA and 70.1% of WA),we achieve a UA improvement of about 6% absolute while achieving a similar WA.Furthermore,We conduct empirical experiments by injecting speech data with 50 types of common noises.We inject the noises by altering the noise intensity,time-shifting the noises,and mixing different noise types,to identify their varied impacts on the SER accuracy and verify the robustness of our model.This work will also help researchers and engineers properly add their training data by using speech data with the appropriate types of noises to alleviate the problem of insufficient high-quality data.
|