摘要:The problem of low recognition accuracy of emotion recognition models is easily caused by interference such as data redundancy and irrelevant features. In this paper, we propose a speech emotion recognition (SER) method based on an attentional convolutional neural network (CNN) bidirectional gated recurrent unit (Bi-GRU) fusing visual information. First, we pretrained the log-mel spectrograms in a ResNet-based attentional convolutional neural network (RACNN) to extract speech features. Second, the CNN-extracted facial static appearance features are fused with speech features using a deep Bi-GRU to obtain speech appearance features. A series of gated recurrent units with attention mechanisms (AGRUs) are used to extract facial geometric features. Then, the hybrid features are obtained by further combining the integrated speech appearance features with facial geometric features, and kernel linear discriminant analysis (KLDA) is used to discriminate them. Finally, the proposed method in this paper obtained accuracies of 87.92% and 89.65% on the RAVDESS and eNTERFACE'05 emotion databases, respectively. The experimental results demonstrate that the method in this paper effectively improved the accuracy and robustness of SER.