摘要:Speech emotion recognition(SER) is extremelychallenging due to the problem of disappearing or explodinggradients and weak spatiotemporal correlations. To addressthis issue, a new approach is proposed the 3D attentionalconvolutional recurrent neural networks based on residualnetworks (Res3DACRNN) model to learn emotion deepfeatures. The Res3DCNN model extracts deep-level multiscalespectral-temporal features of emotional speech fromspectrograms. The introduction of a residual network allowscompensation for the missing features of traditional CNNs inthe convolution process to prevent the problem of gradientdisappearance or explosion. An attention-based recurrentneural network (ARNN) then extracts the long-termdependencies of these features, improving the weakspatiotemporal correlation of the problem. To reduce thecomputational complexity, this paper improves the forget gateof LSTM and proposes a novel post-forgetting gate structure.Finally, a softmax layer is utilized for emotion classification.The experimental results of the proposed model on theEMO-DB and IEMOCAP emotional corpus show that theperformance is significantly improved compared with thecurrent mainstream deep learning methods.