摘要:The traditional speaker recognition method reduces the feature signal from high to low dimensions, but this often leads to some speaker information loss, resulting in a low speaker recognition rate. In response to this problem, this paper proposes a model based on the combination of a 3D convolutional neural network (3DCNN) and a long short-term memory neural network (LSTM). First, the model uses a fixed-step speech feature vector as the 3DCNN input, which converts the text-independent speaker recognition mode into a "semi-text"-related speaker recognition mode, which greatly preserves the speaker's speech features, and thus improving the difference between the characteristics of different speakers. Second, the 3D convolution kernel designed in this paper can extract the personality characteristics of speakers in different dimensions to further distinguish different speakers, connect the output signal to the LSTM network through a time series to enhance the contextual connection of the speaker's voice, and finally mark the classification output result to realize a complete speaker recognition system. The experimental results show that the model structure improves the speaker recognition rate on AISHELL-1 dataset in short-term speech compared with traditional algorithms and popular embedding features, and the system is more robust over time.