摘要:Depression is becoming a social problem as the number of sufferers steadily increases. In this regard, this paper proposes a multimodal analysis-based attention depression detection model that simultaneously uses voice and text data obtained from users. The proposed models consist of Bidirectional Encoders from Transformers-Convolutional Neural Network (BERT-CNN) for natural language analysis, CNN-Bidirectional Long Short-Term Memory (CNN-BiLSTM) for voice signal processing, and multimodal analysis and fusion models for depression detection. The experiments in this paper are conducted using the DAIC-WOZ dataset, a clinical interview designed to support psychological distress states such as anxiety and post-traumatic stress. The voice data were set to 4 seconds in length and the number of mel filters was set to 128 in the preprocessing process. For text data, we used the subject text data of the interview and derived the embedding vector using a transformers tokenizer. Based on each data set, the BERT-CNN and CNN-BiLSTM proposed in this paper were applied and combined to classify depression. Through experiments, the accuracy and loss degree were compared for the cases of using multimodal data and using single data, and it was confirmed that the existing low accuracy was improved.