摘要:In This paper, we propose a hand pose estimation neural networks architecture named MSAHP which can improve PCK (percentage correct keypoints) greatly by fusing self-attention module in CNN (Convolutional Neural Networks). The proposed network is based on a ResNet (Residual Neural Network) backbone and concatenate discriminative features through multiple different scale feature maps, then multiple head self-attention module was used to focus on the salient feature map area. In recent years, self-attention mechanism was applicated widely in NLP and speech recognition, which can improve greatly key metrics. But in compute vision especially for hand pose estimation, we did not find the application. Experiments on hand pose estimation dataset demonstrate the improved PCK of our MSAHP than the existing state-of-the-art hand pose estimation methods. Specifically, the proposed method can achieve 93.68% PCK score on our mixed test dataset.