文章基本信息

标题：Speaker/Style-Dependent Neural Network Speech Synthesis Based on Speaker/Style Embedding
本地全文：下载
作者：Milan Sečujski ; Darko Pekar ; Siniša Suzić 等
期刊名称：Journal of Universal Computer Science
印刷版ISSN：0948-6968
出版年度：2020
卷号：26
期号：4
页码：434-453
出版社：Graz University of Technology and Know-Center
摘要：The paper presents a novel architecture and method for training neural networks to produce synthesized speech in a particular voice and speaking style, based on a small quantity of target speaker/style training data. The method is based on neural network embedding, i.e. mapping of discrete variables into continuous vectors in a low-dimensional space, which has been shown to be a very successful universal deep learning technique. In this particular case, different speaker/style combinations are mapped into different points in a low-dimensional space, which enables the network to capture the similarities and differences between speakers and speaking styles more efficiently. The initial model from which speaker/style adaptation was carried out was a multi-speaker/multi-style model based on 8.5 hours of American English speech data which corresponds to 16 different speaker/style combinations. The results of the experiments show that both versions of the obtained system, one using 10 minutes and the other as little as 30 seconds of target data, outperform the state of the art in parametric speaker/style-dependent speech synthesis. This opens a wide range of application of speaker/style dependent speech synthesis based on small quantities of training data, in domains ranging from customer interaction in call centers to robot-assisted medical therapy.
关键词：deep neural networks; embedding; speaker adaptation; text-to-speech synthesis