Utterance Copy in Formant-based Speech Synthesizers Using LSTM Neural Networks


Utterance copy, also known as speech imitation, is the task of estimating the parameters of an input, target speech signal in order to artificially reconstruct another signal with the same properties at the output. This can be considered a difficult inverse problem, since the input-output relationship is often non-linear, apart from having several parameters to be estimated and adjusted. This work describes the development of an application that uses a long short-term memory neural network (LSTM) to learn how to estimate the input parameters of thel formant-based Klatt speech synthesizer. Formant-based synthesizers do not reach state-of-art performance for text-to-speech (TTS) applications, but are an important tool for linguists studies due to the high interpretability of its input parameters. The proposed system was compared to the WinSnoori baseline software on both artificially-produced target utterances, generated by the DECtalk TTS system; and natural target ones. Results show that our system outperforms the baseline for synthetic voices on the metrics of PESQ, SNR, RMSE and LSD. For natural voices, the experiments indicate the need for an architecture that does not depend on labeled data, such as reinforcement learning.

In Proceedings of 8th Brazilian Conference on Intelligent Systems 2019