BroutonLab built a custom Text-to-Speech model trained on a combination of several open-source datasets, including the necessary denoising and pre-processing of the speech samples. A custom TTS architecture was developed to support multi-speaker embeddings, and we fully trained the encoder, synthesizer, and vocoder models. The result is a real-time voice cloning implementation with high-quality speech and style transfer.
The client is a US-based startup that develops augmented and virtual reality applications for the healthcare industry, engaging medical staff and patients to convey information through lifelike 3D avatars.
The client wants to integrate a 3D animated avatar inside their application that can synthesize speech based on an input audio clip. The avatar explains various medical procedures and terms to patients, and the generated voice must sound realistic, articulated, and precise with more difficult medical terms.
They reached out to BroutonLab to develop an MVP Text-to-Speech engine custom-made for their use case, and the development was carried out within time and budget constraints.
For the problem of voice synthesis, the biggest problem is finding a large enough dataset to resolve the problem. After that, it’s important to choose the appropriate architecture that will provide the best results with the available data.
In order to train the models, open-source datasets were explored, pre-processed, and combined. There was plenty of pre-processing, denoising, and data transformation involved in the process to make sure the training dataset is high-quality.
A voice encoder was used for extracting the voice embeddings. The architecture was trained on a speaker verification task. The network consists of a stack of LSTM, each followed by a linear layer with a “tanh” activation. The final embedding is created by L2-normalizing the output of the top layer at the final frame.
For the first text-to-speech model, an open-MIT-license TTS model leveraging diffusion probabilistic modeling was modified. In it, an encoder-decoder architecture with a score-based decoder produces mel spectrograms by gradually transforming the noise predicted by the encoder and aligning with the text input using Monotonic Alignment Search.
A GAN-based architecture consisting of a generator and multiple discriminators was used. The generator is a fully convolutional neural network. It includes transposed convolutional layers followed by MRF (Multi-Receptive Field) blocks that are used for upsampling the input mel spectrogram.
The GAN architecture uses two types of discriminators:
MPD is used on raw waveforms, whereas MSD is used on smoothed waveforms.
The deliverable included a Flask server app containerized using Docker, with a simple REST API that manages access to the Text-to-Speech engine. The input is an audio sample with the target voice, as well as several sentences that need to be synthesized in the same voice. Server app users can choose between the diffusion probabilistic TTS model Grad-TTS and WaveGrad2 models.
The project was delivered within time and budget as an MVP, and it serves to showcase what kind of results can be achieved in a short time with limited resources. Future work will consist in acquiring custom datasets with high-quality speech to further improve the models.