Developed a Text-to-Speech model to help a startup clone voices for their 3D avatars

BroutonLab built a custom Text-to-Speech model trained on a combination of several open-source datasets, including the necessary denoising and pre-processing of the speech samples. A custom TTS architecture was developed to support multi-speaker embeddings, and we fully trained the encoder, synthesizer, and vocoder models. The result is a real-time voice cloning implementation with high-quality speech and style transfer.

Client

The client is a US-based startup that develops augmented and virtual reality applications for the healthcare industry, engaging medical staff and patients to convey information through lifelike 3D avatars.

‍

Problem

The client wants to integrate a 3D animated avatar inside their application that can synthesize speech based on an input audio clip. The avatar explains various medical procedures and terms to patients, and the generated voice must sound realistic, articulated, and precise with more difficult medical terms.

They reached out to BroutonLab to develop an MVP Text-to-Speech engine custom-made for their use case, and the development was carried out within time and budget constraints.

Solution

For the problem of voice synthesis, the biggest problem is finding a large enough dataset to resolve the problem. After that, it’s important to choose the appropriate architecture that will provide the best results with the available data.

‍

Sample of Generated Speech

Datasets

In order to train the models, open-source datasets were explored, pre-processed, and combined. There was plenty of pre-processing, denoising, and data transformation involved in the process to make sure the training dataset is high-quality.

Models

Voice encoder

A voice encoder was used for extracting the voice embeddings. The architecture was trained on a speaker verification task. The network consists of a stack of LSTM, each followed by a linear layer with a “tanh” activation. The final embedding is created by L2-normalizing the output of the top layer at the final frame.

‍

Diffusion Probabilistic Text-to-Speech Model

For the first text-to-speech model, an open-MIT-license TTS model leveraging diffusion probabilistic modeling was modified. In it, an encoder-decoder architecture with a score-based decoder produces mel spectrograms by gradually transforming the noise predicted by the encoder and aligning with the text input using Monotonic Alignment Search.

Scheme of second sub-discriminators for MSD and MPD blocks (from HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis)

GAN-Based Speech Synthesis

A GAN-based architecture consisting of a generator and multiple discriminators was used. The generator is a fully convolutional neural network. It includes transposed convolutional layers followed by MRF (Multi-Receptive Field) blocks that are used for upsampling the input mel spectrogram.

The GAN architecture uses two types of discriminators:

MPD (Multi-Period Discriminator) consists of several sub-discriminators that are designed to capture different implicit structures from each other by looking at different parts of the input audio.
MSD (Multi-Scale Discriminator) is a mixture of three sub-discriminators operating on different input scales: raw audio, ×2 average-pooled audio, and ×4 average-pooled audio. Each of the sub-discriminators in MSD is a stack of strided and grouped convolutional layers with leaky ReLU activation.

MPD is used on raw waveforms, whereas MSD is used on smoothed waveforms.

‍

Server

The deliverable included a Flask server app containerized using Docker, with a simple REST API that manages access to the Text-to-Speech engine. The input is an audio sample with the target voice, as well as several sentences that need to be synthesized in the same voice. Server app users can choose between the diffusion probabilistic TTS model Grad-TTS and WaveGrad2 models.

‍

Results

The project was delivered within time and budget as an MVP, and it serves to showcase what kind of results can be achieved in a short time with limited resources. Future work will consist in acquiring custom datasets with high-quality speech to further improve the models.