Generate Text to Speech Online with Any Voice

Generative Adversarial Networks (GANs) that appeared in 2014 made it possible not only to perform image style transfer but also to generate text to speech with any voice.

How did this all start?

In 2014 a Google intern Ian Goodfellow and his colleagues designed a class of machine learning frameworks called GANs. GANs is a method of training neural networks with an adversarial process to generate unique images and objects based on the data used to train the neural network.

It is one of the most exciting artificial intelligence case studies. Here is an example of image style transfer done using GAN architecture. The network extracts a painting style from a well-known painting of Van Gogh and applies it to high-resolution

GAN Style Transfer Examples. AI use cases

Examples of artwork generated with the Image Style Transfer algorithm

In this article, we will go through a text to speech demo using Yoda's voice as an example. We will walk you through a tutorial of voice generation and show you how to train a model to produce Yoda's voice from a given text:


broutonlab · Yoda text to speech


What's more, we will share with you a pipeline that will enable you to generate any voice, even your own, just with a few samples of the voice that you want to produce.

Master Yoda giphy image Master Yoda giphy image

Table of contents:


  • Exploring theory behind the scenes
  • Getting familiar with the solution
  • Get practicing generating Yoda's voice yourself


If you want jump into the fun part, start from here


At BroutonLab Data Science Consulting, we used Real-Time-Voice-Cloning, an implementation of what we learned in Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis (SV2TTS). We used a vocoder that analyzes the sound of the modulator signal (human voice) in real-time.

This research paper aims to build a TTS system that can generate natural speech mimicking a variety of speakers in a data-efficient manner by addressing a zero-shot learning setting. This setting needs just a few seconds sample of an untranscribed reference audio from a target speaker to synthesize new speech using the voice in the sample without updating any model parameters.

Exploring theory behind the scenes

But how can the underlying idea of Image Style Transfer be applied to sound?

There is a way of converting audio signals to image-like 2-dimensional representation called "Spectrogram" It is the key to using specifically designed computer vision algorithms for audio-related tasks.

Text to Speech Generation with Spectogram Spectrogram


Let's take a closer look at what spectrogram is and its role in text to speech generation. Given one dimension time-domain signal, we want to obtain a time-frequency 2-dimensional representation. To achieve that, our data scientists apply the Short-Time Fourier Transform with a window of a certain length on the audio signal, considering only the result's squared magnitude.

Illustration of how Text to Speech Generation: Time and Frequency correlate from the MelNet paper page Illustration of how Time and Frequency correlate from the MelNet paper page

To make spectrograms even more useful for text to speech generation, we converted each "pixel" (or magnitude value) to the decibel scale, logging each value.

As a result, by converting spectrograms to the mel scale and applying a mel filter bank, we will get "mel-spectrograms":

Examples of mel-spectrograms for text to speech generation Examples of mel-spectrograms

Proposed Text to Speech Solution

Our data scientists trained individually three parts of the SV2TTS model.

This way, each part is trained on independent data, reducing the need to obtain high quality, multispeaker data.

The general SV2TTS architecture for text to speech generation The general SV2TTS architecture

The Speaker Encoder

Speaker encoder recieves input audio encoded as mel spectrogram frames of given speaker and process an embedding that captures "how the speaker sounds."

It ignores the language, spoken words, or background noise. Speaker encoder processes the speaker's voice features such as high/low-pitched voice, accent, tone, etc.

These features are combined into a low dimensional vector, known formally as d-vector, or informally as the speaker embedding.


Visualization of speaker embeddings extracted from LibriSpeech utterances. Each color corresponds to a different speaker

Visualization of speaker embeddings extracted from LibriSpeech utterances. Each color corresponds to a different speaker



The synthesizer takes a sequence of text - mapped to phonemes together with embeddings produced by the speaker encoder. Phonemes are the smallest units of human voice sound, such as the sound you make when saying 'a'.

Then the synthesizer uses the Tacotron 2 architecture to generate frames of a mel spectrogram recurrently.


Sythesizer training

Sythesizer training


Neural Vocoder

Our data scientists used a vocoder to convert a mel spectrogram made by the synthesizer into raw audio waves.

It's based on DeepMind's WaveNet model, which generates raw audio waveforms from the text. This model, at some point, was state-of-the-art for TTS systems.


Now Generate Yoda Voice Yourself!

Now when you familiar with theory and how it's implemented in real life, we will run the solution with google colab notebook in order to receive immediate results with GPU on board, no matter what's device or hardware you use.

To synthesize Master Yoda voice we will use this youtube video:


The playground script based on Real-Time Voice Cloning jupyter notebook with various enhancements.

Follow described instructions from notebook bellow to synthesize Yoda's voice. Moreover, by the end of it, you will have the opportunity to synthesize your own voice, so have fun!

Other articles