Generate Yoda's voice by your text in 5 minutes

What if progress in Neural Networks came so far that now we can build systems that excel not only in classifying or detecting, but on generating something unique with style of given object, for example your text to Yoda speech?

As first works of Generative Adversarial Networks (GANs in short) appeared in 2014, this field made huge steps forward and some astonishing breakthroughs. We have all heard about image style transfer: extracting the style from a famous painting and applying it to another image.

Examples of artwork generated with the Image Style Transfer algorithm. Examples of artwork generated with the Image Style Transfer algorithm

But today we will look deeper into voice generation tutorial showing how to train Yoda text to speech model and in the end we will have complete Yoda voice by given text:

Also we will show you pipeline to generate voice of any character by few samples with his voice and even your own

Master Yoda giphy image Master Yoda giphy image

    Table of contents:

  • Exploring theory behind the scenes
  • Getting familiar with proposed solution
  • Hands on generating Yoda's voice

If you want jump into the fun part, start from here

We will use Real-Time-Voice-Cloning which is an implementation of Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis (SV2TTS) with a vocoder that works in real-time

The goal of the underlined paper was to build a TTS system which can generate natural speech for a variety of speakers in a data efficient manner with addressing a zero-shot learning setting, where a few seconds of untranscribed reference audio from a target speaker is used to synthesize new speech in that speaker's voice, without updating any model parameters

Exploring theory behind the scenes

But how the underlying idea of Image Style Transfer can be applied to sound? There is the way of converting audio signals to image-like 2-dimensional representation called "Spectrogram", and it's the key for using specifically designed computer vision algorithms for audio-related tasks.

Spectrogram Spectrogram

Spectrograms

Let's take a closer look on what spectrogram really is. Given 1 dimension time-domain signal we want to obtain a time-frequency 2-dimensional representation. In order to achieve that, the Short-Time Fourier Transform with a window of a certain length on the audio signal was applied, considering only the squared magnitude of the result.

Illustration of how Time and Frequency correlate from the MelNet paper page Illustration of how Time and Frequency correlate from the MelNet paper page

In order to make spectrograms even more useful for our task, each "pixel" (or magnitude value) conversion to the decibel scale had place, getting the log of each value.

Finally, by converting spectrograms to the mel scale and applying a mel filter bank we will get "mel-spectrograms"

Examples of mel-spectrograms Examples of mel-spectrograms

Getting familiar with proposed solution

The SV2TTS model is composed of three parts, each trained individually.

This allows each part to be trained on independent data, reducing the need to obtain high quality, multispeaker data.

The general SV2TTS architecture The general SV2TTS architecture

The Speaker Encoder

Speaker encoder recieves input audio encoded as mel spectrogram frames of given speaker and process an embedding that captures "how the speaker sounds."
It doesn't care about the words or background noise, rather than the voice features of the speaker such as high/low pitched voice, accent, tone, etc.

All of these features are combined into a low dimensional vector, known formally as d-vector, or informally as the speaker embedding

Synthesizer

The synthesizer takes a sequence of text - mapped to phonemes (the smallest units of human sound, e.g., the sound you make when saying 'a'), along with embeddings produced by the speaker encoder, and uses the Tacotron 2 architecture to generate frames of a mel spectrogram recurrently

Neural Vocoder

In order to convert mel spectrogram made by the synthesizer into raw audio waves, the authors use a vocoder.

It's based on DeepMind's WaveNet model, that generates raw audio waveforms from text, and was at one point state-of-the-art for TTS systems.

Visualization of vocoder flow

Hands on solution

Now when you familiar with theory and how it's implemented in real life, we will run the solution with google colab notebook in order to receive immediate results with GPU on board, no matter what's device or hardware you use.

For synthesize Master Yoda voice we will use this youtube video:

The playground script based on Real-Time Voice Cloning jupyter notebook with various enhancements.

Follow described instructions from notebook bellow to synthesize Yoda's voice.  Furthermore, in the end of it, you will have the opportunity to synthesize your own voice, so have fun with it!

Other articles