Generate Yoda's voice by your text in 5 minutes
What if progress in Neural Networks came so far that now we can build systems that excel not only in classifying or detecting, but on generating something unique with style of given object, for example your text to Yoda speech?
As first works of Generative Adversarial Networks (GANs in short) appeared in 2014, this field made huge steps forward and some astonishing breakthroughs. We have all heard about image style transfer: extracting the style from a famous painting and applying it to another image.Examples of artwork generated with the Image Style Transfer algorithm
But today we will look deeper into voice generation tutorial showing how to train Yoda text to speech model and in the end we will have complete Yoda voice by given text:
Also we will show you pipeline to generate voice of any character by few samples with his voice and even your own
- Exploring theory behind the scenes
- Getting familiar with proposed solution
- Hands on generating Yoda's voice
Table of contents:
If you want jump into the fun part, start from here
We will use Real-Time-Voice-Cloning which is an implementation of Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis (SV2TTS) with a vocoder that works in real-time
The goal of the underlined paper was to build a TTS system which can generate natural speech for a variety of speakers in a data efficient manner with addressing a zero-shot learning setting, where a few seconds of untranscribed reference audio from a target speaker is used to synthesize new speech in that speaker's voice, without updating any model parameters
Exploring theory behind the scenes
But how the underlying idea of Image Style Transfer can be applied to sound? There is the way of converting audio signals to image-like 2-dimensional representation called "Spectrogram", and it's the key for using specifically designed computer vision algorithms for audio-related tasks.
Let's take a closer look on what spectrogram really is. Given 1 dimension time-domain signal we want to obtain a time-frequency 2-dimensional representation. In order to achieve that, the Short-Time Fourier Transform with a window of a certain length on the audio signal was applied, considering only the squared magnitude of the result.
In order to make spectrograms even more useful for our task, each "pixel" (or magnitude value) conversion to the decibel scale had place, getting the log of each value.
Finally, by converting spectrograms to the mel scale and applying a mel filter bank we will get "mel-spectrograms"
Getting familiar with proposed solution
The SV2TTS model is composed of three parts, each trained individually.
This allows each part to be trained on independent data, reducing the need to obtain high quality, multispeaker data.
The Speaker Encoder
Speaker encoder recieves input audio encoded as mel spectrogram frames of given speaker and process an embedding
that captures "how the speaker sounds."
It doesn't care about the words or background noise, rather than the voice features of the speaker such as high/low pitched voice, accent, tone, etc.
All of these features are combined into a low dimensional vector, known formally as d-vector, or informally as the speaker embedding
The synthesizer takes a sequence of text - mapped to phonemes (the smallest units of human sound, e.g., the sound you make when saying 'a'), along with embeddings produced by the speaker encoder, and uses the Tacotron 2 architecture to generate frames of a mel spectrogram recurrently
In order to convert mel spectrogram made by the synthesizer into raw audio waves, the authors use a vocoder.
It's based on DeepMind's WaveNet model, that generates raw audio waveforms from text, and was at one point state-of-the-art for TTS systems.
Hands on solution
Now when you familiar with theory and how it's implemented in real life, we will run the solution with google colab notebook in order to receive immediate results with GPU on board, no matter what's device or hardware you use.
For synthesize Master Yoda voice we will use this youtube video:
Follow described instructions from notebook bellow to synthesize Yoda's voice. Furthermore, in the end of it, you will have the opportunity to synthesize your own voice, so have fun with it!