Skip to content

DeepMind's Tacotron-2 Tensorflow implementation

License

Notifications You must be signed in to change notification settings

JohnsonTsing/Tacotron-2

 
 

Repository files navigation

Tacotron-2-Multispeaker:

Tensorflow implementation of DeepMind's Tacotron-2. A deep neural network architecture described in this paper: Natural TTS synthesis by conditioning Wavenet on MEL spectogram predictions

Multispeaker implement for multispeaker multilingual speech synthesis and cross-language voice cloning. Module details follow the archtecture in Google's paper: Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning

Forked from https://github.com/Rayhane-mamah/Tacotron-2 with [the commit on 2018.10.07] (https://github.com/Rayhane-mamah/Tacotron-2/tree/970b0803bb41e68cbac854dc958dbb03f34f9604)

Keeping only Griffin-Lim vocoder (WaveNet vocoder deleted).

Repository Structure:

Tacotron-2
├── datasets
├── LJSpeech-1.1	(0)
│   └── wavs
├── logs-Tacotron	(2)
│   ├── eval_-dir
│   │   ├── plots
│   │   └── wavs
│   ├── mel-spectrograms
│   ├── plots
│   ├── pretrained
│   └── wavs
├── tacotron
│   ├── models
│   └── utils
├── tacotron_output	(3)
│   ├── eval
│   ├── gta
│   ├── logs-eval
│   │   ├── plots
│   │   └── wavs
│   └── natural
└── training_data	(1)
    ├── audio
    ├── linear
    └── mels

The previous tree shows the current state of the repository (separate training, one step at a time).

  • Step (0): Get your dataset, here I have set the examples of Ljspeech.
  • Step (1): Preprocess your data. This will give you the training_data folder.
  • Step (2): Train your Tacotron model. Yields the logs-Tacotron folder.
  • Step (3): Synthesize/Evaluate the Tacotron model. Gives the tacotron_output folder.

Note:

  • Our preprocessing only supports Ljspeech and Ljspeech-like datasets (M-AILABS speech data)! If running on datasets stored differently, you will probably need to make your own preprocessing script.
  • In the previous tree, files were not represented and max depth was set to 3 for simplicity.
  • If you run training of both models at the same time, repository structure will be different.

Model Architecture:

The model described by the authors can be divided in two parts:

  • Spectrogram prediction network
  • Wavenet vocoder

To have an in-depth exploration of the model architecture, training procedure and preprocessing logic, refer to our wiki

How to start

first, you need to have python 3 installed along with Tensorflow.

next you can install the requirements. If you are an Anaconda user: (else replace pip with pip3 and python with python3)

pip install -r requirements.txt

For more details about environment setup, please visit the Environment Setup page: SETUP.md.

Dataset:

We tested the code above on the ljspeech dataset, which has almost 24 hours of labeled single actress voice recording. (further info on the dataset are available in the README file when you download it)

We are also running current tests on the new M-AILABS speech dataset which contains more than 700h of speech (more than 80 Gb of data) for more than 10 languages.

After downloading the dataset, extract the compressed file, and place the folder inside the cloned repository.

Hparams setting:

Before proceeding, you must pick the hyperparameters that suit best your needs. While it is possible to change the hyper parameters from command line during preprocessing/training, I still recommend making the changes once and for all on the hparams.py file directly.

To pick optimal fft parameters, I have made a griffin_lim_synthesis_tool notebook that you can use to invert real extracted mel/linear spectrograms and choose how good your preprocessing is. All other options are well explained in the hparams.py and have meaningful names so that you can try multiple things with them.

Preprocessing

Before running the following steps, please make sure you are inside Tacotron-2 folder

cd Tacotron-2

Preprocessing can then be started using:

python preprocess.py

dataset can be chosen using the --dataset argument. If using M-AILABS dataset, you need to provide the language, voice, reader, merge_books and book arguments for your custom need. Default is Ljspeech.

This should take no longer than a few minutes.

Training:

To train the Tacotron-2 model using:

python train.py

checkpoints will be made each 5000 steps and stored under logs-Tacotron folder.

Note:

  • Please refer to train arguments under train.py for a set of options you can use.

Synthesis

To synthesize audio using:

python synthesize.py

Note:

  • Please refer to synthesis arguments under synthesize.py for a set of options you can use.

References and Resources:

About

DeepMind's Tacotron-2 Tensorflow implementation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 96.9%
  • Jupyter Notebook 2.1%
  • Perl 1.0%