Tacotron-2-Multispeaker:

Tensorflow implementation of DeepMind's Tacotron-2. A deep neural network architecture described in this paper: Natural TTS synthesis by conditioning Wavenet on MEL spectogram predictions

Multispeaker implement for multispeaker multilingual speech synthesis and cross-language voice cloning. Module details follow the archtecture in Google's paper: Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning

Forked from https://github.com/Rayhane-mamah/Tacotron-2 with [the commit on 2018.10.07] (https://github.com/Rayhane-mamah/Tacotron-2/tree/970b0803bb41e68cbac854dc958dbb03f34f9604)

Keeping only Griffin-Lim vocoder (WaveNet vocoder deleted).

Repository Structure:

Tacotron-2
├── datasets
├── LJSpeech-1.1	(0)
│   └── wavs
├── logs-Tacotron	(2)
│   ├── eval_-dir
│   │   ├── plots
│   │   └── wavs
│   ├── mel-spectrograms
│   ├── plots
│   ├── pretrained
│   └── wavs
├── tacotron
│   ├── models
│   └── utils
├── tacotron_output	(3)
│   ├── eval
│   ├── gta
│   ├── logs-eval
│   │   ├── plots
│   │   └── wavs
│   └── natural
└── training_data	(1)
    ├── audio
    ├── linear
    └── mels

The previous tree shows the current state of the repository (separate training, one step at a time).

Step (0): Get your dataset, here I have set the examples of Ljspeech.
Step (1): Preprocess your data. This will give you the training_data folder.
Step (2): Train your Tacotron model. Yields the logs-Tacotron folder.
Step (3): Synthesize/Evaluate the Tacotron model. Gives the tacotron_output folder.

Note:

Our preprocessing only supports Ljspeech and Ljspeech-like datasets (M-AILABS speech data)! If running on datasets stored differently, you will probably need to make your own preprocessing script.
In the previous tree, files were not represented and max depth was set to 3 for simplicity.
If you run training of both models at the same time, repository structure will be different.

Model Architecture:

The model described by the authors can be divided in two parts:

Spectrogram prediction network
Wavenet vocoder

To have an in-depth exploration of the model architecture, training procedure and preprocessing logic, refer to our wiki

How to start

first, you need to have python 3 installed along with Tensorflow.

next you can install the requirements. If you are an Anaconda user: (else replace pip with pip3 and python with python3)

pip install -r requirements.txt

For more details about environment setup, please visit the Environment Setup page: SETUP.md.

Dataset:

We tested the code above on the ljspeech dataset, which has almost 24 hours of labeled single actress voice recording. (further info on the dataset are available in the README file when you download it)

We are also running current tests on the new M-AILABS speech dataset which contains more than 700h of speech (more than 80 Gb of data) for more than 10 languages.

After downloading the dataset, extract the compressed file, and place the folder inside the cloned repository.

Hparams setting:

Before proceeding, you must pick the hyperparameters that suit best your needs. While it is possible to change the hyper parameters from command line during preprocessing/training, I still recommend making the changes once and for all on the hparams.py file directly.

To pick optimal fft parameters, I have made a griffin_lim_synthesis_tool notebook that you can use to invert real extracted mel/linear spectrograms and choose how good your preprocessing is. All other options are well explained in the hparams.py and have meaningful names so that you can try multiple things with them.

Preprocessing

Before running the following steps, please make sure you are inside Tacotron-2 folder

cd Tacotron-2

Preprocessing can then be started using:

python preprocess.py

dataset can be chosen using the --dataset argument. If using M-AILABS dataset, you need to provide the language, voice, reader, merge_books and book arguments for your custom need. Default is Ljspeech.

This should take no longer than a few minutes.

Training:

To train the Tacotron-2 model using:

python train.py

checkpoints will be made each 5000 steps and stored under logs-Tacotron folder.

Note:

Please refer to train arguments under train.py for a set of options you can use.

Synthesis

To synthesize audio using:

python synthesize.py

Note:

Please refer to synthesis arguments under synthesize.py for a set of options you can use.

Name		Name	Last commit message	Last commit date
Latest commit History 185 Commits
MultiSets		MultiSets
datasets		datasets
tacotron		tacotron
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
SETUP.md		SETUP.md
enviornment.yaml		enviornment.yaml
griffin_lim_synthesis_tool.ipynb		griffin_lim_synthesis_tool.ipynb
hparams.py		hparams.py
infolog.py		infolog.py
preprocess.py		preprocess.py
requirements.txt		requirements.txt
sentences.txt		sentences.txt
synthesize.py		synthesize.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tacotron-2-Multispeaker:

Repository Structure:

Model Architecture:

How to start

Dataset:

Hparams setting:

Preprocessing

Training:

Synthesis

References and Resources:

About

Releases

Packages

Languages

License

JohnsonTsing/Tacotron-2

Folders and files

Latest commit

History

Repository files navigation

Tacotron-2-Multispeaker:

Repository Structure:

Model Architecture:

How to start

Dataset:

Hparams setting:

Preprocessing

Training:

Synthesis

References and Resources:

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages