The gist of this post is that every day for the past few weeks I have gone home and talked to my computer. Yes, you read that correctly. I have gone home and talked to my computer. Also, don’t let the title fool you. The beef of this post is setting up an environment and a process to help me develop a custom text to speech program. I would only call this a mere glance into the Text-To-Speech Algorithms.
I need the future, now. This project is to build a text to speech system using my own voice as the training model! I am very excited to build this from the ground up with my own voice as the training data.
I will be using a handful of Artificial Intelligence libraries to ensure this process goes as smooth as possible. Some include Mozilla’s TTS, Gentle for speech mapping, SOX for data cutting, and of course Python…
Picking a Library
After looking at and evaluating the libraries that are out there, I decided to go with Mozilla’s TTS Library. I felt that their library was relatively easy to put together and felt I could easily reproduce my own voice from text with their library.
Building a Dataset
It appears having a good, no scratch that, perfect, dataset is the most important part of building any decent text to speech application.
I looked at the LJ Speech Dataset and decided about 24hrs, (close to ~13,100 utterances/audioclips) of my own time would be needed to record and collect data. I am sure this will take upwards of 40-50hrs to ensure the data is properly Extracted, Transformed and Loaded. I plan to use the aeneas library to match my speech to text. Confirming this is the correct way to build a model, I also looked at building other datasets as well. The other popular dataset is from the Blizzard challenge, and the M-AILABS Speech Dataset. I could have used these, but the LJ Speech Dataseet seemed easier to replicate.
Mozilla provides a great article here on how to build a custom voice Text to Speech Application. The article mentioned will be one of many I will be using to learn more about building my custom TTS program.
There’s really just no chance
There is not a fucking chance I sit down and read 13,000 one liners back to back. I need to find an already exiting text broken up and I need it then matched to the
wav file and broken down into 5 second
wav files. Enter Python. This isn’t too crazy, but my plan is to basically read a chapter a day until I am done with three books. This should get me to ~15,000 sentences which should be all that’s needed for the training model. I will feed the model more data if I feel it is necessary.
The process will look like the following:
1) Find a Full Plain Text Book Online
2) Parse Text Sentence by Sentence into a single file data (python..)
3) Read and Record the Single file to a single wav file
4) Use Python Library Aeneas to match text to speech (still in bigger file)
5) Use Python to break up the large wav file into a smaller wav file using ffmpeg
6) Aeneas to Create the LJ wav folder and
So I thought…
The 6 step process above is nice, but almost a little unrealistic and way too time consuming for someone as lazy as myself. If it can be better I will make it better. Here is the new process:
1) Find Text/Plaintext Script (Movie Scripts are fun to read) () ->
2) Record on Audacity and save it to Wav Format () ->
3) Upload Text & Unbroken Large Wav File to Gentle for it () ->
4) Parse JSON returned from Gentle and break large file with sox into LJSpeech Dataset, () ->
wavs folder and
csv mapping to the file
5) Pass LJSpeech Dataset to TTS Model
Seems Easy Enough.
Microsoft’s site says, “the data needs to be a collection (.zip) of audio files (.wav) as individual utterances. Each audio file should be 15 seconds or less in length, paired with a formatted transcript (.txt).” They are basically correct with the information they provide.
This dataset example for mozillas TTS is what the custom dataset example should look like. I found the link on the Mozilla Form here. There is a good forum post mostly here and here that goes over training a custom voice.
/custom-dataset-sample/ directory there exists a
wavs directory and a
metadata_sample.csv file. The
wavs directory stores
.wav files and the
metadata_sample.csv is structured to map
wavs/file1.wav to the text inside of the
Writing a Preprocessor
Because we will be using a simliar format to the LJ Dataset, I will need to make sure the preprocessor uses the correct data processor. This could really fuck my model otherwise.
Training the Model
Looking at this example of the tacotron example, it appears the LJ Speech Dataset went through 441k steps and the results sound decent. I will be using the Tacotron2 library.
Currently I know the process I am going to follow to achieve this goal of having my voice used by a computer. My plan is to write part 2 of this series after I am done with all the data collection.
This will allow me to really dive deep into curve fitting and understand the specifics of how ML/AI works. I plan to have a demystified understanding of AI/ML when I return for the second post.
- FFMPEG Python Library
- Text To Speech Deep Learning Architectures
- Github Aeneas - A set of tools to automagically synchronize audio and text
- Aeneas Docs
- Github - Tacotron2 Nvidia
- Github - Tacotron2
- The M-AILABS Speech Dataset
- Tacotron-2 Implementation Status and planned TODOs
- WaceNet vocoder
- Tacotron Audio Example
- Tacotron 2 Quick Observations Sharing
- The LJ Speech Dataset
- What makes a good Dataset
- Expressive Speech Synthesis with Tacotron
- HN - Building a dataset
- Microsoft - Data Types
- Microsoft - Create A Custom Voice