Learning To Speak: Key Components of a Neural TTS

Generally, a neural TTS has three key components: a text analysis module, an acoustic model, and a vocoder.

A previous post discussed the functions, importance, and subcomponents of a text analysis module. That post, however, looked at the TTS system from a traditional TTS perspective. In this post, I will try to explain, the best I can, how the entire neural TTS system works, from raw text to the final output: a waveform.

The neural TTS process is visualized in the illustration below. As can be seen here, the process is divided into 4 steps or subprocesses.

My Image

Figure 1: The data flows from text to waveform from [1]

To touch on the flow briefly before we going deep into it, this is basically how the process works from the first component to the last:

  1. The text analysis module processes, normalizes, and extracts information from the raw text,
  2. The acoustic models generate acoustic features from the output, which are ither phonemes or linguistic features, or both, of the previous module.
  3. The acoustic features generated by the acoustic models are then received by vocoders to generate speech or a waveform.

To have the full picture of the mechanism of a neural TTS, it is important to understand the data types received (as input) from other modules and each module’s output.

My Image

Figure 1: An illustration of a waveform from [2]

Even among neural-based TTS systems, there can be different data flows from text to waveform:

  1. character -> linguistic features -> acoustic features -> waveform
  2. character -> phoneme -> avoustic features -> waveform
  3. character -> linguistic features -> waveform
  4. character -> phoneme -> acoustic features -> waveform
  5. character -> phoneme -> waveform
  6. character -> waveform
My Image

Figure 1: Different neural TTS system architecture from [1]

In other posts, Text Analysis Modules, Acoustic Models, and Vocoders will be discussed in much greater depth.

References

  1. https://doi.org/10.1007/978-981-99-0827-1
  2. https://en.wikipedia.org/wiki/Waveform