Close

An Overview of Speech Synthesis

A project log for μTTS

Speech synthesis on a microcontroller. Talking projects, cheap as chips!

greg-kennedyGreg Kennedy 09/16/2014 at 19:480 Comments

Starting from the beginning: Human speech is complicated.

---

There are a lot of noisemakers in the body.  The lungs and chest create (resonate) harmonically rich sounds which are modified and shaped by the glottis, larynx, tongue, lips, jaw, and others.  We can also create various plosives (popping, clicking, other one-shot noises) as well as a simulation of white noise through air blowing.  Languages choose from these sounds to create letters (or groups of letters), and from the letters we make words.

Fortunately for me, the ground work has been laid already.  The International Phonetic Association has categorized the various possible human sounds (or at least, those used in known languages) already.

http://www.internationalphoneticalphabet.org/ipa-sounds/ipa-chart-with-sounds/

Using these charts, we arrive at a set of phones which would be needed to implement any given language.  (Languages group and select phones into phonemes, which are the relevant "atoms" that make up words.  Most languages don't use all the phones, and further group the others together in certain ways.)

There are quite a few of these.  For recognizable speech, it may not be necessary to implement all of them (say, the difference between "m" made by closing the lips, vs "m" made by touching lips to teeth).

---

Vowel production is done by synthesizing formants - "the spectral peaks of sound of the human voice".  The interaction of vocal cords and internal structures create resonances (some tones louder than others).  With humans there are four to six peaks, but research on sound synthesis has found that just two formants are necessary for humans to be able to distinguish one vowel from another.

http://auditoryneuroscience.com/topics/two-formant-artificial-vowels

Further, there are "average formant frequency" tables available from IPA.  Implementing these into the chip is just a lookup table driving the wave synths.

Ever noticed how DTMF tones sound a bit like vowels?  Like formants, it's just two tones played together - the ratio, and overall pitch, makes them sound more like one or another vowel.

---

Noise production - Making an "s" sound is pretty straightforward: white noise generation.  "sh" is also noisy, but with the high frequencies rolled off.  Several phones can be produced by noise generation.

---

Other consonant production - to be determined : )

---

Putting it all together, what I intend to build is more of a "phone synthesizer", capable of producing the phones necessary to build speech.  It has no knowledge of letters or vowels and must be hand-fed the proper pronunciation strings from the host.

To make this easier, I've built a cross-platform C application which combines the synth engine with a loadable per-language dictionary.  The dictionary has IPA pronunciation keys for lists of words, but also phoneme groupings for the language to make guesses at the pronunciation for unknown words.  (As an example - there is some debate to the number of phonemes used in English, but a reasonable estimate is about 42.)

http://www.auburn.edu/academic/education/reading_genie/spellings.html

The C app allows users to preview and tweak the sound, dump a .wav of the sample, or retrieve the phone string - and then compile this into their own application, "baking in" the phrases.

I may further put the phoneme translation into a standalone C module, for use in e.g. Arduino and friends, as a way to speak arbitrary phrases.

Discussions