What is speech synthesis?
Speech synthesis is the artificial, computer-generated production of human speech. It is pretty much the counterpart of speech or voice recognition. A computer system used for speech synthesis is known as a speech computer or a speech synthesizer. It can be implemented in hardware as well as software products. A text-to-speech (TTS) system transforms natural language text into speech. Other kinds of systems convert symbolic linguistic representations such as phonetic transcriptions into speech.
The Russian professor Christian Kratzenstein made the first speech synthesis effort in 1779 by creating an apparatus based on the human vocal tract for the purpose of demonstrating the physiological differences in the generation of five long vowel sounds.
Homer Dudley's VODER (Voice Operating Demonstrator) was the world’s first fully functional voice synthesizer and was showcased at the 1939 World's Fair. It was based on Bell Laboratories' mid-thirties vocoder (voice coder) research.
In his TED Talk, Roger Ebert proposed the Ebert Test for speech synthesis systems. This test gauges whether a computer-based synthesized voice has the capability to tell a joke with enough skill to make people laugh.
What is speech prosthesis?
Speech prosthesis refers to the generation of speech via a computer for people with physical issues that inhibit their ability to speak intelligibly. A lot of the research in this domain integrate text as well as speech generation because many of the issues that inhibit speech also make it difficult to enter text. Keeping the speed and fluidity of human conversation in mind, the major challenge in speech prosthesis is to overcome these difficulties.
The overarching research objective in speech prosthesis is to design and create a prosthetic system that will resemble natural speech as much as possible, with minimum input needed from the user.
Speech prosthesis systems also enable visually impaired people to use computers.
What is Multimodal speech synthesis?
Multimodal speech synthesis is also known as audio-visual speech synthesis. It involves incorporating that animated face that is synchronized to complement the synthesized speech. The difficulties that impair speech also tend to limit their ability to communicate via facial expressions.
Synthesized speech is becoming increasingly lifelike, but it could take a while before it is able to handle the nuances of natural speech.
Multimodal speech synthesis even makes it possible to add non-verbal cues like nodding or shaking the head, smiling, winking, etc. to clarify the user’s meaning as far as possible.
Why is speech synthesis used?
Synthetic speech has various applications. The quality of speech synthesis is also improving rapidly. Here are some of the areas in which speech synthesis is used:
Applications for the Blind
Creating reading and communication aids for the blind is one of the biggest and most important applications of speech synthesis. Before synthesized speech, if a blind person wanted to read, they needed to make use of audiobooks. Turning a large book into an audiobook could be a rather time-consuming and expensive task.
Getting information from a computer with speech synthesis is also much easier and more affordable than using a special bliss symbol keyboard, an interface that is used for reading Braille characters.
The Kurzweil reading machine for the blind was possibly the first commercially available text-to-speech application. It was made up of an optical scanner and text recognition software and was able to produce rather intelligible speech from written multi-font text.
When it comes to reading machines, the most important factor is speech intelligibility. This should be maintained with speaking rates ranging from less than half to at least thrice the normal rate. Naturalness is also important in making synthetic speech more acceptable, but sometimes it is important that the listener is able to know that the speech came from a machine.
Applications for the Deafened and Vocally Handicapped
People who are born without the ability to hear are unable to learn how to speak properly. People with hearing difficulties also tend to have speaking difficulties.
Synthesized speech gives people who are deaf or vocally impaired the chance to communicate with people who do not understand sign language. With multimodal speech synthesis, it is possible to enhance the quality of speech even further because visual information is very important for the hearing and speech impaired.
Educational Applications
Speech synthesis can also be used in several educational situations. Speech synthesis can make it possible for computers to teach students around the world 24/7 throughout the year.
This can be very useful for students who have dyslexia because they might feel embarrassed or uncomfortable with asking an actual teacher for help.
Applications for Telecommunications and Multimedia
Synthesized speech has been used for a very long time in telephone inquiry systems like IVRs. However, the quality was not that good back in the day. Today, the quality has improved vastly.
Speech synthesis could also be used to read out text messages and emails on mobile phones and computers.
It is also used vastly in other interactive multimedia applications.
Other Applications
Theoretically, speech synthesis could pretty much be used for all types of human-computer interactions. It could be used in warning and alarm systems to give you a better, more accurate understanding of the situation.
In the future, it could even be used language interpreters, video conferencing, etc.
How does speech synthesis work?
There are basically three stages in speech synthesis: turning text into words, turning words into phenomes, and turning phenomes into sound.
Text to words
This phase involves preprocessing or normalization. It focuses on reducing ambiguity, narrowing down the various ways in which a piece of text could be read by you, leaving only the most appropriate one.
It involves cleaning up the text so that the computer makes fewer mistakes while reading the words aloud. Numbers, dates, times, abbreviations, acronyms, special characters, etc. need to be converted into words.
Computers use Hidden Markov Models or neural networks to find the most appropriate pronunciation.
Preprocessing also ha to deal with homographs which are essentially words that are spelt the same but pronounced differently, depending on the meaning.
Words to phonemes
After figuring out the words, the synthesizer has to generate the speech sounds that make up those words. For every word, the computer would need a list of phenomes that make up the word.
As an alternative, the computer could also break down written words into their graphemes. These are the written components units that are usually made from the individual letters or syllables that make up a word. The synthesizer then generates phenomes that correspond to the graphemes by making use of a set of rules.
Phenomes to sound
There are three approaches to getting the basic phenomes that the computer reads aloud while converting text to speech.
- Concatenative synthesis
This involves using recorded human voices that have to be preloaded with little snippets of human sound which they can rearrange. - Formant synthesis
This involves generating the speech sounds that the system needs from scratch like a music synthesizer. - Articulatory synthesis
This involves generating speech by modeling the intricate human vocal apparatus and thus synthesizing speech.