Speech Recognition HOWTO

Stephen Cook


Revision History
Revision v2.0April 19, 2002Revised by: scc
Changed license information (now GFDL) and added a new publication.
Revision v1.2February 5, 2002Revised by: scc
Added more commercial software listings (sent by Mayur Patel).
Revision v1.1October 5, 2001Revised by: scc
Added info for Vocalis Speechware. Fixed/Updated various other items.
Revision v1.0November 20, 2000Revised by: scc
Added info on L and H and HTK
Revision v0.5September 13, 2000Revised by: scc
Initial HOWTO Submission

Table of Contents
1. Legal Notices
1.1. Copyright/License
1.2. Disclaimer
1.3. Trademarks
2. Forward
2.1. About This Document
2.2. Acknowledgements
2.3. Comments/Updates/Feedback
2.4. ToDo
2.5. Revision History
3. Introduction
3.1. Speech Recognition Basics
3.2. Types of Speech Recognition
3.3. Uses and Applications
4. Hardware
4.1. Sound Cards
4.2. Microphones
4.3. Computers/Processors
5. Speech Recognition Software
5.1. Free Software
5.2. Commercial Software
6. Inside Speech Recognition
6.1. How Recognizers Work
6.2. Digital Audio Basics
7. Publications
7.1. Books
7.2. Internet

1. Legal Notices

2. Forward

3. Introduction

3.1. Speech Recognition Basics

Speech recognition is the process by which a computer (or other type of machine) identifies spoken words. Basically, it means talking to your computer, AND having it correctly recognize what you are saying.

The following definitions are the basics needed for understanding speech recognition technology.


An utterance is the vocalization (speaking) of a word or words that represent a single meaning to the computer. Utterances can be a single word, a few words, a sentence, or even multiple sentences.

Speaker Dependance

Speaker dependent systems are designed around a specific speaker. They generally are more accurate for the correct speaker, but much less accurate for other speakers. They assume the speaker will speak in a consistent voice and tempo. Speaker independent systems are designed for a variety of speakers. Adaptive systems usually start as speaker independent systems and utilize training techniques to adapt to the speaker to increase their recognition accuracy.


Vocabularies (or dictionaries) are lists of words or utterances that can be recognized by the SR system. Generally, smaller vocabularies are easier for a computer to recognize, while larger vocabularies are more difficult. Unlike normal dictionaries, each entry doesn't have to be a single word. They can be as long as a sentence or two. Smaller vocabularies can have as few as 1 or 2 recognized utterances (e.g."Wake Up"), while very large vocabularies can have a hundred thousand or more!


The ability of a recognizer can be examined by measuring its accuracy - or how well it recognizes utterances. This includes not only correctly identifying an utterance but also identifying if the spoken utterance is not in its vocabulary. Good ASR systems have an accuracy of 98% or more! The acceptable accuracy of a system really depends on the application.


Some speech recognizers have the ability to adapt to a speaker. When the system has this ability, it may allow training to take place. An ASR system is trained by having the speaker repeat standard or common phrases and adjusting its comparison algorithms to match that particular speaker. Training a recognizer usually improves its accuracy.

Training can also be used by speakers that have difficulty speaking, or pronouncing certain words. As long as the speaker can consistently repeat an utterance, ASR systems with training should be able to adapt.

3.2. Types of Speech Recognition

Speech recognition systems can be separated in several different classes by describing what types of utterances they have the ability to recognize. These classes are based on the fact that one of the difficulties of ASR is the ability to determine when a speaker starts and finishes an utterance. Most packages can fit into more than one class, depending on which mode they're using.

Isolated Words

Isolated word recognizers usually require each utterance to have quiet (lack of an audio signal) on BOTH sides of the sample window. It doesn't mean that it accepts single words, but does require a single utterance at a time. Often, these systems have "Listen/Not-Listen" states, where they require the speaker to wait between utterances (usually doing processing during the pauses). Isolated Utterance might be a better name for this class.

Connected Words

Connect word systems (or more correctly 'connected utterances') are similar to Isolated words, but allow separate utterances to be 'run-together' with a minimal pause between them.

Continuous Speech

Continuous recognition is the next step. Recognizers with continuous speech capabilities are some of the most difficult to create because they must utilize special methods to determine utterance boundaries. Continuous speech recognizers allow users to speak almost naturally, while the computer determines the content. Basically, it's computer dictation.

Spontaneous Speech

There appears to be a variety of definitions for what spontaneous speech actually is. At a basic level, it can be thought of as speech that is natural sounding and not rehearsed. An ASR system with spontaneous speech ability should be able to handle a variety of natural speech features such as words being run together, "ums" and "ahs", and even slight stutters.

Voice Verification/Identification

Some ASR systems have the ability to identify specific users. This document doesn't cover verification or security systems.

4. Hardware

5. Speech Recognition Software

5.1. Free Software

Much of the free software listed here is available for download at: http://sunsite.uio.no/pub/Linux/sound/apps/speech/

5.1.11. More Free Software?

If you know of free software that isn't included in the above list, please send me a note at: scook@gear21.com. If you're in the mood, you can also send me where to get a copy of the software, and any impressions you may have about it. Thanks!

5.2. Commercial Software

6. Inside Speech Recognition

6.1. How Recognizers Work

Recognition systems can be broken down into two main types. Pattern Recognition systems compare patterns to known/trained patterns to determine a match. Acoustic Phonetic systems use knowledge of the human body (speech production, and hearing) to compare speech features (phonetics such as vowel sounds). Most modern systems focus on the pattern recognition approach because it combines nicely with current computing techniques and tends to have higher accuracy.

Most recognizers can be broken down into the following steps:

  1. Audio recording and Utterance detection

  2. Pre-Filtering (pre-emphasis, normalization, banding, etc.)

  3. Framing and Windowing (chopping the data into a usable format)

  4. Filtering (further filtering of each window/frame/freq. band)

  5. Comparison and Matching (recognizing the utterance)

  6. Action (Perform function associated with the recognized pattern)

Although each step seems simple, each one can involve a multitude of different (and sometimes completely opposite) techniques.

(1) Audio/Utterance Recording: can be accomplished in a number of ways. Starting points can be found by comparing ambient audio levels (acoustic energy in some cases) with the sample just recorded. Endpoint detection is harder because speakers tend to leave "artifacts" including breathing/sighing,teeth chatters, and echoes.

(2) Pre-Filtering: is accomplished in a variety of ways, depending on other features of the recognition system. The most common methods are the "Bank-of-Filters" method which utilizes a series of audio filters to prepare the sample, and the Linear Predictive Coding method which uses a prediction function to calculate differences (errors). Different forms of spectral analysis are also used.

(3) Framing/Windowing involves separating the sample data into specific sizes. This is often rolled into step 2 or step 4. This step also involves preparing the sample boundaries for analysis (removing edge clicks, etc.)

(4) Additional Filtering is not always present. It is the final preparation for each window before comparison and matching. Often this consists of time alignment and normalization.

There are a huge number of techniques available for (5), Comparison and Matching. Most involve comparing the current window with known samples. There are methods that use Hidden Markov Models (HMM), frequency analysis, differential analysis, linear algebra techniques/shortcuts, spectral distortion, and time distortion methods. All these methods are used to generate a probability and accuracy match.

(6) Actions can be just about anything the developer wants. *GRIN*

6.2. Digital Audio Basics

Audio is inherently an analog phenomenon. Recording a digital sample is done by converting the analog signal from the microphone to an digital signal through the A/D converter in the sound card. When a microphone is operating, sound waves vibrate the magnetic element in the microphone, causing an electrical current to the sound card (think of a speaker working in reverse). Basically, the A/D converter records the value of the electrical voltage at specific intervals.

There are two important factors during this process. First is the "sample rate", or how often to record the voltage values. Second, is the "bits per sample", or how accurate the value is recorded. A third item is the number of channels (mono or stereo), but for most ASR applications mono is sufficient. Most applications use pre-set values for these parameters and user's shouldn't change them unless the documentation suggests it. Developers should experiment with different values to determine what works best with their algorithms.

So what is a good sample rate for ASR? Because speech is relatively low bandwidth (mostly between 100Hz-8kHz), 8000 samples/sec (8kHz) is sufficient for most basic ASR. But, some people prefer 16000 samples/sec (16kHz) because it provides more accurate high frequency information. If you have the processing power, use 16kHz. For most ASR applications, sampling rates higher than about 22kHz is a waste.

And what is a good value for "bits per sample"? 8 bits per sample will record values between 0 and 255, which means that the position of the microphone element is in one of 256 positions. 16 bits per sample divides the element position into 65536 possible values. Similar to sample rate, if you have enough processing power and memory, go with 16 bits per sample. For comparison, an audio Compact Disc is encoded with 16 bits per sample at about 44kHz.

The encoding format used should be simple - linear signed or unsigned. Using a U-Law/A-Law algorithm or some other compression scheme is usually not worth it, as it will cost you in computing power, and not gain you much.

7. Publications

If there is a publication that is not on this list, that you think should be, please send the information to me at: scook@gear21.com.

7.2. Internet


Newsgroup dedicated to computer and speech.


Newsgroup dedicated to users of speech software.


Newsgroup dedicated to speech software and hardware research.


Newsgroup dedicated to digital signal processing.


Newsgroup dedicated to the physics of sound.

DDLinux Email List

Speech Recognition on Linux Mailing List.

Linux Software Repository for speech applications


Russ Wilcox's List of Speech Recognition Links

(excellent) http://www.tiac.net/users/rwilcox/speech.html

Online Bibliography

Online Bibliography of Phonetics and Speech Technology Publications. http://www.informatik.uni-frankfurt.de/~ifb/bib_engl.html

MIT's Spoken Language Systems Homepage


Oregon Graduate Institute

Center for Spoken Language Understanding at Oregon Graduate Institute. An excellent location for developers and researchers. http://cslu.cse.ogi.edu/

IBM's ViaVoice Linux SDK


Mississippi State

Mississippi State Institute for Signal and Information Processing homepage with a large amount of useful information for developers. http://www.isip.msstate.edu/projects/speech/

Speech Technology

ASR software and accessories. http://www.speechtechnology.com

Speech Control

Speech Controlled Computer Systems. Microphones, headsets, and wireless products for ASR. http://www.speechcontrol.com


Microphones and accessories for ASR. http://www.microphones.com

21st Century Eloquence

"Speech Recognition Specialists." http://voicerecognition.com

Computing Out Loud

Primarily for Windows users, but good info. http://www.out-loud.com

Say I Can.com

"The Speech Recognition Information Source." http://www.sayican.com