There are two types of automatic speech recognition: Grammar ASR and Transcription ASR.
Grammar ASR uses a closed set of rules (a grammar) that includes all possible inputs from the user. Think of the audio phone trees that give you several options to direct your call to the right department.
Grammar ASR doesn’t need to understand the universe of possible words and numbers. The user is prompted with a closed set of options: “say 1 for accounts receivable or 2 for all other inquiries.”
With grammar ASR, it’s possible to achieve an extremely high level of accuracy because there is a limited number of possible choices for each utterance, which reduces the probability of error.
In grammar ASR, a good engine can achieve upwards of 96% accuracy, while a great one can reach 98 to 99% accuracy, or a Word Error Rate (WER) of under 2%.
Transcription ASR
Transcription ASR is much more challenging. In transcription ASR, the engine has to recognize every possible word, in every available dialect.
Achieving high levels of accuracy in Transcription ASR requires a language model that covers every regional dialect. It is a massive data problem.
Many of the leading Transcription ASR engines have been mired under 90% accuracy for years.
Recently there have been stunning breakthroughs in transcription ASR thanks to deep neural networks. DNN can achieve transcription ASR accuracy of over 90%, or a WER of less than 10/100.
The key to understanding the two types of speech recognition is that one is not necessarily better than the other. It depends on what you want to do with it.
Grammar ASR vs. Transcription ASR
Grammars ASR tends to be the best choice for Interactive Voice Response systems, or IVR.
If you have a limited set of options you want to give a caller, you’ll want to use grammar ASR.
If you want to use automatic speech recognition to transcribe live or recorded audio, the type of speech recognition you’ll want to use is Transcription ASR. Voicebots also use Transcription ASR.
Transcription ASR will be able to accurately recognize and transcribe words from the entire language.
Speech recognition and regional accents
If your system needs to understand a range of dialects and accents you should use a Transcription ASR engine that uses deep neural networks to train on large, all-encompassing data. Such an engine should handle the variability without placing limits on the number of ways a word can be pronounced. This approach will be more efficient for your business, as opposed to deploying multiple languages to account for every dialect.
Whether you need grammar ASR, transcription ASR, or a hybrid of these two types of speech recognition, LumenVox by Capacity can help. Book a free consultation to discuss which type of speech recognition is right for your business.