Speech Recognition

Sub-field of computational linguistics that develops methodologies and technologies that enables recognition and translation of spoken language into text by computers

What is Speech Recognition?

Speech recognition is the task of detecting spoken words but there is more to speech recognition than recognizing individual sounds in the audio: sequences of sounds need to match existing words, and sequences of words should make sense in the language. This is called “language modelling.” Language models are typically trained over very large corpora of text, often orders of magnitude larger than the acoustic data.

While speech recognition has been around for decades but recent advances in deep learning finally made speech recognition accurate enough to be useful outside of carefully controlled environments. Speech recognition is  built into our phones, our game consoles and our smart watches. It’s even automating our homes.

Common Tools and Libraries

AI Speech Lab

AI Singapore (AISG) has set up an AI Speech Lab to develop a speech recognition system that could interpret and process the unique vocabulary used by Singaporeans – including Singlish and dialects – conversations.

SpeechLab technology is available as a service for both batch and near-real-time processing. Please contact AI Singapore for further information.

Kaldi

Kaldi is an open source toolkit made for dealing with speech data. it’s being used in voice-related applications mostly for speech recognition but also for other tasks — like speaker recognition and speaker diarisation.
Kaldi:
https://github.com/kaldi-asr/kaldi
Kaldi GStreamer https://github.com/jcsilva/docker-kaldi-gstreamer-server

Picovoice/Porcupine

Porcupine is a self-service, highly-accurate, and lightweight wake word (voice control) engine. It enables developers to build always-listening voice-enabled applications/platforms.

Developer's Resource: https://github.com/Picovoice/Porcupine

Google Speech-To-Text

Speech-to-text conversion powered by machine learning and available for short-form or long-form audio.

Developer's Resource:
https://cloud.google.com/speech-to-text/

Azure Cognitive Services

Create apps, websites and bots with intelligent algorithms to see, hear, speak, understand and interpret your user needs through natural methods of communication.

Developer's Resource:
https://azure.microsoft.com/en-us/services/cognitive-services/

DeepSpeech2

Open source implementation of end-to-end Automatic Speech Recognition (ASR) engine, based on Baidu's Deep Speech 2 paper, with PaddlePaddle platform

Developer's Resource:
https://github.com/PaddlePaddle/DeepSpeech

100E Use Cases

  1. MSF – Support an automated hotline which enables citizens to ask questions about available programs and processes to enroll. Increase hotline capacity with improved accuracy and performance as well as add sentiment monitoring capability.
  2. SCDF – Use SpeechLab  technology to support verbatim transcription of calls so that call-takers could focus more on listening rather than typing and translation into English so that call-takers could better understand the conversation. Transcripts could will also be used for further analysis.
  3. MCI – Speechlab batch transcription services to support government meetings and events.
  4. Socibot – AI demonstration platform will use SpeechLab technology to support integration with Azure Cognitive services knowledgebase to answer questions from local Singaporeans more accurately. Socibot also uses Porcupine for wakeword detection to reduce latency and improcve user experience.

Open Datasets

National Speech Corpus

Contains 2,000 hours of locally accented audio and text transcriptions

Free Spoken Digit Dataset

A simple audio/speech dataset consisting of recordings of spoken digits in wav files at 8kHz

LibriSpeech

Dataset consists of a large-scale corpus of around 1000 hours of English speech.

The Spoken Wikipedia Corpora

Corpus of aligned spoken Wikipedia articles from the English, German, and Dutch Wikipedia

TIMIT

A collection of recordings of 630 speakers of American English

Google Audioset

Large-scale collection of human-labeled 10-second sound clips drawn from YouTube videos

Related Articles

  1. How to start with Kaldi and Speech Recognition
    1. Link to article: https://towardsdatascience.com/how-to-start-with-kaldi-and-speech-recognition-a9b7670ffff6
  2. Simple guide to Kaldi – an efficient open source speech recognition tool for extreme beginners
    1. Link to article: https://medium.com/@nikhilamunipalli/simple-guide-to-kaldi-an-efficient-open-source-speech-recognition-tool-for-extreme-beginners-98a48bb34756
  3. Creating voice assistant for games tutorial for Fifa
    1. Link to articlehttps://towardsdatascience.com/creating-voice-assistant-for-games-tutorial-for-fifa-71cfbe428bd1