Thursday, 23 June 2011

Kinect SDK: Speech Recognition using the Microsoft Speech API


A previous post examined using the Kinect audio API to access the Kinect audio stream, and associated beamforming data. A key aspect of a natural user interface (NUI) is speech recognition. The Kinect sensor’s microphone array can be used as an input device for speech recognition-based applications.

In order to develop a speech recognition-based application, the following components must be installed:


The first step in implementing this application (after creating a new WPF project) is to include a reference to Microsoft.Research.Kinect. This assembly is in the GAC, and calls unmanaged functions from managed code. I then developed a basic UI, using XAML, that displays recognized and hypothesized words. The code is shown below. MainWindow.xaml is wired up to the Window_Closed event, and binds to three properties in the code-behind that contain the recognized word, the hypothesized word, and whether the word is rejected or not.

<Window x:Class="Speech_Demo.MainWindow"
        Title="MainWindow" ResizeMode="NoResize" SizeToContent="Height" Width="300"
            <conv:BooleanToStringConverter x:Key="boolStr" />
            <StackPanel Margin="10" 
                <TextBlock Text="Recognized word: " />
                <TextBlock Text="{Binding ElementName=_this, Path=RecognizedWord}" />
            <StackPanel Margin="10" 
                <TextBlock Text="Hypothesized word: " />
                <TextBlock Text="{Binding ElementName=_this, Path=HypothesizedWord}"/>
            <StackPanel Margin="10" 
                <TextBlock Text="Word rejected: " />
                <TextBlock Text="{Binding ElementName=_this, Path=WordRejected, 
                    Converter={StaticResource boolStr}}"/>
            <TextBlock FontSize="16"
                       Text="Say Content Master rocks!" />

The following namespaces must be imported into this application. The key ones are Microsoft.Research.Kinect.Audio, Microsoft.Speech.AudioFormat, and Microsoft.Speech.Recognition.
using System;
using System.ComponentModel;
using System.IO;
using System.Linq;
using System.Threading;
using System.Windows;
using Microsoft.Research.Kinect.Audio;
using Microsoft.Speech.AudioFormat;
using Microsoft.Speech.Recognition;

MainWindow.xaml.cs contains the following class-level declarations.
        private Thread t;
        private const string RecognizerId = "SR_MS_en-US_Kinect_10.0";
        private string recognizedWord;
        private string hypothesizedWord;
        private bool wordRejected;
        private KinectAudioSource source;
        private SpeechRecognitionEngine sre;
        private Stream stream;
        public string RecognizedWord
            get { return this.recognizedWord; }
                this.recognizedWord = value;
        public string HypothesizedWord
            get { return this.hypothesizedWord; }
                this.hypothesizedWord = value;
        public bool WordRejected
            get { return this.wordRejected; }
                this.wordRejected = value;

The managed Kinect audio API runs the DirectX Media Object (DMO) on a background thread, which requires the multithreaded apartment (MTA) threading model. Otherwise, the interop layer throws an InvalidCastException. Due to the fact that WPF applications require single-threaded apartment (STA) threading, a separate MTA thread must be used for Kinect audio. Therefore, the constructor initializes and starts an MTA thread that will do the audio capture.
        public MainWindow()
            t = new Thread(new ThreadStart(CaptureAudio));

The next step is to create and configure a KinectAudioSource object, which represents the Kinect sensor’s microphone array. The KinectAudioSource object is configured so that feature mode is enabled, automatic gain control (AGC) is disabled (which is required for speech recognition), and the system mode is set to an adaptive beam without acoustic echo cancellation (AEC). In this mode, the microphone array functions as a single-directional microphone that is pointed within a few degrees of the audio source. A speech recognition engine is then created; a LINQ query is used to obtain the ID of the first recogniser in the list. This ID is used to create the SpeechRecognitionEngine object. A grammar is then created and loaded that contains the words to be recognized. The GrammarBuilder object specifes the culture to match that of the recognizer.

The speech recognition engine raises three events:
  1. SpeechHypothesized. This event occurs for each attempted word. It passes the event handler an object that contains the best-fitting word from the word set, and a measure of the estimate’s confidence.
  2. SpeechRecognized. This event occurs when an attempted word is recognized as being a member of the word set. It passes the event handler an object that contains the recognized command.
  3. SpeechRejected. This event occurs when an attempted command is rejected as being a member of the word set.
After the speech recognition engine is configured, the speech recognition process begins by starting the audio capture stream, feeding the stream into the speech recognition engine, and starting the recognition process asynchronously on a background thread.
        private void CaptureAudio()
            this.source = new KinectAudioSource();
            this.source.FeatureMode = true;
            this.source.AutomaticGainControl = false;
            this.source.SystemMode = SystemMode.OptibeamArrayOnly;
            RecognizerInfo ri = SpeechRecognitionEngine.InstalledRecognizers().
                Where(r => r.Id == RecognizerId).FirstOrDefault();
            if (ri == null)
            this.sre = new SpeechRecognitionEngine(ri.Id);
            var words = new Choices();
            var gb = new GrammarBuilder();
            gb.Culture = ri.Culture;
            var g = new Grammar(gb);
            this.sre.SpeechRecognized += 
                new EventHandler<SpeechRecognizedEventArgs>
            this.sre.SpeechHypothesized += 
                new EventHandler<SpeechHypothesizedEventArgs>
            this.sre.SpeechRecognitionRejected += 
                new EventHandler<SpeechRecognitionRejectedEventArgs>
   = this.source.Start();
            this.sre.SetInputToAudioStream(, new SpeechAudioFormatInfo(
                EncodingFormat.Pcm, 16000, 16, 1, 32000, 2, null));

The three event handlers are shown below. Each simply updates a property that the UI binds to, when the event fires.
        private void sre_SpeechRecognitionRejected(object sender, 
            SpeechRecognitionRejectedEventArgs e)
            this.WordRejected = true;
        private void sre_SpeechHypothesized(object sender, 
            SpeechHypothesizedEventArgs e)
            this.HypothesizedWord = e.Result.Text;
        private void sre_SpeechRecognized(object sender, 
            SpeechRecognizedEventArgs e)
            this.RecognizedWord = e.Result.Text;
            this.WordRejected = false;

The Window_Closed event handler simply stops the audio capture from the Kinect sensor, disposes of the resources for the capture stream, aborts the thread, and stops the recognition process and terminates the speech recognition engine.
        private void Window_Closed(object sender, EventArgs e)
            if (this.source != null)
            if ( != null)
            if (this.t != null)
            if (this.sre != null)

The change notification code omitted due to being so standard.

The application is shown below. Each time you speak a word, the speech recognition engine compares your speech with the templates for the words in the grammar.



The Kinect for Windows SDK beta from Microsoft Research is a starter kit for application developers. It enables access to the Kinect sensor, and experimentation with its features. In order to develop a speech recognition-based application, you must install the Microsoft Speech Platform SDK, the Microsoft Speech Platform Server Runtime, and the Kinect for Windows Runtime Language Pack.

No comments: