Tuesday, 21 June 2011

Kinect SDK: Beamforming

Introduction

My previous post examined the basics of accessing the video and depth streams from the Kinect sensor. In this post I’ll focus on accessing the audio stream from the Kinect sensor.

Background

The Kinect sensor includes a four-element linear microphone array, which uses 24-bit ADC and provides in-built signal processing, including noise suppression and echo cancellation. An array of microphones has the following advantages over a single microphone:

  • Improved audio quality. Microphone arrays can support more effective noise reduction and automatic echo cancellation (AEC) algorithms than are possible with a single microphone.
  • Beamforming and source localization. By using the fact that the sound from a particular audio source arrives at each microphone in the array at a slightly different time, beamforming allows applications to determine the direction of the audio source and use the microphone array as a directional microphone.

For a detailed discussion of Beamforming, click here.

The Kinect SDK includes a managed audio API that allows application to configure the DMO and perform operations such as starting, capturing, and stopping the audio stream. The managed API also includes events that provide the source and beam directions to the application. A DMO is a standard COM object that can be incorporated into a DirectShow graph or a Microsoft Media Foundation topology.

Implementation

The application documented here simply gets audio data from the Kinect sensor, and displays the source and beam directions of the audio, in a WPF application. The first step in implementing this application (after creating a new WPF project) is to include a reference to Microsoft.Research.Kinect. This assembly is in the GAC, and calls unmanaged functions from managed code. To use the audio API you must import the Microsoft.Research.Kinect.Audio namespace to your application.

I then built a basic UI, using XAML, that displays the sound source position, and the microphone array beam angle. The code can be seen below. MainWindow.xaml is wired up to the Closed event, and binds to two properties in the code-behind that contain the sound source position data and the microphone array beam angle data.

<Window x:Class="BeamForming_Demo.MainWindow"
        x:Name="_this"
        xmlns="http://schemas.microsoft.com/winfx/2006/xaml/presentation"
        xmlns:x="http://schemas.microsoft.com/winfx/2006/xaml"
        Title="Beam Forming Demo" ResizeMode="NoResize" SizeToContent="Height" Width="300"
        Closed="Window_Closed">
    <StackPanel>   
        <StackPanel Margin="10" 
                    Orientation="Horizontal">
            <TextBlock Text="Sound Source Position: " />
            <TextBlock Text="{Binding ElementName=_this, Path=SoundSourcePosition}" />
        </StackPanel>
        <StackPanel Margin="10" 
                    Orientation="Horizontal">
            <TextBlock Text="Mic Array Beam Angle: " />
            <TextBlock Text="{Binding ElementName=_this, Path=MicArrayBeamAngle}"/>
        </StackPanel>
    </StackPanel>
</Window>

MainWindow.xaml.cs contains the following class-level declarations:

        private string micArrayBeamAngle = "0.000 radians";
        private string soundSourcePosition = "0.000 radians";
        private Thread t;
        public string MicArrayBeamAngle
        {
            get { return this.micArrayBeamAngle; }
            private set
            {
                this.micArrayBeamAngle = value;
                this.OnPropertyChanged("MicArrayBeamAngle");
            }
        }
        public string SoundSourcePosition
        {
            get { return this.soundSourcePosition; }
            private set
            {
                this.soundSourcePosition = value;
                this.OnPropertyChanged("SoundSourcePosition");
            }
        }

The managed Kinect audio API runs the DMO on a background thread, which requires the multithreaded apartment (MTA) threading model. Otherwise, the interop layer throws an exception. Due to the fact that this is a WPF application (that requires single-threaded apartment (STA) threading), a separate MTA thread must be used for Kinect audio. Therefore, the constructor initializes and starts an MTA thread that will do the audio capture.

        public MainWindow()
        {
            InitializeComponent();
            t = new Thread(new ThreadStart(CaptureAudio));
            t.SetApartmentState(ApartmentState.MTA);
            t.Start();
        }

The CaptureAudio method creates a buffer to store the audio stream and creates a new KinectAudioSource object, which is then configured. The KinectAudioSource object is started, and then the audio data is read into the buffer. The audio data will be read for 30 seconds, at 16kHz. If the confidence level of the sound source position is greater than 0.9, the microphone array beam angle and sound source position properties will be updated and change notification will cause the UI to be updated.

        private void CaptureAudio()
        {
            var buffer = new byte[4096];
            const int time = 30;
            const int length = time * 2 * 16000; // 30 seconds, 16 bits per sample, 16kHz
            using (var source = new KinectAudioSource())
            {
                source.SystemMode = SystemMode.OptibeamArrayOnly;
                source.BeamChanged += source_BeamChanged;
                using (var audioStream = source.Start())
                {
                    int count, totalCount = 0;
                    while ((count = audioStream.Read(buffer, 0, buffer.Length)) > 0 && totalCount < length)
                    {
                        totalCount += count;
                        if (source.SoundSourcePositionConfidence > 0.9)
                        {
                            this.MicArrayBeamAngle = source.MicArrayBeamAngle.ToString() + " radians";
                            this.SoundSourcePosition = source.SoundSourcePosition.ToString() + " radians";
                        }
                    }
                }
            }
        }

The BeamChanged event handler ensures that when the beam angle of the microphone array changes, that the appropriate property is updated (and in turn, the UI is updated through change notification).

        private void source_BeamChanged(object sender, BeamChangedEventArgs e)
        {
            this.MicArrayBeamAngle = e.Angle.ToString() + " radians";
        }

Finally, the Window_Closed event handler ensures that when the window is closed the thread is terminated if it’s still active.

        private void Window_Closed(object sender, EventArgs e)
        {
            if (t != null)
            {
                t.Abort();
            }
        }

The code implementing change notification is not shown due to being so standard.

The application is shown below. While only capturing data, it does hint at the possibilities that this technology can be used for. A natural next step is to extend this example to incorporate speech recognition.


beamformingapp

Conclusion


The Kinect for Windows SDK beta from Microsoft Research is a starter kit for application developers. It enables access to the Kinect sensor, and experimentation with its features. The audio API is simple to use and provides easy access to the beamforming data.

4 comments:

Anonymous said...

It's not Microsoft.Speech, change it to System.Speek and install the Speech SDK Version 5.1 instead of your version...it has the same classes...

David Britch said...

v5.1 of the Speech SDK dates back to 2009. I used v10.2 of the Speech Platform SDK, available from:

http://www.microsoft.com/download/en/details.aspx?displaylang=en&id=14373

Anonymous said...

the demo here
http://www.youtube.com/user/steve118x#p/u/0/j4oxq4o04HI

has a visualization of the beam following someone talking as they move.

steve118x

My journey said...

Hi,
I am a new user of kinect for windows.I am interested in getting the raw data from microphone array for my further processing.I am not able to read the out.wav file obtained by running the "AudioCaptureRaw" code present in the SDK.
Help me in this regard on ow to extract the individual mic signals for my processing.

thank you.