Speech Recognition API

Session 509 WWDC 2016

iOS 10 brings a brand new Speech Recognition API that allows you to perform rapid and contextually informed speech recognition in both file-based and realtime scenarios. In this video, you will learn all about the new API and how to bring advanced speech recognition services into your apps.

[ Music ]

Hi. I'm Henry Mason, an engineer working on speech recognition for Siri.

Today we're incredibly excited to announce a brand new API which will let our speech recognition solve problems for your apps too.

A quick overview of what speech recognition is.

Speech recognition is the automatic process of converting audio of human speech into text.

It depends on the language of the speech.

English will be recognized differently than Chinese, for example.

On iOS, most people think of Siri but speech recognition is also useful for many other tasks.

Since Siri was released with iPhone 4S, iOS has also featured keyboard dictation.

That little microphone button next to your iOS keyboard spacebar triggers speech recognition for any UI kit text input.

Tens of thousands of apps every day use this feature.

In fact, about a third of all requests come from third party apps.

It's extremely easy to use.

It handles audio recording and recording interruption.

It displays a user interface.

It doesn't require you to write anymore code than you would to support any text input.

And it's been available since iOS 5.

But with its simplicity comes many limitations.

It often doesn't make sense for your user interface to require a keyboard.

You can't control when the audio recording starts.

There's no control over which language is used.

It just happens to use the system's keyboard language.

There isn't even a way to know if the dictation button is available.

The default audio recording might not make sense for your use case and you may want more information than just text.

So now in iOS 10, we're introducing a new speech framework.

It uses the same underlying technology that we use in Siri and Dictation.

It provides fast and accurate results which are transparently customized to the user without you having to collect any user data.

The framework also provides more information about recognition than just text.

For example, we also provide alternative interpretations of what your users might have said, confidence levels, and timing information.

Audio for the API can be provided from either pre-recorded files or a live source like a microphone.

iOS 10 supports over 50 languages and dialects from Arabic to Vietnamese.

Any device which runs iOS 10 is supported.

The speech recognition API typically does its heavy lifting on our big servers which requires an internet connection.

However, some newer devices do support speech recognition all the time.

We provide an availability API to determine if a given language is available at the moment.

Use this rather than looking for internet connectivity explicitly.

Since speech recognition requires transmitting the user's audio over the internet, the user must explicitly provide permission to your application before speech recognition can be used.

There are four major steps to adopting speech recognition in your app.

First, provide a usage description in your app's info.plist.

For example, your camera app, Phromage, might have used a usage description for speech recognition of this will allow you to take a photo just by saying cheese.

Second, request authorization using the request authorization class method.

The explanation you provided earlier will be presented to the user in a familiar dialogue.

And the user will be able to decide if they want to provide your app to speech recognition.

Next create a speech recognition request.

If you already have recorded audio file, use the SFSpeechURLRecognitionRequest class.

Otherwise, you'll want to use the SFSpeechAudioBuffer RecognitionRequest class.

Finally hand the recognition request to an SFSpeech recognizer to begin recognition.

You can optionally hold onto the returned recognition task which can useful for monitoring and controlling recognition progress.

Let's see what this all looks like in code.

We'll assume we've already updated our info.plist with an accurate description of how will it be used.

Our next step is to request authorization.

It may be a good idea to wait to do this until the user has invoked a feature of your app which depends on speech recognition.

The request authorization class method takes the completion handler which doesn't guarantee a particular execution context.

Apps will typically want to dispatch to the main queue if they're going to do something like enable or disable a user interface button.

If your authorization handler has given authorize status, you should be ready to start recognition.

If not, recognition won't be available to your app.

It's important to gracefully disable necessary functionality when the user makes this decision or when the device is otherwise restricted from accessing speech recognition.

Authorization can be changed later in the device's privacy settings.

Let's see what it looks like to recognize a prerecorded audio file.

We'll assume we already have a file url.

Recognition requires a speech recognizer which only recognizes a single language.

The default initializer for SFSpeechRecognizer is failable.

So it'll return nil if the locale is not supported.

The default initializer uses device's current locale.

In this function, we'll just return one in this case.

While this speech recognition may be supported, it may not be available, perhaps due to having no internet connectivity.

Use the is available property on your recognizer to monitor this.

Now we create a recognition request with the recorded file's url and give that to the recognition task method of the recognizer.

This method takes completion handler with two optional arguments, result and error.

If result is nil, that means recognition has failed for some reason.

Check the error parameter for an explanation.

Otherwise, we can read the speech we recognize so far by looking at results.

Note that the completion handler may be called more than once as speech is recognized incrementally.

You can tell the recognition is finished by checking the is final property of the result.

Here we'll just print the text of the final recognition.

Recognizing live audio from the device's microphone is very similar but requires a few changes.

We'll make an audio buffer recognition request instead.

This allows us to provide a sequence of in memory audio buffers instead of a file on disc.

Here we'll use AVAudioEngine to get a stream of audio buffers.

Then append them to the request.

Note that it's totally okay to append audio buffers to a recognition request both before and after starting recognition.

One difference here is that we no longer ignore the return value of the recognition task method.

Instead we store it in a variable property.

We'll see why in a moment.

When we're done recording, we need to inform the request that no more audio is coming so that it can finish recognition.

Use the end audio method to do this.

But what if the user cancels our recording or the audio recording gets interrupted?

In this case, we really don't care about the results and we should free up any resources still being used by speech recognition.

Just cancel the recognition task that we started we stored when we started recognition.

This can also be done for prerecorded audio recognition.

Now just a quick talk about some best practices.

We're making speech recognition available for free to all apps but we do have some reasonable limits in place so that the service remains available to everyone.

Individual devices may be limited in the amount of recognitions that can be performed per day.

Apps may also be throttled globally on a request per day basis.

Like other service backed APIs, for example CLGO Coder, be prepared to handle network and rate limiting failures.

If you find that you're routinely hitting your throttling limits, please let us know.

It's also important to be aware that speech recognition can have a high cost in terms of battery drain and network traffic.

For iOS 10 we're starting with a strict audio duration limit of about one minute which is similar to that of keyboard dictation.

A few words about being transparent and respecting your user's privacy.

If you're recording the user's speech, it's a great idea to make this crystal clear in your user interface.

Playing recording sounds and/or displaying a visual recording indicator makes it clear to your users that they're being recorded.

Some speech is a bad candidate for recognition.

Passwords, health data, financial information, and other sensitive speech should not be given to speech recognition.

Displaying speech as it is recognized like Siri and Dictation do can also help users understand what your app is doing.

It can also be helpful for users so that they can see when recognition errors occur.

So developers.

Your apps now have free access to high performance speech recognition dozens of languages.

But it's important to gracefully handle cases where it's not available or the user doesn't want your app to use it.

Transparency is the best policy.

Make it clear to the user when speech recognition is being used.

We're incredibly excited to see what new uses for speech recognition you come up with.

For more information and some sample code, check out this session's webpage.

You might also be interested in some sessions on SiriKit.

There's one on Wednesday and the more advanced one later on Thursday.

Thank you for your time and have a great WWDC.

Apple, Inc. AAPL
1 Infinite Loop Cupertino CA 95014 US