Beyond Dictation — Enhanced Voice-Control for macOS apps

Session 717 WWDC 2016

Using the native speech recognition capabilities of macOS you can dictate a letter, message, or perform command and control actions like launching and switching between apps, finding and creating documents, and more; But even more is possible when you combine dictation with built-in automation technologies of macOS to create personalized, voice-triggered, workflows. Learn how to dynamically create dictation command sets for yourself, your application, and your customers.

[ Music ]

I guess I'm on.

[ Applause ]

Thank you.

We're going to be moving the attendee bash to my house over in Berkley, it's a lot more fun over there.

So welcome to Session 717, Beyond Dictation, Enhanced Voice-Control for macOS and Applications.

I'm Sal Saghoian, I'm the Product Manager for Automation Technologies at Apple.

Today we're going to be talking about the Mac, macOS and what happens when you combine the speech recognition architecture in the macOS with the native automation frameworks that are available to you to produce some really incredible results.

Now when you're talking about speech recognition and dictation it's grouped into four different categories.

The first dictation is basically about transcription, about you speaking something that getting transcribed and put into the current text field.

Enhanced dictation advances that ability by adding the ability to edit the text and to navigate within a text field.

Advanced commands include the ability to control an application's interface, as well as perform dictation tasks as well.

You can click buttons, tabs, menus, those kinds of things.

And finally, user commands are custom dictation commands you create yourself to perform the kind of things that aren't a part of the standard library of commands that come with the computer.

So in macOS in the latest version a lot of things got moved around, so what I'm going to do is take a very short couple minutes and review where all these different technologies live, how to turn them on and then we're going to have some real fun.

So begin with we'll take a look at basic dictation.

Now if you open up the System Preferences app the controls for speech and dictation used to be in this window, but instead of where they were there is now the Siri preference.

And if we open the Siri preference you'll see that there are no controls in there for dictation and speech.

It's important to understand that Siri and dictation and speech are two different technologies.

They have different functions, different uses, and different purposes.

So if we close this and we go back out to the main preference window and we search for the word dictation in the search field at the top right you'll see that interestingly the keyboard preference is now selected.

And if we open that you'll see a new button added to the Tab bar of that window at the end and it's for dictation.

If we select that there are the dictation controls now in macOS are now within the Keyboard System Preference pane.

There's a couple of radio buttons there initially for turning it off and on.

And when you turn on dictation, basic dictation, you'll see this sheet drop down and it has two bits of information that are important to understand.

The first is that what you say gets recorded by the computer and sent to Apple servers so that it's transcribed and that text is then sent back to your computer.

In addition, for accuracy they send up the names and addresses from your contacts list so that they can better match some of the phrases that you're saying.

If those terms are okay with you, then you just click the Enable Dictation button and now dictation is on.

And you can dictate into any text field like a TextEdit document, into a messages window, into a mail message, and even into a URL input field for Safari.

You can use dictation anywhere on the computer where you have a text field.

And it's good for doing short things, names, addresses, some short phrases, it's really good for that.

But it does have some limitations in that you do need an active Internet connection to connect to the servers.

And it's also not intended for use with long dictations like if you're doing a business letter, it's not really designed for that kind of thing.

You'll have to keep activating the dictation command all the time and there really are no editing controls for dealing with mistakes or changes you want to make to the dictated text.

Which brings us to the next category, which is enhanced dictation.

Not only can you dictate into text fields, but you could navigate between the different text fields of a window and edit the text that you have there as well.

That was a pretty nice little animation there, did you like that?

So if we go back to our Preferences window here, below the On Off button that we just turned off and on is a checkbox called Use Enhanced Dictation.

And when you select this three things happen.

The first is it allows your computer to be used offline.

You no longer have to require an Internet connection to Apple servers.

All the transcription is done locally on your computer.

In addition, you can dictate continuously.

So you can sit there over a 20 minute period to dictate a letter and work continuously with the computer so that it will actually grab what you say and type it in for you.

And the third thing is that the computer can now give you live feedback.

It can give you an audio sound or a visual cue that it has heard you, that it has recognized what you have to say.

So we're going to turn that on and then go back out to the main Preferences window.

Now to see what kind of commands are available to us we're going to go to the Accessibility System Preference pane and scroll down that list on the left and select the thing that says Dictation.

Your preferences for using dictation on the computer in an enhanced mode are now displayed here on the right.

You have an option to mute the audio when you're dictating.

So if you have iTunes playing and you started dictation it automatically mutes it.

You can have it play a confirmation sound, which is usually like a little [swish sound] just to let you know that it heard you.

You'll hear this being used when I do some demos here for you.

And you can also assign a key word so that the computer will not react to what you say until it hears that key word first like computer what time is it.

So to see the commands that you've now enabled we're going to click the Dictation Commands button and we get this sheet that comes down and it lists all the commands.

Commands are grouped into suites and suites contain one or more commands.

You can search through all the available commands using the search field at the top.

And when you have a command selected you can turn it off or on by selecting that checkbox to the left of it.

And you can also see the syntactical options available to you on the right-hand side.

A lot of commands can be spoken a variety of different ways and all of those different ways will trigger the command to execute.

So let's look at the first suite of commands, the Selection Suite.

This suite is designed to select text.

So you can select a word, a paragraph, a sentence and you'll notice that there is even a command for selecting a phrase.

The Navigation Suite is for navigating within your editing area.

So you can go to the end, go to the beginning, go to the end of the sentence that kind of thing.

You can scroll up, scroll down, move left, move right.

So it's really designed to help you move around a window between different fields and scrolling the view plane.

There is a short editing suite that contains the ability to cut, copy, paste, as well as perform some transformations like capitalize, lowercase, uppercase, those kinds of things.

And you'll notice that there is one special command there called Replace.

So you can say replace this phrase with this phrase, that's quite handy.

There's a small formatting suite for doing things like bold, italicized underline and there is a small system suite for stopping dictation and for also displaying a floating HUD that lists all of the commands.

So if you're in the middle of a dictation and you're not sure what command is available, you can say show Commands and this little HUD will appear.

So together these make a pretty comprehensive set for doing some basic good editing of what you're dictating.

And to show you I recorded a little video of me doing it in pages.

So I just said something it gets transcribed.

Replace Brazil with America.

Select power, capitalize that, go to end, new line, new line, the future of wind power is amazing period.

So I dictated to the computer, it transcribed what I said.

I said select wait until you see me live.

So then I used the Replace command, I selected Brazil, changed it to America, I capitalized something, I navigated around.

So you get the idea that, you know, this is very powerful.

Enhanced dictation gives you many things.

It allows you to work continuously offline on your computer with no interaction necessary.

It's giving you live feedback, but it does have some limitations and I call it it's procedural.

It means that it's like describing how to make a peanut butter sandwich.

You hold the jar with this hand, you turn the top this way.

So you have to guide the computer through the process of what you're doing.

You have to tell it to move to the end, new line, new line.

And that can take some getting used to in order to be able to do that smoothly.

The next category of recognition is advanced commands and that's where you augment the dictation abilities with the ability to control the application interface.

So not only are you being able to dictate, but you can push buttons.

So if we go back to this accessibility sheet that we were looking at down at the bottom is a checkbox called Enable Advanced Commands.

And if you select that it turns on new suites and expands some of the suites of commands that are available.

So the Navigation Suite now expands to include things for controlling the application interface.

You can say show numbers and a number will be placed over every control in the application and then you can just speak the number and that button or menu will be pressed and activated.

The System Suite is now expanded to include the ability to search spotlight and that is really useful.

You can also show commands and hide commands just the same way as before.

There is now an Application Suite for switching between applications, for quitting an application, and hiding an application.

And there's a small Document Suite that lets you open, create, save, close documents within an application.

So in addition to this existing set of commands that we have, the new set augments that and expands some of the key things there.

So let me show you how that works when you use it.

And in this case I'm going to use it to create a document in Keynote.

So switch to Keynote and Keynote will come forward, Show Commands and you'll get that floating HUD that has a list of all the commands that are available.

And if we look at the document section there, there's a command for new item.

And this is same thing as going Command End on the keyboard or selecting New from the File menu.

So I'll say New Item, it recognizes and since we're in Keynote it brings up the Template Picker so I can pick what kind of presentation I want to make.

I'm looking to make a wide presentation and at the top is a button that I would click if I was doing this by hand to see the wide presentation templates.

So I'll click Wide and that will press the button for me and reveal all the wide presentations.

Now I'm looking for parchment, so it doesn't appear in this view.

So I'll have it scroll the window by saying Scroll Down and then there's parchment at the very end.

And I can select that by saying Click Parchment.

And once I do it will open up a new Keynote document with the parchment theme.

So you can see that advanced commands augment dictation by giving you control of the application interface.

All the buttons and tabs are available.

But again, like the other categories of speech recognition there are some limitations and it too is procedural.

You notice I had to walk through each step of doing the things required to create a Keynote document.

I'm basically using my voice as a mouse or a keyboard and not all applications have complete accessibility support.

Generally, they do have good support, but that could also be an issue.

You might not be able to give a command for the thing that you want it to do.

But here's the good news user commands.

This is where we leave the world of accessibility behind.

This is where the power of automation and the power of speech recognition fuse together to produce incredible tools.

So let me show you how you access these on the computer.

So in this window is a secret button for turning on this power ability and it is right here, this + button.

It doesn't look like much, it looks like any other + button, but when you press that + button a new suite is added called User Commands and by default a new blank command is set up for you.

So now we can create our own command and to do that I'm going to describe some parameters on the left-hand side beginning with what is the phrase that I want to use as a command.

So let's say I want to create a command to tell the computer to take my picture right.

So I enter take my picture in there.

So when I say the words take my picture the computer will do the actions I want it to do.

Now you can determine if this command is always available or only within a specific application by selecting it from this pop-up.

I'm going to leave it set for Any Application.

And to perform the action you have a menu of possible things that the computer will do for you upon dictating that command.

You can have the computer open up certain finder items like certain documents that you want opened.

You can have the computer open a specific URL.

You can have it paste text that you've liked like your favorite legalese statement.

You can have it paste data like your favorite corporate logo.

You can have it press a keyboard shortcut.

You can have it select any menu in any application and you can even have it run an Automator workflow or AppleScript or JavaScript script.

And as a matter fact the OS ships with nine pre-done Automator workflows for you.

The first three here are for starting a recording in the QuickTime player.

You can say new audio recording, new screen recording, new video recording.

There is one for going to the Apple website of course and there are five for navigating within iTunes.

So you can say show me the top family movies and it will go right to that page within the iTunes application.

And at the bottom is a choice for other that will let you pick your own Automator workflow or own JavaScript or AppleScript script and make that the action that gets run.

But you'll notice there is one for take my picture, so I'm going to select that and click the Done button to create this command.

And it automatically gets added to the list of commands.

So my user suite now has one command called take my picture and this is what it looks like when you run it.

So I start dictation, I go take my picture, it recognizes, brings up the camera picker, I smile and it takes my picture and puts it in photos for me automatically.

So user commands are different than the other types of commands because they're task oriented.

They're not so much procedural where you're telling it to do this, then this, then this.

You're basically giving a command to execute this particular task no matter how many steps it takes, no matter what it needs to do.

And it really does expose all the power of automation that is available to you on the operating system.

And you still get all the other dictation abilities that you currently have with the other ones.

It's hands-free, always on, visual feedback, you don't need a net connect.

It is incredibly powerful, but as you saw how many steps it took me to make a user dictation command if I wanted to make a hundred of those to really do something with a particular app, it might be a little bit cumbersome to create that until today.

Today, there's a set of custom dictation commands that I'm going to give you today that will let you control the Finder iWork and Photos and some of the other apps.

And they take advantage of this automation and speech recognition technology together.

So let me switch over here and we'll watch these in action.

Okay, so I'm getting over a cold my voice is a bit lower, so we'll see if she even pays attention to me today.

And I have a mic set up here because we're in this huge boomy [phonetic] room.

But let me tell you when I'm in the office at Apple I just use the computer it's open there connected to my monitor and I can talk to this thing from across the room and it works perfectly.

So I'm going to be using this today and I expect one or two errors, but let's see what happens.

So let's try that thing about creating a new presentation in Keynote.

Switch to Keynote, make a new presentation okay.

How about let's make a different one.

Make a wide presentation, oh okay.

How about one with a particular template.

Make a new wide presentation using the gradient template.

Oh that's pretty good, let's try a different theme.

Make a new standard presentation with the brush canvas theme.

Pretty good.

Close all without saving.

So the whole idea of having to be procedural to create a document you saw me go through all the steps is now like no.

I'm thinking about what I need to do, give me a new document with this template, this width I'm done, I'm there.

So that's good for creating a new document, let's see how you can work with an existing document.

So let me pull up a document that I have on the computer and let's work with it a little bit.

Search spotlight for presentation the States of the United States, open result.

She's chugging today.

Start at the top.

[ Music ]

Oh I'm good with slides, I'm good with slides.

[ Music ]

Next slide.

Go to slide nine.

Read the presenter notes.

California's diverse geography ranges from the Sierra Nevada in the east to the Pacific Coast in the west and from the redwood forests of the northwest to the Mohave Desert areas in the southeast.

Tell me about this image.

Although Death Valley is a below sea level basin a great diversity of life flourishes in this land of extremes.

Take a break.

Export the presenter notes to mail.

Export presenter notes to mail.

That's pretty handy I could use that a couple times.

Switch to Keynote.

Show related items.

So we have a spotlight window down here with all the items that are related to this.

There's a GarageBand file with the shaker song in it okay.

There's even the email that I just sent.

There's a bunch of images that I used.

There's a picture of the US flag.

There's also an audio recording that I put in right, that was the audio recording and some links.

So it was really easy for me to find things that are related to the document that I'm working on just by asking for them.

Close this without saving.

So you can see this concept of having task oriented things is really useful because it doesn't really interfere with the mental flow that you have going, especially when you're creating something.

So let me use user commands and dictation commands to create some content.

So we've seen it open a document, create a new document and we've seen it work with a document.

Let me actually create some content using these commands.

Switch to photos.

These are some photos I took on my trip down the Rhone River.

I took one of those Viking type cruises.

Lots of fun I highly recommend that.

Select all photos.

Help me to add titles.

Enter the title for image one of five.

You know, normally doing this kind of thing is a real pain because you've got to get down there in the interface and look at these little things.

And, you know, being able to have this guide you through something sometimes is just the easiest way to do something.

On the Rhone River yeah, you need that special character there.

Rhone River.

Five of five.

This was an incredible Roman temple right in the middle of town.

Done.

So using that command I was able to add titles to a bunch of images really quickly.

Select all photos, make a new presentation with these.

Okay, so it just did [applause].

Yeah, I mean that's a pain right, who wants to do the 40 steps that it takes to do that.

I can think of it why can't I just ask the computer do it, obviously it can right.

Go to slide one.

Change master slide to title center.

Scratch that.

Change master slide to title center.

And let's see vacation photos.

Select Photos, capitalize that, stop, edit.

Okay, let's see what I've got here.

Move the slide to the end.

Okay.

Done.

And then I want this to go down around there.

I think no, the lock came first and then that okay.

This needs some tweaking, this is bad.

Edit this in photos.

So the computer knows where that photo came from doesn't it.

Why should I have to go spend my time doing that kind of thing?

Let's set it to crop and I'll straighten this out a little bit and then as long as I'm adding I'm going to crop it to get rid of the guy standing there and make this a little bit more interesting, more focused on the lock rather than the rest of the stuff around it.

And Photos has some nice little tools for doing that, so that looks good.

Show this in Keynote.

It knows where it is in Keynote right.

Let me select it, update this image.

Update this image.

So I don't have to go through that whole process about doing stuff I'm just thinking, I'm creating here on the fly right.

Let's make his little bit bigger so that it looks a little bit more interesting.

And then oh this is great, the Popes' Palace, it's really interesting.

Show this in Maps.

So here it is in Maps and let's use the nice 3-D feature of Apple Maps.

And let's see if we can get a nice background for that photo to show people the context and the fact that the thing's right in the middle of this old Roman part of town.

You can see the circular area that was like the original village there right.

Okay, export this map to Keynote.

All right, very good and let's kind of crop this in here a little bit.

There we go.

I haven't done the command yet for crop the slide.

It's a good idea though.

Let's go paste it and put it to the back here.

I'm going to copy this and put this on top of this, it's like a little trick right.

So that I can do like a magic move and then reveal.

So I need to apply a magic move.

Apply magic move.

Apply a dissolve.

Do that again.

Okay, so I got the transitions and then oh, Pont du Gard.

Okay, this one we need to fix a little bit.

Scale this to fit slide width.

All right, that's good.

And make a long panoramic sequence.

Done.

Okay and while we're at it put descriptions on top of every image.

That could take a while to do.

And so after this trip I was so impressed with these locations that I went on the Internet, I did some research as to how many people visit these sites every year.

And I have a spreadsheet that has that information.

Search spotlight for spreadsheet tourism in France.

Open result.

Select the table.

Export this table to Keynote as a chart.

Good, that could be handy.

And let's scale this down a little bit so that it fits there nice.

Okay, start from the top.

All right, so this was a Roman temple right in the middle of town, it's been around forever.

They actually turned it into a movie theater for a while.

And this was us going through a lock on the Rhone River, there's about four or five locks during the course of that.

This is the power station that they use to provide power for Arles.

And this is Pont du Gard, which is this incredible Roman aqueduct that was built like 3,000 years ago.

And the fact that this huge thing that's three stories tall it's still here today and still in the condition it's in is absolutely amazing.

It spans this entire long valley and I'll bet if we try to do that today it would run in the billions of dollars.

But it's quite impressive.

And that's Pont du Gard.

And of course, there is the Popes' Palace and it's located right in the center of the old Roman town like you can see here.

And people find these sites really interesting, there's millions and millions means of people that visit these every year.

Save this presentation.

And as a matter fact, why don't I while I'm at it, but I don't have a free port here.

Let's see if I can copy this to my thumb drive.

So I'm going to stick a thumb drive in here.

Save this presentation to my thumb drive and eject it.

Scratch that.

Save this presentation to my thumb drive and eject it.

Saving document vacation photos to drive digital briefcase.

Dismounting digital briefcase.

So this is the kind of thing that you can do where, you know, you're still in the application you don't have to leave what you're doing to dismount something, to eject something, to copy something.

Done.

That deals something with the Finder.

I mean you just tell the computer to do it.

It knows how to do this stuff right.

Why shouldn't we just be able to tell it to do that?

So where do you get this stuff?

Add a blank slide.

Turn this into a QR code.

Scale down 10%.

There we go.

So all of this material is available for to you today and I'm going to show you where to get it.

So let's go back to the slides here and my slide advancer.

So what are these type of commands useful for?

Well we saw that they're really good for when you want to remain in context and you want actions and tasks performed for you.

They are really good for performing multistep tasks and like when I had to go through and name all of those images, so that I could use their descriptions later.

They're also good for tasks that require dexterity.

Can you imagine copying the description from an image file on a slide and then creating a text box, placing it just so on that and doing that for an entire presentation?

Boy, you have to be a wizard to be able to do that quickly.

But with voice command, with a dictation command that's just asking that be done.

You can also move data.

So like I had the map in Maps, I wanted it over in Keynote.

So voice commands and dictation commands are perfect for that and excellent for data transformation.

So I had a table with data that I wanted as a chart in Keynote and I was able to give it a command and have that executed.

You can perform tasks that aren't available in the application UI that you're in.

For example, turn this into a QR code.

That's a perfect example of something you'd want to do in a presentation, but it's just not there.

And things that the user wants to do, but doesn't know how.

Man, all the people in my family do not know how to use the Clipboard, do not know how to do a screen capture, do not know how to use AirDrop.

So being able to do that with a spoken dictation command is perfect.

So what are they good for?

Dictation commands are good for solving all of these problems and for being a solution.

So how does it work?

Well, I only have a couple minutes so I'm going to kind of boot out of that for a second and jump past that section.

It's basically magic and I'm going to kick back in and tell you where to go to get these resources.

Everything you need to know about how they work and the collection I just showed you is available for you on dictationcommands.com.

So to summarize, the power of speech recognition and automation working together makes it so that dictation is no longer just another way to enter text into a text field.

And speech is no longer just an assistive technology.

And now voice is a peer to touch, keys and cursor.

You can use your voice the same way that you use the other inputs into the computer.

And this is what's possible when you have macOS and all of these technologies working together.

It's only something that can happen on a Mac.

So thank you so much for being part of my day.

Thank you for being here.

Thank you for being part of this session I appreciate it.

[ Applause ]

Apple, Inc. AAPL
1 Infinite Loop Cupertino CA 95014 US