Vision Framework: Building on Core ML

Session 506 WWDC 2017

Vision is a new, powerful, and easy-to-use framework that provides solutions to computer vision challenges through a consistent interface. Understand how to use the Vision API to detect faces, compute facial landmarks, track objects, and more. Learn how to take things even further by providing custom machine learning models for Vision tasks using CoreML.

Hello everyone.

[ Applause ]

I hope you're have a great time at WWDC so far.

Allow me to introduce myself.

My name is Brett Keating and I'm here with my colleague Frank Doepke and we're here to tell you about Apple's new Vision framework, so let's get started.

We're going to begin by showing you what Vision can do for your apps.

We're going to go through a few visual examples of the algorithms that are going to be made available in the Vision framework this year.

At which point I'll hand it off to Frank to talk about the concepts behind the Vision framework, why we designed things the way we did, what the mental model behind our API is.

And then we'll go a little deeper and go through a code example.

This code example brings together a few different technologies in our SDK, including Core Image, as well as the brand-new Core ML framework that we're offering this year which enables you to put in your own custom models and have them be accelerated using our hardware.

So, let's begin with what you can do with Vision.

Let's start off with face detection.

Now face detection is something we already have in our SDK, but we're offering in the Vision framework new this year a face detection that's based on deep learning.

And you may already know that deep learning has made groundbreaking changes in the accuracy in what we can do with Vision technologies and face detection is no exception.

We're going to have higher precision which means fewer false positives, but we are also going to have dramatically higher recall which means we'll miss less faces.

So, let's look at some of the examples of faces that we will now be able to detect with the Vision framework.

For one thing, we'll be able to detect smaller faces.

We'll also be doing a better job of detecting strong profiles.

We'll also do a better job detecting more partially occluded faces and that includes things like hats and glasses.

Sticking with the faces theme for a little longer, we now are offering in the Vision framework new this year face landmarks, what are face landmarks?

This is a constellation of points that we detect on the facer, things like the corners of the eyes, the outline of the mouth, the contour of the chin.

Here's an example, here's another example, and one more example.

We're really excited about this I think there's going to be some great apps created with this technology.

Next, also new this year in the Vision framework is image registration.

If you don't know what image registration is it's basically aligning two images based on the features that are present in those images.

You can use this for stitching together used for panorama kind of like this example or image stacking applications.

We have two different kinds, one that's translation only and one that gives you for full homography for greater accuracy.

We're also offering a few technologies that are already in our SDK through CIDetector interface.

We're making them available in the Vision API as well.

That includes rectangle detection as you can see, we detect the sign in the picture.

We're also doing barcode detection and recognition in the Vision API and text detection as well.

Another new technology, brand-new in the Vision framework this year is object tracking.

You can use this to track a face if you've detected a face.

You can use that face rectangle as an initial condition to the tracking and then the Vision framework will track that square throughout the rest of your video.

Will also track rectangles and you can also define the initial condition yourself.

So that's what I mean by general templates, if you decide to for example, put a square around this wakeboarder as I have, you can then go ahead and track that.

You can see that we handle pretty large changes in scale, pretty large deformations fairly robustly with this technology.

Another really exciting technology that's new in Apple's SDK this year Core ML and you can integrate your Core ML models directly into Vision.

As I've mentioned, machine learning does great things for Computer Vision and you can use Core ML if you want to create your own models, do your own solution.

Perhaps for example, you want to create a wedding application where you're able to detect this part of the wedding is the reception, this part of the wedding is where the bride is walking down the aisle.

If you want to train your own model and you have the data to train your own model you can do that.

Core ML as I mentioned, provides native acceleration for custom models so they'll run really fast and Vision provides the imaging pipeline to support these models, so you won't have to do any rescaling or anything like that we'll take care of all that for you.

We know what your model is expecting and we'll put the image in the right format.

If you're interested in Core ML there's some sessions that you can go to, we've listed the labs down here for you.

One of them will be tomorrow morning and then another one on Friday afternoon.

So that's basically the features that are in the Vision framework.

Overall, what Apple's new Vision framework provides are high-level on-device solutions to Computer Vision problems through one simple API.

Now let me break this statement down just a little bit.

What do I mean by high-level solutions?

Well we don't want you to have to be a Computer Vision expert to put the magic of Computer Vision into your applications.

You don't want to necessarily have to know which feature detector you want to use in combination with what classifier or set of classifiers, we're going to handle that for you or whether or not you want to use machine learning for example.

If you're a developer you're probably thinking I just want to know where the faces are.

And so, we're going to handle all that complexity for you.

Depending on your use case we'll be doing either traditional approach if that's what's needed for maybe real-time applications or deep learning algorithms for higher accuracy.

Now I also mentioned that we're doing all these algorithms on the device, let's talk a little bit about why we'd want to do things on device versus provided a cloud-based solution.

First of all, it's privacy.

As you know, Apple cares a lot about privacy, I care a lot about privacy working at Apple, sometimes it makes my job a little harder, but nonetheless keeping all your data on the device is the best way to protect your user's data privacy.

Furthermore, with certain cloud-based solutions there's a cost associated with it.

If you're a developer maybe you're paying usage fees to use a cloud-based solution.

Your users will have to transfer the data to that cloud.

All these costs they can add up for both the developers and the users.

So, when everything's on the device it's free.

And you can support real-time use cases like the tracking example I showed you.

Imagine trying to track something through a video by sending every frame to the cloud, I don't think that's going to work too well.

So, no latency, fast execution that's what we're offering with the Vision framework.

So, I hope you enjoyed that introduction, now we're going to go a little deeper and talk about the Vision concepts.

For this part of the presentation I'm going to hand it off to Frank.

[ Applause ]

Thank you Brett.

Hi, good afternoon, my name is Frank Doepke and I'm going to talk about more of the technical details what is part of our Vision framework.

So, what do we want to do, when we want to analyze an image we have three major tasks that we actually want to perform.

So, we [inaudible] finding out what is in the image and what do I want to know about it.

There's the machinery, somebody's got to do the work and we get some results out of it, at least we hope that's what's going to happen.

So, in terminology for Vision that means the asks these are requests.

And I just did a few examples here like the barcode detection or face detection and we feed them into our request handler.

That's the one in this case to be an image request and [inaudible] hold on to the image and it's going to do all the work for us.

And as a result, we get back what we call observations, what did we observe in this image.

And these observations depend on what you asked us to do.

So, we have classification observation or detected objects.

Now when you want to track something in the sequence like the wakeboarder it's basically the same concept.

We have some asks, we have the machinery, and we get some results out of it in the end.

Again, the asks are requests.

Now since this changes with every frame the image actually travels with the request.

Our machinery is again [inaudible] request handler it's the sequence request handler and we get results which are observations that go with our requests.

So, let me talk a little bit more about these two image request handlers that I mentioned so far.

So, we have the image request handler that is mostly if you want to do something interactive with the image.

You want to do multiple Vision tasks on an image, sometimes you actually do one and then based on the results you then kick off the next one and that's what you want to use the image request handler for.

It'll hold on to the image that it's set up with for its lifecycle and that allows us under the cover to do performance optimizations by holding on to intermediates to make these requests perform faster.

On the flipside, if I want to track something we use the sequence request handler.

The sequence request handler allows us to keep tracking [inaudible] in the sequence request handler.

And it will not hold on to all the images that gets fed into it over its lifecycle so they get released earlier.

But that means on the flipside, you cannot do the same optimizations if you want to do multiple requests on the same image.

So how does this look in the code, we are developers that's what we want to see.

So, we start as a blank slate that's always good and then we create a request.

In this case, it's a face detection request.

Now we create the request handler, what I'm choosing here is a request handler based on the files so I have a file on disk that I want to use.

Now I ask myRequestHandler to perform my request and this in this case [inaudible] it's just one request I have my array, but it could be many and I get my observations back.

And this can be many faces that I detected.

Now the one thing I would like to highlight here is the results come back as part of the request that we actually set up initially.

How does it look when we want to track something?

We create a sequence request handler [inaudible] of course not set up as an image because we have to [inaudible] all the frames of the sequence.

So, I started with an observation that I got from the previous detection or I mark something up and I create my tracking request.

And I simply have to run the request.

And I feed in in this case as a pixel buffer the frame that is currently being dragged.

And out of it again I get some results.

So, now that we have talked about how this API is kind of structured I would like to guide you through some best practices so that you get, you know, the best experience out of Vision.

So, when we want to put together a Computer Vision task you have to think about a few things.

Number one, what is the right image type that I want to use.

Number two, what am I going to do with the image.

And number three, what performance do I need or want.

Of course, you always want fastest, but there are some tradeoffs that you have to think about.

So, let's talk about the image type.

Vision supports a number of image types and they range from CVPixelBuffer, CGIImage or even as we saw in the previous example just from data that I use in NSURL.

And we go over all these types in the following slides so that you know what to choose when.

Which to choose depends a lot of like what you want to do.

If you run from a camera stream or if you run from files on disk you have to look at that [inaudible] kind of which type of image you want to use.

Now two important things to remember is we already have an imaging pipeline in the Vision framework you don't need to scale the images.

So, unless you already have a very small representation that you absolutely want to use, please don't pre-scale because we'll just do the work twice.

And mind the orientation.

Computer Vision algorithms are mostly not, you know, sensitive to orientation or sorry, they are sensitive to orientation so you have to pass that in.

And that is an important part because if you pass in a portrait image that's actually lying on its side we will not find the faces and that's one of the common mistakes that usually happens.

So, I promised to go over the types.

When you want to do something streaming we want to use the CVPixelBuffer.

When you create a VideoDataOut [inaudible] capture you will get CMSampleBuffers and through those we get your CVPixelBuffers.

It's also a pretty good format if you already have something where you keep your image data raw in memory like it's LGB pixels and wrap them into a CVPixelBuffer this is a great format to pass into Vision.

When you get files from disk please use the URL or if it comes from the web use the NSData path.

The great thing about that is it really allows us to reduce the memory for print in your application.

Vision will only read what it needs to perform the task.

If you think about you want to do face detection on a 64-megapixel panorama Vision will actually reduce your memory for it, but not reading the full file actually into the memory and that is an important thing to keep in mind.

We will read in this case the EXIF Orientation out of the file, but you can override it if you have to for those formats that don't support it.

If you're already using Core Image in your application by all means process the CI image.

This is also important when you want to actually do some preprocessing.

If you have some domain knowledge of what you want to do in your Computer Vision task you can do some preprocessing and try and enhance the image and, therefore, enhance the Vision results.

If you want to learn a bit more about Core Image, there's a session on Thursday at 1:50 and they will also show the integration with our Vision framework.

Last but not least, if you have all the images in your UI you can use the CG image [inaudible] out of the NS image or the UI images let's say it comes from the UI image picker and pass those into Vision.

Now what am I going to do with the image and that's where we have to decide if I want to do something interactive with the image in that case I use my ImageRequestHandler.

It will hold on to the image for the time and I can do multiple passes on that image and get the best results out of that.

Now the CVPixelBuffer technically will allow you that you could change the pixels [inaudible], but we see them as immutable so don't do that because we'll get some strange results.

Next, if you want to track something we use the SequenceRequestHandler.

It allows us to keep the tracking state and lifecycle of my image is not tied to those requests handler anymore, but just how long it needs it for the tracking.

Performance, so these Vision tasks are computationally intensive very often and they do take time, so you have to think about that you want to actually run your task on a different queue not your main queue.

And think about if you want to do it in the background, which is a bit slower or if you need it very quickly use a more interactive quality of service to get the performance.

A good practice is to use the completion handler to get the results back, this is part of our API.

But keep in mind that this completion handler gets called on that queue in which you actually set it off.

So, if you need to update your UI you have to dispatch that back to the main queue.

So as Brett already highlighted, we have a new face detection and you might say oh God, yet another one.

But we have good reasons for this.

Vision uses deep learning and this gives us really a lot better precision and recall, therefore, much better results.

The downside of it, on older hardware it will run a bit slower.

So, let's look a little bit at our overall landscape of face detectors that we have available.

So, we have Vision which really gives us the best results and it's pretty fast and also pretty good in its power use as it is optimized for that.

And we have it available on all platforms except the watchOS.

And this is the same in terms of availability for Core Image and it's a bit faster, but the results are not quite as good.

In the AV capture session which is only happening during the capture side we can actually use hardware so it's really fast in performance, but the results again are not as good as we get out of Vision.

So, you have to choose depending on your application what you want to do choose the right technology for the face detection.

Now I did mention that our quality is better, so let me try to prove that a little bit.

So, I have here an image and I ran the face detection through Core Image.

And we find two faces and we see roughly where the eyes and where the mouth are.

Now in Vision we find all four faces even the occluded ones and we get a whole lot more details with the visual landmarks.

Speaking of Core Image, I would like to highlight a little bit what's happening with the CIDetectors.

So, whoever uses it already can keep on using them they are still in Core Image, but all new parts and all the improvements in terms of algorithms for computer moving forward will be in Vision that's the new home for Computer Vision.

So, an awful lot of sides, how about a demo.

So, what I'm going to show you is an application that runs an AV capture session on the device if the demo Gods are with us.

And we will do a very simple rectangle detection request.

So, what do I have to do to set this up?

What you see here is I create my request, that's my simple rectangle detection request in this case.

I'm actually in the wrong sample that's why I'm getting confused here, my apologies.

Here we go.

Okay we have our rectangle detection request and I'm setting some parameters just as an example here, I only want them this minimum size in our coordinates are normalized so I only want a 10% minimize size of the image and I just want 20 rectangles.

I could get more, but I want 20, I just picked a number.

I set up my area of the request that I want to perform and the right here this is our completion handler and all that I'm going to do is I'm going to draw my rectangles, but as you notice I'm just patching it to the main queue to update our UI.

Where do our images come from?

So, we look at the capture output here and as I promised, in the capture output we get our pixelBuffer from the CMSampleBuffer.

Right here I'm getting the cameraIntrinsics.

Now this is something that is important in some of these Computer Vision paths where we actually know what the camera is kind of looking at.

As I mentioned, we don't forget the acts of orientation and I create an image request handler and perform our tasks.

So, how does this look when we actually run it?

All right, so what we're going to see here is that now we tracked this rectangle and that's as simple as it is, we can find other rectangles.

If the cable is long enough we can actually look oh, there we find a computer with various rectangles.

Now I chose the yellow kind of on purpose because it's the same color as you saw in the demo during the keynote for the new document camera on notes.

And I borrowed their color because they borrowed our code to do actually the rectangle detection.

[ Applause ]

Thank you.

[ Applause ]

Now that was simple, let's do a bit more.

So, how about we throw some machine learning at this as well just for the fun of it.

So, what I have to do is I have a little model that I just dragged into my project here.

And that is a classifier that will tell us a bit something about the image.

And we see when we look at this part here that we need to feed it an image of a very strange size and get out of it some classification.

Now you don't need to worry about that size because Vision will do the work for you.

So, what do I need to do?

I first need to create a Vision model and my request with that.

And that is the part that we have here, so I'm simply loading the inception model and I create my classification request.

Now it tells me there's something missing and I will get to that in just a moment.

The last thing I want to highlight here is it says that it was okay square image, but our cameras don't see squares so I need to tell it actually how to handle just, you know, the aspect ratio that I want to use.

And I say okay I want to just send a crop.

So, I need a completion handler for my task and I have that already pre canned here as well.

So, in this completion handler I simply look at my observation and I will get so this classifier can see a thousand different things and I don't want to show all of them I only show the ones that I care about.

So, what I'm doing is a little bit of filtering, I only take the top four and I only look at the ones that have a confidence of at least 30%, it just works well for my demo here, but you know you will figure out what kind of works well for your model.

And all I have to do next is add my classification request into my area of request and now I will actually run two requests.

So, I have this already loaded on my device, let's see how this actually looks.

Of course, you will see it when I switch to the correct machine, there we go.

Okay, so we have a coffee mug which is empty, somebody better fill that for me.

We have a ballpoint pen, we have a padlock and look an iPod.

Who has stolen those empty cards away and didn't realize it was an iPod?

[ Applause ]

All right, let's go back to the slides before we get to the next show-and-tell.

For my next demo, I want to do something a little bit more elaborate and with that I chose something that's called MNISTVision.

People in the machine learning community have already looked at that a little bit more.

MNIST is a dataset where a bunch of government employees and high school students wrote numbers down and this was marked up and people were trained in our classifier on that.

Note this is basically like white numbers on black background, so I guess they've written it with chalk on an old blackboard.

So, in this sample code I'm going to show you I want to show a few concepts that are kind of important like making something a bit more elaborate with Vision.

First, we'll spin off model requests based on top of each other then we use Core Image in between to do some image process and last but not least, we use Core ML again for the machine learning part.

So, how is this going to work?

We have here an image on which we find a sticky note.

Well we find it by using the rectangle detector, there's our sticky note.

Now that is prospectively distorted and it's clearly not white text on black background.

So, we use Core Image in the next step and we'll actually do the perspective correction of it and invert the color and enhance also the contrast so that we get rid this black-and-white image.

And last but not least, I need to run my MNIST classifier on it and it should tell me that this is the number four and this has 80% confidence that this is the number four.

Again, let's see how this looks in the app.

So again, I start off as a rectangle detector request, it's my favorite I know.

But it's more interesting what I'm going to do in the completion handler.

So, I do some validation just to make sure that the rectangles I'm getting out of it are actually okay.

But the interesting part happens here.

I get the coordinates of the corners and feed them into CI to use the CIPerspectiveCorrection.

That allows me to take this prospectively distorted image and actually bring it upright as if I would have the camera straight on.

I use the CIColorControls to really bring out the contrast of the image to make it kind of binarized.

And as I said, I have to color invert it.

Now the resulting image of that I feed into a new request in there because we have a new image on which I'll run the classification.

So, how does the classification look like?

The classification for that I use my endless model which I actually have ready and this is a small model that I've really trained actually on this laptop very easily give us a few lines of code script and then thanks to Core ML I can just drag that in and use this very easily.

So, I have my model here.

Now again, this one part I would like to highlight this, so that takes in this case a very small grayscale image.

So, an image 28 by 28 pixels it should be able to read these numbers.

So that's where my classification is coming from and now I need to feed in the image.

So, this sample code has been made available also for the session to make it easy also to run a simulator not running it live off of the camera I'm just going to use actually the UIImagePicker and feed it into my VMImageRequestHandler and let it just perform the rectangle request.

Now notice I buried the request for the classification into the completion handler of my rectangle detection and that allows us to basically cascade multiple requests on top of each other.

So, let's try the demo for this.

Okay, so I have my app here and well [inaudible] giveaway.

Okay, so what I see here is again I have my image on the top, this was actually the photo that I took earlier.

We see its correctly classifying as a number one, it was a really high confidence in this case.

And what you see on the bottom is just basically just to visualize that I took this intermediate image that we created in CI and show this as well.

Let's choose another number.

Yes, this is the number three.

Can we guess what this number is, it's the number four?

It works correctly.

All right, thank you.

[ Applause ]

Let me go back to our slides.

So that is our Vision framework.

Let's capitalize a little bit on what we really have seen here.

So, Vision is a high-level framework for Computer Vision and it should really make it easy for you to use this in your applications even if you're not a Computer Vision expert.

We have various detectors and there's a whole variety of that and they all run through one consistent interface which would make it very easy to learn that set of APIs.

And last but not least, the integration with Core ML.

By bringing your own custom models you can do a lot in your application, you can find hotdogs and see if they are really hotdogs.

I had to make that joke.

So, if you want to learn more about our session, please go to our website and I would definitely highlight there are some related sessions that you should have watched perhaps in the past.

I'll read you the Core ML one, but you can find it on our website.

Please come for our get-together that we have at 6:30 today, chat about what we can do.

And for the little bit more advanced part of Core ML there's a session on Thursday, as well as we have a session with Core Image where they will also do some very fancy stuff with Core Image and Vision.

And with that I'd like to thank you for coming today and enjoy the rest of WWDC.

Thank you.

[ Applause ]

Apple, Inc. AAPL
1 Infinite Loop Cupertino CA 95014 US