Understanding Images in Vision Framework

Session 222 WWDC 2019

Learn all about the many advances in the Vision Framework including effortless image classification, image saliency, determining image similarity, and improvements in facial feature detection, and face capture quality scoring. This packed session will show you how easy it is to bring powerful computer vision techniques to your apps.

[ Music ]

[ Applause ]

Good morning.

My name is Brittany Weinert, and I'm a software engineer on the Vision Framework Team.

This year the Vision Team has a lot of exciting new updates that we think you're all going to love.

Because we have so much new stuff to cover, we're going to dive right into the new features.

If you're completely new to Vision, don't worry.

You should still be able to follow along, and our hope is that the new capabilities that we introduce today will motivate you to learn about Vision and to use it in your apps.

Today well be covering four completely new topics, saliency, image classification, Image Similarity, and face quality.

We also have some technology upgrades for the Object Tracker and Face Landmarks as well as new detectors and improved Core ML support.

Today, I'm going to be talking about saliency.

Let's start with a definition.

I'm about to show you a photo, and I want you to pay attention to where your eyes are first drawn.

When you first saw this photo of the three puffins sitting on a cliff, did you notice what stood out to you first?

According to our models, most of you looked at the puffins faces first.

This is saliency.

There are two types of saliency, attention based and objectness based.

The overlay that you saw on the puffin image just now called the heatmap was generated by attention based saliency.

But before we get into more visual examples, I want to go over the basics of each algorithm.

Attention based saliency is a human aspected saliency, and by this, I mean that the attention based saliency models were generated by where people looked when they were shown a series of images.

This means that the heatmap reflects and highlights where people first look when they're shown an image.

Objectness based saliency on the other hand was trained on subject segmentation in an image with the goal to highlight the foreground objects or the subjects of an image.

So, in the heatmap, the subjects or foreground objects should be highlighted.

Let's look at some examples now.

So, here are the puffins from earlier.

Here's the attention based heatmap overlaid on the image, and here's the objectness based heatmap.

As I said, people tend to look at the puffins' faces first, so the area around the puffins' heads is very salient for the attention based heatmap.

For objectness, we're just trying to pick up the subjects, and in this case, it's the three puffins.

So, all the puffins are highlighted.

Let's look at how saliency works with images of people.

For attention based saliency, the areas around peoples' faces tend to be the most salient, unsurprisingly because we tend to look at people's faces first.

For objectness based saliency, if the person is the subject of the image, the entire person should be highlighted.

So, attention based saliency though is the more complicated of the two saliencies, I'd say, because it is determined by a number of very human factors.

And the number, the main factors that determine attention based saliency and what's salient or not, is contrast, faces, subjects, horizons, and light.

But interestingly enough, it can also be affected by perceived motion.

In this example, the umbrella colors really pop, so the area around the umbrella is salient, but the road is also salient because our eyes try to track where the umbrella is headed.

For objectness based saliency, we just pick up on the umbrella guy.

So, I could do this all day and show you more examples, but honestly, the best way to understand saliency is to try it out for yourself.

I encourage everybody to download the Saliency app and try it on their own photo libraries.

So, let's get into what's returned from the saliency request, mainly the heatmap.

So, the images that I've been showing to you up until now, the heatmap has been scaled, overlaid, and colorized and put onto the image, but in actuality, the heatmap is a very small CV pixel buffer that's made up of Floats in the range of 0 to 1, 0 designating nonsalient and 1 being most salient.

And there's extra code that you'd have to do to get the exact same effect like you see here.

But let's go into how to formulate a request at the very basic level.

Okay. So, first we start out with a VNImageRequestHandler to handle a single image.

Next, you choose the algorithm that you want to run, in this case, AttentionBasedSaliency, and set the revision if you always want to be using the same algorithm.

Next, you call perform request, like you usually would, and if it's successful, the results property on the request should be populated with a VNSaliencyImageObservation.

To access the heatmap, you call the pixelBuffer property on the VNSaliencyImageObservation like so.

If you wanted to do objectness based saliency, all you would have to do is change the request name and the revision to be objectness.

So, for attention, it's VNGenerateAttentionBased SaliencyImageRequest and for objectness, it's VNGenerateObjectnessBased SaliencyRequest.

So, let's get into another tool other than at heatmap, the bounding box.

The bounding boxes encapsulate all the salient regions in an image.

For attention based saliency, you should always have one bounding box, and for objectness based saliency, you can have up to three bounding boxes.

The bounding boxes are in normalized coordinate space with respect to the image, the original image, and the lower left-hand corner is the origin point, much like bounding boxes returned by other algorithms in Vision.

So, I wrote up a small method to show how to access the bounding boxes and use them.

Here we have a VNSaliencyImageObservation, and all you have to do is access the salientObjects property on that observation, and you should get a list of bounding boxes, and you can access them like so.

Okay. So, now that you know how to formulate a request and now that you know what saliency is, let's get into some of the use cases.

First, for a bit of fun, you can use saliency as a graphical mask to edit your photos with.

So, here you have the heatmaps.

On the left-hand side, I've desaturated all the nonsalient regions, and on the right-hand side, I've added a Gaussian blur to all the nonsalient regions.

It really makes the subjects pop.

Another use case of saliency is you can enhance your photo viewing experience.

So, let's say that you're at home.

You're sitting on the couch, and either your TV or your computer has gone into standby mode, and it's going through your photo library.

A lot of times, these photo-showing algorithms can be a little bit awkward.

They zoom into seemingly random parts of the image, and it's not always what you expect.

But with saliency, you always know where the subjects are, so you can get a more documentary-like effect like this.

Finally, saliency works really great with other vision algorithms.

Let's say we have an image, and we want to classify the objects in the image.

We can run objectness based saliency to pick up on the objects in the image, crop the image to the bounding boxes returned by objectness based saliency, and run these crops through the algorithm through a image classification algorithm to find out what the objects are.

So, not only do you know where they are in the image because of the bounding boxes, but it allows you the hone in on what the objects are by just picking out the crops that have those objects in it.

Now, you can already classify things with Core ML, but this year, Vision has new image classification technique that Rohan will now present to you.

[ Applause ]

Good morning.

My name is Rohan Chandra, and I'm a researcher on the Vision Team.

Today, I'm going to be talking about some of the new image classification requests we're introducing to the Vision API this year.

Now, image classification as a task is fundamentally meant to answer the question, what are the objects that appear in my image.

Many of you will already be familiar with image classification.

You may have used Create ML or Core ML to train your own classification networks on your own data as we showed in the Vision with Core ML talk last year.

Others of you may have been interested in image classification but felt you lacked the resources or the expertise to develop your own networks.

In practice, developing a large-scale classification network from scratch can take millions of images to annotate, thousands of hours to train, and very specialized domain expertise to develop.

We here at Apple have already gone through this process, and so we wanted to share our large-scale, on-device classification network with you so that you can leverage this technology without needing to invest a huge amount of time or resources into developing it yourself.

We've also strived to put tools in the API to help you contextualize and understand the results in a way that makes sense for your application.

Now, the network we're talking about exposing here is in fact the same network we ourselves use to power the photo search experience.

This is a network we've developed specifically to run efficiently on device without requiring any service side processing.

We've also developed it to identify over a thousand different categories of objects.

Now, it's also important to note that this is a multi-label network capable of identifying multiple objects in a single image, in contrast to more typical mono-label networks that try to focus on identifying a single large central object in an image.

Now, as I talk about this new classification API, I think one of the first questions that comes to mind is what are the objects it can actually identify?

Well, the set of objects that a classifier can predict is known as the taxonomy.

The taxonomy has a hierarchical structure with directional relationships between classes.

These relationships are based upon shared semantic meaning.

For instance, a class like dog might have children like Beagle, Poodle, Husky, and other sub-breeds of dogs.

In this sense, a parent class tends to be more general while child classes are more specific instances of their parent.

You can of course see the entire taxonomy using ImageRequest.known Classifications.

Now, when we constructed the taxonomy, we had a few specific rules that we applied.

The first is that the classes must be visually identifiable.

That is, we avoid more abstract concepts like holiday or festival.

We also avoid any classes that might be considered controversial or offensive as well as those to do with proper names, nouns, excuse me, adjectives, or basic shapes.

Finally, we omit occupations, and this might seem odd at first.

But consider the range of answers you'd get if we asked something like what does an engineer look like.

There probably isn't a single concise description you could give that would apply to every engineer aside from sleep deprived and usually glued to a computer screen.

Let's take a look at the code you need to use in order to classify an image.

So, as usual, you form an ImageRequestHandler to your source image.

You then perform the VNClassifyImageRequest and retrieve your observations.

Now, in this case, you actually get an array of observations, one for every class in the taxonomy and its associated confidence.

In a mono-label problem, you'd probably expect that these probabilities sum up to 1, but this is a multi-label classification network, and each prediction is an independent confidence associated with a particular class.

As such, they won't sum to 1, and they're meant to be compared within the same class, not across different classes.

So we can't simply take the max amongst them in order to determine our final prediction.

You might be wondering then, how do I deal with so many classes and so many numbers.

Well, there are a few key tools in the API that we've implemented to help you make sense of the result.

Now, in order to talk about these tools in the API, we first need to define some basic terms.

The first is when you get a confidence for a class, we typically compare that to a class-specific threshold, which we refer to as an operating point.

If the class confidence is above the threshold, then we say that class is present in the image.

If the class confidence is below the class threshold, then we say that object is not present in the image.

In this sense, we want to pick thresholds such that objects with the target class typically have a confidence higher than the threshold, and images without the target class typically have a score lower than the threshold.

However, machine learning is not infallible, and there will be instances where the network is unsure and the confidence is proportionally lower.

This can happen when objects are office gated, appear in odd lighting or at odd angles, for instance.

So how do we pick our thresholds?

Well, there are essentially three different regimes we can be in depending on our choice of threshold that yield three different kinds of searches.

To make this a little more concrete, let's say I have a library of images for which I've already performed classification and stored the results.

Let's say in this particular case I'm looking for images of motorcycles.

Now, I want to pick my thresholds such that images with motorcycles typically have a confidence higher than this threshold and images without motorcycles typically have a score lower than this threshold.

So, what happens if I just pick a low threshold.

As you can see behind me, when I apply this low threshold, I do in fact get my motorcycle images, but I'm also getting these images of mopeds in the bottom right.

And if my users are motorcycle enthusiasts, they might be a little annoyed with that result.

When we talk about a search that tries to maximize the percentage of the target class retrieved amongst the entire library, and isn't as concerned with these missed predictions where we say the motorcycle is present when it actually isn't.

We are typically talking about a high recall search.

Now, I could maximize recall by simply returning as many images as possible, but I would get a huge number of these false predictions where I say my target class is present when it actually isn't, and so we need to find a more balanced point of recall to operate at.

Let's take a look at how I need to change my code in order to perform this high recall search.

So, here I have the same code snippet as before, but this time I'm performing a filtering with hasMinimumPrecision and a specific recall value.

For each observation in my array of observations, the filter only retains it if the confidence associated with the class achieves the level of recall that I specified.

Now, the actual operating point needed to determine this is going to be different for every class, and it's something we've determined based on our internal tests of how the network performs on every class in the taxonomy.

However, the filter handles this for you automatically.

All you need to do is specify the level of recall you want to operate at.

So, we talked about a high recall search here, but what if I have an application that can't tolerate these false predictions where I'm saying motorcycles are present when they're not.

That is, I want to be absolutely sure that the images I retrieve actually do contain a motorcycle.

Well, let's come back to our library of images then and see what would happen if we applied the higher threshold.

As you can see behind me, when I apply my high threshold, I do in fact only get motorcycle images, but I get far fewer images overall.

When we talk about a search that tries to maximize the percentage of the target class amongst the retrieved images and isn't as concerned with overlooking some of the more ambiguous images that actually do contain the target class, we are typically talking about a high precision search.

Again, like with high recall, we need to find a more balanced operating point where I have an acceptable likelihood about my target class appearing in my results, but I'm not getting too few images.

So, let's take a look at how I need to modify my code in order to perform this high precision search.

So here's the same code snippet, but this time my filtering is done with hasMinimumRecall and a precision value I've specified.

Again, I only retain the observation if the confidence associated with it achieves the level of precision that I specified.

The actual threshold needed for this is going to be different for every class, but the filter handles that for me automatically.

All I need to do is tell it the level of precision I want to operate at.

So we've talked about two different extremes here, one of high recall and one of high precision, but in practice, it can be better to find a balanced tradeoff between the two.

So, let's see how we can go about doing that, and in order to understand what's happening, I first need to introduce something known as the precision and recall curve.

So, in practice, there is a tradeoff to be made where increasing one of precision and recall can lead to a decrease in the other.

I can represent this tradeoff as a graph, where for each operating point I can compute the corresponding precision and recall.

For instance, at the operating point at where I achieve a recall of 0.7, I find that I get a corresponding precision of 0.74.

I can compute this for a multitude of operating points in order to form my full curve.

As I said before, I want to find a balance point along this curve that achieves the level of recall and precision that makes sense for my application.

So let's see how I need to change my code in order to accomplish it and how the precision and recall curve plays into that.

So here I have a filtering with hasMinimumPrecision where I'm specifying the minimum precision and a recall value.

When I specify a MinimumPrecision, I'm actually selecting an area along the graph that I want to operate within.

When I select a recall point with forRecall, I'm choosing a point along the curve that will be my operating point.

Now, if the operating point is in the valid region that I selected, then that is the threshold that the filter will apply when looking at that particular class.

If the operating point is not in the valid region, then there is no operating point that meets the constraints I stated, and the class will always be filtered out of my results.

In this sense, all you need to do is provide the level of precision and recall that you want to operate at, and the filter will determine the necessary thresholds for you automatically.

So, to summarize, the observation I get back when performing image classification is actually an array of observations, one for every class in the taxonomy.

Because this is a multi-label problem, the confidences will not sum to 1.

Instead, we have independent confidence values, one for every class between 0 to 1, and we need to understand precision and recall and how they apply to our specific use case in order to apply a filtering with hasMinimumPrecision or hasMinimumRecall that makes sense for our application.

So, that concludes the portion on image classification.

I'd like to switch gears and talk about a related topic, Image excuse me.

Image Similarity.

When we talk about Image Similarity, what we really mean is a method to describe the content of an image and another method to compare those descriptions.

The most basic way in which I can describe the contents of an image is using the source pixels themselves.

That is, I can search for other images that have close to or exactly the same pixel values and retrieve them.

If I did a search in this fashion, however, it's extremely fragile, and it's easily fooled by small changes like rotations or lighting augmentations that drastically change the pixel values but not the semantic content in the image.

What I really want is a more high-level description of what the content of the image is, perhaps something like natural language.

I could make use of the image classification API I was describing previously in order to extract a set of words that describe my image.

I could then retrieve other images with a similar set of classifications.

I might even combine this with something like word vectors to account for similar but not exactly matching words like cat and kitten.

Well, if I performed a search like this, I might get similar objects in a very general sense, but the way in which those objects appear and the relationships between them could be very different.

As well, I would be limited by the taxonomy of my classifier.

That is, any object that appeared in my image that wasn't in my classification networks taxonomy couldn't be expressed in a search like this.

What I really want is a high-level description of the objects that appear in the image that isn't fixated on the exact pixel values but still cares about them.

I also want this to apply to any natural image and not just those within a specific taxonomy.

As it turns out, this kind of representation learning is something that's naturally engendered in our classification network as part of its training process.

The upper layers of the network contain all of the salient information necessary to perform classification while discarding any redundant or unnecessary information that doesn't aid it in that task.

We can make use of these upper layers then to act as our feature descriptor, and it's something we refer to as the feature print.

Now, the feature print is a vector that describes the content of the image that isn't constrained to a particular taxonomy, even the one that the classification network was trained on.

It simply leverages what the network has learned about images during its training process.

If we look at these pairs of images, we can compare how similar their feature prints are, and the smaller the value is, the more similar the two images are in a semantic sense.

We can see that even though the two images of the cats are visually dissimilar, they have a much more similar feature print than the visually similar pairs of different animals.

To make this a little more concrete, let's go through a specific example.

Let's say I have the source image on screen, and I want to find other semantically similar images to it.

I'm going to take a library of images and compute the feature print for each image and then retrieve those images with the most similar feature print to my source image.

When I do it for this image of the gentleman in the coffee shop, I find I get other images of people in coffee shop and restaurant settings.

If I focus on a crop of the newspaper, however, I get other images of newspapers.

And if I focus on the teapot, I get other images of teapots.

I'd like to now invite the Vision Team onstage to help me with a quick demonstration to expand a little more on how Image Similarity works.

[ Applause ]

Hello everyone.

My name is Brett, and we have a really fun way to demonstrate Image Similarity for you today.

We have very creatively called it the Image Similarity game.

And here is how you play.

You draw something on a piece of paper, then ask a few friends to re-create your original as close as possible.

So I will start by drawing the original.

Okay. Tap continue to scan it in as my original.

And then save.

Now, my team will act as contestants, and they will draw this as best as they can.

Now while they're drawing, I should tell you that this sample app is currently available to you now on the developer documentation website as sample code, and also, we are using the Vision kit document scanner to scan in our drawings, and you can learn more about that at our text recognition session.

Let's them give a few more seconds.

Five, four, three, okay, I guess they're done.

Okay. Let's bring them up and start scanning them in.

Contestant number one.

Pretty good [applause].

That might be a winner.

Let's see contestant number two.

Still pretty good.

Nicely done.

[ Applause ]

Contestant number three please.

[ Laughter and Applause ]

I think that's pretty good.

[ Applause ]

And contestant number four.

Well, I don't know about that, but we'll see how it goes.

[ Applause ]

All right.

So let's save those, and we find out that the winner is contestant number one.


[ Applause ]

Now I can swipe over, and we can see that the faces are more semantically similar that way.

They are closer to the original while the tree is semantically different, it was much further away.

And that is the Image Similarity game, and background check to Rohan.

[ Applause ]

Thanks everyone.

I want to take a quick look at a snippet from that demo application to show how we determined the winning contestant.

So here I have the portion of the code that compares each of the contestant's drawings feature print to Brett's drawing's feature print.

Now, I extracted the contestant's feature print with a function we have defined in the application called featureprintObservationForImage.

Once I have each feature print, I then need to determine how similar it was to the original drawing, and I can do that using computeDistance, which returns me a floating-point value.

Now, the smaller the floating-point value, the more similar the two images are.

And so, once I've determined this for every contestant, I simply need to sort them in order to determine the winner.

Well, this concludes the portion on Image Similarity.

I'd now like to hand the mic over to Sergey to talk about some of the changes coming to Face Technologies.

[ Applause ]

Good morning everybody.

My name is Sergey Kamensky.

I'm a software engineer on the Vision Framework Team.

I'm excited to share with you today even more new features coming to the Framework this year.

Let's talk about Face Technology first.

Remember, two years ago when we introduced Vision Framework, we also talked about Face Landmark detector.

This year, we're coming with a new revision for this algorithm.

So, what are the changes?

Well, first, we now have 76-point cancellation, and this is versus 65-point cancellation as we had before.

The 76-point cancellation gives us a greater density to represent different face regions.

Second, we now report confidence score per landmark point, and this is versus a single average confidence score, as we reported before.

But the biggest improvement comes in the pupil detection.

As you can see, the image on the right-hand side has pupils detected with much better accuracy.

Let's take a look at the client code sample.

This code snippet will repeat throughout the presentation so the first time we're going to go line by line.

Also, I use for [inaudible] in my samples.

If this is just to simplify the slides, when you develop your apps, you probably should use proper error handling to avoid unwanted boundary conditions.

Let's get back to the sample.

In order to get your facial landmarks, first you need to create a DetectFaceLandmarksRequest.

Then, you need to create ImageRequestHandler, passing the image into it the image that needs to be processed, and then you need to use that request handler to process your request.

Finally, you need to look at the results.

The results for everything that this human face related in Vision Framework will come in forms of face observations.

Face observation derives some detected object observation.

It inherits bounding box property, and it also adds several other properties on its level to describe human face.

This time we'll be interested in the landmarks property.

The landmarks property is of FaceLandmarks2D class.

FaceLandmarks2D class consists of the confidence score.

This is the average single average confidence score for the entire set and multiple face regions where each face region is represented by FaceLandmarksRegion2D class.

Let's take a closer look at the properties of this class.

First is pointCount.

PointCount will tell you how many points represent a particular face region.

This property will [inaudible] a different value depending how you configure your request, with 65-point cancellation or 76-point cancellation.

The normalizedPoints property will represent the actual landmarks point, and the precisionEstimatesPerPoint will represent the actual confidence score for teach landmark point.

Let's take a look at the codes needed.

This is the same code snippet as in the previous slide, but now we're going to look at it from a slightly different perspective.

We want to see how revisioning of the algorithm works in Vision Framework.

If you take this code snippet and recompile it with the last [inaudible], what you will get is that the request object will be configured as follows: the revision property will be set to revision number 2, and the cancellation property will be set to cancellation of 65 points.

Technically, we didn't have cancellation property last year, but if we did, we could have set it to a single value only.

Now, if on the other hand you take the same code snippet and recompile it with the current [inaudible], what you will get is that the revision property will be set to revision number 3, and the cancellation property will be set to cancellation 76 points.

This actually represents the philosophy of how Vision Framework handles revisions of algorithms by default.

If you don't specify a revision, what we will do is, we will give the latest supported by the SDK your code is compiled and linked against.

Of course, we'll always recommend to set those properties explicitly.

This is just to guarantee deterministic behavior in the future.

Let's take a new metric that we developed this year, Face Capture Quality.

There are two images on the screen.

You can clearly see that one image was captured with better lighting and focusing conditions.

We wanted to develop the metric that looks at the image as a whole and gives you one score back saying how bad or good the capture quality was.

As a result, we came up with a Face Capture Quality metric.

We trained our models for this metric in such a way so they tend to score lower if the image was captured with low light or bad focus, or for example, if a person had negative expressions.

If we run this metric on these two images, we will get our scores back.

These are floating-point numbers.

You can compare them against each other, and you can say that the image that scored higher is the image that was captured with better quality.

Let's take a look at the code sample.

This is very similar to what we saw just a couple of slides ago, with the differences being in the request type and the results.

Since we still with C1 faces, we're going to get our face observation back, but now we're going to look at a different property of the face observation, Face Capture Quality property.

Let's take a look at the broader example.

Let's say I have a sequence of images that could have been obtained by using the burst mode on the selfie camera or in the photo burst, for example.

And you will ask yourself a question.

Which image was captured with the best quality?

What you can do now, you can run our algorithm on each image, assign scores, rank them, and the image that apps on the most light is the image that was captured with the best quality.

Let's try to understand how we can interpret the results that are coming from the Face Capture Quality metric.

I have two sequences of images on the slide.

Each sequence is of the same person, and each sequence is represented by the images that scores lowest and the highest in the sequence with respect to Face Capture Quality.

What can we say about these ranges?

Well, there is some overlapping region, but there are some also regions that belong to one and don't belong to the other.

If you had yet another sequence, it could have happened that there was no overlapping region at all.

The point I'm trying to make here is that the Face Capture Quality should not be compared against a threshold.

In this particular example, if I picked 0.52, I would have missed all the images on the left, and I would pretty much can get any image that's just past the midpoint on the right.

But then what is Face Capture Quality?

We define Face Capture Quality is a comparative or ranking measure of the same subject.

Now, comparative and same are the key words in this sentence.

If you're thinking, cool, I have this great new metric, I'm going to develop my beauty contest app.

Probably not a good idea.

In a beauty contest app, you would have to compare faces of different people, and that's not what this metric was developed and designed for.

And that's Face Technology.

Let's take a look at the new detectors we're adding this year.

We're introducing Human Detector that detects human upper body that consists of human head and torso and also a pet detector, an Animal Detector that detects cats and dogs.

The Animal Detector gives you bounding box back, and in addition to bounding boxes it gives you also a label saying which animal was detected.

Let's take a look at the client code sample.

Two snippets, one for Human Detector, one for Animal Detector.

Very similar to what we had before.

Again, the differences are in the request types that you create and in the results.

Now, for Human Detector, all we care about is the bounding box.

So, we use for that DetectedObjectObservation.

For the Animal Detector on the other hand, we also need the label, so we use RecognizedObjectObservation that derives from detected object observation.

It inherits bounding box, but it also adds a label property on the [inaudible].

And that's new detectors.

Let's take a look at what's new in tracking this year.

We're coming up with a new revision for the Tracker.

The changes are, we have improvements in the bounding boxes expansion area.

We can now handle better occlusions.

We are machine learning based this time.

And we can run with low power consumption on multiple [inaudible] devices.

Let's take a look at a sample.

I have a mini video clip where a man is running in the forest, and he appears sometimes behind the trees.

As you can see, the tracker is able to successfully recapture the tracked object and keep going with the tracking sequence.

[ Applause ]

Thank you.

[ Applause ]

Let's take a look at the client code sample.

This is exactly the same snippet that we showed last year.

It represents probably the simplest tracking sequence you can imagine.

It tracks your object of interest for five consecutive frames.

I want to go line-by-line, but I want to emphasize two points here.

First is we use our SequenceRequestHandler.

That is as opposite to ImageRequestHandler as we have used so far throughout the presentation.

SequenceRequestHandler is used in Vision when you work with a sequence of frames and you need to cache some information from frame to frame to frame.

Second point is when you implement your tracking sequence, you need to get your results from iteration number n and feed it as an input to a duration number n plus 1.

Of course, if you recompiled this quote with the current [inaudible] SDK, the revision of the request will be set to revision number 2 by default.

But we also recommend to set it explicitly.

And that's the tracking.

Let's take a look at the news with respect to Vision and Core ML integration.

Last year, we presented integration with Vision and Core ML, and we showed how you can run Core ML models through Vision API.

The advantage of doing that was that you can use 1 over 5 different overloads of the image request handler to translate the image that you have in your hand to the image type, size, and color scheme that the Core ML model requires.

We will run the inference for you, and we'll pack the outputs or results coming from Core ML model into Vision observations.

Now, if you have a different task in mind, for example, if you want to do image style transfer, you need to have at least two images, the image content and the image style.

You may also need to have some mixed ratio saying how much of a style needs to be applied on the content.

So, I have three parameters now.

Well, this year we're going to introduce API where we can use multiple inputs through Vision to Core ML, and that's including multi-image inputs.

Also, on the output section, this sample shows only one output.

But, for example, if you had more than one, especially if you have more than one of the same type, it's hard to distinguish them when they come in forms of observation later on.

So, what we do this year, we introduce a new field in the observation that maps exactly to the name that shows up here in the output section.

Let's take a look at the inputs and outputs.

We will use them in the next slide.

This is the code snippet that represents how to use Core ML through Vision.

The highlighted sections show what's new this year.

Let's keep them for now, and we'll go over the code, and we'll return to them later.

In order to run Core ML through Vision, first you need to log your Core ML model.

Then, you need to create Vision CoreMLmodel wrapper around it.

Then, you need to create Vision CoreMLRequest and pass in that wrapper.

Then you create ImageRequestHandler, you process your request, and you look at the results.

Now, with the new API that we added this year, that only image that you could use last year is the default or the main image is the image that is passing to ImageRequestHandler, but that's also the image whose name needs to be assigned to input feature name field of the CoreMLModel wrapper.

All other parameters whether images or not will have to be passed through feature provider property of the CoreMLModel wrapper.

As you can see, image style and mixed ratio are passed in that way.

Finally, when you look at the results, you can look at the feature name property of the observation that comes out, and you can compare it in this case against image result.

That's exactly the name that appears in the output section of Core ML, and that way you can process your results accordingly.

This slide actually concludes our presentation for today.

For more information you can refer to the links on the slide.

Thank you, and have a great rest of your WWDC.

[ Applause ]

Apple, Inc. AAPL
1 Infinite Loop Cupertino CA 95014 US