What’s New in Core ML, Part 1 

Session 708 WWDC 2018

Introduced just one year ago, Core ML has already revolutionized the way apps can benefit from machine learning, by enabling fast and private on device machine learning features for your app. Find out how new Core ML features let you reduce the size of models, make them more flexible, and dramatically improve performance.

[ Music ]

[ Applause ]

Good morning.

Welcome. I’m Michael, and this session is about what’s new in Core ML.

So introduced a year ago, Core ML is all about making it unbelievably simple for you to integrate machine learning models into your app.

It’s been wonderful to see the adoption over the past year.

We hope it has all of you thinking about what great new experiences you can enable if your app had the ability to do things like understand the content of images, or perhaps, analyze some text.

What could you do if your app could reason about audio or music, or interpret your users’ actions based on their motion activity, or even transform or generate new content for them?

All of this, and much, much more, is easily within reach.

And that’s because this type of functionality can be encoded in a Core ML model.

Now if we take a peek inside one of these, we may find a neural network, tree ensemble, or some other model architecture.

They may have millions of parameters, the values of which have been learned from large amounts of data.

But for you, you could focus on a single file.

You can focus on the functionality it provides and the experience it enables rather than those implementation details.

Adding a Core ML model to your app is simple as adding that file to your Xcode project.

Xcode will give you a simple view, describe what it does in terms of the inputs it requires and the outputs it provides.

Xcode will take this one step further and generate an interface for you, so that interacting with this model is just a few lines of code, one to load the model, one to make a prediction, and sometimes one to pull out the specific output you are interested in.

Note that in some cases you don’t even have to write back code because Core ML integrates with some of our higher-level APIs and allows you to customize their behavior if you give them a Core ML model.

So with Vision, this is done through the VNCoreML Request object.

And in the new Natural Language framework, you can instantiate an MLModel from a CoreML model.

So that’s Core ML in a nutshell.

But we are here to talk about what’s new.

We took all the great feedback we received from you over the past year and focused on some key enhancements to CoreML 2.

And we are going to talk about these in two sessions.

In the first session, the one you are all sitting in right now, we are going to talk about what’s new from the perspective of your app.

In the second session, which starts immediately after this at 10 a.m. after a short break, we are going to talk about tools and how you can update and convert models to take advantage of the new features in Core ML 2.

When it comes to your app, we are going to focus on three key areas.

The first is how you can reduce the size and number of models using your app while still getting the same functionality.

Then we’ll look about how you can get more performance out of a single model.

And then we’ll conclude about how using Core ML will allow you to keep pace with the state-of-the-art and rapidly moving field of machine learning.

So to kick it off, let’s talk about model size.

I’m going to hand it off to Francesco.

[ Applause ]

Thank you Michael.

Hello. Every ways to reduce the size of your Core ML app is very important.

My name is Francesco, and I am going to introduce quantization and flexible shapes, two new features in Core ML 2 that can help reduce your app size.

So Core ML, [inaudible] why should you learn in models and device.

This gives your app four key advantages compared to running them in the cloud.

First of all, user privacy is fully respected.

There are many machine-learning models on device.

We guarantee that the data never leaves the device of the user.

Second, it can help you achieve real-time performance.

And for silicon [phonetic] and devices are super- efficient for machine-learning workloads.

Furthermore, you don’t have to maintain and pay for Internet servers.

And Core ML inference is available anywhere at any time despite natural connectivity issues.

All these great benefits come with the fact that you now need to store your machine-learning models on device.

And if the machine-learning models are big, then you might be concerned about the size of your app.

For example, you are you have your [inaudible] map, and it’s full of cool features.

And your users are very happy about it.

And now you want to take advantage of the new opportunities offered by machine learning on device, and you want to add new amazing capabilities to your app.

So what you do, you train some Core ML models and you add them to your app.

What this means is that your app has become more awesome and your users are even happier.

But some of them might notice that your app has increased in size a little bit.

It’s not uncommon to see that apps grow up, either to tens or hundreds of megabytes after adding machine-learning capabilities to them.

And as you keep adding more and more features to your app, your app size might simply become out of control.

So that’s the first thing that you can do about it.

And if these machine-learning models are supporting other features to your app, you can keep them outside your initial bundle.

And then as the user uses the other features, you can download them on demand, compile them on device.

So this in this case, this user is happy in the beginning because the installation size is unchanged.

But since the user downloads and uses all the current Core ML functionality in your app, at the end of the day the size of the of your app is still large.

So wouldn’t it be better if, instead, we could tackle this problem by reducing the size of the models itself?

This would give us a smaller bundle in case we ship the models inside the app, faster and smaller downloads if instead of shipping the models in the app we download them.

And in any case, your app will enjoy a lower memory footprint.

Using less memory is better for the performance of your app and great for the system in general.

So let’s see how we can decompose the size of a Core ML app into factors to better tackle this problem.

First, there is the number of models.

This depends on how many machine-learning functionalities your app has.

Then there is the number of weights.

The number of weights depends on the architecture that you have chosen to solve your machine-learning problem.

As Michael was mentioning, the number of weight the weights are the place in which the machine-learning model stores the information that it has been learning during training.

So it is if it has been trained to do a complex task, it’s not uncommon to see a model requiring tens of millions of weights.

Finally, there is the size of the weight.

How are we storing these parameters that we are learning during training?

Let’s focus on this factor first.

For neural networks, we have several options to represent and store the weights.

And the first, really, is of Core ML in iOS 11.

Neural networks were stored using floating-point 32-bit weights.

In iOS 11.2, we heard your feedback and we introduced half precision floating-point 16 weight.

This gives your app half the storage required for the same accuracy.

But this year we wanted to take several steps further, and we are introducing quantized weights.

With quantized weights we are no longer restricted to use either Float 32 or Float 16 values.

But neural networks can be encoded using 8 bits, 4 bits, any bits all the way down to 1 bit.

So let’s now see what quantization here is.

Here we are representing a subset of the weights of our neural networks.

As we can see, these weights can take any value in a continuous range.

This means that in theory, a single weight can take an infinite number of possible values.

So in practice, in neural networks we store weights using floater 32 point Float 32 numbers.

This means that this weight can take billions of values to better represent the their continuous nature.

But it turns out the neural networks also work with lower precision weights.

Quantization is the process it takes to discontinue strings of values and constrains them to take a very small and discrete subset of possible values.

For example, here quantization has turned this continuous spectrum of weights into only 256 possible values.

So before quantization, the weights would take any possible values.

After quantization, they only have 256 options.

Now since its weight can be taken from this small set, Core ML now needs only 8 bits of stored information of a weight.

But nothing can stop us here.

We can go further.

And for example, we can constrain the network to take, instead of one of 56 different values, for example, just 8.

And since now we now have only 8 options, Core ML will need 3-bit values per weight to store your model.

There are now some details about how we are going to choose these values to represent the weights.

They can be uniformly distributed in this range, and in this case we have linear quantization instead in lookup table quantization, we can have these values scattered in this range in an arbitrary manner.

So let’s see practically how quantization can help us reduce the size of our model.

In this example, you are focusing on Resnet50, which is a common architecture used by many applications for many different tasks.

It includes 25 million trained parameters and this means that you have to use 32-bit floats to represent it.

Then the total model size is more than 100 megabytes.

If we quantize it to 8-bits, then the architecture hasn’t changed; we still have 25 million parameters.

But we are now using only 1 byte to store a single weight, and this means that the model size is reduced by a factor of 4x.

It’s only it now only takes 26 megabytes to store this model.

And we can go further.

We can use that quantized representation that only uses 4 bits per weight in this model and end up with a model that is even smaller.

[ Applause ]

And again, Core ML supports all the quantization modes all the way down to 8 bits.

Now quantization is a powerful technique to take an existing architecture and of a smaller version of it.

But how can you obtain quantized model?

If you have any neural networking in Core ML format, you can use Core ML Tools to obtain a quantized representation of it.

So Core ML 2 should quantize for you automatically.

Or you can train quantized models.

You can either train quantized with a quantization constraint from scratch, or retrain existing models with quantization constraints.

After you have obtained your quantized model with your training tools, you can then convert it to Core ML as usual.

And nothing will change in the app in the way you use the model.

Inside the model, the numbers are going to be stored in different precision, but the interface for using the model will not change at all.

However, we always have to consider that quantized models have lower-precision approximations of the original reference floating-point models.

And this means that quantized models come with an accuracy versus size-of-the-model tradeoff.

This tradeoff is model dependent and use case dependent.

And it’s also a very active area of research.

So it’s always recommended to check the accuracy of the quantized model and compare it with the referenced floating-point version for relevant this data and form metrics that are valid for your app and use case.

Now let’s see a demo of how we can use adopt quantized models to reduce the size of an app.

[ Applause ]

I would like to show you a style transfer app.

In style transfer, a neural network has been trained to render user images using styles that have been learned by watching paintings or other images.

So let me load my app.

As we can see, I am shipping this app with four styles; City, Glass, Oils and Waves.

And then I can pick images from the photo library of the users and then process them blending them in different styles right on device.

So this is the original image, and I am going to render the City style, Glass, Oils, and Waves.

Let’s see how this app has been built in Xcode.

This app uses Core ML and Vision API to perform this stylization.

And as we can see, we have four Core ML models here bundled in Xcode; City, Glass, Oils, and Waves, the same ones we are seeing in the app.

And we can see we can inspect this model.

These are seen as quantized model, so each one of these models is 6.7 megabytes of this space on disk.

We see that the models take an input image of a certain resolution and produce an image called Stylized of the same resolution.

Now we want to investigate how much storage space and memory space we can use by we can save by switching to quantized, models.

So I have been playing with Core ML Tools and obtained quantizer presentation for these models.

And for a tutorial about how to obtain these models, stay for Part 2 that is going to cover quantization with Core ML Tools in detail.

So I want to focus first on the Glass style and see how the different quantization versions work for these styles.

So all I have to do is drag these new models inside the Xcode project, and rerun the app.

And then we are going to see how these models behave.

First we can see that the size has been greatly reduced.

For example, the 8-bit version already from 6 or 7 megabytes went down to just 1.7.

[ Applause ]

In 4-bit, we can save even more, and now the model is less than 1 megabyte.

In 3-bit, it is that’s even smaller, at 49 kilobytes.

And so on.

Now let’s go back to the app.

Let’s make this same image for reference and apply the Glass style in the original version.

Still looks as before.

Now we can compare it with the 8-bit version.

And you can see nothing has changed.

This is because 8-bit quantization methods are very solid.

We can also venture further and try the 4-bit version of this model.

Wow. The results are still great.

And now let’s try the 3-bit version.

We see that there are we see the first color shift.

So it probably is good if we go and check with the designers if this effect is still acceptable.

And now, as we see the 2-bit version, this is not really what we were looking for.

Maybe we will save it for a horror app, but I am not going to show this to the designer.

[ Applause ]

Let’s go back to the 4-bit version and hide this one.

This was just a reminder that quantized models are approximation of the original models.

So it’s always recommended to check them and compare with the original versions.

Now for every model and quantization technique, there is always a point in which things start to mismatch.

Now we after some discussion with the designer, extensive evaluation of many images, we decided to ship the 4-bit version of this model, which is the smallest size for the best quality.

So let’s remove all the floating-point version of the models that were taking a lot of space in our app and replace them with the 4-bit version.

And now let’s run the app one last time.

OK. Let’s pick the same image again and show all the styles.

This was the City, Glass, Oils, and big Wave.

So in this demo we saw how we started with four models and they were huge, in 32-bit or total app size was 27 megabytes.

Then we evaluated the quality and switched to 4-bit models, and the total size of our app went down to just 3.4 megabytes.

Now [ Applause ]

This doesn’t cost us anything in terms of quality because all these versions these quantized versions look the same, and the quality is still amazing.

We showed how quantization can help us reduce the size of an app by reducing the size of the weight at the very microscopic level.

Now let’s see how we can reduce the number of models that your app needs.

In the most straightforward case, if your app has three machine-learning functionalities then you need three different machine-learning models.

But in some cases, it is possible to have the same model to support two different functions.

For example, you can train a multi-task model.

And multi-task models has been trained to perform multiple things at once.

There is an example about style transferring, the Turi Create session about multi-task models.

Or in some cases, you can use a yet new feature in Core ML called Flexible Shapes and Sizes.

Let’s go back to our Style Transfer demo.

In Xcode we saw that the size of the input image and the output image was encoded in part of the definition of the model.

But what if we want to run the same style on different image resolution?

What if we want to run the same network on different image sizes?

For example, the user might want to see a high-definition style transfer.

So they use they give us a high-definition image.

Now if I were Core ML model all it takes is a lower resolution as an input, all we can do as developers is size or resize the image down, process it, and then scale it back up.

This is not really going to amaze the user.

Even in the past, we could reship this model with Corel ML Tools and make it accept any resolution, in particular, a higher-resolution image.

So even in the past we could do this feature and feed directly the high-resolution image to a Corel ML model, producing a high-definition result.

This is because we wanted to introduce a finer detail in the stylization and the way finer strokes that are amazing when you zoom in, because they have they add a lot of work into the final image.

So in the past we could do it, but we could do it by duplicating the model and creating two different versions: one for the standard definition and one for the high definitions.

And this, of course, means that our app is twice as much the size besides the fact that the network has been trained to support any resolution.

Not anymore.

We are introducing flexible shapes.

And with flexible shapes, you have if you have you can have the single model to process more resolutions and many more resolutions.

So now in Xcode [ Applause ]

in Xcode you are going to see that this the input is still an image, but the size of the full resolution, the model also accepts flexible resolutions.

In this simple example, SD and HD.

This means that now you have to ship a single model.

You don’t have to have any redundant code.

And if you need to switch between standard definition and high definition, you can do it much faster because we don’t need to reload the model from scratch; we just need to resize it.

You have two options to specify the flexibility of the model.

You can define a range for its dimension, so you can define a minimal width and height and the maximum width and height.

And then at inference pick any value in between.

But there is also another way.

You can enumerate all the shapes that you are going to use.

For example, all different aspect ratios, all different resolutions, and this is better for performance.

Core ML knows more about your use case earlier, so it can it has the opportunities of performing more optimizations.

And it also gives your app a smaller tested surface.

Now which models are flexible?

Which models can be trained to support multiple resolutions?

Fully convolutional neural networks, commonly used for MS processing tasks such as style transfer, image enhancement, super resolution, and so on and some of the architecture.

Core ML Tools can check if a model has this capability for you.

So we still have the number of models Core ML uses in flexible sizes, and the size of the weights can be reduced by quantization.

But what about the number of weights?

Core ML, given the fact that it supports many, many different architecture at any framework, has always helped you choose the right the model of the right size for your machine-learning problem.

So Core ML can help you tackle the size of your app using this all these three factors.

In any case, the inference is going to be super performant.

And to introduce new features in performance and customization, let’s welcome Bill March.

Thank you.

[ Applause ]

Thank you.

One of the fundamental design principles of Core ML from the very beginning has been that it should give your app the best possible performance.

And in keeping with that goal, I’d like to highlight a new feature of Core ML to help ensure that your app will shine on any Apple device.

Let’s take a look at the style transfer example that Francesco showed us.

From the perspective of your app, it takes an image of an input and simply returns the stylized image.

And there are two key components that go into making this happen: first, the MLModel file, which stores the particular parameters needed to apply this style; and second, the inference engine, which takes in the MLModel and the image and performs the calculations necessary to produce the result.

So let’s peek under the hood of this inference engine and see how we leverage Apple’s technology to perform this style transfer efficiently.

This model is an example of a neural network, which consists of a series of mathematical operations called layers.

Each layer applies some transformation to the image, finally resulting in the stylized output.

The model stores weights for each layer which determine the particular transformation and the style that we are going to apply.

The Core ML neural network inference engine has highly optimized implementations for each of these layers.

On the GPU, we use MTL shaders.

On the CPU we can use Accelerate, the proficient calculation.

And we can dispatch different parts of the computation to different pieces of hardware dynamically depending on the model, the device state, and other factors.

We can also find opportunities to fuse layers in the network, resulting in fewer overall computations being needed.

We are able to optimize here because we know what’s going on.

We know the details of the model; they are contained in the MLModel file that you provided to us.

And we know the details of the inference engine and the device because we designed them.

We can take care of all of these optimizations for you, and you can focus on delivering the best user experience in your app.

But what about your workload?

What about, in particular, if you need to make multiple predictions?

If Core ML doesn’t know about it, then Core ML can’t optimize for it.

So in the past, if you had a workload like this, you needed to do something like this: a simple for loop wrapped around a call to the existing Core ML prediction API.

So you’d loop over some array of inputs and produce an array of outputs.

Let’s take a closer look at what happens under the hood when this when we are doing this.

For each image, we will need to do some kind of preprocessing work.

If nothing else, we need to send the data down to the GPU.

Once we have done that, we can do the calculation and produce the output image.

But then there is a postprocessing step in which we need to retrieve the data from the GPU and return it to your app.

The key to improving this picture is to eliminate the bubbles in the GPU pipeline.

This results in greater performance for two major reasons.

First, since there is no time when the GPU is idle the overall compute time is reduced.

And second, because the GPU is kept working continuously, it’s able to operate in a higher performance state and reduce the time necessary to compute each particular output.

But so much of the appeal in Core ML is that you don’t have to worry about any details like this at all.

In fact, for your app all you are really concerned with for your users is going from a long time to get results to a short time.

So this year we are introducing a new batch API that will allow you to do exactly this.

Where before you needed to loop over your inputs and call separate predictions, the new API is very simple.

One-line predictions, it consumes an input an array of inputs and produces an array of outputs.

Core ML will take care of the rest.

[ Applause ]

So let’s see it in action.

So in keeping with our style transfer example, let’s look at the case where we wanted to apply a style to our entire photo library.

So here I have a simple app that’s going to do just that.

I am going to apply a style to 200 images.

On the left, as in your left, there is an implementation using last year’s API in a for loop.

And on the right we have the new batch API.

So let’s get started.

We are off.

And we can see the new is already done.

We’ll wait a moment for last year’s technology, and there we go.

[ Applause ]

In this example we see a noticeable improvement with the new batch API.

And in general, the improvement you’ll see in your app depends on the model and the device and the workload.

But if you have a large number of predictions to call, use the new API and give Core ML every opportunity to accelerate your computation.

Of course, the most high-performance app in the world isn’t terribly exciting if it’s not delivering an experience that you want for your users.

We want to ensure that no matter what that experience is, or what it could be in the future, Core ML will be just as performant and simple to use as ever.

But the field of machine learning is growing rapidly.

How will we keep up?

And just how rapidly?

Well let me tell you a little bit of a personal story about that.

Let’s take a look at a deceptively simple question that we can answer with machine learning.

Given an image, what I want to know: Are there any horses in it?

So I think I heard a chuckle or two.

Maybe this seems like kind of a silly challenge problem.

Small children love this, by the way.

But so way, way back in the past, when I was first starting graduate school and I was thinking about this problem and first learning about machine learning, my insights on the top came down to something like this: I don’t know seems hard.

I don’t really have any good idea for you.

So a few years pass.

I get older, hopefully a little bit wiser.

But certainly the field is moving very, very quickly, because there started to be a lot of exciting new results using deep neural networks.

And so then my view on this problem changed.

And suddenly, wow, this cutting-edge research can really answer these kind of questions, and computers can catch up with small children and horse recognition technology.

What an exciting development.

[ Laughter ]

So a few more years pass.

Now I work at Apple, and my perspective on this problem has changed again.

Now, just grab Create ML.

The UI is lovely.

You’ll have a horse classifier in just a few minutes.

So, you know, if you are a machine learning expert, maybe you are looking at this and you are thinking, “Oh, this guy doesn’t know what he is talking about.

You know, in 2007 I knew how to solve that problem.

In 2012 I’d solved it a hundred times.”

Not my point.

If you are someone who cares about long-lasting, high-quality software, this should make you nervous, because in 11 years we have seen the entire picture of this problem turn over.

So let’s take a look at a few more features in Core ML that can help set your mind at ease.

To do that, let’s open the hood once again and peek at one of these new horse finder models, which is, once again, a neural network.

As we have illustrated before, the neural network consists of a series of highly optimized layers.

It is a series of layers, and we have highly optimized implementations for each of them in our inference engine.

Our list of supported operations is large and always growing, trying to keep up with new developments in the field.

But what if there is a layer that just isn’t supported in Core ML?

In the past, you either needed to wait or you needed a different model.

But what if this layer is the key horse-finding layer?

This is the breakthrough that your horse app was waiting for.

Can you afford to wait?

Given the speed of machine learning, this could be a serious obstacle.

So we introduced custom layers for neural network models.

Now if a neural network layer is missing, you can provide an implementation with will mesh seamlessly with the rest of the Core ML model.

Inside the model, the custom layer stores the name of an implementing class the AAPLCustomHorseLayer in this case.

The implementation class fills the role of the missing implementation in the inference engine.

Just like the layer is built into Core ML, the implementation provided here should be general and applicable to any instance of the new layer.

It simply needs to be included in your app at runtime.

Then the parameters for this particular layer are encapsulated in the ML model with the rest of the information about the model.

Implementing a custom layer is simple.

We expose an MLCustomLayer protocol.

You simply provide methods to initialize the layer based on the data stored in the ML model.

You’ll need to provide a method that tells us how much space to allocate for the outputs of the layer, and then a method that does the computation.

Plus, you can add this flexibility without sacrificing the performance of your model as a whole.

The protocol includes an optional method, which allows you to provide us with a MTL shader implementation of your model of the layer, excuse me.

If you give us this, then it can be encoded in the same command buffer as the rest of the Core ML computation.

So there is no extra overhead from additional encodings or multiple trips to and from the GPU.

If you don’t provide this, then we’ll simply evaluate the layer on the CPU with no other work on your part.

So no matter how quickly advancements in neural network models may happen, you have a way to keep up with Core ML.

But there are limitations.

Custom layers only work for neural network models, and they only take inputs and outputs which are ML MultiArrays.

This is a natural way to interact with neural networks.

But the machine learning field is hardly restricted to only advancing in this area.

In fact, when I was first learning about image recognition, almost no one was talking about neural networks as a solution to that problem.

And you can see today it’s the absolute state of the art.

And it’s not hard to imagine machine-learning-enabled app experiences where custom layers simply wouldn’t fit.

For instance, a machine-learning app might use a neural network to embed an image in some similarity space, then look up similar images using a nearest-neighbor method or locality-sensitive hashing or even some other approach.

A model might combine audio and motion data to provide a bit of needed encouragement to someone who doesn’t always close his rings.

Or even a completely new model type we haven’t even imagined yet that enables novel experiences for your users.

In all these cases, it would be great if we could have the simplicity and portability of Core ML without having to sacrifice the flexibility to keep up with the field.

So we are introducing custom models.

A Core ML custom model allows you to encapsulate the implementation of a part of a computation that’s missing inside Core ML.

Just like for custom layers, the model stores the name of an implementation class.

The class fills the role of the general inference engine for this type of model.

Then the parameters are stored in the ML Model just like before.

This allows the model to be updated as an asset in your app without having to touch code.

And implementing a custom model is simple as well.

We expose a protocol, MLCustomModel.

You provide methods to initialize based on the data stored in the ML Model.

And you provide a method to compute the prediction on an input.

There is an optional method to provide a batch implementation if there are opportunities in this particular model type to have optimizations there.

And if not, we’ll call the single prediction in a for loop.

And using a customized model in your app is largely the same workflow as any other Core ML model.

In Xcode, a model with customized components will have a dependency section listing the names of the implementations needed along with a short description.

Just include these in your app, and you are ready to go.

The prediction API is unchanged, whether for single predictions or batch.

So custom layers and custom models allow you to use the power and simplicity of Core ML without sacrificing the flexibility needed to keep up with the fast-paced area of machine learning.

For new neural network layers, custom layers allow you to make use of the many optimizations already present in the neural network inference engine in Core ML.

Custom models are more flexible for types and functionality, but they do require more implementation work on your part.

Both forms of customization allow you to encapsulate model parameters in an ML model, making the model portable and your code simpler.

And we’ve only been able to touch on a few of the great new features in Core ML 2.

Please download the beta, try them out for yourself.

Core ML has many great new features to reduce your app size, improve performance, and ensure flexibility and compatibility with the latest developments in machine learning.

We showed you how quantization can reduce model size, how the new batch API can enable more efficient processing, and how custom layers and custom models can help you bring cutting-edge machine learning to your app.

Combined with our great new tool for training models in Create ML, there are more ways than ever to add ML-enabled features to your app and support great new experiences for your users.

After just a short break, we’ll be back right here to take a deeper look at some of these features.

In particular, we’ll show you how to use our Core ML Tools software to start reducing model sizes and customizing your user experiences with Core ML today.

Thank you.

[ Applause ]

Apple, Inc. AAPL
1 Infinite Loop Cupertino CA 95014 US