# Metal for Accelerating Machine Learning

## Session 609 WWDC 2018

### Metal Performance Shaders (MPS) includes a highly tuned library of machine learning primitives leveraging the tremendous power of the GPU. With iOS 12 and macOS Mojave, MPS adds capabilities to accelerate the computationally intensive task of training a neural network. Learn performance optimizations for inference, and understand the training process for both convolutional and recurrent neural networks (CNNs and RNNs).

[ Music ]

[ Applause ]

Good afternoon, everyone.

Welcome to our talk on Metal for Accelerating Machine Learning.

My name is Anna Tikhonova, and I'm an Engineer on the GPU Software Team.

The Metal Performance Shaders Framework is built on top of Metal.

And it provides GPU accelerated primitives that are optimized for both iOS and macOS.

We provide primitives for image processing, linear algebra, and machine learning.

We talked extensively about inference in our past WWDC sessions, so I just want to highlight them here.

And this year, we've also added support for training on both iOS and macOS.

[ Applause ]

Thank you.

We've also added support for accelerated ray tracing on our platform, and we had an entire session on this topic earlier this week.

So, it was titled Metal for Ray Tracing Acceleration.

And the video for the session will be available online shortly.

In this session, I will talk primarily about machine learning, specifically training.

So, I mentioned training and inference.

Deep learning algorithms consist of these two phases.

The first phase is the training phase.

So, let's use an example where we want to train a model, to categorize images into classes like cats, dogs, giraffes, etcetera.

So, in order to train a model to recognize cats, we need to feed it a large number of labeled images of cats.

And same for the rabbits and all of the other animals you want your model to be able to recognize.

Training is a computationally expensive and time-consuming, iterative process.

The result of training are trained parameters.

The trained parameters are required for the next phase, the inference phase.

This is when your model is presented with a new image, that is never seen before, and it needs to classify it based on the trained parameters.

This is a cat.

We now provide GPU acceleration for both the training and inference phases.

But before I talk about training, let me tell you about some enhancements to CNN Inference we've added this year.

We now have support for FP16 accumulation for the convolution and convolution transpose primitives.

This new feature is available on devices with an Apple A11 Bionic GPU.

We find that using FP16 accumulation for inference workloads is more than sufficient in terms of precision for many commonly used neural networks.

FP16 accumulation offers better precision and significant power benefits.

So, please take advantage of it in your inference workloads.

And this is an example of how you can enable FP16 accumulation for a convolution primitive.

You just need to set the accumulatorPrecisionOption property.

And now, let's talk in depth about the main topic of this session, training neural networks.

And let's start with training convolutional neural networks.

So, here we have a simple, handwritten, digit recognition network that takes an image of a handwritten image as input, and assigns it to one of 10 classes, from 0 to 9.

In this example, we are correctly classifying this image as an image of a digit 7.

For inference, we initialize our network with trained parameters.

So, in this example, the trained parameters add the weights for the convolution, and fully connected primitives.

The goal of a training algorithm is to compute these trained parameters, so the network can use them to map inputs to correct outputs during inference.

When we start the training process, we don't have any weights.

We need to compute them.

So, the first step is to initialize the weights with small, random numbers.

And now we are ready to train this network.

So, let's take a look at all the steps involved in a training process.

Training is an iterative process, and each iteration of training can be divided into four steps.

The first step is the forward pass.

This is when we take an input and pass it to our network to produce an output.

It's very similar to inference.

And next, we need to compute loss.

So, loss intuitively measures the difference between the network's output and the ground truth.

The objective of a training algorithm is to minimize loss.

Our next step is the gradient pass.

This is when we propagate this difference when the network's output and the ground truth, back to the network to update weights.

The idea is that as training continues, our network is becoming better trained, so it's better able to map inputs to correct outputs, which in turn helps to minimize loss.

So, this is an overview and now let's take a look at each one of these steps in more detail.

The forward pass involves propagation forward to the network, to compute an output.

As you can see, during this first situation of training, our network is not doing so well.

It's outputting a result that's clearly wrong.

So, why is it doing so badly?

Well, this is expected.

We just initialized our weights with some random numbers.

We haven't trained the network to do any better yet.

So, now, we need some weight to quantify how well or how badly our network is currently doing.

So, we can use this information to improve our weights, so that hopefully after more iterations of training, the network can produce a better result.

But in order to know how well we're doing, we need to know what the right answers are.

So, the ground truth, which I will call labels from now on, is an input to the network along with the input image.

So, in this case, it's a vector of 10 values, where we have 1 for the correct class, Class 7, and zeros for all the other classes.

The output of the network is our 10 probabilities.

One per class.

So, as you can see in this first situation of training, the network is producing a very low result for the correct Class 7, and it's assigning the highest probability to a Class 9, which is why the network is returning 9 as the answer.

So, now we take all of this information and we pass it to our loss primitive.

And as I mentioned previously, loss measures the difference between the network's output and the ground truth.

And the objective of a training algorithm is to minimize loss.

And now, we also need the second half of the graph.

The second half of the graph contains gradient primitives for each responding forward primitive.

The gradient primitives compute gradients that are needed to update weights.

So, the loss primitive computes the first gradient, which is a derivative of a chosen loss function with respect to its inputs.

And then we take this gradient and back propagate it, backward through the network, backward through the first gradient primitive in the backward direction.

In this case, it's the SoftMax gradient primitive.

And we use the Chain Rule to do that.

So, the Chain Rule allows us to back propagate gradients, backwards through the network.

And we're computing these gradients so that we can update weights.

So, we're computing very small deltas to apply to the weights, in each iteration of training.

And then we can use these updated weights in the next iteration of training, to ideally produce a lower loss value, which is what we're trying to minimize.

In practice, any situation of training, we're not going to operate on a single image.

We're going to operate on a group or a batch of images.

For example, a batch of size 32 or 64.

And we need a corresponding batch of labels, for loss computation.

So, in this case, we have a batch of labels, was 1 for the correct class and zeroes everywhere else.

And in each situation of training, we're going to use a different batch of images and a responding batch of labels.

So, let's now run through several iterations of training.

For the first batch of images, we're doing a forward pass, computing loss, and doing a gradient pass.

And updating weights.

So, what happens with the second batch?

Exactly the same process.

The forward pass, then we compute loss, to the gradient pass, and update weights.

And as we go through iterations of training, we want the loss for our network to decrease and the accuracy of the network to increase.

And we continue training until the loss falls below a particular threshold and the accuracy of our network reaches a desired level.

And then we know that the network is fully trained and now we can use the computed trained parameters for inference.

And now, let's take a look at the steps necessary to train a neural network using the Metal Performance Shaders Framework.

Neural networks are often described using graph abstraction.

So, in MPS, we enable you to describe your networks as a graph.

So, the first step is to create a training graph.

Then we need to prepare our inputs.

We need to specify weights.

And then we execute the graph.

So, we run the forward paths, compute loss, do the gradient pass, and update weights.

And training is an iterative process.

It can take many iterations to train a network.

So, we'll also need to know when we can stop training.

So, let's now discuss each one of these topics in more detail.

Let's start with creating a training graph.

So, as I said, in MPS, we enable you to describe your networks as a graph using a neural network graph API.

So, here we again have a visualization of our handwritten, digit recognition network.

But in this visualization, you can also see the image notes.

They're the small white notes.

The image notes are for your data.

For your input, your outputs, and all of the intermediate results.

They describe how data moves between different operations.

And then, operations on the data, like convolution and pooling, are described with your filter nodes.

We support all of the nodes necessary to create commonly used neural networks.

And now, let's take a look at how easy it is to use the neural network graph API.

So, here's an example of how you can create an MPSImageNode using the neural network graph API.

And this is how you would create a convolution node using the graph API.

And now, for every forward node, we support a corresponding gradient node for training.

It takes a single line of code to create a gradient node from the forward node.

So, here is an example of how you can create a gradient convolution node from the convolution node.

And now, let's build an entire graph.

So, here we have a very small network.

We have a convolution node followed by a pooling node, followed by another convolution node.

So, how do we connect these nodes into a graph?

So, that's easy.

We take the result node of the result image of one node and pass it as a source image to the subsequent node.

And here we have an entire connected graph.

And now, let's build a training graph.

So first, we need to add a loss node to the graph.

And now, let's add some gradient nodes.

So, as I said, it takes a single line of code to create a gradient node from its corresponding forward node.

And then we connect them as previously, and now you have a complete training graph.

So, as you can see from this example, the graph API is very simple to use.

The graph does a lot for you automatically.

It manages all the intermediate results, and even the output image.

It minimizes the memory footprint of your networks, by aliasing memory for all your intermediate images, using Metal heaps.

It can also fuse graph nodes.

For example, it can fuse batch normalization and neural nodes.

And it can optimize away nodes.

For example, it optimizes the way you can cut nation nodes.

The graph also automatically handles padding and manages state objects for you, which we will discuss later in this session.

So, please take advantage of the graph API.

So, now that we know how to create a training graph, let's now take a look at the inputs we need to pass to the graph.

And for this, let's take a look at the encode call we will use to encode the graph to the GPU.

So, as I already mentioned, we're not going to send in one image at a time for training.

We're going to operate on groups or batches of images.

So, one of the inputs to the graph, is a batch of images.

And as you recall, for every batch of images, we also need a corresponding batch of labels for loss computation.

The labels for loss computation are passed to the graph as states.

So, the code call also takes a batch of states as input.

And now, let's talk about these batches and states.

What are they?

Let's start with batches.

So, batches are just arrays of images or states.

We've added them this year specifically to support training.

There are two new MPS types for you to use: MPSImageBatch and MPSStateBatch.

So, here's an example of how you can create a single image, using our API.

So, here we're creating one from an existing Metal texture.

And this is an example of how you can create a batch of images, using our API and append a new image to the batch, so you can pass it to the graph.

And now, what are state objects?

An MPS state is an opaque data container.

In training, it's frequently used to capture a state of a forward node, when it's called.

And so, it can later be used by the gradient node.

So, the graph manages all of the state objects.

So, as a developer, you generally don't need to worry about states.

But it's nice to know how they work.

So, let's use a specific example.

So, let's go back to our handwritten digit recognition network.

And take a look specifically at the drop-out and drop-out gradient nodes.

The forward drop-out node drops, or it zeroes out, values in its input, with a certain probability.

And then, the dropout state object captures information about the forward drop-out operation, so it can later be used by the drop-out gradient node because it used to zero out values in its input gradient at the exact same locations as was zeroed out by the forward operation.

So, as I said, you don't generally need to worry about states, because the graph manages them.

But because the labels for loss computation are passed as states to the graph, and because they require user input.

So, that's your ground truth or correct results.

You need to create a batch of labels for loss computation and pass this batch as input to the graph.

So, this is an example of how you would create a single label for loss computation.

You first need to create a loss data descriptor which describes how the label's data is laid out in memory.

And then you need to create an MPSCNNLossLabel object, with this descriptor.

And then you create a batch of these for training, and when the GPU's done running the graph, your batch of labels will contain the per image loss values for the batch.

And you can inspect these values, or you can compute a single value across the batch and inspect that value.

So, now that we have a training graph and we talked about how to provide inputs to our graph, let's talk about how to provide weights to the graph nodes that require weights.

The only way to provide weights to convolution fully connected, batch normalization and instance normalization nodes, is through data source provider protocols.

This is an example of how to create a convolution node, with a data source provider.

You need to implement a class that conforms to the protocol.

We call it MyWeights in this example.

Data source providers are very useful in many ways.

For example, if you have many convolution nodes in your network, the overall size of the weights for the network can be quite considerable.

And we do not want the weights for all of your convolution nodes to be in memory all at the same time.

We want to keep the memory footprints of your networks as low as possible.

And data source providers come into play here because they provide just in time loading and purging of weights data.

So, we load the weights for one convolution kernel, when we process it.

And then we purge them before moving on the next convolution.

So, here's an implementation of MyWeights.

You need to provide an initialization method that is responsible for pulling in memory and making it ready.

And then the graph will call the load function.

And then when the purge method is called, you can release the weights.

Data source providers are also essential for training, and we will discuss this later in this session.

So, now that we have a training graph and we've prepared our inputs and specified weights, we're ready to execute the graph on the GPU.

To change the [inaudible] graph on the GPU, we first need to do the usual Metal setup.

We need to initialize our training graph.

So, we have prepared our inputs.

And now, let's train a network on the GPU.

Training is an iterative process.

So, we want to set up a training loop.

And we usually want to execute our graph over a number of EPOCHS.

The number of EPOCHS is the total number is the number of times we want to iterate over our entire data set.

And we want there to be multiple iterations in each EPOCH.

So, the number of iterations is the total number of images in your data set divided by batch size, like 32 or 64.

And now, let's take a look at each training iteration.

In each training iteration, we encode a batch of images for training.

But we don't want the CPU to wait for the GPU to finish running one run of the graph, with one batch of images before the CPU can start encoding commands to the command buffer for the next run of the graph.

We want the CPU and the GPU to work concurrently.

And for this, we're going to use double buffering.

So, in this setup, we're going to create a counting semaphore with an initial value of 2.

It's because we want only two encodes to be in flight at the same time.

And then when we enter the training iteration function, we're going to call weight on the semaphore.

That's decrementing it.

So, if the value of the count has already been decremented to zero, we wait.

Otherwise, we continue.

And then we encode our graph, and the encode call returns immediately.

And a user specified callback is called, when the GPU is done running the graph.

So, now we know.

The GPU is done running the graph, and the CPU can continue encoding more work to the GPU, work that was previously waiting on the semaphore.

So, why are we using double buffering?

Why not encode more runs of the graph, to the GPU concurrently?

Well, it takes a lot less time to encode commands to the command buffer, than to run the graph.

So, we don't want to encode too many runs of the graph concurrently to minimize memory footprint.

Okay, we've talked about executing the graph.

When we execute the graph, we do the forward pass, we compute loss, we do the gradient pass, and the graph will also update weights.

So now, let's talk about weight updates.

As I mentioned, data source providers are essential for training.

All of your weight updates need to happen through an optional update method on a data source provider.

The graph will call the update method automatically.

So, what does the weight update step actually involve?

Let's take a look.

So, recall that we're computing gradients during the gradient pass that we can apply small deltas to the weights, in each situation of training.

How these deltas are applied to the weights, is described by an optimizer.

It's just a function that takes the old weights, the computed gradients as input, and produces updated weights as outputs.

You will use an optimizer in the update method of your data source provider.

And we support a number of different variants of the weight update step on the GPU, including Adam, Stochastic Gradient Descent, and RMSProp.

And you can even define your own custom update weight step if you prefer.

So now, let's take a look at how to use an optimizer in MPS.

So, recall that your data source provider has an init method.

This is where you want to create your optimizer because you only want to create it once.

And now, let's take a look at the implementation of our update method.

The update method receives the source state and gradient state as inputs.

So, the source state contains the old weights, the gradient state contains the computed gradients, and now we can encode our optimizer with this data, and the last step is to return the source state, which now has the update weights.

So, pretty simple.

And now we have just one more step to discuss.

So, as I said, training is an iterative process.

It can take many iterations to train a network.

And you will need to know when to stop training.

So, let's now discuss how can you make this decision in the context of your training loop?

So, here we again have our training loop, where we're training a neural network for a number of EPOCHS.

To check whether you can stop training, you need to have a test set of images.

So, a test set of images contains images that are not used for training.

They're only used to evaluate the accuracy of your network.

So, after each EPOCH, you can optionally wait for the graph to for the GPU to stop running the graph, and then you can use the current trained parameters to initialize an inference network.

And then you can run this inference network on your test set, and you can optionally stop training when the accuracy of your network on this test set, reaches a particular level.

So, now that we've discussed all of the steps that are necessary to train a neural network in MPS, it's time for a demo.

So, as was already mentioned in the platform State of the Union, the Metal Performance Shaders Framework powers Core ML, Create ML, and Turi Create.

To Turi Create is an easy to use, flexible, high-performance tool set for creating Core ML models for tasks such as image classification, object detection, recommendations, and more.

For more information on Turi Create, we want to refer you to the A Guide to Turi Create Session Video.

We've prepared a demo where we will be using an we'll be training an object detection network in Turi Create powered by MPS.

As was mentioned in the platform State of the Union, this is nine times faster than without MPS.

An object detection network draws bounding boxes around recognized objects.

So, in this demo, I will be using a MacBook Pro, with a connected external GPU.

I will be running Turi Create on the MacBook Pro and I will use an external GPU to train the network with MPS.

This is a great example of how you can use an external GPU to enhance the computational power of a MacBook Pro.

The external GPU we're using is an AMD Vega GPU.

So, in this demo setup, I've already imported Turi Create, and preloaded the object detection network, and a training data set.

So, now let's train this network for 10 iterations.

And now, the entire object detection network, all the primitives, the optimizer, the weight update step, everything is running on the external GPU.

Okay, so we're already done with 10 iterations of training, it would take more than 10 iterations to train this network, and we're not going to do this on stage.

But what I'm going to do right now, is to load a network that we pretrained in advance, run it on a test set of images and visualize some of the results.

So, let's take a look.

Okay, so here we have a banana, that's correctly classified as a banana.

And we have a bounding box, and now we have a perfect breakfast of a cup of coffee and a croissant, and very, mean-looking egg.

Okay, so that's it for the Turi Create demo.

[ Applause ]

Thank you very much.

And now, let's switch gears and talk about training recurrent neural networks.

But first, let's do a recap of what are the recurrent neural networks?

One of the disadvantages of convolutional neural networks is their inability to remember anything that happened previously.

They can take one input, such as an image, and generate a single output, such as a set of probabilities of what is depicted in the image.

RNNs on the other hand, have memory.

And they're good at operating on sequences of inputs and outputs.

For example, they can take one set of probabilities, so what is depicted in an image, which is an output of a CNN, and generate a sequence of outputs, which is a sequence of words that make up a caption for this image.

They can also take a sequence of inputs, such as a sequence of words that make up a sentence and generate a sequence of outputs which is a same sentence but translated to a different language.

For example, to ration or finish.

With support, a number of different variance of RNNs.

The most commonly used one is the Long Short-Term Memory RNN, or LSTM for short.

In our last year's WWDC Session, we talked extensively about the gates inside LSTM and walked through a LSTM inference example.

So, please refer to that session for more information on LSTM inference.

This year, we've added support for training, for all of these variants of RNNs.

And in this session, I'm going to talk about training LSTMs.

So, let's use a specific example.

So, here we have an activity classifier network which takes motion sensory data as input.

For example, reading some sensors like an accelerometer or a gyroscope.

And then the network uses this data to identify a physical activity performed by the user.

So, for example, we want to know if a user is cycling, skiing, or walking.

As you can see, this network is set up in an interesting way.

So, it contains a series of CNN primitives, followed by LSTM primitive, followed by more CNN primitives.

So, why is it set up this way?

Let's take a look.

So, even though our input is sensor data, it's represented by a batch of 1D images with six feature channels.

So, one feature channel for access in the accelerometer and gyroscope readings.

And each 1D image has 2,000 pixels.

And you can think of them as samples in time because the activity we're trying to identify, occurs over time.

And then we pass these images through a 1D convolution primitive which compresses these 2,000 samples, to just 20 samples.

But it expends a number of feature channels, because so, we're not losing any features in the data.

And then, this new representation of the data, is passed to LSTM primitive as a sequence of lengths 20.

And we ran LSTM for 20 iterations.

So, our LSTM is operating on a sequence of lengths 20 instead of 2,000, so it's operating on a higher-level feature representation of the data.

And then we have additional CNN primitives that we find high-level features in the data.

And the last primitive in this network is the SoftMax primitive which generates probabilities for the different activity classes, which is the output of the network.

And now, let's take a look at how to train this network.

So, we again need a loss primitive, which takes the output of the network and the labels as input.

And then we need the second half of the graph.

So, in the second half of the graph, we again have gradient primitives for the corresponding forward primitives, including the LSTM primitive.

And now, for training, we do the forward pass through the network, then we compute loss, and we do the gradient pass to compute gradients that will be used to update weights.

So, this is a very similar setup that we have for a CNN training.

And the last step is of course to update the weights and as you know, the LSTM also has weights, so they need to be updated as well.

And now, let's take a look at how to train this network in MPS.

But first, let's take a look at how we can create LSTM layer for training using our framework.

So, first, you need to create LSTM layer descriptor.

And we initialize the descriptor with initial training parameters using data source providers.

So, these initial training parameters are, use smaller random number or some checkpoint values.

The descriptor setup for training, is exactly the same as it is for inference.

And we discussed the layer descriptor setup in our last year WWDC session in a lot more detail.

So, I want to refer you to the session for more information on LSTM layer descriptor setup.

Once you have the descriptor, the next step is to create LSTM training layer with this descriptor.

MPS will populate training weights using the data sources specified in the descriptor.

And we also need to have some matrices to hold the computed gradients.

You will use the createWeightGradientMatrices API on the training layer to create these matrices.

And then, the training weights will be used in a forward and gradient passes and will be passed to an optimizer along with the computed gradients, job, to date, weights.

And now we need to prepare some inputs and outputs for training our LSTM.

So, here's an example of how you can create the matrices to hold the input and output sequences for both the forward and gradient passes.

You will need 20 matrices for each one of those.

And here is how you would initialize these matrices with data.

And now, we are ready to train our activity classifier network in MPS.

So, in this code example, I will be highlighting only the LSTM filter in the interest of time.

So, in the forward pass, we ran a sequence of 20 matrices forward through the LSTM training layer.

And then in the backward pass, we ran a sequence of 20 matrices though the LSTM layer to compute gradients.

And now, you have the training weights, and you have the computed gradients, and you can pass them to an optimizer to update weights.

So, there's just one more thing I'd like to mention.

[Inaudible] neural networks operate on images and LSTMs operate on matrices.

And we'll provide convenience kernels in the framework to make it easy to convert between images and matrices.

So, in order to copy an image to a matrix, you need to use the MPI's Image Copy to Matrix Kernel.

So, this is how you can create one, and this is how you can encode one on a batch of images.

Here, each row in a destination matrix, will contain one source image.

And to copy from a matrix to an image, you need to use the MPS Matrix Copy to Image Kernel.

This is how you can create one and this is how you encode one to the GPU.

So, we just showed you how to train CNNs and RNNs using MPS.

We also showed you a demo of Turi Create which is now powered by MPS.

And now it's time for one more demo.

We have been working with Google to add support to the Metal Performance Shaders Framework to TensorFlow to accelerate machine learning on macOS, and we would love to show you a demo of that in action.

Specifically, we want to show you a demo of training the InceptionV3 Object Classification Network, using TensorFlow, powered by MPS.

So, for this demo, I will again be using a MacBook Pro with an attached external GPU.

So, I will be running TensorFlow on this MacBook Pro, and I will use an external GPU to train a network using MPS.

So, in this demo setup, I've already imported TensorFlow and preloaded the InceptionV3 Network and a training data set.

So, now, let's train this network for 30 iterations.

So, you can see how fast this is going.

Again, the entire network, all of the primitives, the optimizer, and the weight update step, everything is running on the external GPU.

And we're already done.

And as you can see, the training rate is approximately 100 images per second.

So, as was stated in the platform State of the Union, training the InceptionV3 Network, in TensorFlow powered by MPS, is up to 20 times faster than without MPS.

So, this is it for the TensorFlow demo.

Thank you very much.

And now, let's summarize this session.

This year, we've added a FP16 accumulation for the convolution and convolution transpose primitives to improve the performance of CNN inference.

We've also added GPU accelerate primitives for training neural networks.

These primitives are optimized for both iOS and macOS.

We've also added the neural network graph API for training.

It makes it very easy to train neural networks on the GPU and enables us to provide the best performance across different GPUs.

For more information on this session and links to related resources, please go to our developer website.

We have the Metal for Machine Learning Lab tomorrow at 9 a.m.

So, we would love to talk to you.

So, please come talk to us.

And thank you for coming and have a great WWDC.