Using Metal 2 for Compute

Session 608 WWDC 2017

Metal Performance Shaders (MPS) provides a highly tuned library of functions that extend the power of the GPU for more than just graphics. With Metal 2, MPS comes to the Mac along with an expanded set of capabilities. Learn how to tap into the latest image processing operations, perform linear algebra operations, and accelerate machine learning algorithms via new primitives and a graph API to build and execute neural networks on the GPU.

Good afternoon everyone, welcome to our talk on using Metal 2 for Compute.

My name is Anna Tikhonova.

I'm an engineer on the GPU Software Team, so let's begin.

The Metal 2 echo system is so much more than the Metal API and the language.

We also have the GPU Tools and we have the MetalKit and Metal Performance Shaders frameworks.

You might know Metal as this great technology for developing high-end games and graphics.

But it can also be used for Compute processing.

In fact, the Compute side of Metal is so powerful and flexible that the Metal Performance Shaders framework is built completely on top of Compute.

And in this session, we'll talk about what's new in the Metal Performance Shaders framework.

We introduced the Metal Performers Shaders framework, or MPS in 2015.

And the videos of our past sessions are available on our developer website.

MPS uses the compute power of the GPU to bring GPU accelerated primitives.

For image processing, linear algebra and machine learning.

The framework is optimized for iOS and we're happy to announce that this year we're also bringing MPS to the Mac.

[ Applause ]

Thank you.

The entire feature set is available in both iOS and macOS.

So let's begin with a quick update on our image processing support.

So here's a list of all of the primitives for image processing that we had available in iOS 10.

So there's Convolution, Gaussian Blur, Lanczos Resampling, just to name a few.

They're all now available in macOS.

And this year we're bringing you four new image processing primitives.

The Image Keypoints primitive can be used is often used in computer vision algorithms such as image stabilization and Bilinear Rescale, Image Statistics, and Element-wise Arithmetic Operators, are commonly used to pre-process images.

For example, in machine learning.

And the arithmetic filters also support broadcasting operations.

Which, for example, allow you to add a 2D image or the 1D image.

So that's it for our very quick update on image processing.

And now let's talk about the new Linear Algebra operations.

Without support, Matrix Multiplication, Matrix Vector Multiplication, and Triangular Matrix Factorization and Linear Solvers.

To support Linear Algebra operations, we now have multiple new data representations.

First, we have the MPSVector object which interprets the data in a metal buffer as a one-dimensional array.

And we have an MPSMatrix object which interprets the data in a metal buffer as a rectangular array.

And MPS matrices are in role major order.

And you can think of both MPSVectors and MPSMatrices as wrappers around user data buffers.

And we also support a temporary variance of MPSMatrix.

MPS images temporary images and MPSTemporaryMatrices are allocated from a Metal heap associated with a command buffer.

And they are called temporary because their lifespan is limited to the lifetime of the command buffer.

And we recommend you use temporary images and matrices for most of your intermediate storage.

Both MPSVector and MPSMatrix support a number of input types.

We support single-precision and half-precision input types and a floating-point input types.

And 16-bits and 8-bit signed integer input types.

And now let's take a look at how we can create an MPSVector of size N.

So if you don't already have a Metal buffer, you need to create one.

And then you need to create a descriptor for your vector.

And note here that you specify the length of the vector.

That's because the vector can be made from a portion of the original Metal buffer.

And other related offsets can be set in a kernel that will use this vector.

And then the last step is creating a vector from the buffer with a descriptor.

And now let's take a look at how you can create an MPSMatrix with M rows and N columns.

So it's very similar to the way you would create MPSVector, but there's just a few things we want to mention.

We provide a convenient API that you can use to find the recommended bytes per row value for sizing your Metal buffers.

And if you choose to use the API, this is how you would create a metal buffer with this recommended value.

And using this API is completely optional, but recommended for better performance.

And then the rest is simple.

You create a descriptor for your matrix, and then you create a matrix with a descriptor.

And now that we talked about the data presentations, now let's talk about the primitives.

So for Matrix-Matrix and Matrix-Vector multiplication our API is modeled after the standard BLAS GEMM and GEMV interfaces.

And for triangular matrix vectorization and linear solvers, our API is modeled after standard LAPACK decomposition and solve interfaces.

So if you're familiar with those interfaces our API will look very familiar to you as well.

And now let's take a look at a very simple code example.

So we'll be doing matrix multiplication and computing just C = A times B.

So first we need to create our matrices A, B and C.

But I know you know how to do this, I showed you in a previous slide, so let's move on.

Now we want to run matrix multiplication on the GPU.

So first we do our usual Metal setup to getting a device, a command queue, and a command buffer.

And then we need to create our matrix multiplication kernel.

And note here that you specify the size of the result.

That's because this kernel can operate on subregions of matrices.

And then we encode this kernel to the GPU and tell it to start doing the work.

And we already have sample code from Matrix multiplication available on our developer website, and the sample code for triangular matrix vectorization and solving a system of linear equations is coming very soon.

So that's it for our for the linear algebra operations.

Let's now move on to the next topic, which is Accelerating Machine Learning Primitives on the GPU.

There are a number of machine-learning related talks at WWDC this year.

And we are a part of the machine-learning community.

And this slide shows the overall architecture.

So as an application developer, you can add machine learning functionality to your applications by using high-level domain-specific frameworks such as division framework and the natural language processing framework, which rely on the Core ML framework.

And the Core ML framework is powered by the accelerates framework BNNS primitives on the CPU.

And by the machine learning and by the Metal Performance Shaders framework on the GPU.

But if you're writing an application that uses Metal, then you can use the MPS framework directly and I will show you how in this session.

So let's start with what are we talking about here?

What is deep learning?

What is Machine learning?

So imagine that this is you.

And when you see an image, you know immediately what's depicted on it.

It's a panda.

But now think about all of the images on your iPhone.

Or all of those pictures in your family albums.

Or all of the images on the internet.

No human can possibly can classify these many images.

But deep-learning algorithms is designed specifically to do that.

They can be used for sifting through large amounts of data and answering questions such as what is in this image?

Deep learning algorithms have two phases.

Training and inference.

So let's talk about training first.

And let's actually use an example.

Let's train a system to classify images.

So the training system to classify images, for example, if you want to have it recognize animals.

Like, to have it recognize cats, you need to feed this system a large number of labeled images of cats and then rabbits, and all the other animals that you want your system to be able to recognize.

And this training step is a one-time computationally expensive and labor-intensive step.

And it's usually done offline.

But the results of this training phase is trained parameters which are required for the next phase, the inference phase.

This is when your system is presented with a new image that it has never seen before and it needs to classify, this is a cap.

We provide view acceleration for the second phase; the inference phase.

Specifically, last year we talked about the building blocks for building convolutional neural networks on the GPU for inference.

So before we move onto any of the new features for machine learning that we brought you this year, we are going to review some of the core information that was covered in our last year's presentation.

Such as, what are convolutional neural networks?

And once we do that then we can talk about the new primitives that we've added for convolutional neural networks this year, and then we'll introduce the new, easy-to-use neural network graph API.

And our last topic will be recurrent neural networks.

So let's go into our recap.

So what are convolutional neural networks?

Convolutional neural networks are biologically inspired and designed to resemble the visual cortex.

So let's think about how our brain processes visual inputs.

The first hierarchy of neurons that receive information in the visual cortex is sensitive to specific edges and blobs of color.

While the brain region's further down the visual pipeline respond to more complex structures such as faces of our friends or kinds of animals like cats.

So in a similar way, CNNs are organized into a hierarchy of layers where high-level features are derived from low-level features.

So the first few layers in your network respond to low-level features like edges and blobs of color.

While subsequent layers respond to progressively more complex features such as faces.

And I keep saying features.

So think of a feature as a filter that filters your input data; that particular feature.

And here's a list of all of the CNN primitives that we had available in iOS 10.

And in this recap I will be just talking about the core convolution layer.

The core building block of a CNN.

And the rest of these primitives are covered in great detail in our presentation in our documentation.

So Pooling, Fully-Connected and SoftMax.

You can find information on those.

So let's talk about the core building block.

So the function of this core convolution layer is to recognize features in the input data and it's called a convolution layer because it performs a convolution on its input.

So let's recall how regular convolution works.

You have your inputs, your outputs and the filter.

And to convole a filter with the input data you need to multiply each value in your filter with the value in the input data and combine that information to compute a single output value.

And you do the same for the rest of the output pixels.

And now the convolution layer is a generalization of regular convolution.

It allows you to have multiple filters.

So you have as many filters as you have output channels or 16 in this case.

And these are the filters which are going to be filtering input data for particular features.

Now imagine that you're working with RGB data.

So you actually have three channels in your input.

And just because how CNNs work, this means you need three sets of 16 filters.

One set for each input channel.

And then these filters are applied to the input data separately.

And then the final step combines all of this information to compute a single output pixel.

So that's it for our recap of the convolution layer.

Now let's talk about the new primitives we've added for convolutional neural networks.

So as you can see, we've added quite a few.

And I'll be talking about the ones highlighted in yellow, but the rest of them like L2Norm Pooling, Resampling, Up-sampling, they will all be covered in our documentation.

So let's talk about updates to our core convolution layer.

We used to support only single precision floating-point weight types.

And now to help you reduce the memory footprint and to prove the performance of your networks.

We also support half-precision floating points, 8-bit integer, and binary weight types.

We used to support only standard convolution and now we also support binary and XNOR convolution.

Dilated convolution.

Sub-pixel convolution and convolution transpose operations.

And many of these are orthogonal, so you can even have dilated sub-pixel convolution if you want.

So let's go through them one-by-one.

Binary and XNOR convolution perform the same exact operation as regular convolution but they do so with improved performance and great space savings.

So in regular convolution, you may have floating point inputs and floating point weights.

What binary convolution allow you to do is to use your full-sized input with binary weights.

And for XNOR convolution the first thing that happens is that your input is first converted to binary so that both your inputs and the weights are binary.

In regular convolution, the input has to be multiplied with the weights.

And for XNOR convolution the separation becomes a simple XNOR operation.

And now let's talk about dilated convolution.

So we already know how regular convolution works.

You need to apply a filter to the input data to compute a single output value.

But say you're working on an algorithm that requires global integration of a wider context of your input data.

So instead of a 3 by 3 kernel, you may be using a 5 by 5 kernel to look out further.

But that's a lot more computationally expensive.

What you can do instead is use dilated convolutions which allows you to which allows you to use dilation factors to introduce gaps into your convolution kernel so that you're still using just a 3 by 3 kernel, but you can look out further.

And now let's talk about subluxal convolution and convolution transpose primitive; very commonly used for image upscaling.

And let's think about how upscaling usually works.

So you have your input data and you want to upscale it by a factor of 2.

So you won't you have some missing pixels to compute.

And usually upscaling is the fixed operation with a constant filter.

So for example how would a box filter help you to upscale this image?

So the box filter, which is take the known pixels and copy the known data into the missing location to get you upscaled results.

For sub-pixel convolution, your filters are not constant.

Your filters are learned from the data.

They are your trained parameters that you get from the training step where the system was trained to do this task; to do image upscaling.

So for 2x upscaling you get 4 filters.

For 4x upscaling you get 16 filters and so on.

So for our 2x upscaling we get our 4 filters and we apply them to the input data.

And then the output of that operation is reshuffled to get your final full-resolution image.

And now let's talk about how the convolution transpose primitive can be used to upscale images.

So we have our inputs and we still have to compute our missing data.

So the way that this primitive computes the missing data is that it applies a kind of convolution pass to this intermediate result with gaps to compute each output pixel.

So that's how you get your upscaled output.

And now we're going to show you how you can use these new convolution primitives to implement a real-world network.

So we took this colorization network that takes black and white images as input and produces colorized images.

And this particular network uses the dilated convolution primitive to integrate wider global context quicker.

And it uses the convolution transpose primitive to upscale the results of the network.

And now let's look at this colorization network in action.

So in this demo we have a collection of black and white images like this image of a lion.

And as soon as I tap on this image, the colorization network will run right here live on the device, and we'll see a colorized image.

And let's try another example for this beautiful snowy mountain.

And now we see it in color.

And this beautiful lovely image of a dad and a daughter playing guitar.

And now you can see them playing guitar in color.

And I really like this one, the brown bear walking in the forest.

So I think this network does just a really wonderful job.

Okay. So that's it for the live demo.

[ Applause ]

Thank you so much.

So we've added all of these new convolution CNN primitives, but that's not all.

We also went back and improved the performance of some of the core CNN kernels that were available to you in iOS 10.

So this chart will show the performance of the Inception-v3 network, which is a commonly used network for image recognition.

So it shows the performance of this network in iOS 11.

And as you can see, we're bringing you at least 20 percent performance improvement across different iOS hardware.

And now let's talk about the new neural network graph API.

the neural networks are commonly described using a graph abstraction like this visualization of the Inception-v3 network.

And we're now allowing to do just this using the new graph API.

So let's zoom in on one of these inception modules.

You have filter nodes which describe the operations that you can perform on your data.

Such as convolution, pooling, etcetera.

And you have image nodes which describe how the data flows between these different operations.

So why did we add this new graph API?

Well because it's easy to use.

You get this compact representation of your entire network and you can save it to disk and restore it, and that works across platforms.

You only need to initialize the graph once and then you can reuse it for multiple input images.

And you can execute the entire graph on the GPU with a single call.

There are no intermediate images for you to manage, you just need to take care of your input and output.

Internally we use Metal heaps to make sure that the memory footprint of all your intermediate images is as small as possible.

For example, for the Inception-v3 network this means 5x memory savings and 10x viewer allocations, which I think is pretty impressive.

So as I said, the graph does all the groundwork for you.

It takes care of creating intermediate images.

It takes care of sizing them.

It also it even sizes your outputs.

It takes care of the padding policies.

It takes care of censoring.

So in short, it's a lot less code for you to write and a lot fewer bugs for you to write as well.

And when I say less code, I mean a lot less code.

So last year we released this Metal recognition sample that uses the Inception-v3 network for image recognition.

And we took that sample and converted it to use the new graph API and found that we had to write four times less code.

And that's pretty much the same number of lines as Python code you would have to write in the open-source sensor flow framework to implement the same network.

And we just want to mention that we will be releasing this updated sample code updated example as sample code.

And now having all this information about your entire network allows us to deliver the best performance across different views.

We make it easy for your to parallelize between the CPU and the GPU.

so as the graph is executing as the GPU is executing the graph of one input image, the CPU can already prepare to execute the graph for a different input image.

We can also fuse graph nodes together like the convolution and neuron nodes.

And we can execute graph nodes concurrently.

So if we look at this inception module again, you can see that there are multiple rows of these nodes that can be executed completely independently of each other.

And of course the output of these independent executions need to be concatenated via concatenation nodes.

And the graph is smart enough to optimize those away as well.

And now let's take a look at how you can use the new graph API.

So this is the code for creating a convolution node using the graph API.

So it takes an image as source and it also has weights.

So let's talk about weights for a minute.

Neural networks keep growing larger and larger in size.

And if you have many convolution nodes in your networks, that means that the overall size of the weights for your entire network could be quite considerable.

And to help with that we've added a convolution data source protocol that you can implement and it provides just in time loading and purging of weights data.

So the idea is that the weights for your entire network do not have to be loaded in memory all at the same time.

They also do not have to be loaded in advance.

To help minimize the memory footprint, when we initialize the graph and we process a particular convolution layer, we'll load the weights for that convolution layer and then we purge them before we move on to the next convolution layer.

What you have to do is to implement this initialization method which just knows where the data is but it doesn't actually load it.

And then when the graph calls the load function that alerts you that the weights need to be loaded.

And then when the purge function is called by the graph then you can release the weights.

And now let's build a graph.

So here we're implementing this makeGraph function.

And on the left you can see all the nodes that make up our network that we need to build.

So then we create the nodes.

So we create the convolution node.

The pooling node.

And then the rest of the nodes.

So we have the nodes.

How do we connect them into a graph?

So we just take the result image of one node and pass it as a source image to the next node.

And then we have our graph.

And now let's run it on the GPU.

So first we do our usual Metal setup.

We initialize the graph.

We take care of our input data and then we encode the graph to the GPU.

And the data in the output image will be the output image will be populated with data when the command buffer completes.

And then we have an option to wait for the GPU to finish.

But we don't want you to do that.

When this happens the CPU is waiting for the GPU to finish before it can start encoding the next run of the graph.

And this introduces bubbles into your pipeline, which can adversely affect performance.

So what we want you to do instead is to use the new asynchronous executeAsync API.

So with this API your Metal setup is even smaller.

So you just need to get the Metal device.

Then you still need to initialize your graph.

Prepare the input data and then you executeAsync call.

It returns immediately and then the output image will be ready when this code inside the closure executes.

But in the meantime, you don't have to wait for the GPU to finish, you can already proceed with a coding and new GPU task.

And this way the CPU and the GPU are executing concurrently.

There are no bubbles in your pipeline and they're both utilized to full capacity.

Okay. And now I will do a live demo that demonstrates the performance difference between the synchronous and asynchronous APIs.

And this demo will be using the Inception-v3 network for image recognition.

All right.

So I will be starting with synchronous API and here we're detecting a water bottle.

And we're getting about 50 milliseconds per second per image on average.

And now I will switch to the asynchronous API.

And now we're getting about 36 milliseconds per image on average.

So that's pretty good performance improvement.

All right.

So that's it for the live demo.

[ Applause ]

Thank you.

Okay. Now that we've talked about the new neural network graph API and I showed you how easy it is to use and what great performance you can achieve with it, let's now switch gears and talk about recurrent neural networks.

So what are recurrent neural networks?

So one disadvantage of CNNs is their inability to remember anything that happened in the past.

They can take one image as input and generate a single output such as the set of probabilities of what is depicted in the image.

RNNs on the other hand have memory.

And they're good at operating on sequences.

So they can take one input such as a set of probabilities of what is depicted in the image and generate a sequence of outputs.

So a sequence of words that make up a caption for this image.

They can also take a sequence of inputs such as a sentence in English and generate a sequence of outputs such as the same sentence translated to a different language like Russian or Finnish.

And we support a number of different of variants of RNNs.

The single gate RNN, the long short-term memory RNN or LSTM, and multiple variants of LSTMs.

The GRU and the MGU.

So let's talk about the simplest kind of RNN, the single gate RNN.

the single gate RNN has a recurrent unit which enables the previous output over RNN to affect the output of the subsequent iterations of the same RNN.

But the single gate RNNs are not powerful enough to carry on important information for many iterations.

Because the current output of an RNN of the single gate RNN is also its current state.

There's nothing else there.

The solution to this is the long short-term memory RNN or LSTM.

It's built from single gate RNNs and it has an internal memory cell.

And a certain combination of gates control how the information flows inside LSTM.

And what is stored and not stored in the memory cell.

So let's take a look at the architecture of LSTM in more detail.

As I said, the most important entity inside LSTM is the memory cell which is updated in every duration of LSTM.

So you can think of each iteration of LSTM is this transition between the old and new memory.

And now let's talk about the gates.

So first there is a forget gate which decides what to keep and what not to keep from old memory.

And then there are the inputs and the cell gates and their combined contribution determines what from the current input will affect the new memory.

And then the combination of all of these three gates is combined to update the memory cell.

And finally, there is the output gate which determines what from the previous inputs the the previous output, the current inputs and the new memory will affect the output of LSTM.

So now that you know what LSTM is made up of, let's take a look at how you can create one using our framework.

So first you create a descriptor for the LSTM.

And then you need to initialize the gates.

So what controls the gates?

So what controls the gates with what controls how they operate is the trained parameters.

The ones that come from the training step where you train a system to do a particular task.

And there are multiple gates for you to initialize as you can see, but we're only showing two initializations just to be brief.

And as you can see, we're also using a data source provider.

The same one I showed you before to initialize the weights.

And the next step is to create our LSTM layer and now we want to run it on the GPU.

So we need to create our arrays that will hold the input and output for the sequence of the LSTM executions.

And then we encode the sequence to the GPU.

And here we're showing you the a matrix-based RNN, but we just want to mention that we also support RNNs that operate on MPS images via convolutions.

And now let's take a look at an actual example.

So we'll use image captioning as an example of using LSTM.

So as you recall, I told you that deep learning algorithms have two phases.

The training phase and the inference phase.

So to train a system to caption images you need to feed it a large number of images with human-generated captions.

So what does this system have?

Like what is it made out of?

So this system has a CNN and a RNN working together to generate captions.

The CNN is used to figure out what's depicted in the image and then the RNN is used to generate the actual caption.

And the output of that process is the trained parameters which are required for the next step, the inference step.

So in the inference phase, the trained parameters control both the operation of the CNN layers and the operation of the RNN gates.

And then for each image it's processed by both the CNN and the RNN to generate a caption.

So we already know a good network for figuring out what's depicted in the image.

It's the Inception-v3 network, so we'll use that.

And we just talked about LSTMs, so let's use that to generate our caption.

And the caption generation phase the caption generation process also has two phases.

So first we have the LSTM initialization phase.

So we run our Inception-v3 network and we actually run all of the layers except the very last SoftMax layer.

And the output of that is a feature vector which has information about what is depicted in the image.

And then we take that feature vector and convert it to a compact representation that's required by LSTM.

And then run that through LSTM to initialize it.

And then once we have our initialized LSTM, then we're ready for the next phase.

Our actual caption generation phase.

And we start this process by passing in a special sentence start ID token to our LSTM.

And the output of that operations is a sequence of words which are, you know, the words that are connected to what is depicted in the image.

And then we pass those words to a SoftMax layer which computes probabilities for these words.

And we pick the three best ones.

And these three best words are also our one-word partial captions for a particular image.

So we take those words and pass them to the next situation of LSTM which function is to now come up with three best two-word captions for our image and so on.

We execute for N iterations until we reach a stopping condition, which is when we either reach the maximum number of words that we want to be in our caption, or when the probabilities evolving the newly generating captions drop to 0.

So I know this is still pretty abstract.

So let's look at the output of LSTM of an actual output of LSTM for several iterations for a particular image.

So in this image we have, you know, our surfer riding a wave.

And we want to compute the top three captions for this image.

And in the first iteration of LSTM we generate three best words.

So which are our best one-word captions for this image.

So "man", "a", and "the".

And the word "a" has the highest probability.

So we take these three words and we pass them to the next iteration of LSTM.

And in this iteration, for every one of these three starter words, LSTM generates three new words that have the highest probability of following each one of these starter words.

Right? So we have three new words to follow the word "man".

Three new words to follow the word "a", and three new words to follow the word "the".

And now as you can see, each one of these two-word captions also have a probability.

And because the word "a" had such a high probability in the first iteration, then the captions that start with the word "a" in the second iteration also end up having the highest probability.

Why? Because the probability of a two-word caption is just a product of the probabilities of the words that make up that caption.

So that's how we get these three best ones.

And then we take them and we move on to the next iteration.

And in the next iteration we just add one more word to our captions so that we have three-word captions.

And then we compute the probabilities of those captions and pick the best three.

And we move on to the next iteration where we just end up adding one more word to our caption.

So we have four-word captions.

And then we compute the probabilities of all these captions and pick the best three.

And so on I think you get the idea.

Let's just skip to the end.

So in the end, we get our three top captions for this particular image.

And the best one is a man riding a wave on top of a surfboard, which I think it's pretty close.

So [applause] and let's now do a demo.

[ Applause ]

So now we'll do a demo of this of the captioning network.

So we have a collection of images here, and as soon as I tap on an image then the CNN will run to determine what is depicted in the image.

And then the RNN will run to generate the actual caption.

So let's try it out.

A man riding a wave on top of a surfboard.

So we already know this.

Now let's try another image.

An old truck is parked in the field.

So the network actually knows that it's an old truck and that it's parked and not moving, which I think is pretty impressive.

Now let's try one more.

A black and white dog laying in the grass.

So the network knows that it's a black and white dog and that it's laying in the grass, not running.

Not walking.

Not sitting.

Laying in the grass.

So pretty cool.

[ Applause ]

Thank you.

And on this note, let's go to the summary.

So in this session we talked about all of the new primitives that we added to the MPS framework this year.

We've expanded our support for image processing primitives and for convolutional neural networks.

And we've added support for linear algebra and recurrent neural networks.

the framework is optimized for iOS, as I told you, and now these primitives are also all available on the Mac.

We also talked about the new neural network graph API and we showed you how easy it is to use, to build and execute your networks on the GPU.

And that it makes it possible for us to deliver the best performance for your networks across the different GPUs.

And we would love to see you use all of this new functionality to create a really great apps and tell us about it.

So please check out the related Metal 2 sessions and the sessions on the core ML and Accelerate and Vision Frameworks.

And for more information about this session and for links of sample code, please check out this link on our developer website and thank you so much for coming, and have a great WWDC.

[ Applause ]

Apple, Inc. AAPL
1 Infinite Loop Cupertino CA 95014 US