[ Music ]
[ Applause ]
My name's Justin.
I'm an engineer in GPU Software, and this is Metal for Machine Learning.
Today we'll be discussing the Metal Performance Shaders framework, and new machine learning features that we added this year.
The Metal Performance Shaders or MPS is a collection of GPU-accelerated primitives, which allow you to leverage the high-performance capabilities of metal in the GPU.
MPS provides kernels for image processing, linear algebra, ray tracing, and machine learning.
Now machine learning kernels support both inference and training, and they're optimized for both for iOS, macOS, and tvOS.
MPS also provides a convenient way of building neural networks through the graph API.
So, here we can see how MPS fits into the larger Apple ML ecosystem.
You have higher-level frameworks like Core ML and Create ML that give you a convenient way to implement many of your networks.
But if you want a little more flexibility and control over your program, you can use a lower-level framework like MPS.
And this year, we have expanded our machine learning support with several new features.
We've added kernels to support even more networks than before, we've improved performance on existing networks, and we've made MPS even easier to use.
Now, as we go over these new features, it's going to be helpful to know a few things about how inference and training work in machine learning.
So, let's briefly review these concepts.
So, inference is the process of applying a network on an input, in this case an image, and producing an output or a guess of what it is.
Now, the network is made up of a variety of functions such as convolutions and neuron activations, and these layers in turn depend upon a set of parameters.
During inference, these sets of parameters are fixed, but their values are determined during the training process.
So, what happens during training.
During training, we give the network many images of known objects.
The training process involves repeatedly classifying these images, and as we do so, we update our parameters, and each iteration of the network produces a better set of parameters until we finally reach a set of parameters that allows us to best classify the images.
Now, at that point we stop the training process, and our parameters are ready to be used in inference.
So, let's look at how we can MPS to implement some of these ideas.
But I would like to mention that there's a lot more to inference and training than what we just covered here.
So if you want more details, please see some of our talks from the past couple years.
Now, we added several new features this year to better support a wide range of inference and training networks.
So, first we made creating graphs of your network simpler by supporting implicit creation of your training graphs from your inference graphs.
We've added kernels for separable loss layers and random number generation to enable a variety of new networks and we've added support for things like predication and better control over how MPS commits its work to improve performance.
So, let's start with implicit graph creation.
With implicit graph creation, we can implicitly create our training graphs from our inference graphs.
So, let's first review how we create a graph for our network.
Here we have a simple inference network.
It's made up of some convolution layers, some pooling layers, and finally some fully connected layers.
So, we're going to create a graph for this network by creating nodes for each layer.
We're going to create a convolution node for each of the convolution layers, a pooling node for each of the pooling layers, and finally some fully connecting nodes for the fully connected layers.
So now with our inference graph defined, we can extend it to a training graph.
We do this by first attending a loss node at the end of our inference graph, and then then we add gradient nodes for each of our forward nodes, moving in the reverse order of our inference graph.
So, we can look at the code for this section.
As before, we start by adding the loss node, and then we add each gradient node, just moving in the same order we just mentioned.
So, here we can see that each gradient node is pretty easily created from the forward node, but with implicit graph creation, this is even simpler.
Now, once you've initialized your gradient image with the loss node, we can automatically create the entire training graph corresponding to the inference graph.
So, as before, we create our loss node.
Then with a single line of code, we can create our entire training graph.
Now, in this case, we're creating the training graph from the loss node.
We're going to use a nil argument for our source gradient, which tells the loss node to use its result to initialized the gradients.
But we could use another image if we wanted.
And we're also providing nil for the second argument.
This is called a node handler.
The node handler allows you to provide a block, which you can use to execute some custom code to configure your nodes after they're created.
I also want to mention another useful feature, the stop gradient property.
So, typically, when you generate your training sequence, all of your trainable layers will update their weights.
In this case, those are convolutions and the fully connected layers.
But in some cases, you may only want to update the weights for some of the layers of the network, as in transfer learning, for example.
Now, in transfer learning, we are going to use pretrained weights for many of the layers, and we only want to train the weight for some of the layers.
Let's say, for example, the final fully connected layers.
Implicit graph creation also supports creating graphs for these types of networks through the stop gradient property.
So, to do this, we're going to set the stop gradient property on the first, on the first layer, whose weights we want to update.
In this case, the fully connected layer.
And then when the graph is generated, none of the subsequent gradient nodes will be created.
So as you can see, using implicit graph creation is a very easy way of generating your training graphs from your inference graphs.
So now let's look at a feature we added to support some new networks.
Separable loss kernels.
So, earlier, we just saw how to use a loss node using MPS CNN loss.
MPS CNN loss consumes a final image, which is usually the result of something like a soft max layer along with the ground truth data in order to compute gradient values to begin the back-propagation phase.
But there are some networks which use multiple intermediate loss values in order to produce a final loss.
So, to support this, we added separate forward and gradient loss kernels.
So, here we can see two loss values being computed using forward loss nodes, and then we take those results, add them together to produce a final loss.
Now, we need to initialize the gradient value to begin the back-propagation phase.
Before this happened implicitly through the loss node now, we need to add an initial gradient kernel.
This is going to generate just a gradient image of ones, and it's going to be sized to the result of the final loss calculation.
So with the gradient values initialized, we can start the back-propagation phase.
We're going to use gradient kernels for each of our forward kernels with an addition gradient and gradients for each forward loss kernel.
Now, let's take a look at a network that uses separable losses.
Specifically, we're going to look at style transfer.
Now, the style transfer network produces images which are combinations of a style and an original image.
The model we'll be looking at is one that you can find and to re-create, and it's implemented using MPS.
Now in inference, this network consists of a transformer node, which is made up of things like convolutions and instance normalization layers, and their weights make up the trained parameters.
This is where the style is incorporated.
It's learned into the parameters through the training process.
So let's look at how we do the training.
So here we have an overview of the network.
Now, as an inference, we're going to apply the transformer to produce a stylized image.
Now, in this case, this is going to be the network's current guess at the best styled image, which combines the style and the content.
And since the goal of the network is to match both the desired style and the content of the original image, we're going to need two loss values.
So, the first loss value is computed by this sum network that we're going to call the style loss network.
This loss value is going to help ensure that the network converges on a result, which closely matches our desired style.
And then we also want to make sure that the generated image also retains the features of the original.
So, for this we're going to use a second loss network.
This is the content loss.
And we can use our new forward loss kernels for each of these loss calculations.
But let's take a closer look at the style loss network.
So, in order to compute the style loss, we need a way of sort of measuring the style of an image.
So, to do this we're going to calculate what's called the Gram Matrix for several intermediate feature representations of the images.
This year, Gram Matrix calculations are natively supported in MPS with both forward and gradient kernels.
So let's take a quick look at the Gram Matrix and how it's computed.
So, the Gram Matrix represents uncentered cross-correlations between feature vectors.
Now, each feature vector results from spatially flattening the results from a single image in a single feature channel.
We compute dot products between each feature vector to produce a Gram Matrix.
So let's take a look at how it's used.
So, before we get to the Gram Matrix, we're going to use the VGG image classification network to extract some features from both our style and our stylized input image.
Now, as we described before, the Gram Matrix gives us correlations between feature vectors.
Now, when we take these from the features extracted from the style, this gives us our sort of ground truth for the style that we want to apply, and we're also going to do the same thing for our current guess at the best stylized image.
Now, we take these two values together to form our style loss.
So now let's look at how we compute the second of our two losses, the content loss.
So, as before, we're going to extract features using VGG and then compute a loss using those features and our features from our stylized image.
And then the network's final loss is going to be the sum of the content loss and the style loss.
So, now let's look at how we can use MPS to compute these values and initialize gradients.
So first, let's assume we have our feature representations, as produced by VGG.
First, we're going to add our Gram Matrix calculation nodes to compute the Gram Matrix for both the style and our stylized image.
We're going to feed these results into a forward loss node to just compute the loss for our style.
The source image here is the result of the Gram Matrix calculation for our network stylized image.
The Gram Matrix for the reference style image is going to be used in the labels argument.
Now this shows an important feature of the new forward loss kernels.
Previously, you had to pass labels using MPS state objects.
But now, you can used MPS images.
So now we can add the loss node for the content loss using the features of the stylized image and the original image, and we can combine them to get our total loss value.
And now we need to initialize the final loss gradient to begin the back-propagation phase.
We do this using the initial gradient node we discussed before.
Now we saw before how the result of the loss node can be used to implicitly generate the training graph.
This is because it generates the initial gradient.
But now, as I mentioned before, with the separable loss kernels, we do this explicitly using the initial gradient node.
So, this is the node that were going to use to generate our training graph.
So, with the graph generated, let's take a look at what this network does in action.
So, here we can see the style transfer network running on the GPU using MPS.
This was run on a Mac Book Pro with an AMD Radeon pro 560 graphics card.
Now this is showing the results of the style transfer training at each iteration as it progresses.
As you can see, the style is being applied progressively, but the content of the image is being retained.
I also want to mention that these iterations have been sped up for this video from real time to better illustrate the progression of the training network.
So now, I'd like to look at another feature that we added this year, random number generation.
This year we added support for two types of random number generators in MPS.
We have a variant of the Mersenne Twister called MTGP32 and a counter-based generator called Philox.
Now these generators were chosen because their algorithms are well suited to GPU architectures, and they still provide sequences of random numbers with pretty good statistical properties.
Now, you can use these kernels to generate large sequences of random numbers using buffers and GPU memory.
And since you have this result available in GPU memory, you can avoid having to synchronize large rays and numbers from the CPU.
And generating random numbers like this is important for several machine learning applications.
They're required, for example, for initializing your weights of your networks for training and also for creating inputs when training generated adversarial networks, or GANs.
Now GANs are an especially important use case for random number generators.
You had to generate the random input at each iteration of your training.
If you had to synchronize an array of numbers from the CPU, every iteration, it could make training your network prohibitively expensive.
So, let's take a closer look at those networks and how we can use the new random number generators.
So, generative adversarial networks or GANs are built around two networks.
We have a generator network and a discriminator network.
Here we have an example of a generator, which just generates images of handwritten digits.
Now, similar to image classification during training, we're going to provide the network with many examples of handwritten digits.
However, instead of attempting to classify them, the network is going to attempt to generate new images from a random initial set of data to look similar to its training set.
So, in order to perform this training process, we needed some way of determining how similar these images should be.
So, for this second network, we're going to use what we call the discriminator.
Now as this name suggests, it's designed to discriminate between training images and those images which are simulated by the generator.
So, in this case, it acts as an image classifier network but with only two possibilities.
The input is either real, from the training set, or it's a generated, or a fake image.
So you can see, here's the discriminator, looking at some numbers and coming with whether they're real or fake.
Now, typically both the generator and the discriminator are trained together.
We trained the generator to produce more realistic images, while we trained the discriminator to better distinguish synthetic images from the training images.
So, here we have a high-level overview of the nodes for your training network.
So here's our discriminator training, training network.
It consists of two loss calculations.
So, this is an example of where you could use the separable loss nodes we just talked about.
We have one loss where we attempt to ensure the discriminator properly classifies the simulated images as fake, and we have a second loss where we trained the discriminator to classify the real images from the training set as real.
After computing the separate loss values, we can use an initial gradient node to initialize your training graph.
And secondly, here we have the generator training network.
This one is a little simpler.
It just has a single loss value.
But in this case, we use a label value of real to ensure that our generator generates images, which the discriminator subsequently classifies as real.
Now, I mentioned earlier that the generator network begins with a random set of data that we're going to use our random number generator for.
So, let's take a closer look at random number generation.
Now random number generation kernels belong to the MPSMatrix subframework, and they're accessed through MPSMatrix random classes.
So, they operate on MPSMatrix and MPSVector objects, which means they work with metal buffers, and they support generating random integers with the underlying generator, or you can generate floating point values using a uniform distribution.
So, here, we're going to create a distribution descriptor for uniform distribution of values between 0 and 1.
Then we're going to create our generator, testing the proper data types, and then we give it an initial seed.
Finally, we create a matrix to hold the result, and we encode the operation to the command buffer.
So, now let's go back to the network and see how we can use it.
So, here's a closer view of the generator network.
We have some convolution layers, some ReLu layers, and the hyperbolic tangent neuron.
Now the input image is going to be the output of our random number generator.
As we saw before the random number generator works with matrices, but the graph and all the neural network kernels require images.
So, we're going to use our MPS copy kernel to copy the data from the matrix into an image.
So, first we'll create a matrix to hold our random values.
Then we'll also create an image, which is going to serve as the input for our network.
And we're going to initialize a copy kernel to perform the copy.
Then were going to encode our random number generator to generate the values.
We're going to encode the copy to copy them into the image, and now we're going to encode the network using the image.
Now, for more details on this network and using MPSMatrix random number generation kernels, please see the online documentation.
There's also some sample code.
Now we also added features to help improve the performance and efficiency of networks using MPS.
So let's take a look at one of them now, predication.
With predication, you can now conditionally execute MPS kernels.
The kernels' execution is predicated on values which exist in GPU memory, and they're referenced at the time of execution of the kernel.
So, let's take a look at a network, which illustrates how this can be used.
This is image captioning.
This is a network we showed a couple years ago, and it generates captions of images using a convolutional neural network and a recurrent neural network.
The convolution network is the common classification network.
In this case, we're using Inception V3.
It's going to be used to extract features from the source image.
Then we take these feature maps, and we feed them into a small LSTM-based network where those captions are generated from the extracted features.
Now, then we iterate this network to produce the image caption.
In this case, we need to know, in this case, we need to run the LSTM-based network for some number of iterations, which is going to be fixed, and we need to do it at least as many times as we believe will be needed to generate the captions for the image.
In this case, for example, we run the LSTM-based network 20 times.
Each iteration then computes the best captions by appending a new word to the captions produced in the prior iteration.
But if the caption were to only require five words, then we've had to run many more iterations than we need.
With predication, we can end the execution early.
In this case, after the five-word caption has been generated.
So let's look at how we can use this in MPS.
But to do so, we need to first discuss how we provide predicate values to MPS commands, and for this, we introduce the MPSCommandBuffer.
Now, MPSCommandBuffer is a class that conforms to the MTLCommandBuffer protocol, but it adds a little bit more flexibility.
It can be used anywhere you're currently using metal command buff, and like a MTLCommandBuffer, it's constructed from a MTLCommandQueue.
Now, it provides several important benefits.
It allows you to predicate execution of MPS kernels, and as we'll discuss later, it allows you to easily perform some intermediate commits as you encode your MPS work, using a method called commitAndContinue, but we'll get back to that later.
First, let's look at how we an use MPSCommandBuffers to supply predicates to MPS kernels.
So an MPS predicate object contains a metal buffer, which contains 32-bit integer predicate values, and they're at an offset.
We take the value within the metal buffer at the offset as the execution predicate.
Now, a value of 0 means we don't want the kernel to execute, and a nonzero value means to execute as normal.
So, in this diagram here, we've effectively bypassed the execution of this kernel by setting the value at the offset to 0.
And the offset is important.
It can allow you to share a single metal buffer among multiple MPS predicate objects so you can send a predicate to multiple kernels.
Each predicate value will be referenced with a different offset.
Now, in order to use a predicate value, we have to attach it to an MPSCommandBuffer.
This way, any MPS kernels that we encode on that command buffer will perceive the predicate values.
So, let's take a look at how we can create a predicate and set it on an MPSCommandBuffer.
So, first, we create an MPSPredicate object, and we attach the predicate to our MPSCommandBuffer.
Now, we'll encode an operation that modifies the predicate values.
Now because of the existing metal buffers, we need a kernel that produces its result in a metal buffer.
You can use your own kernel, or you may be able to use one of the MPSMatrix kernels, which is what we're going to do here.
So, we're going to start by wrapping the predicate in an MPSMatrix object.
Then we're going to encode a kernel to modify the predicate value.
So, here, we're just using a linear neuron kernel, and we're going to use it to do something simple.
We're just going to decrement the value of the predicate.
And finally, we're going to encode a cnnKernel to read the value of the predicate prior to execution.
So, using predication in MPSCommandBuffers is an easy way of eliminating unnecessary work in your networks.
If you have kernels, which can be bypassed, you can use predication to take advantage of the reduced workload.
And if there are multiple kernels for which this applies, you can use multiple predicates and use only a single metal buffer by setting unique offset values.
So, now let's talk about the other feature of MPSCommandBuffers, commitAndContinue.
Now this is a method which allows you to easily get better GPU utilization when executing your work.
So, to see how it can benefit, let's first review how a typical workload is executed.
Now, the usual way of executing MPS kernels is to encode your work onto a command buffer and then commit it for execution.
So, here we have a case of a single command buffer, you encode some work, and then we execute it afterwards.
Now, in reality, the CPU's encoding time is going to be less than the GPU's execution time, but we want to avoid any idle time due to throttling and things like that.
So you can see we're going to get some stalling here between the CPU and the GPU.
Now, one way of solving this is to use double buffering.
With double buffering, we're going to keep around two command buffers, and we're going to encode work to one while executing the other.
Now, this should pretty well eliminate the idling that we saw before, but it has some limitations.
So, first off, as I mentioned, you're going to have to keep two sets of work, which means you're going to have to find a way to partition your work into two independent workloads.
And as a result, you can have substantially increased memory requirements.
However, we the commitAndContinue method, we can gain much of this performance benefit by dividing each workload into smaller portions.
So, here we're going to break down the work by utilizing independence of layers within each command buffer.
Then we're going to commit the smaller groups of work using double buffering.
Now, commitAndContinue is automatically going to handle this internal division of work while also ensuring that any temporary objects that you allocated on the command buffer will remain valid for subsequent work to be encoded.
As with double buffering, it allows you to execute work on the GPU while continuing to encode it on the CPU.
And by easily allowing you to partition your workload, you can avoid the increased memory requirement of double buffering while still getting much improved GPU utilization.
So let's see how you can take advantage of this in your own code.
So here we have four MPS kernels we're encoding to a MTLCommandBuffer.
And finally, we commit the work for execution.
As we showed earlier, this is going to give you the stalls that we saw.
However, by using MPSCommandBuffers and the new CommitAndContinue method, we can easily improve this.
So, here we're going to create an MPSCommandBuffer.
We'll encode our first two kernels.
Then we'll call commitAndContinue.
This will commit the work that we've already encoded, move any allocations forward, and allow us to immediately continue encoding the other two kernels.
Finally, we can commit the remaining work using a regular commit.
So you can see, using commitAndContinue requires very few changes to your code, but if you're taking advantage of the graph, it's even easier.
When you encode and MPS in graph using MPSCommandBuffer, it will automatically use commitAndContinue to periodically submit work throughout the encoding process.
No further changes are needed.
Simply use an MPSCommandBuffer instead of a MTLCommandBuffer.
And finally, I want to point out that you can still combine commitAndContinue with double buffering and get even better performance.
So, as you can see here, it allows you to eliminate even the small stalls that we saw with commitAndContinue.
So, we now have a variety of options for committing our work for execution.
You can use a single command buffer, executing a single piece of work at a time.
For better performance, potentially with increased memory consumption, you can use double buffering.
And now, with MPSCommandBuffer, you can achieve nearly the same performance using commitAndContinue.
And if you still want even better performance, you can use commitAndContinue and double buffering.
So let's take a look at how these approaches perform on a real-world network.
So for this case, were going to look at the ResNet 50 network running on a CIFAR-10 dataset.
Now this data was measured using an external AMD Radeon Pro Vega 64 GPU.
It's a common image classification network with many layers, so it's a good example of what we can see with commitAndContinue.
So we're going to start with our single buffering case as our baseline.
We have performance and memory consumption here on the vertical axis.
So, let's see how double buffering compares.
Now we've improved the performance quite a bit, but we've also increased our memory consumption by a similar amount.
That's because we achieve double buffering by maintaining twice as much work in flight at any given time.
So, let's look at using CommitAndContinue.
We come very close on the performance and with significantly less memory overhead, and here we also see CommitAndContinue along with double buffering.
We still get a little bit better performance, but we still use a lot more memory as well.
So, you can see, using CommitAndContinue is a very easy way to achieve much better performance with minimal increase in memory pressure.
So now, let's put all of these approaches together by looking at another application of machine learning, denoising.
Now as this name suggests, denoising seeks to remove noise from a noisy image and produce a clean one.
Now, we're going to be looking at this in the context of ray tracing.
If you saw the earlier metal for ray tracing session, you saw another example of denoising, one using image processing techniques.
Here, we're going to be looking at a solution based on machine learning.
So for this example, we'll look at three phases.
We're going to create an offline training process.
We're going to run the training network, and finally we're going to deploy the inference graph to filter new images.
So, first, we need to create the graph.
Let's take a closer look at the structure.
So here we're going to start with our input image, which is our noisy image, which came out of our ray tracer.
We're going to feed this image into encoder stages.
Now encoders are small subnetworks which extract higher-level feature representations while spatially compressing the image.
We're going to pass these results into our decoder stages.
Now these perform the reverse process.
They're going to reconstruct the image from the feature maps.
Now we're also going to use what are called skip connections.
These boost features from the encoded image into each decoder stage.
This is done by forwarding the result from each encoder to its decoder.
Finally, the denoised image is fully reconstructed.
So, let's take a closer look at the encoder stages.
The encoder stage compresses the images while trying to learn how to preserve its features, consists of three pairs of convolution and ReLu layers and finally a max pooling layer.
Let's look at the code.
Now, as we saw before, we can construct each node in the sequence in the same order they appear in the network.
And we'll construct the decoders in the same way.
You start with an upsampling layer.
After this, we add the result of the corresponding encoder via the skip connection, and then finally we have two pairs of convolution and ReLu layers.
Again, as before, we're going to insert nodes corresponding to each layer in the network.
Now we can put our encoder and decoder stages together.
So, first we're going to connect our encoder nodes.
But before we move on and connect our decoder nodes, we need to put in one more encoder node, which we're going to call the bottleneck node.
It's identical to an encoder except it doesn't have the final max pooling layer.
And after the bottleneck nodes, we're going to connect our decoder nodes.
Now, by passing the result image from the corresponding encoder nodes, we're going to satisfy the skip connections.
So now we have the inference graph.
Let's look at the training phase.
To begin the training phase, we need to compute the loss value.
So we're going to start we the inference, we're going to start with the result of the inference graph, which for a training iteration is now our network's best guess at the current denoised image.
Now, we're going to take the clean RGB image for our ground truth, and we're going to use that to compute a loss value.
Now, we're also going to want to compute a second loss.
We're going to perform some edge detection.
We're going to do this doing a Laplacian of Gaussian filter.
Now, we want to do this because we want our network to learn how to denoise the image, but at the same time we also want to make sure that it preserves the edges of the original image.
So, were going to implement the Laplacian of Gaussian or the LoG filter using convolutions here.
Finally, we're going to combine these two losses.
The first loss we're going to call the RGB loss and the second the LoG loss, and we're going to combine these into the final loss.
So now let's take a closer look at how we do this.
So, we're going to create our RBG loss node using the result of the inference graph and the ground truth RGB images.
So, as you mentioned earlier, we can use separable loss kernels, and we're going to pass both of our, we're going to pass images for both our source and our labels.
For our LoG loss, we need to apply the LoG filter to the target RBG images as well as the result of the inference graph.
So, were going to implement the LoG filter using convolution nodes.
We're going to compute the LoG loss using the results of the convolutions, and finally with both loss values computed, we can add them together to produce the final loss.
Now with the final loss value, we can begin the back-propagation phase and look at the training graph.
So, we're going to do this as before by computing the initial gradient.
With the initial gradient value, we can begin the training graph.
So, this involved several gradient nodes first for the addition followed by gradient nodes for each forward loss and then for the encoder and decoder stages.
Now, implementing graph nodes for each of these layers would take a substantial amount of code and introduce plenty of opportunity for errors.
However, with implicit graph creation, we can have the graph do all of this work for us.
So, here's all we need to write to generate the training graph.
First, we add the initial gradient node using the result of the final loss.
Then using implicit graph creation, we generate all of the remaining gradient nodes.
So now that we have our graph created, we can begin training it.
So first, let's discuss our input training data.
Now, the inputs are images for which we know the desired result.
In this case we have noisy images and we have the corresponding clean images.
Now both images were generated using a ray tracer built on top of MPS.
We generated the noisy images by only letting the ray tracer run for a short period of time.
And the clean images we obtained by running the ray tracer for an extended period of time.
Now, by training with these images, we hope our network will learn to approximate the clean ones from the noisy ones.
And further, we're going to augment our input data with a few other images, also produced by a ray tracer.
Surface normal and albedo.
The albedo image is a three-channel image containing values which for the amount of reflected light, the surface normals are a three-channel image where each channel is going to contain a component of the surface normal vector.
Now, before we can begin training, we need to do a little bit of preprocessing.
So, as I mentioned, these all contain their data in three channels.
However, MPS networks and MPS cnnKernels use their images as four-channel textures.
So, we're going to have to concatenate these values together.
Now, because each image is three channels, we need to concatenate these into a single metal texture array, and we can't necessarily use the MPS cnn concatenation because it requires feature channels in a multiple of four.
However, we can write a simple kernel to do this for us.
So here's a simple metal compute shader to concatenate these images together.
We're going to start using a grid of threads mapped to each four-channel pixel the result.
Our arguments are going to be a result to hold the concatenated image, the RGB input, the albedo input, and our normal image.
So we're going to start having each thread read a pixel from each input at its location in the grid.
We're going to concatenate those values together, and we're going to fill the remaining unused channels with 0.
Finally, we're going to write out the result at its same location in the grid.
So now that we have a shader which can concatenate these values together into a single MPS image, let's look at how we hand it to the graph.
Or rather, let's look at how we encode it first.
So here's an example of how we encode our kernel and wrap the result in an MPS image.
So our inputs are images containing the data, and we're going to want to use the result as an input to the graph.
So, we need to construct an MPS image.
We're going to use its texture to hold the result of our concatenation kernel.
Next, we're going to bind each argument at its appropriate location.
We'll dispatch our threads and then finally return the image ready to be passed into our network.
So, now that our inputs are prepared, let's look at executing the training graph.
Now during training, we'll be executing the graph from many iterations.
We're going to be executing multiple batches within each training set, and then we're going to be executing multiple batches over each epoch.
So, here we're going to run one iteration of the training graph.
We're going to concatenate our images together using the kernel we just showed except for each image in the batch.
We're going to put these together into and array because the graph requires an array of images, one for the source images and one for our labels.
Now we're going to use MPSCommandBuffers here, because as we saw earlier, it's an easy way of getting improved GPU utilization.
So finally, we're going to encode the graph and then commit it for execution.
So, now let's look closer at each training epoch.
Now in this scheme, we're going to process the full training data set, each epoch, to allow for better convergence.
We're also going to update the training set every some number of epochs, in this case every 100, and at that point, we're also going to perform our network validation.
Finally, at every thousandth epoch, we're going to decrease the learning rate of our optimizer.
This will also help improve convergence.
So let's look at the code for this.
So, we're going to begin by processing the entire training set once each epoch.
Here we see every hundredth epoch.
We're going to update our training data set, and we're going to run the validation.
And finally, every thousandth epoch, we'll decay our learning rate by a factor of 2.
So, now that we've trained the graph, we can begin denoising new images.
Now, because MPS is available and optimized across multiple platforms, we can easily deploy the training network on a different device.
For example, you may want to execute the computationally expensive task of training on a Mac and then use the train network to filter images on an iPad.
So, first, let's take a look at serialization support in MPS.
Now all MPS kernels as well as the graph support a secure coding.
This allows you to easily save and restore your networks to and from disk.
And for networks which load their weights from a data source, you're going to have to implement secure coding support on your data source yourself.
Now this requires the support secure coding property and the init and encode with coder methods.
Now, once your data source conforms to secure coding, it's easy to serialize and save the graph.
So, first we're going to create a coder in which to encode the graph.
Then we're going to call encode with coder on the graph.
Now, when this happens, it's going to serialize each of the individual kernels, and if those kernels have data sources, it will serialize those as well.
That way, the resulting archive contains all of the information necessary to restore and initialize the graph.
Finally, we can save the data to a file.
Now let's look at loading it.
So, in order to ensure at the unarchived kernels initialize on the proper metal device, we provide you with the MPSKeyedUnarchiver.
It's like a regular unarchiver except you initialize it with a metal device, and then it will provide this device to all the kernels as they're initialized.
So, after we load our data, we'll create an unarchiver with the device.
We'll restore the graph on the new device, and with the train network now initialized, the graph is ready to be used to denoise new images.
So, let's take a look at this network in action.
So, here we applied our denoiser to a scene.
The top region shows how the scene looks in our input noisy image.
The center region shows the result of our denoiser, and you can see the bottom region shows the ground truth clean image.
As you can see, the denoised region looks nearly as good as the clean target, except we're achieving this with significantly less work because were not running the full ray tracer.
So as you saw, using MPS, we can easily implement complex networks like denoising and style transfer.
This year we've expanded support for inference and training to a new class of networks with features like separable loss and random number generation.
And with MPSCommandBuffering, we now support improved performance and better utilization through things like predication and commitAndContinue, and we made all of these features easier to use through implicit graph creation.
So, for more information about MPS and metal, please see the online documentation and our sample code, and for more information about MPS and ray tracing, please see the Metal for Ray Tracing session earlier.
[ Applause ]