Metal Performance Optimization Techniques

Session 610 WWDC 2015

Learn best practices to maximize the efficiency of your Metal based apps and attain high frame rates. Gain insight into powerful tools for analyzing and optimizing performance for both the CPU and GPU. Discover how to identify bottlenecks, tune performance hot-spots, and overcome any hurdles that could keep your app from reaching its potential.

[ Applause ]

PHILIP BENNETT: Good morning, and welcome to Metal Performance Optimization Techniques.

I'm Phil Bennett of the GPU Software Performance Group, and I will be joined shortly by our special guest Serhat Tekin from the GPU Software Developer Technologies Group and he will be giving a demo of a great new tool you can use to profile your Metal apps.

I'm sure you're going to love it.

So, Metal at the WWDC, the story so far.

In What's New in Metal Part 1, we covered great new features that have been added to Metal as of iOS 9 and OS X El Capitan.

In What's New in Metal Part 2, we introduced two new frameworks, MetalKit and Metal performance shaders.

These make developing Metal apps even easier.

In this our final session, we will be reviewing what tools are available for debugging and profiling your Metal apps and we're going to explore some best practices from getting optimal performance from your Metal apps.

So let's take a look at the tools.

Now, if you have been doing any Metal app development in iOS, you are likely to be familiar with Xcode and its suite of Metal tools.

Now, we are going to take a quick look at the frame debugger.

So what we have here is a capture of a single frame from a Metal app, and on the left, we have the frame navigator which shows all of the states and Draw calls present in the frame.

These are grouped by render encoder, command buffer, and if you have been using debug labels, they will be grouped by debug groups also.

Next we have the render attachment viewer, which shows all of the color attachments associated with the current render pass in addition to any depth and stencil attachments, and it shows this wire frame highlight of the current Draw call, which makes navigating your frame very convenient.

Next we have the resource inspector where you can inspect all of the resources used by your app, from buffers to textures and render attachments.

You can view all different formats, you can individual bitmap levels, cube maps, TD arrays, it's fully featured.

And then we have the state inspector, which allows you to inspect properties of all of the Metal objects in your app.

Moving on, we have the GPU report, which gives you a frames per second measurement of the current frame and gives you timings for CPU and GPU.

In addition, it also shows the most expensive render and compute encoders in your frame so you can help narrow down which shaders and which Draw calls are the most expensive.

And finally, we have the shader profiler and editor.

And this is a wonderful tool for both debugging and profiling your shaders as it allows you to tweak your shaders and recompile them on the fly, thus saving you having to recompile your app.

It's really useful.

And as you probably are aware now, all of these great tools are now available for debugging your Metal apps on OS X El Capitan.

So Instruments is a great companion to Xcode as it allows you to profile your app's performance across the entire system, and now we are enabling you to profile Metal performance in a similar manner with this, the Metal System Trace Instruments.

It's a brand-new tool for iOS 9.

It allows you to profile your Metal apps across your CPU and GPU.

Let's take a look here.

We can start by profiling Metal API usage in the application, down to the driver, right onto the GPU where we can see the individual processing phases, verse X fragments, and optionally computes, and then onto the actual display hardware.

Now, here to give us a demonstration of this great new tool, please welcome Serhat Tekin to the stage.

[ Applause ]

SERHAT TEKIN: Thank you, Philip, and hello, everyone.

I have something really cool to show you today, and it's brand new, it's our latest addition to our Metal development tools, Metal System Trace.

Metal System Trace is a performance analysis and tracing tool for your Metal iOS apps and is available as part of Instruments.

It lets you get a system-wide overview of your application over time also giving you an in-depth look at the graphics down to the microsecond level.

It's important that I should stress this.

This is available for the first time ever on our platform.

This is all thanks to Xcode 7 and iOS 9.

So without further ado, let's go ahead and give it a shot.

So I'm going to launch Instruments, and we are at the template chooser.

You can notice that we have a new template icon here, Metal icon for Metal System Trace.

I will go ahead and choose that.

Those of you familiar with Instruments will realize I just created a new document with four instruments in it, as you can see on the left-hand side of the timeline here.

I will give you a quick tour of these instruments and the data that they present on the timeline.

So let's go ahead and select my Metal app on the iPad as my target app and start recording.

All right.

Now, Metal System Trace is set to record in one instrument called Windowed Mode.

It's essentially capturing the trace into a ring buffer.

This lets you record indefinitely.

And the important point here is that when you see a problem that you want to investigate, you can stop recording.

At that point, Instruments will gather all of the trace data collected, process it for a while, and they will end up with a timeline that looks like this.

So there is quite a lot of stuff going on here, so I will zoom in to get a better look.

I can do that by holding down the Option key and selecting an area of interest in the timeline that I want to zoom in.

I can navigate the timeline using the tracker gestures, two fingers swipe to scroll and pinch to zoom.

And you can see that I get more detail on the timeline as I zoom further in.

So what are we looking at here?

Essentially what we have here is an in-depth look of your Metal application's graphics workload over time across all of the layers of the graphics stack.

The different colors that we go through in the timeline represent different workloads for individual frames.

And the tracks themselves are fairly intuitive.

Each box you see here represents an item's trace relative start time, end time, and how long it took.

Starting from the top and working our way down, we have your application's usage of the Metal framework.

Next, we have the graphics driver processing your command buffers, and if you have any shader compilation activity midframe, it also shows up in the track.

This is followed by the GPU hardware track, which shows your Render and Compute commands executing on the GPU.

And finally we have the display surfaces track.

Essentially, this is your frame getting displayed on the device.

All right.

So another thing you can see here is these labels.

Now, note that these two labels here, shadow buffer and G-buffer and lighting, are labels I assigned myself to my encoders in my Metal code using the encoder's Label property.

These labels propagate their way down the pipeline along with the workload they are associated with, which makes it very easy to track your scenes rendering passes here in Metal System Trace.

I highly recommend taking advantage of this.

And if anything is too small to fit its label, you can always go hover over the ruler and see a tool tip that displays both the label and the duration at the same time.

The order of the tracks here basically map to the same order your Metal commands would work their way down the graphics pipeline.

So let us go ahead and follow this command buffer down the pipe.

So at the top track I can see my application's use of Metal command buffers and encoders, specifically what I see here is the creation time and submission time for both my command buffers and rendering compute encoders.

At the top I have my command buffer, and at the bottom I have my relevant encoders created by this command buffer directly nested underneath.

Now, note this arrow here at the submission time of the command buffer going to the next track.

Dependencies between different levels of the pipeline are represented by these arrows in Metal System Trace.

So, for instance, when this command buffer is submitted, its next stop is going to be the graphics display driver, if I can zoom in there and get a better look.

Look at how much we are taking here.

It's really, really fast, and they are still on the CPU side barely consuming anything.

Similarly, I can go and follow the arrows once the encoders are done processing.

The encoders are going to get submitted to the GPU track.

Following the arrows the same way, I can see my encoders getting processed on my GPU.

This GPU track is separated into three different lanes, one for vertex processing, one for fragment, and one for compute.

So, for instance, here I can see my shadow buffer rendering code for my shadow buffer pass going through its vertex processing phase and moving on to the fragment phase, which happens to overlap with my G-buffer and lighting phase as well.

Something that is desirable.

A quick note here is that the vertex fragment also compute processing costs have more than just the shader processing time.

For instance, we are running on iOS, and it's a tile-based deferred architecture, so the vertex processing cost is going to include the tiling cost as well.

It's something to keep in mind.

Finally, once my frame is done rendering, the surface is going to end up on the display, which is shown in the track at the bottom.

Essentially, it's showing me what time my frame was swapped onto the display and how long it stayed there.

Underneath that, we have the resync track, which shows us the resync intervals separated by these spikes that correspond to individual resync events.

Finally, at the bottom, we have our detail view.

The detail view is similar to what you would see in other instruments.

It offers contextual detail based on the instrument use selected.

For instance right now, I have the Metal application instrument selected, so I can go ahead and expand this to see all of my frames and all of the command buffers and encoders along with the hierarchy involved.

This track is useful if you want to see, say, precise timing.

If I go to the encoder list, precise creation submission timings or what process something originated from.

It's very useful.

Cool! So this timeline look at the graphics pipeline is an incredibly powerful tool.

It's available for the first time with iOS 9 and Metal.

So how do you use this to help you solve your problems?

Or how does a problem app look?

Let me go ahead and open a different trace to show you that.

In a couple of minutes, Philip will go into a lot more detail than I will about Metal performance and how you can use this tool for that purpose.

But I'm going to give you a quick overview of the tool's workflow and a quick couple of tips.

First and foremost, you need to be concerned about your CPU and GPU parallelism.

You can see that this trace that I opened, labeled Problem Run appropriately, is already sparser than the last trace we took.

This is because we have a number of sync points where the CPU is actually waiting on the GPU.

You need to make sure you eliminate these.

Also, another useful thing to look for is the pattern that you see on the timeline.

These frames are all part of the same scene, so they are going to have really high temporal locality.

Any divergence you see might point at a problem you should investigate.

Another important thing is the display surfaces track.

So ideally, if your frame rate target is 60 frames per second, these surfaces should be staying on display for a single VSync interval.

So we should be seeing surfaces getting swapped at every VSync interval.

This particular frame, for instance, stayed on for three, so we are running at 20 fps.

Another thing that pretty useful is the shader compilation track directly shows you if the shader compiler is kicking in at any time during your trace.

One thing that you want to particularly avoid is submitting work to the shader compiler midframe because it's going to waste CPU cycles you can use on other things.

Phil will explain this in a couple more minutes in detail.

Finally, you should aim to profile early and often.

A workflow like this will help you figure out problems as they occur and make it easier to fix them.

And Xcode helps you with that by offering a profile launch option for your build products.

It's going to automatically build a release version of your app, installed on the device, and start an instruments run with a template of your choice.

All right.

So you have our first look at Metal System Trace.

Available for all of your Metal-capable iOS devices out there.

Please give it a try.

We are looking forward to your feedback and suggestions.

Now, I will leave the stage back to Phil, who will demonstrate a couple of key Metal performance issues and how you can use our tools to identify these.

Thank you.

[ Applause ]

PHILIP BENNETT: Thank you, Serhat, that was very informative.

Now, we are going to cover the aforementioned Metal performance best practices, and we are going to use the tools to see how we can diagnose and hopefully follow these best practices.

So let me introduce our sample app, or rather a system trace of our sample app, and immediately we can see that there are several performance issues.

To begin with, there is no parallelism between the CPU and the GPU.

These are incredibly powerful devices, and the only way you are going to obtain the maximum performance is by having them run independently, whereas here they seem to be waiting on the other.

So we can see there is a massive stall between processing frames on the CPU.

There is a whopping 22 milliseconds.

We shouldn't have any stalls.

What's going on there?

And if we look at the actual active period of the CPU, it exceeds our frame deadline.

We were hoping for 60 frames per second.

So we had to get everything done within 16 milliseconds.

And we have blown past that.

And things don't look much better on the GPU side, either.

There is a lengthy stall in proportion to what is on the CPU because the CPU has been spending all its time doing nothing of note and hasn't been able to queue up work for the next frame.

Furthermore, the active GPU period overshoots the frame deadline, and we are shooting for 60 frames per second, but it looks like we are only getting 20.

So what can we do about this?

Well, let's go back to basics.

Let's first examine one of the key principles of Metal design and performance.

And that's creating your expensive objects in state upfront.

Now, in a legacy app, typically what would happen would be during content loading, the app would compile all of its shaders from source, and that could be dozens or even hundreds of them, and this is a rather time-consuming operation.

Now, this is only half of the shared accompilation story because the shaders themselves need to be compiled into a GPU pipeline state in combination with the various state used.

So what some apps might attempt to do is to do something known as prewarming.

Now, normally the device compilation would occur when the shaders and states were first used in a Draw call.

That's bad news.

Imagine you have a racing game and suddenly you turn a corner and it draws in a lot of new objects and the frame rate drops.

That's really bad.

So what prewarming does is you issue a load of W Draw calls with various combinations of graphic states and shaders in the hope that the driver will compile the relevant GPU pipeline state.

So when the time comes to actually draw using this combination state and shaders, everything is ready to go and you don't get a frame rate drop.

Now, in the actual rendering loop, there would typically be your setting of states, and if you actually get around to any, maybe you will do some Draw calls as well.

So the Metal approach is to move the expensive stuff ahead of time.

Shaders can be compiled from source offline.

That's already saving a chunk of work.

We move state's definition ahead of time.

You define your state.

The GPU pipeline state is compiled into these state objects.

So when you come to actually do the Draw calls, there is none of that device compilation nonsense, so there is no need for a shade of warming anymore.

It's a thing of the past.

That leaves the rendering loop free for Draw calls.

Loads of Draw calls.

So fundamentally, Metal facilitates upfront state definition by decoupling expensive state validation and compilation from the Draw commands, thus allowing you to pull this out of the rendering loop and keep the rendering loop for actual Draw calls.

Now, the expensive-to-create state is encapsulated in these immutable state objects, and the intention is that you will create these once and reuse them many times.

Now, getting back to our sample app, here we see there is some shader compilation going on midframe, and we are wasting about a millisecond here.

That's no good at all.

And if we look at the Xcode's frame debugger, look at all of this happening in a single frame.

Look at all of these objects.

We don't want any of this.

All that you should be seeing is this, the creation of the command buffer for the frame and the acquisition of the drawable and its texture.

All of the rest is completely superfluous.

So let's cover these expensive objects and when you should create them.

And we are going to begin with shader libraries.

These are your library of compiled shaders.

Now, what you really want to do is compile all of them offline.

You can use Xcode, any Metal source files in your project will automatically be compiled into the default library.

Now, your app may have its own custom content pipeline, and you might not necessarily want to use this approach.

So for that, we provide command-line tools, which you can integrate into your pipeline.

If you absolutely cannot avoid compiling your shaders from source in runtime, the best you can do is create them asynchronously.

So you create the library, and in the meantime, your app, or rather, the calling threads, can get on with doing something else, and once the shader library has been created, your app will be asynchronously notified.

Now, one of the first objects you will be creating in your app will be the device and command queue.

And these represent the GPU you will be using and its queue of ordered command buffers.

Now, as we said, you want to create these during app initialization and because they are expensive to create, you want to reuse them throughout the lifetime of your app.

And, of course, you want to create one per GPU used.

Now, next is the interesting stuff, the render and compute pipeline state, which encapsulates all of the programmable GPU pipeline states, so it takes all the descriptors, your vertex formatter scripts, render buffer formats, and compiles it down to the actual raw pipeline state.

Now, as this is an expensive operation, you should be creating these pipeline objects when you load your content, and you should aim to reuse them as often as you can.

Now, as with the libraries, you can also create these asynchronously using these methods.

So once created, your app will be notified by a completion handler.

One point to mention is that unless you actually need it, you shouldn't obtain the reflection data as this is an expensive operation.

So next we have the depth stencil and sampler states.

These are the fixed-function GPU pipeline states, and you should be creating these when you load your content along with the other pipeline states.

Now, you may end up with many, many pieces of depth stencil and sampler states, but you needn't worry about this because some Metal implementations will internally hash the states and create loads of duplicates so don't worry about that.

Now, next we have the actual data consumed by the GPU.

You have got your textures and your buffers.

And you should, once again, be creating these when you load your content, and reuse them as often as possible, because there is an overhead associated with both allocating and deallocating these resources.

And even dynamic resources, you might not be able to fully initialize them ahead of time, but you should at least create the underlying storage.

And we are going to be covering more on that very soon.

So to briefly recap.

So the most expensive states obviously should be created ahead of time, so these are the shader libraries that you aim to build offline.

The device and the command queue, which are created when you initialize your app, the render and compute pipeline states, created when you load your content, as are the fixed function pipeline state, the depth stencil and sampler states, and then finally the textures and buffers that are used by your app.

So we went ahead and we applied this best practice to our example app, which you may remember looked like this.

We had some shader compilation occurring midframe every frame, and now we have got none.

So already we have saved about a millisecond of CPU time.

This is a good start, but we will see if we can do better soon.

So in summary, create your expensive state and objects up front and aim to reuse them.

Expecially compile your shader source offline, and you want to keep the rendering loop for what it's intended for.

It's for Draw calls.

Get rid of all of the object creation.

Now, what about the resources you can't entirely create up front?

We are talking about these dynamic resources, so what do we do about them?

How can we efficiently create and manage them?

Now, by dynamic resources, we are talking about resources which, once created, may be modified many, many times by the CPU.

And a good example of this is buffer shader constants, and also any dynamic vertex and index buffers you might have for things like particles generated on the CPU, in addition to dynamic textures, perhaps your app has some textures which it modifies in the CPU between frames.

So ideally given the choice, you would put these resources somewhere which is efficient for both the CPU and the GPU to access.

And you do this with the shared storage mode option when you create your resource.

And this creates resources in memory shared by both the CPU and the GPU.

Now, this is actually the default storage mode on iOS, iOS devices being unified memory architecture, so the same memory is shared between the CPU and GPU.

Now, the thing about these shared resources is the CPU has completely unsynchronized access to them.

It can modify the data as freely as it wants through a pointer.

And in fact, it's quite easy for the CPU to stomp all over the data which is in use by the GPU, which tends to be pretty catastrophic.

So we want to avoid that.

But how can we achieve this?

Well, the brute force approach would be to have a single buffer for the resource, where we have, say, a buffer of constants which are updated on the CPU and consumed later by the GPU.

Now, if the CPU wants to modify any of the data in the constants buffer, it has to wait until the GPU is finished with it.

And the only way it can know that is if it waits for the command buffer in which the resource is referenced to finish processing on the GPU.

And for that, in this case we use Wait Until Completed.

So we wait around, rather the CPU waits around, until the GPU is finished processing and then it can go ahead and modify the buffer, which is consumed by the GPU in the next frame.

Now, this is really bad because not only is the CPU stored but the GPU is stored as well because the CPU hasn't had time to queue up work for the next frame.

This is what is happening in the example app.

The CPU is waiting around for the GPU to finish on each frame.

You are introducing a massive store period, and, yes, there is no parallelism between the CPU and the GPU.

So we need a better approach clearly, and you might be tempted to just create new buffers every frame as you need them.

But as we learned in the previous section, that's not a particularly good idea because there is an overhead associated with creating each buffer.

And if you have many buffers, large buffers, this will add up, so you really don't want to be doing this.

What you should do instead is employ a buffer scheme.

Here we have a triple buffering scheme, where we have three buffers, which are updated on the CPU and then consumed by the GPU.

Why three?

Typically we suggest that you limit the number of command buffers in flight to three, and effectively, you have one buffer per command buffer.

And by employing a semaphore to prevent the CPU from getting too far ahead of the GPU, we can ensure that it's safe to update the buffers on the CPU when the GPU wraps around, when it goes back to reading the first buffer.

Rather than bore you with a lot of sample code, I will point you straight at a great example we already have.

That is the Metal Uniform Streaming example, which shows you exactly how to do this.

So I recommend you check it out afterward if you are interested.

Getting back to our example app, you may remember we had these very performance-crippling weights between each frame on the CPU.

Now, after employing a buffering scheme to update dynamic data, we managed to greatly reduce the gap between processing on both the CPU and the GPU.

We still have some sort of synchronization issue, but we are going to look into that very shortly.

So we are making good progress already.

And in summary, you want to buffer up your dynamic shared resources because it's the most efficient way of updating these between frames, and you enforce safety via use of the buffers and flights that I mentioned.

Now, I'm going to talk about something or rather the one thing you don't actually want to do up front, and that relates to when you acquire your app's drawable service.

Now, the drawable surface is your app's window on the world, it's what your app renders its visible content into, which is either displayed directly on the display or it may be part of a composition pipeline.

Now, you retrieve the drawables from the Metal layer of Core Animation, but there is only a limited number of these drawables because they are actually quite big, and we don't want to keep loads of them around nor do we want to be allocating them whenever we need them.

So these drawables are maintained very limited, and predrawables are relinquished at display intervals once they have been displayed in the hardware.

And each stage of the display pipeline may actually be holding onto a drawable at any point from your app, to a GPU, to Core Animation if you have any compositing, to the actual display hardware.

Now, your app grabs a drawable surface typically by calling the next drawable method.

If you are using MetalKit, this will be performed when you call Current Render Pass Descriptor.

Now, the method will only return once a drawable is available, and if there happens to be a drawable available at the time, it will return immediately.

Great, you can go on and continue with the frame.

However, if there are none available your app, or rather the calling for it, will be blocked until at least the next display interval waiting for a drawable.

This can be a long time.

It's 60 frames per second.

We are talking 16 milliseconds.

So that's very bad news.

So is this what our example app was doing?

Is this the explanation for these huge gaps in execution?

Well, let's see what Xcode says.

So we go to the frame navigator, and we take a look at the frame navigator here.

And Xcode seems to have a problem with our shadow buffer encoder.

See a little warning there.

So if we take a closer look, we see that indeed we are actually calling the next drawable method earlier than we should do.

The next code offers some very sage advice that we should only call it when we actually need the drawable.

So how does this fit in with our example app?

Well, we have several passes here in our example app, and we were acquiring the drawable right at the start of each frame before the shadow pass.

This is far too early, because right up until the last pass, we are drawing everything off screen, and we don't need a drawable right up until we come to render the UI pass.

So the best place to acquire the next drawable is naturally right before the UI pass.

So we went ahead and we made the change, we moved our call to next drawable later, and let's see if that solved our problem.

Well, as you can already see, yes, it did!

We removed our second synchronization point, and now we don't have any stalls between processing on the frame processing on the CPU.

That's a massive improvement.

So the advice is very simple: only acquire the drawable when you actually need it.

This is before the render pass in which it's actually used.

This will ensure that you hide any long latency that would occur if there weren't any drawables available.

So your app can continue to do useful work, and by the time it actually needs a drawable, one is likely to be available.

So at this point we are doing pretty well so far.

But there is still room for improvement.

So why don't we look at the efficiency of the GPU side rather than diving to a very low level, say, trying to optimize our shaders or change texture formats, whatever, why don't we see if there is any general advice we can apply.

As it so happens, there is.

That relates to how we use Render Command Encoders.

Now, a Render Command Encoder is what is used to generate Draw commands for a single rendering pass.

And a single rendering pass operates on a fixed set of color attachments, and depth and stencil attachments.

Once you begin the pass, you cannot change these attachments.

However, you can change the actions acting on them, such as the depth stencil state, color masking and blending, for instance.

And this is valuable to remember.

Now, the way in which we use our render encoders particularly important on the iOS device GPUs due to the interesting way in which they are architected.

They are tile-based deferred renderers.

So each Render Command Encoder results in two GPU passes.

First you have the vertex phase, which transforms all of the geometry in your encoder, and then performs clipping, coloring, and then bins all of the geometry into screen space tiles.

This is followed by the fragment phase, which processes all of the objects tile by tile to determine which objects are visible, and then only the visible pixels are actually processed.

And all of the fragment processing occurs in these fast on-chip tile buffers.

Now, typically at the end of a render you only need to store out the color buffer.

You would just discard the depth buffer.

And even sometimes you may have, say, multiple color attachments, but you only need to store one of them.

By not storing the tile data in each pass, you are saving quite a bit of bandwidth.

You are avoiding writing out entire frame buffers.

This is important for performance, as is not having to load in data each tile.

So what can Xcode tell us?

Can it give us or rather, I mentioned that each encoder corresponds to a vertex pass and a fragment pass.

And this applies even for MT encoders, and this is quite important.

Here we have actually two G-buffer encoders, and the first one doesn't seem to be drawing anything.

I guess that just slipped in there by mistake, but this actually has quite an impact on performance if we look at the system trace of the app.

Just that empty encoder consumed 2.8 milliseconds on the GPU, and presumably it was just writing a clear color out to however many attachments we had, three color and two depth and stencil.

And our total GPU processing time for this particular frame is 22 milliseconds.

Now, if we remove the MT encoder, which is done very easily because it shouldn't be there in the first place, we go down to 19, so that's a very nice win for doing very little at all.

So watch out for these MT encoders.

If you are not going to do any drawing in a pass, don't start encoding.

So let's look a bit deeper now.

Let's have a look at the render passes in our example app and see what we have got.

So we have got a shadow pass, which renders into a depth buffer.

We have a G-buffer pass, which renders into three color attachments and a depth and stencil attachment, and then we have these three lighting passes, which use the render attachment data from the G-buffer pass, either sampling through the texture units or loading to the frame buffer content.

And when the lighting passes use this data, and they perform lighting and outputs to a single accumulation target which is used several times over.

And finally you have a user interface pass onto which user interface elements are drawn and presented to the screen.

So is this the most efficient setup of encoders?

Once again we summon Xcode's frame debugger to see if it has anything to say.

And once again, yes it does.

It has taken issue with our sunlight encoder.

So let's take a closer look.

We are inefficiently using our command encoders.

And Xcode is kind enough to tell us which ones we could actually combine.

So let's go ahead and merge a couple of passes.

Rather than merge just two, we can actually merge three, which all operate on the same color attachment.

So let's go ahead and do that.

So we have six passes here, and now we are going to merge them down to four.

So what impact did that have on performance, GPU side?

Let's go back to the GPU, the system trace.

Here we can see we have gone from 21 milliseconds, six passes, down to 18 by not having to write out all of that load and store all of that attachment data.

So that's quite a nice win.

But could we go any further?

Let's return to our app.

So we have four passes, and is it actually possible to combine both the G-buffer and the lighing pass to avoid having to store out five attachments and keep everything on chip?

Well, it in fact is.

We can do that with clever use of programmable blending.

So I'm not going to go into too much detail there, but what we did was we combined these two encoders down to one.

So now we are left with three render encoders and we are having to load and store far, far less attachment data, and that's a massive win in terms of bandwidth.

So let's see what impact that had.

Actually not a lot.

That was very unexpected.

We have only chopped off about a millisecond.

That's not great.

I was hoping for more than that.

So once again, can Xcode save us?

We turn to Xcode's frame debugger.

And we take a closer look at the load and store bandwidth for the G-buffer encoder.

Now, it turns out that we are actually still loading and storing quite a lot of data, and the reason for that is quite simple.

It looks like here we have mistakenly set our loads and store actions for each attachment incorrectly.

We only wanted to be storing the first color attachment, and we want to discard the remaining color attachments in addition to the depth and stencil attachments, and we certainly don't want to be loading them in.

So if we make the very simple change, we change our load and store actions to something more appropriate, we have reduced our load bandwidth down to zero and we have massively reduced the amounts of attachment data we're storing.

So now, what impact did that have?

So before, with our three passes, we are taking 17 milliseconds on the GPU.

Now, we are down to 14.

That's more like it.

So to summarize, don't waste your render encoders.

Try to do as much useful work as possible in them, and definitely do not start encoding if you are not going to draw anything.

And if you can, and with the help of Xcode, merge encoders which are rendering to the same attachments.

This will get you big wins.

Now, we are doing pretty well on the GPU side now.

In fact, we are actually within our frame budget.

But is there anything we can do on the CPU side?

If you remember, I think we were actually still slightly beyond our frame budget.

What about multithreading?

How could multithreading help us?

What does Metal allow us to do in terms of multithreading?

Fortunately for us, Metal was designed with multithreading in mind and has a very efficient threadsafe and scalable means of multithreading your rendering.

It allows you to encode multiple command buffers simultaneously on different threads, and your app has control over the order in which these are executed.

Let's take a look at a possible scenario where we might attempt some multithreading.

But before that, I would like to stress that before you even go ahead and try to multithread your rendering, you should actively pursue the best possible single-threaded performance.

So make sure there is nothing terribly inefficient in there before you start trying to multithread things.

Okay. So we have an example here where we have two render passes, and we are actually taking so long to encode these two passes on the CPU that we are actually missing our frame deadline.

So how can we improve this?

Well, we can go ahead and we can encode the two passes in parallel.

And not only have we managed to reduce the CPU time per frame, the side effect is that the first render pass can be submitted to the GPU quicker.

So how would this look in terms of Metal objects?

How does it come together?

Where we start with our Metal device in the command queue as usual, and now for this example we are going to have three threads.

And for each thread, you need a command buffer.

Now, for the two threads, each has a Render Command Encoder which is operating on separate passes, and on our third thread we might have multiple encoders executing serially.

So it goes to show the approaches to multithreading can be quite flexible, and once they have all finished their encoding, the command buffers are submitted to the command queue.

So how would you set this up?

It's quite simple.

You create one command buffer per thread and you go ahead and initialize render passes as usual, and now the important point here is the order in which the command buffers will be submitted to the GPU.

Chances are this is important to you.

So you enforce it by calling the Enqueue method on the command buffers, and that reserves a place in the command queue so when the buffers are eventually committed, they will be executed in the order that they were enqueued.

This is an important point to remember.

Because then we create the render encoders for each thread, and we go ahead and encode our draws on the separate threads and then commit the command buffers.

It's really very simple to do.

Now, what about another scenario which could potentially benefit from multithreading?

So here again we have two passes, but one of them is significantly longer than the other.

Could we split that up somehow?

Yes, we can.

Here, we will break it up into two separate passes.

We have three threads here.

One is working on the first render pass, and we have two dedicated to working on chunks of the second.

And, again, here by employing multithreading we are within our frame deadline, and we have got a bit of time to spare on the CPU as well for doing whatever else we fancy doing.

It need not necessarily be more Metal work.

So how would we, or rather what would this look like?

So once again, we have the device and the command queue.

And for this example, we are going to be using three threads.

But here we only want one command buffer.

Next, we have the special form of the Render Command Encoder, the Parallel Render Command Encoder.

Now, this allows you to split work for a single encoder over multiple threads, and this is particularly important to use on iOS because it ensures that the threaded workloads are later combined into a single pass on the GPU.

So there is no loading and storing between passes.

This is very important that you use this if you are going to split up a single pass across multiple threads.

So from the Parallel Render Command Encoder, we create our three subordinate command encoders, and each will encode to the command buffer now, because we are multithreading they may finish encoding at indeterminate times, not necessarily any particular order.

Then the command buffer submitted to the queue.

Now, it's entirely feasible that you could even have parallel Parallel Render Command Encoders.

The multithreading possibilities are not quite endless, but very flexible.

Or you could have like we saw earlier, you could have a fourth thread which is executing encoder serially.

So how do we set this up?

Well, we begin by creating one command buffer per Parallel Render Command Encoder.

So no matter how many threads you are using, you only want one command buffer.

We then proceed to initialize the render pass as usual, and then we create our actual parallel encoder.

Now, here is the important bit.

When we create our subordinate encoders, the order in which they are created determines the order in which they will be submitted to the GPU.

This is something to bear in mind when you split up your workload for encoding over multiple threads.

Then we go ahead and we encode our draws and separate threads, and then finish encoding for each subordinate encoder.

Now, the second important point is all of the subordinate encoders must have finished encoding before we end encoding on the parallel encoder.

And how you implement this is up to you.

Then finally, the command buffer is committed to the queue.

So we went ahead and we decided to multithread our app.

Look what turned up.

So previously, we had serial encoding or passes.

This was taking 25 milliseconds of CPU time.

Now, we pursued an approach where we encode the shadow pass on one thread, and the G-buffer pass and UI pass on another, and now we are down to 15 milliseconds.

That's quite a nifty improvement, and we have got a bit of time left over on the CPU as well.

So as far as multithreading goes, if you find that you are still CPU bound and you have done all of the investigations you can, and determining you haven't got anything silly going on in your app, and that you could actually benefit from multithreading, you can encode render passes simultaneously on multiple threads.

But should you decide to split up a single pass across multiple threads, you want to use the Parallel Render Command Encoder to do so.

Now, what did we learn in this session?

Well, we introduced the Metal System Trace tool, and it was great.

It offers new insight into your app's Metal performance.

And you want to use this in conjunction with Xcode to profile early and often.

And as we have seen, you should also try to follow the best practices set out, so you want to create the expensive state up front and reuse it as often as possible.

We want to buffer dynamic resources so we can efficiently modify them between frames without causing stalls.

We want to make sure we are acquiring our drawable at the correct point in time.

Usually at the last possible moment.

We want to make sure we are efficiently using our Render Command Encoders.

We don't have any empty encoders, and we have coalesced any encoders which are writing to the same attachment down to one.

And then if we find we are still CPU bound as we were in this case, we might consider the approaches Metal offers for multithreading our rendering.

So how did we do?

Well, now look at our app!

We don't have any runtime shader compilation.

Furthermore, our GPU workload is within the frame deadline.

It's great.

As is the CPU workload.

And there are no gaps between processing of frames on the CPU.

And we even got quite fancy and decided to do multithreading.

We have a lot of time left over there to do other things.

And we managed to meet our target, which in this case was 60 frames per second.

So well done us!

So now, the talk is over, and if you would like any more information on anything mentioned in this session, you can visit our developer portal, you can also sign up for the developer forums, and should you have any detailed questions or general inquiries, you can direct them to Allan Schaffer, who is our Graphics and Games Technologies Evangelist.

So thank you very much for attending this talk.

And we hope you found it interesting, and enjoy the rest of WWDC!

Thank you very much!

[ Applause ]

Apple, Inc. AAPL
1 Infinite Loop Cupertino CA 95014 US