Metal 2 Optimization and Debugging

Session 607 WWDC 2017

Developing Metal 2-based apps is even easier with the redesigned tools for debugging and profiling in Xcode. Dive into the enhanced Metal Frame Debugger and explore techniques for fine-tuning graphics and compute workloads. Learn about accessing detailed GPU performance counters, and check out new support in Metal System Trace for optimizing VR apps.

[ Background Conversations ]

[ Applause ]

Good afternoon, everyone.

And welcome to Metal 2 Optimization and Debugging.

As you know, we're talking a lot about Metal 2 this year, with some great new enhancements to the platform including GP driven rendering, machine learning acceleration, and a macOS VR, and external GPU support.

And not forgetting Advanced Optimization Tools.

So, this afternoon, we're going to talk about the current Metal tools, give you a recap of those, talk about some great new enhancements to the Metal frame debugger, and then finally cover some major enhancements in terms of GPU profiling.

But first, the frame debugger.

So hopefully you're all familiar with this tool.

It is our fully-featured frame debugger integrated into Xcode, that lets you capture your Metal 2 work, be it computer graphics, and then step through it into the debugger, to inspect state and resources, letting you debug and optimize.

One of our focuses this year has been on improving the performance to the frame debugger, and in particular, we paid particular regard to improving the speed with which captures happen.

And I'm happy to say that compared to Xcode 8, Xcode 9 now captures up to 10 times as fast, getting you from clicking the Capture button to into the debugger much, much more quickly.

As you'd expect, we have full support for all the new Metal 2 API, including Raster order groups, sampler arrays, viewport arrays, and the new pixel and vertex array formats.

One particular part of Metal 2 that we've paid a lot of attention to is the support for the new argument buffers.

With this, in the buffer viewer, you can now see all the argument, buffer arguments, displayed in line, and you can click through whether it be to attach your sampler, buffer, or another argument buffer itself.

We've also added support for VR captures this year, with automatic support for SteamVR.

And we've added support for you to view your submitted surfaces in stereo.

So when you get to the Submit call, you're sending your surfaces to the VR compositer, you will see left eye and right eye alongside each other, to quickly spot any discrepancies.

Another area of focus this year has been on improving the workflow for capturing more complex workloads.

So now if you're doing compute-only work in Metal, or perhaps you're using multiple Metal cues, it's much easier to capture exactly the work that you want.

We've added the lightweight capture API, with some new Metal capture scope objects, that you create at startup and then reuse every frame to surround the work that you want to capture.

You'll see a demo of this later, but it's great for being able to say, group all your regular rendering work in one scope, and some asynchronous work like regenerating your tessellation factor buffers, in another scope, and when you come to capture, you get exactly what you want.

We also have support for triggering captures programmatically from your app.

We use this a lot within our test apps ourselves, so that we can quickly do a gesture on the device, and trigger a frame capture, without having to switch our focus back to Xcode.

Another new feature this year is support for Xcode's Quick Looks support.

So now, you have lightweight Metal debugging in the CPU debugger.

So if you hit a breakpoint, and there is a Metal texture there, we'll pull back the data from the metal texture on the GPU and let you view it there and then.

Similarly, with buffers and samplers.

This is great for those cases where a full frame capture might be too invasive, for instance, if you're debugging your resource loading or some of the setup code.

It is also great in cases where you're debugging some compute workloads as well.

So last year we introduced support for rich filtering throughout the frame debugger, so that you could filter both on things like resource properties, but also within the frame navigator, you can filter based on contextual data, so what resources you're using at a given draw call will let that draw call show up.

Well we've taken this to the next level this year, with support for data mining throughout your capture tray.

So now, when you type, we'll give you context aware, or complete suggestions.

And we now allow compound terms.

So now if you search for a given encoder, and then you search for a texture, we'll only show you auto-complete suggestions for the textures that are actually used within that encoder.

One of our most requested features over the years has been support for pixel inspection.

And we've finally caught up with that.

So now you can do detailed inspection of individual pixels within your textures and your render targets.

And if you have multiple attachments to your render targets, we'll show you the pixel values for the same location in each attachment at the same time.

So it's really good if you're trying to debug what the color value is alongside that and stencil and such like.

It's also very valuable for debugging compute workloads if you're working with images there and you are, for instance, halfway through your CNN and you want to test watch to see what the exact values in the buffers are.

[ Applause ]

Another new feature we introduced last year was our vertex attribute viewer, where you can see all the vertex data as it goes into your vertex shader, you know, shown on a per vertex basis.

Well, this year, we've added support for viewing the outputs from your vertex shader as well, and we will display this inline with all the other input data, so in this case, you can see your position inputs, and your position outputs, at the same time.

Well, to show you all these great new features in action, I'd like to invite my colleague, Max, to the stage, who is going to give you a demo of all this new stuff.

[ Applause ]

Hello. Great to see you here today.

I hope you are doing fine and you are as hyped about Metal as we are.

Xcode GPU debugger helps you debugging your GPU and the Metal usage.

I am Max, and I am going to maximize your debugging experience, showing our new features.

[ Applause ]

Yeah, let me run my demo app.

It is rendering a beautiful scenery, reflects snowy mountains, grass that is waving in the wind, and to make it even look nicer, I added some particles that are glowing in the air.

But as you can see, the particles of the grass, there is some kind of a problem.

So let's figure this out.

As a first step, let's check if the texture is correct.

Let me set a breakpoint in the rendering loop, where this texture is being used.

Hovering over a variable gives you access to Xcode's data tips, and you can quick look into the texture data.

This data is fetched live from the GPU, and it helps you to verify the resources you are binding, and, of course, it works with all Metal resources.

The texture in this case looks correct.

So what else can we check?

Our next step is to capture a frame.

Using the little camera icon in the debug bar, let's you capture a frame, but using a long press gives you access to capture scopes and command cues.

A capture scope is one path through your rendering pipeline.

Like my environment map, I'm only updating every couple of frames.

In this case, however, we want to capture rendering, this is where the particles are being drawn.

So let's capture this.

And already done.

For those who are not familiar with our tool, I will give you a quick run through of all the views you are seeing here.

On the left side, we have the debug navigator.

It is in [inaudible] presentation of your frame, and to help you, we automatically group by command buffers and command encoders.

But also your debugging groups are shown here, giving you fine grain control over the grouping.

You can select the draw call or any other Metal call to inspect further details.

The editor in the center is showing the bound resources.

All the Metal objects you are using in the selected API call.

Again, you can see labeling your objects will greatly increase readability.

So I suggest to do that.

The editor on the right side is showing the attachments, the output of the last issued draw call.

So whenever you are, like, navigating through your frame, you instantly see where you are.

On the bottom, we have our variables view, where you can access all the states of each Metal object.

Back to our problem with the particles.

Here, we can make use of our new super powerful filtering.

So I know the particles are drawn somewhere in my forward rendering.

So let me filter for this.

Filtering for command encoder will only show API calls inside this command encoder.

Like this.

But it is still a lot.

So let me add an additional filter.

We know it is using our particle texture.

Filtering for texture will only show draw calls using this texture.

And boom, this combination of filters results in a single API call we want to inspect further.

So let's go here.

Let's take a look at the bound resources.

The vertex attributes combines the data that is going into your vertex function and leaving it.

Maybe we are doing something wrong with our geometry here, so let's open this by double clicking.

Let me also hide the attachments for a moment.

Last year, we started to show you a nice layout for all the buffers.

And this year we added something more.

In the header, you can see the direction where the data is flowing.

And if we take a look, this is the output data, the data that is leaving the vertex function.

This is the output position of every particle vertex, and as we can see here, there is no obvious error, like big numbers or something like this, so I assume this data is correct.

So what else can we check?

The debug navigator now gives you quick access to all the views related to this draw call.

Let's switch back to the attachments again.

We are using two render targets here.

Color, and depth.

Let's inspect some more pixel values.

Using the inspect pixels button in the lower right corner, we will present a new tool.

A loop. This loop displays the value like they are outputted by the fragment function.

And you can move the loop around all the render targets.

But you can also use your arrow keys for pixel precise control, even if you are not zoomed in.

Also, you notice, all the loops are being synchronized between all the render targets.

That helps you to relate values.

Let me find an interesting pixel now.

Using a long press will instantly move the cursor.

And here we can see something strange.

The depth value inside and outside a particle is different, and let's opt, our particles shouldn't write into the depths buffer, of course.

That will be an easy fix.

And I'm also sure our new GPU debugger will help you fixing your issues with the GPU.

I hope we see each other in the labs tomorrow morning, or at the [inaudible] later today.

Back to my colleague, Seth.

[ Applause ]

So now, onto GPU profiling.

As you know, performance is crucial to games and other graphical applications, and achieving a consistent, fast framework is always necessary.

But on the flip side, you want to get the most of the GPU for the best looking game as well, and at the same time, increase efficiency for a longer game experience.

Well, for all this, you need to use the GPU Profiling tools.

The first tool I want to talk about is Metal System Trace.

This is our tool for investigating timing issues, by which I mean investigating cases where the CPU and the GPU might not be running in parallel, because you have some synching operations by mistake, and you're forcing them to work in serial.

It's also great for investigating those cases where you're mostly achieving the framework you want, but occasionally you get a stutter, and you need to figure out, okay, what is going wrong in that particular frame?

It lets you trace your Metal workloads through the system, from CPU to GPU to display.

This year, we've added support for VR applications, with specific VR trace points for activities like when you query the head set for post-data, when you submit your surfaces to the VR compositor, when it does its work to do the compositing, and finally, when it hits the glass on the headset.

In effect, it lets you trace from motion to photon.

We've also added support this year for the new ProMotion displays, as you'll find in the new iPads, iPad Pros released early this week, and also support for external GPUs on macOS.

It is also worth noting there are some great improvements in the instruments, to make it much easier to view other instruments alongside Metal System Trace in a more integrated fashion.

Our next profiling tool is the GPU Shader Profiler.

The tool for probing shader performance.

It is integrated into the frame debugger, and lets you view shader time on a per draw call and per pipeline basis.

And if you're on iOS or tvOS, it also lets you view it on a per line basis.

Well, our first new tool this year is designed to work hand in hand with the GPU Shader Profiler.

We call that Metal Pipeline Statistics.

Metal Pipeline Statistics gives you a direct line to the GPU compiler to find out about the quality of the machine code the compiler is generating from your shader.

It gives you a rich set of statistics with things such as instruction count, instruction mix, by which I mean the relative ratio of operations such as ALU or memory or control flow, and on GPUs where it is relevant, it will also show you register usage and occupancy.

For GPUs such as that, these measures are crucial in understanding what is the limitations on how many shaders can be scheduled simultaneously, by which shader instances can be scheduled simultaneously.

But even better are the new compiler remarks.

With this, the GPU compiler will give you direct actual guidance on the performance of your shader, and things you can do to avoid performance hits, from things such as slow math usage, register spills, and stack usage.

It's like having a GPU compiler engineer built into every Xcode.

For each remark, it will explain what it means, what you can do to reduce it, and give you a link to where you need to go to fix it.

Well, to demo this new feature, I'd like to invite my colleague Jose to the stage, to give you a tour of Metal Pipeline Statistics.

[ Applause ]

Hello everyone.

My name is Jose Enrique [inaudible].

I am going to present you a new feature of our GPU friendly debugger that will have you produce good quality.

As you can see, we are replaying a capture of Metal [inaudible] demo for iOS.

The first thing I'm going to do, I'm going to change my debug navigator view from view frame by call to view frame by performance.

What this gives, what this view gives is all these pipelines you capture, sorted by time.

Remember, in Metal, a shader is always linked to a pipeline, therefore, this is a list of all initiator combinations that are available in [inaudible] capture.

I am going to go to a [inaudible] view of the most expensive pipeline, to see if we can improve the shaders.

As you can see, this view has three sections.

On top, we have remarks.

Remarks are a unique approach to the compiler shader quality.

As the report, final compiler co-generation issues.

Remember, the GPU will execute your shader millions of times per frame, therefore, the [inaudible] you can get, the better the performance it will have.

Remarks also are sorted by relevance, and if expanded, they offer you reason on why we're reporting it, and our recommendation on how to prevent it.

Below remarks, we have an overview for each shader, where you can see how the compiler final assemblies compose [inaudible] of instruction ratios.

And finally, we have a list of all the recalls using this pipeline.

This will become very handy whenever we are iterating over our shaders.

So let me showcase an example of workflow profiler statistics.

We go to our top remark, Register Spill, we can see the compiler is reporting a big spill, 1,040 bytes.

Spills will cause the GPU to access memory, which can stall your shade execution.

Knowing that a compiler is spilling and fixing it can drastically improve your shader performance, yet finding where and why the compiler is spilling tends to be a time consuming manual [inaudible].

But notice the second and fourth remarks.

Dynamic Stack Store, and Dynamic Stack Load.

If expanded, they offer a reason an expensive stack load is emitted to a dynamic offset in a local array.

As well as a recommendation, to reduce the stack access, eliminate dynamic access to local arrays.

This is basically saying that we have an array variable in our shader code that is storing the stack and where accessing it using some other index variable.

This is a very common pattern when providing for the CPU, but GPUs are different, they suffer when we rely on stack usage.

But note the languages below the recommendation.

It has an exact line number.

This means that we option click to it, we'll jump directly to the [inaudible] line where the compiler is actually loading data from the stack array.

We just found the compiler spill.

Also these align very well with our shared performing data, informing us of the high cost of this line, which now we know exactly why.

The shader is doing two passes.

The first pass, doing some light computation.

And a second pass, doing the light accumulation.

This speaks to the issue by working with the compiler but from a GPU perspective.

The first thing I'm going to do, I'm going to remove the stack array, I am going to remove it in relation, and then I'm going to compute the light accommodation directly into the first loop.

Then I am going to remove the second loop, and just not do that anymore.

Now, I click my update shader button, and wait for the results.

What this is going to do, it's going to have the compiler perform a whole loop optimization and reuse the same reducer over and over, instead of relying on the stack.

Once the results are back, we can see the instruction ratio between the previous and current [inaudible] has been reduced, as well as the impact of this change on every single draw call used in the pipeline, giving us also whole space performance improvements.

And with this, I conclude my example of [inaudible] statistics.

I hand it back to my colleague, Seth.

[ Applause ]

Thank you, Jose.

And finally, onto our last new tool today.

GPU Counter Profiling.

As you know, GPU architecture is complex, with a pipeline consisting of multiple programmable and fixed function blocks, bottlenecks can occur at any point within the pipeline, and often, at multiple simultaneous points.

Your job as Metal programmers is to minimize fixed function bottlenecks while efficiently harnessing programmable blocks.

Well, to do that, our new GPU Counter Profiling is the tool.

Instead of going directly into the GPU Frame Debugger, it gives you detailed GPU hardware performance statistics on a per draw call, on macOS and per encoder on iOS and tvOS basis.

And instead of giving you an arcane list of counters that change for every GPU and are hard to understand, and often don't tell you what you need to know, we've defined a high level set of characters that mean the same across each GPU.

So you don't have a per GPU learning curve.

So here is the Counter Profiling.

On the left, we have a graph view, showing you detailed GPU counter graphs, and on the right, the detail window.

Let's talk about those in order.

In the graph view, we show you counters across your frame, where the x axis represents draw calls, or encoders across time.

At the top, we show you GPU time.

As that is inherent to all GPU counter profiling.

And then below that, a set of top level counters that correspond to the stages in the GPU pipeline, along with some other top level counters that correspond to shared execution units, such as the shader core and test units.

For each group, you can drill down to more detailed counters, exploring a lot more data for each stage, and this is great for work flow where you identify it as your first cut, where you think the performance issue is, and then drill down to see why it's going on.

In the detail view, we'll show you all the same counters from the counter graph view, but displayed in full detail numerically.

And to give it context, we will show it alongside the median, max, and total values for the frame as well.

Now, both the graph view and the detail views support the same full, rich filtering options that we support elsewhere in the frame debugger, so if you want to just view certain pixel stats and certain memory stats at the same time, you can put together the search term, and view everything you want alongside each other.

But I'll highlight the GPU counter profiling is our advance to bottleneck analysis.

With this, we take all the counters that have been pulled for each draw call, or each encoder, and perform rich analysis on it, both on a cross platform basis, and on a per GPU specific basis to identify potential bottlenecks at each call.

Alongside this, we will give you lots of data about, okay, what is going on here?

What could cause it?

And then intuitive workflow to navigate direct to the affected area.

Now, both the, all the bottlenecks and all the counters will have rich detailed documentation within Xcode docs, explaining what each counter means in detail, why it might be particularly high or particularly low, and what you can do about it.

To give a demo of this great new GPU Counter Profiling feature, I'd like to invite my colleague, Jose, back to the stage to give you a demo of it in action.

[ Applause ]

Thank you, Seth.

And hello again everyone.

This time, I will demonstrate GPU counters, a tool that will help you analyze your GPU performance.

First, I'm going to replay the same demo that [inaudible] was on, but this time, we'll focus from a performance perspective.

The first thing to note is, the new GPU gauge, just under the FPS gauge.

By clicking on it, we'll navigate to our GP counter view.

As you can see, there is a wealth of data here.

Available for the first time.

With this view, you can now deliver file any GPU performance issue that you have in any of your capture frames.

So let me demonstrate how to find performance issues.

First, let's focus on the graph view.

' As you can see, there is a main spike in GPU time at the very beginning of a capture.

The first thing you want to do is to zoom in to see a single recall, there are more offenders.

In order to do that, I can simply pinch and zoom, just like that.

Any default system behaviors will work just as you expect them to.

Now I will see that there is a main spike.

You can click on this draw call to highlight this impact across all the pipeline, and hovering over each row will give us detailed information on how relevant that stage is for this particular draw call.

In this case, Vertex Omission, Vertex Shader, and Primitives did not seem to have relevant impact.

On the contrary, Fragment Shader and Pixels [inaudible] seemed to be quite high.

Let's focus on the Fragment Shader first.

If we expand this group we now get access to a massive amount of counter data that gives us detailed information what is going on with the Shader stage.

The last thing that this counted, we can quickly see that the stall time is unusually high, over 76%.

This means that most of the time we are spending on the Fragment Shader is actually waiting for some memory or data to be available.

This is caused because you are fetching from a buffer or from a texture, but texture captures should be here in this latency, so let's go down to our Texture Unit, to see what is the cache rate.

And we can immediately see that the textures cache rate is also unusually high, almost at 60%.

This means that more than half of the texture samples we are doing are coming from video memory and not from the texture caches.

Now that we have a better understanding of the issue at hand, let's focus on the assistant editor.

As you can see, the assistant editors offer the same graph inform counter information as the graph view was offering, but this time, displayed as a table view.

But more important, look at the top.

This is our bottleneck access tool.

It will point out two relevant issues that we consider when we analyze all the counters within the selected draw call, and point out any relevant issues that we cconsider important for you to pay attention to.

In this case, highlighting the same as we just found manually by checking the graph, that the texture cache miss rate is high.

When expanded, it also offers recommendations on what to check.

In this case, check if sampled textures have [inaudible], and a quick navigational name to relevant views for this issue.

For example, boundary sources, where we can immediately see what's the issue at hand, we're fetching a 4K by 4K RGBA32 Floating Point Texture with [inaudible] in both our vertex and our Fragment Shader.

This is a 256-megabyte texture that is fetched all over the pipeline.

No wonder we are trashing our cache.

Just think for a moment what we just did.

This was an incredibly detailed view of how GPU internals work.

You finally have the data to prove what [inaudible] telling you, that fetching from the textures is expensive, but now you know exactly why.

Accessing this texture was a star on the Fragment Shader, because it had to fetch some data from [inaudible] memory that was not available in the caches.

This level of detail is typically not seen outside consult tools.

Solving this issue now is a matter of balancing performance, quality, and correctness, but you have demonstrated how you can use GPU counters [inaudible] the GPU Frame Debugger, to help you investigate, analyze, and verify any capture information any performance information that you have in your captures.

And now, I hand it back to my colleague, Seth.

[ Applause ]

Thank you, Jose.

So that is GPU Counter Profiling.

Like all the new features we talked about today, it's the ultimate joy in Xcode Beta 9, and it's available for all Metal capable GPUs.

You will find that more recent GPUs have more counters available due to the more modern nature of the GPU, but there's still a rich and very usable set available for all GPUs.

However, we would love to hear feedback from you if you feel there's particular counters that were unexposed that would be particularly valuable, you know, please drop by the lab, or [inaudible] and we'll be happy to investigate.

So what have we talked about today?

We've talked about some great enhancements to the Metal Frame Debugger, with support for pixel inspection, inspecting Vertex Shader outputs, rich filtering, better capture support, better capture performance, and Xcode Metal Quick Looks.

We've talked about support for debugging and profiling VR applications in Metal Tray Debugger, and Metal System Trays.

We've talked about Metal Pipeline Statistics, giving you a direct line to the GPU compiler for performance information.

And we've introduced GPU Counter Profiling, giving you unheralded access to GPU Performance Counter Data in Metal.

For more information, check out the website.

Code is 607.

I did want to call out a couple of other sessions.

If you didn't catch either the Introducing Metal 2, or VR With Metal 2 sessions earlier on this week, it's highly worth going and checking out the video, even if you did see them, it's still worth checking out the videos.

And coming up later this afternoon, there is a great session on using Metal 2 for Compute, in Grand Ballroom A, at 10 past 4.

And that's it.

Thanks for coming.

Have a great remainder of your WWDC 17 and enjoy the bash.

Thank you.

[ Applause ]

Apple, Inc. AAPL
1 Infinite Loop Cupertino CA 95014 US