Metal Game Performance Optimization

Session 612 WWDC 2018

Realize the full potential of your Metal-based games by tackling common issues that cause frame rate slowdowns, stutters, and stalls. Discover how to clear up jitter and maintain a silky-smooth frame rate with simple changes in frame pacing. Get introduced to new tools for analyzing rendering passes and pinpoint expensive or unexpected work. Learn how to avoid thread stalls and get specific advice about handling thermal notifications.

[ Music ]

[ Applause ]

Good morning, and welcome to this talk.

My name is Guillem Vinals Gangollels.

And I work at the GPU Software Performance Team here at Apple.

Good developers like you make iOS an excellent gaming platform.

And we at Apple obviously want to help.

So this year we reviewed some of the top iOS games and found some common performance issues.

We analyzed a lot of data, and as a result of that investigation, we decided to put this talk together.

So this is going to be the main topic today.

Develop Awesome Games.

But I will only be providing technical directions here.

So we'll [inaudible].

Before we begin, please let me thank our friends at Croteam.

They are the developers behind The Talos Principle, which is a really awesome game.

You will see it featured in these slides and in two of the demos.

Notice that it has stunning visuals but it does really not compromise in performance.

And that's what this is all about.

So let's do a quick run through of the agenda.

I'll start with an introduction to the tools.

This is a very good place to start.

And then we'll talk about the actual performance issues.

Around frame pacing, thread priorities, thermal states, and unnecessary GPU work.

Even though all these issues seem unrelated, they will compound and aggravate each other.

So it's important to tackle them all.

Let's start with the tools.

This is the most important message.

You should profile early and do it often.

Do not ship your game unless you've profiled it.

And for that you will need to know your tools.

Today, I will focus on two of them.

First, we have instruments, which is our main profiling tool.

You will want to use it to understand performance, latency, and overall timing.

Second, we have the Metal Frame Debugger, which is also very powerful tool, which you will want to use to debug your GPU workload.

So where do we start?

This is a question we often get.

Well, this year we are making it easier for you.

We are introducing a new instruments template, which will be a great starting point.

The Game Performance Template.

It is the combination of already existing instruments such as System Trace, Time Profiler, and Metal System Trace.

We configured it for you so it records all the CPU and GPU data that is relevant for your game.

So you can make it smooth.

So how do we launch it?

How do we get there?

Well, just open Instruments and you will see it right there in the center.

After you choose it, you will be able to configure it same as every other template.

Once you start recording, you will do so in windowed mode, which will allow you to play your game for as long as you like, and only the last few seconds of data will be recorded.

And this is how this last few seconds of data will look like.

There's a lot of information so let's have a quick high-level overview.

First, we have System Trace and Time Profiler, which will give you an overview of the system load as well as your application CPU usage.

For example, user interactive mode will record all the active threads at a given time.

In this case, the orange color you can see means that there are more runnable threads available than CPU cores.

So there is some contingency.

These will offer a great view of the system.

There's a couple of great talks that talk about this instrument in more depth.

Please follow-up on them.

Next on our list is Metal System Trace, our GPU profiling tool.

It offers a great view of the graphic stack.

All the way from the Metal Framework down to the display.

In particular, we will want to pay close attention to the GPU [inaudible], which is split in vertex, fragment, and compute if your game uses it.

Notice as well that the display track will be the starting point of many of our investigations.

We will identify a long frame or a starter and we will work it all the way up from there.

So it's a very natural place to start.

There is a lot of information about the tool because it really is a very powerful tool.

And I encourage you all to catch up on it.

These are a couple sessions that will provide you a great starting point.

Okay.

So next on our list we'll have a thread states view which we introduced this year.

This view will show you the state of every thread in your game.

In this case, each color represents a possible thread state, such as preempted which is represented in orange.

Or blocked which is represented in red.

We designed this view specifically with you, game developers, in mind.

Because we know the threading systems in modern games are very complex.

And we hope this really will help you.

Also we have a track for each CPU core.

It will show the thread running on that core as well, as well as the priority of that thread, which is color coded.

By using this, you will be able to see at a glance how easy the system really is.

That was a short but a quite wide introduction to the tools.

So it's about time we move to the actual performance issues.

The first one will be around frame pacing.

And let's visualize it first.

For this we used the modified version of the Fox [inaudible] demo.

That will help us illustrate the issue better.

Can you guess which game renders faster?

Well, some of you may not have guessed it.

The game on the left is trying to render at 60 frames per second.

But it can only achieve 40, so it's inconsistent, and it seems jittery.

The game on the right on the other hand is targeting 30 frames per second, which can consistently be achieved.

That's why it looks smoother.

But that's a bit counterintuitive.

How, how come the game that renders faster doesn't look smoother?

Well, this issue's known as micro stuttering or inconsistent frame pace.

It occurs when the frame time is higher than the display refresh interval.

For example, our game may take 25 milliseconds to render or 40 frames per second.

And the display may refresh at 16.6 millisecond or 60 frames per second.

Same as the video we've just seen.

These will create some visual inconsistencies.

So how did we get there?

What have we done to be in this situation?

Well, we didn't do much really, and that's kind of the whole point of this.

After rendering the frame, we requested the next drawable from the display link.

And as soon as we got the drawable, we finished the final pass and presented it right away.

We explicitly told the system to present that drawable as soon as possible, at the next refresh interval.

After all, we are targeting 60 frames per second, right?

There's also another class of problems that will cause micro stuttering.

And some games are already targeting lower frame rate.

But we have also identified many of those games that are using usleep on their main or random thread.

This is a very bad practice in iOS, so please don't do that and just hang, hang here for the next few minutes.

And I'll tell you the actual correct way of doing this in iOS.

Now, let's have a deeper look into what happens in the system for micro stuttering to be visible.

In this case, we see here a timeline of all the components involved in rendering.

And we'll start rendering our game normally.

Notice this is a three-point buffer case, which is quite common in iOS.

In this case, every drawable is represented by a letter and a color.

And also notice the premise here.

Rendering to drawable V takes longer than one display refresh interval, which is the time between vsyncs.

In this case, could be 25 millisecond to render to V and 16.6 millisecond in between display refresh intervals.

So since that is the premise, this means that we will need to [inaudible] on the display for the next interval to give time so we can finish.

And we will do so.

And that during that interval, we will actually B, B will actually finish.

And we will be ready to present it but notice that we have just hid the issue here.

During this interval, we have also finished rendering to C.

And we are ready to present it right away.

So we will [inaudible] an inconsistent frame pacing from that moment onward.

We are stuck in this pattern.

Every other frame will be inconsistent.

And the user will see micro stuttering.

Now this may appear in different shapes and forms in the real world.

So what we'll do now is a quick demo and I'll show you an instruments trace of the Talos Principle.

And we will use to see if we can identify micro stuttering in the real world case.

Okay.

So what we see here is the same lot of information I've shown you before.

This has been captured with the Game Performance Template by default.

Notice all the same instruments I talked about here displayed on the left.

And all the game threads here in the middle.

In particular though, we are looking now at micro stuttering.

So this quite intuitively will bring us to look at the display track because micro stuttering by definition is frames presented inconsistently.

In this case, we have the display track here.

Notice as well that there are some hints in the display track.

We [inaudible] and these are the hints here.

They will show you when a surface has been displayed for longer than we would expect on a normal rendering.

So maybe this is a great place to start looking at it.

There's some clusters of them.

So let's zoom into one.

To zoom, we will hold the option key and just drag the pointer to the region of interest.

And in this case, if we keep looking at the display track, it's kind of evident already that we are micro stuttering.

We can see that every display has a different timing.

So in this case for example, we have 50, 33, 16, back to 50, and back to 33.

So when we see this pattern in an instruments capture, it means that we are micro stuttering and we should correct it.

So let's just do that.

Back to the slides.

Okay.

We've just seen the problem, how it occurs in the real world.

The pattern is basically the same.

So how do we go about fixing it?

The best practice here is to target the frame rate your game can achieve.

So at the minimum frame duration there is longer than the time it takes to render.

For that, there's a bunch of APIs that can help you.

For example, MT Drawable addPresentedHandler will give you a call back once that drawable is presented.

So you can identify micro stuttering as it is happening.

The other two APIs will help you to actually fix the problem.

They will allow you to explicitly control the frame rating the frame pacing.

In this case we have present afterMinimumDuration and present atTime.

What we want to do here?

We set the minimum duration for our frame longer than it takes to render.

And we'll do just that.

Let's see how that looks.

Notice that when we start rendering, we are already consistent from the get-go.

Our frame spends on display more time it takes to render.

Every frame will be consistent.

The user will see also being consistent.

And that's great.

Also notice that there's a side effect.

The frame rate will be lowered.

We went from 40 frames per second to 30 frames per second.

So that also gave us some extra frame time to play with.

So how did we do this?

How did we fix the the frame pacing?

Well, really it's just a couple of lines of code.

We have the same pattern as before.

We rendered the scene.

We get the next drawable.

We do the final pass.

The only difference here is that we specify a minimum duration for our frame.

And present it with that minimum duration.

That's all it takes.

That will allow us to set the minimum duration for our frames.

And they will all be consistent.

And after doing so, you may be thinking well, what about maximum duration?

What about the concept of priority of our work?

Or how long a thing could take?

Well, that's actually the next issue on our list thread priorities.

Let's visualize it first, same as we did before.

Again, with the modified version of the Fox II demo.

You may be thinking and you would be right that there are many things that could cause stuttering such as this.

Maybe you are doing some resource loading or [inaudible] compilation.

Today we will focus on the more fundamental but also incredibly common type of stutter.

That caused by thread stalling.

If the work priority is not well communicated to the system, your game may have unexpected stalls.

iOS does plenty of stuff besides rendering your game.

Thread priorities are used to warranty the quality of service in the whole system.

So if a thread does a lot of work, its priority will be lowered over time so other threads can run instead.

That's the concept known as priority decay.

Also you see on the slide behind me priority inversion.

This is another class of problems that manifests in a very similar way.

In this case, priority inversion occurs when the render thread depends on the lower priority worker thread from your same engine in order to complete the work.

Let's see how that looks like in the same timeline as we've seen before.

In this case, we start rendering at 30 frames per second so we are cool.

But then there is some background work.

iOS does lots of stuff.

Maybe now it's checking the email.

And the problem here is that the [inaudible] thread is not well configured.

You may get preempted by that background work.

You may not finish scheduling all the work onto the GPU.

And there is no such thing as maximum duration for a frame.

So that could potentially go along for hundreds of milliseconds.

The user will see a stutter.

This is also the theory behind it.

And in practice it shows in different ways that follow the same pattern.

So let's do another demo.

I'll show you another instruments capture of the Talos Principle.

That will show you how to identify this problem.

So in this case, what you see here is again a capture taken with the Game Performance Template.

But this time we have already zoomed into the frame we are interested in, which is this very long frame.

It has a duration of 233 milliseconds.

So that's likely a very good stutter that we should investigate.

By by looking at it at a glance, we can already tell that the GPU does not seem to be doing much.

It's idle during this time, so this means that we are not fitting it.

Now we can look at the CPU, of course, and they seem to be fairly busy down here.

Right?

They are really all of it seems quite solid.

But notice what you see here is the time profiler view of our application.

And it does not seem to be running.

Why is our game not running and how come that causes a stutter?

Why?

Well, we can switch to the new view I talked to you about, the new thread states view.

To do so you will go into the icon of your application and click on that button here and that would pull out the track display.

And in this case, you can switch to thread states.

And that will hope hopefully already help you to see there is something wrong here.

It is highlighted in orange, and it's already telling us that the thread has been preempted for 192 milliseconds.

So that's the actual problem here.

A render thread is not running.

Something preempted it.

If you want to know more, you can expand information at the bottom, which will contain also the thread narrative.

And by clicking at the preempted thread, you will see here an explanation of what's going on.

In this case, your render thread was preempted at priority 26, which is very low.

It's below background priority because the App Store was updating.

So that's something we do not want.

We want to tell the system that to our user, our game is more important than an App Store update at that particular moment.

So let's go back to the slides and see how can we do that?

So the best practice here is to configure your render set.

We recommend the render set priority to be fixed to 45.

Notice that the [inaudible] OS and macOS priorities have ascending values.

So priority 31 has higher priority than priority four.

Also, we need to opt out of the scheduler's quality of service in order to prevent priority decay which could lower our priority as well.

Let's see how a well-configured render thread looks like.

In this case, we configure just how I told you.

We start rendering normally.

We also have some background work going on.

Otherwise it wouldn't be fair.

And this background work could be updating the App Store just as we've seen in the demo.

But notice that vsync after vsync, our render occurs normally.

We are preempting the background work of the CPUs so we can run instead.

The user does not see the stutter.

Your game can run at 30 solid frames per second, even though the system is under heavy load.

That is technically awesome, and that's what this is all about.

So let's see how we make this happen with a little bit of code.

And it literally is a little bit of code.

It is only like a couple lines.

In this case, it's just about configuring the pthread attributes before we can create the pthread.

We need to opt out of quality of service, set the priority to 45.

And that's it.

We can create the pthread with those attributes, and it will work just fine.

It is simple and technically awesome.

What's not so simple though is the next issue on our list.

That about dealing with multiple thermal states.

The message is very clear.

Design for sustained performance and deal with the occasional thermal issues.

So let's see how we go about that.

iOS devices give you access to an unprecedented amount of power.

But [inaudible] in a very small form factor.

So more apps use more resources on the device, the system may begin enacting measures in order to stay cool and responsive.

Also the user may have enabled a low power mode condition, which will have a very similar effect.

Okay, so the best practice really is just to adjust your workload to the system state.

You should monitor the system and tune the workload accordingly.

iOS has many APIs to help you with that.

For example, use NSProcessInfo thermalState to either query or register for notification when the device thermal state changes.

You should also check for the low power mode condition in a similar fashion.

Also consider querying the GPU start/GPU end time from the MTL Command Buffer in order to understand how system loads may impact the GPU time.

Let's see how we do that with a simple code example.

This comes straight from our best practices.

A tip score is a very simple switch statement when every case corresponds to a thermal state.

We have nominal, fair, serious, and critical.

And that is all very good.

So now we know that we are in a thermal state and thse command's telling us to do something about it.

So how can, how can we actually help the system stay cool?

Well, I can give you some suggestions, but it's up to you game developers to decide what compromises to make in order to help the system.

You know what's best for your game to keep being awesome under stress.

Some recommendations I'll give you though are to target the frame rate that can be sustained for the entire game session.

For example, stay at 30 frames per second if you cannot sustain 60 for ten minutes or more.

Doing the GPU work is also super helpful.

For example, consider lowering the resolution of intermediate render targets, or simply find the shadow maps, loading simpler assets and even removing some of the post-processes altogether.

Wherever, whatever fits your game the best.

You should decide that one.

And this will bring us to the next issue on our list.

That about dealing with unnecessary GPU work.

For that, please welcome my colleague Ohad on stage.

He's going to tell you all about it.

[ Applause ]

Thank you, Guillem.

[ Applause ]

Hey, everyone.

My name is Ohad, and I'm a member of the Game Technologies Team here at Apple.

In the previous slides, Guillem showed you how important it is to adapt to the system.

Responding to states like low power mode or the varying thermal states will require you to tune your GPU workload in order to maintain consistent frame rates throughout an entire game session.

However, for many developers, the GPU is a bit of a black box hidden behind the curtains of a game engine.

Today, we'll pull back those curtains.

Wasted GPU time is a very common problem and it's one that often goes unnoticed.

But I want you to remember this.

Technically awesome games don't only hit their GPU budget.

They're also good citizens to the system, helping it to stay cool and save power.

All the popular game engines provide a great list of best practices to follow.

We won't cover those.

Instead we'll focus on how to tell if something is expensive to render.

And as we've done with the CPU several times today, the best practice here is profile your GPU as well.

The power of our GPUs can hide many efficiencies in either content or algorithms.

You will want to time your workload, but also understand each rendering technique that you enable.

And only keep those that add noticeably to the visual quality of your games.

But how do you find these inefficiencies?

How do you determine which parts of your pipeline are flat-out excessive?

This of course brings us back to tools.

As always, your first stop should be Instruments.

Here we're looking at Metal System Trace.

It'll provide you accurate timings for vertex, fragment, and compute work being done.

But by measuring your GPU time, you're only halfway there.

Next you want to really understand what each of your passes is doing.

And for this, we're added a new tool to the Metal Frame Debugger this year.

It's the Dependency graph.

The Dependency graph is a story of a single frame.

It's made up of nodes and edges and each one of these tell a different part of the story.

Edges represent dependencies between passes.

As you follow them from top to bottom, you'll see where each pass fits into your rendering pipeline.

And how they work together to create your frame.

Nodes on the other hand are the story of a single pass.

They're made up of three main components.

First, the title element will give you the name of the pass.

Now I really want to emphasize this.

Label everything.

It'll help you not only in the Dependency viewer, but throughout our entire suite of tools.

Secondly, it'll allow you to quickly tell what type of pass you're looking at.

Render, blit, or compute.

Here from the icon we can see that it's a render pass.

Next, you have a list of statistics describing the work being done in this pass.

And finally to the bottom, a list of all the resources that are being written to during this pass, and each of these also comed with a label, a thumbnail allowing you to preview your work, and a list of information describing each one of those resources specifically.

And all that together allows you to really understand each of your passes.

Okay, so now we know how to read the graph.

Let's jump into a demo and see how it all fits together.

Okay.

So I have the Fox II demo running on my machine here.

It was built in Scene Kit, which allowed me to add all sorts of great effects.

As you can see, I have cascading shadow maps, bloom, depth of field, and all of it comes together to create a beautifully rendered scene.

Let's use the dependency viewer to see how it all works.

First, we'll go to Xcode and we'll capture a frame using the capture GPU frame button in the bottom.

And we'll select the main pass on the left.

[Applause] And we'll also switch to automatic mode which will give us, will give us our assistant on the right.

Now notice that the same pass that I selected in the debug navigator is also the one that's showing is selected, and focused in the main view.

And this is a two-way street.

So as we interact with the graph, select, selecting different passes or textures or even buffers, both the navigator on the left and the assistant on the right will update to show your selection.

So this is a really fantastic way to navigate your frame.

Now as I zoom out, the first thing you'll notice that the statistics hide and the focus goes away from the individual passes onto the frame as a whole.

And I can zoom out even more to see a great bird's-eye view of my entire frame.

Now the really cool thing to notice here is that since dependencies drive the connectivity of the graph, each logical piece of work is grouped together in space.

So let's zoom in and see what I mean.

Here I have a branch of work that's creating my shadow maps.

On the left, I can see three passes that are rendering the shadows.

So this is really fantastic because I'm not just getting the story of my entire frame.

But there's another story in between these two layers.

One of how each rendering technique is built up.

And this is something that isn't always entirely obvious when you're using a game engine to turn these on.

For instance, when my shadow maps, I may not have known that cas that each cascade would require its own pass.

If I considered each one of these individually, they wouldn't really stand out.

But now I see that I have to consider them as a group.

And that gives me the insights that I need to make informed decisions on any compromises that I make while tuning my GPU workload.

So that's the Dependency viewer.

I'll switch back to the slides.

And please help me welcome Guillem back onto the stage for his final thoughts.

Thank you.

[ Applause ]

Thank you.

That was an awesome demo [inaudible].

Cool.

So Ohad had just shown us how a frame looks like through Dependency viewer.

And that is great for you to inspect your GPU workload.

For example, oftentimes we may go from a very small and simple pipeline such as this one to a very complex one with post-process, multiple shadow maps in HDR.

And all of these can be done by adding, you know, a couple properties to the common object of your favorite game engine.

You see that the code complexity of those changes is minimal.

But the but the rendering complexity may have increased tenfold, which will really bring us back to the beginning right where we started.

Profile.

It is very important that you understand what your game does.

You spend tens of thousands of hours developing a game, you should consider spending some of that time profiling as well.

Everything we have seen today can be found within minutes.

The best part?

You don't need to know what you're looking for.

Just record the stutter, get the long frame, and work it all up all the way up from there.

It's that simple.

The tool will give you all the information you need to identify the problems.

But you will need to use the tool.

And that is really the takeaway.

So we have seen a bunch of common pitfalls followed by some best practices.

All of these issues can be found through profiling.

That's how we found them.

We analyzed a ton of games, found the common issues, and decided to put a talk together.

Now, if you have access to the engine source code, make sure that both thread pacing and thread priorities are well configured.

It's just a couple lines of code really.

But regardless, your game should always adapt to thermals and do not submit unnecessary GPU work.

By making sure to follow all these best practices, you too will be developing technically awesome games.

And that's what this is all about.

For more information, there is a a coming lab at 12 PM.

We will be there.

I'll be there and now we'll be more than happy to ask any questions you may have after this session.

Or maybe you just want to sit down and let us profile your game.

Also there, there were two great talks [inaudible] about Metal for game developers and our profiling tools.

Thank you very much, and enjoy the rest of the day.

And have a great one.

[ Applause ]

Apple, Inc. AAPL
1 Infinite Loop Cupertino CA 95014 US