Metal for Game Developers

Session 607 WWDC 2018

Metal 2 is Apple-designed graphics software that lets developers build console‑style games. Learn key aspects of the Metal architecture that support the techniques for modern high-performance game rendering. See how Metal now enables the GPU to schedule work for itself, allowing complete scenes and compute workloads to be built and executed with little to no CPU interaction. Understand how the seamless integration of Metal 2 with the A11 Bionic chip lets your apps and games realize entirely new levels of performance and capability.

[ Music ]

[ Applause ]

Welcome.

Last year, we introduced Metal 2, which includes new ways for the GPU to drive the rendering pipeline.

This year, we're introducing even more new and exciting features to solve common game development challenges.

My name is Brian Ross, and together with my colleague, Michael Imbrogno, we'll explore new ways to make your applications better, faster, and more efficient.

But first, I want to talk about some of the challenges that I'm trying to help you solve.

Your games are using an ever-increasing number of objects, materials, and lights.

Games like Inside, for example, use a great deal of special effects to capture and support the mood of the game.

Making games like this that truly draw you in is challenging because it can require efficient GPU utilization.

At the same time, games are requiring more and more CPU cycles for exciting gameplay.

For example, games like Tomb Raider that features breathtaking vistas and highly-detailed terrain, but, at the same time, they're also managing complex physics simulations in AI.

This is challenging because it leaves less CPU time for rendering.

And finally, developers are taking AAA titles like Fortnite from Epic Games, importing them to iOS so you can run a console-level game in the palm console-level game in the palm of your hand.

This is a truly amazing feat, but this also leaves us with even more challenges, like how to balance battery life with a great frame rate.

So now, let's look at how Metal can help you solve these challenges.

Today, I'm going to show you how to harness parallelism on both the CPU and the GPU to draw more complex scenes.

We'll also talk about ways to maximize performance using more explicit control with heaps, fences, and events.

And then, I'm going to show you how to build GPU-driven pipelines using our latest features, argument buffers and indirect command buffers.

Now, while all these API improvements are key, it's equally important to understand the underlying hardware they run on.

So the next section, my colleague Michael is going to show you how to optimize for the show you how to optimize for the A11 to improve performance extend playtime.

And finally, I'm really excited that we're going to be joined by Nick Penwarden from Epic Games.

He is going to show us how they've used Metal to bring console-level games to our devices.

So let's get started.

Harnessing both CPU and GPU parallelism is probably the most important and easiest optimization you can make.

Building a command stream on a single thread is not sufficient anymore.

The latest iPhone has 6 cores, and the iMac Pro can have up to 18.

So scalable, multithreaded architecture is key to great performance on all of our devices.

Metal is designed for multithreading.

I'm going to show you 2 ways how to parallelize on the CPU, and then I'm going to close this section by showing you how Metal could automatically parallelize could automatically parallelize for you on the GPU.

So let's set up an example of a typical game frame.

With a classic, single-threaded rendering, you'd [inaudible] build GPU commands and GPU execution order into a single command buffer.

Typically, you're then having to fit this into some small fraction of your frame time.

And, of course, you're going to have maximum latency because the entire command buffer must be encoded before the GPU can consume it.

Obviously, there's a better way to do this, so what we're going to do is we're going to start by building in parallelism with the CPU.

Render and compute passes are the basic granularity of multithread in Metal.

All you need to do is create multiple command buffers and start encoding each into separate passes on a separate thread.

You can encode them in any order you wish.

The final order of execution is determined by the order they're added to the command queue.

added to the command queue.

So now, let's take a look at how easy this is to do in your code.

So you can see this is not a lot of code.

The first thing that you're going to do is create any number of command buffers from the queue.

Next, we're going to define the GPU execution order upfront by using the enqueue interface.

This is great because you can do all this without waiting for the command buffer to be encoded first.

And finally, we're going to create separate threads and caller encoding functions for each.

And that's it.

That's all you have to do.

It's really fast, it's really efficient, and it's really simple.

So now, let's go back to the previous diagram and look at another example.

So as you can see, we did a pretty good job parallelizing these on the CPU, but what if you have 1 really long rendering pass?

So in cases like this, Metal has So in cases like this, Metal has a dedicated parallel encoder that allows you to encode on multiple threads without explicitly dividing up the render pass or the command buffer.

So now, let's look at how simple this is in your code.

It looks a lot like the previous example.

The first thing you're going to do is create a parallel encoder.

And from that, you create any number of subordinate encoders.

And it's important to realize that this is actually where you define the GPU execution order.

Next, we're going to create separate threads and encode each of our G-buffer functions separately.

And finally, we set up a notification so that when the threads are complete, we call end encoding on the parallel encoder.

And that is it.

That's all you have to do to parallelize a render pass.

It's really fast, and it's really easy.

So now that I've shown you 2 ways to parallelize on the CPU, now let's see how Metal can parallelize for you automatically on the GPU.

So let's look at the frame example from the beginning and see how the GPU executes the frame.

Based on the capabilities of your platform, Metal can extract parallelism automatically by analyzing your data dependencies.

Let's look at just 2 of these dependencies.

So in this example, the particle simulation writes data, which is later used by the effects pass to render the particles.

Similarly, the G-buffer pass generates geometry, which is later used by the deferred shading pass to compute material lighting.

All this information allows Metal to automatically and cheaply identify entire passes that can run in parallel, such as using async compute.

So you can achieve parallelism and async compute for free on and async compute for free on the GPU.

It's free because Metal doesn't require you to do anything special on your part.

So I think we all love getting free optimizations on the GPU, but sometimes you as a developer, you may need to dive a little bit deeper.

For the most critical parts of your code, Metal allows you to incrementally dive deeper with more control.

For example, you could disable automatic reference counting and do it yourself to save on CPU time.

You could also use Metal heaps to tightly control allocations really cheaply.

And Metal heaps are complemented by fences and events, which allow you to explicitly control the GPU parallelism.

Many of your games are using a lot of resources, which can be costly.

Allocations require a round trip to the OS, which has to map and initialize memory on each request.

If your game uses temporary render targets, these allocations can happen in the allocations can happen in the middle of your frame, causing stutters.

Resource heaps are a great solution to this problem.

Heaps also let you allocate large slabs of memory from the system upfront.

And from those, you can later add or remove textures and buffers from those slabs without any costly round trip.

So starting from a case where you allocate 3 normal textures, Metal typically places these in 3 separate allocations, but putting these all instead into a single heap lets you perform all memory allocation upfront at heap creation time.

So then, the act of creating textures becomes extremely cheap.

Also, heaps can sometimes let us use the space more efficiently by packing allocations closer together.

So with a traditional model, you would deallocate textures, releasing pages back to the system, and then reallocate, which will allocate a new set of textures all over again.

With heaps, you deallocate and With heaps, you deallocate and reallocate without any costly round trip to the OS.

Finally, heaps also let you alias different memory resources with each other.

This is really helpful if your game frame has a lot of temporary render targets.

There's no reason for these to occupy a different memory all the time, so you could alias and save hundreds of megabytes.

Now, the faster allocations in aliasing are great, but it's not entirely free when it comes to dependency tracking.

Let's return to our frame example for a better explanation.

With heaps, Metal no longer sees individual resources, so therefore, it can't automatically identify the read and write dependencies between passes, such as the G-buffer and deferred shading pass in our example.

So you have to use fences to explicitly signal which pass produces data and which pass consumes the data.

So in this example, the G-buffer So in this example, the G-buffer updates the fence, and the deferred shading waits for it.

So now, let's take a look at how we could apply these basic concepts in your code.

So the first thing that we're going to do is we're going to apply this to our G-buffer and deferred shading example.

First, we're going to allocate our temporary render target from the heap.

This looks just like what you're probably already doing today when you allocate a texture.

Next, we're going to render into that temporary render target.

And finally, update the fence after the fragment stage completes.

This will ensure that all the data is produced before the next pass consumes it.

So now, let's switch gears over to the deferred shading pass.

Now, we're going to use this temporary render target to compute material lighting.

Then, we're going to wait for the fence to make sure that it's been produced before we consume it.

And finally, market is aliasable so that we can reuse this for other operations, saving hundreds of megabytes.

So now that we've talked about how to parallelize and optimize performance with explicit control, this is great, but what if you want to put the GPU more into the driving seat?

So let's talk about GPU-driven pipelines.

Your games are moving more and more of the decision logic onto the GPU, especially when it comes to processing extremely large data sets or scene graphs with thousands of objects.

With Metal 2, we've made another really important step forward in our focus on GPU-driven pipelines.

Last year, we introduced indirect argument buffers, allowing you to further decrease CPU usage and move a large portion of the workload to the GPU.

This year, we're also introducing indirect command buffers, and this will allow you buffers, and this will allow you to move entire rendering loops onto the GPU.

So first, let's briefly recap the argument buffer feature.

An argument buffer is simply a structure represented like this.

Previously, these would have only constants, but with argument buffers, we can have textures and samplers.

Before, these would have to have separate shader bind points.

So since this structure, you have all the features of the Metal shading language at your disposal, so it's really flexible and really easy.

You could do things like add substructures, or arrays, or even pointers to other argument buffers.

You could modify textures and samplers, creating new materials on a GPU without any CPU involvement.

Or you can make giant arrays of materials and use a single-instance draw call to render many objects with unique properties.

So argument buffers allow you to offload the material management onto the GPU and save valuable CPU resources.

But this year, we're putting it a little bit, extending it a little bit more.

We started by adding 2 new argument types.

This includes pipeline states and command buffers.

Now, these are used to support our brand-new indirect command buffer feature.

With indirect command buffers, you could encode entire scenes on the GPU.

On the CPU, you only have a few threads available for rendering, but on the GPU, you have hundreds or even thousands of threads all running at the same time.

With indirect command buffers, you can fully utilize this massively parallel nature.

Also, indirect command buffers are completely reusable, so you could spend the encoding cost once and reuse it again and again.

And since an ICB is a directly accessible buffer, you can accessible buffer, you can modify its contents at any time, like change the shader type, or the camera matrix, or anything else that you might need to change.

And of course, by moving your rendering to the GPU, you remove expensive CPU and GPU synchronization points that are normally required to hand over the data.

So let's take a look at an example.

Here is a typical game frame.

The usual rendering loop has a few common stages.

First, you walk your scene graph to determine which objects you need to render.

You probably use frustum culling to determine what objects are within the view frustum.

Some of you might use a more complex solution that accounts for occlusion.

Also, level of detail selection naturally occurs at this stage.

Only once you encode and submit your command buffer will the GPU start to consume it.

More and more games are moving the process of determining visible objects onto the GPU.

visible objects onto the GPU.

GPUs are just better at handling the growing scene complexity of the latest games.

Unfortunately, this creates a sync point in your frame.

And the, it makes it so that the CPU cannot encode draw calls until the GPU produces the data.

It's extremely difficult to get this right without wasting valuable CPU and GPU time on synchronization.

With ICBs, the benefits are immense.

Not only can you move the final bits of processing to the GPU, you naturally remove any sync points required to hand over the data and you improve your CPU and GPU utilization.

At the same time, you reduce your CPU overhead to a constant.

So let's look at the encoding in a little bit more detail.

I'm going to start by expanding on our previous example and look at the massively parallel nature that only the GPU can provide.

We could begin with the list of visible objects and LODs coming visible objects and LODs coming from our culling dispatch.

Also, keep in mind that we're utilizing the power of argument buffers here.

So in this case, each element has a pointer to the actual properties, so we don't need to store everything in the same buffer.

This solution saves us a lot of memory and performance, and it's because we only build a very small list of information.

The actual argument buffer contains several levels of detail for geometry.

This includes position, vertex buffer, index buffer, and a material argument buffer.

For rendering, we only select 1 of these LODs per object.

The actual encoding happens in a compute kernel, and we encode into an indirect command buffer.

Each thread of the compute kernel encodes a single draw call.

So we read the object with all of its properties, and we encode these into the ICB.

these into the ICB.

There's a couple of details worth noting.

You can think of an ICB as an array of render commands.

A render command consists of a pipeline object with shaders, any number of buffers, and a draw call.

Next, an ICB is built for parallelism, so you could encode concurrently and out of order.

And lastly, we kept the API very simple, so it's just like what you might be doing today on the CPU.

Another thing each command could have different properties and even draw types.

So this is a really, really significant step forward from all the flavors of indirect rendering that many of you may have seen elsewhere.

Now, let's take a look at how we can do this in your code.

So this is how easy it is to encode a draw call.

The first thing you're going to do is select the render command by index using your thread ID.

Then, we're going to set the properties.

So in this example, we're setting a shader with a pipeline setting a shader with a pipeline state and then a separate buffer for the geometry and material.

And finally, this is how you encode a draw call.

Thanks to the Metal shading language, encoding on the GPU is really, really simple.

Even though this is in a compute shader, this looks just like what you're already doing on the CPU today.

Now, let's look at 1 more sample.

Here are some of the basic things you need to do to create, encode, and execute an ICB.

To create it, you first fill out a descriptor.

The descriptor contains things like draw types, and inheritance properties, and per-stage bind counts.

This describes the way that the indirect buffer will behave.

When it's time to encode the ICB, you simply create compute encoder and call dispatch just like what you've been doing already.

Once the ICB is encoded, you can optionally decide if you want to optimize it.

When you optimize it, you remove all the redundant state, and the all the redundant state, and the end result is a lean and highly-efficient set of GPU commands.

Now, once the ICB is encoded and optimized, it's time to schedule it for execution.

You notice here that you could actually specify the exact range of commands that you execute.

Also in this example, we use an indirect buffer, which itself can be encoded with a GPU.

So once the ICB is encoded, it could be reused again and again, and the overhead is completely negligible.

So I'm really excited, but we actually went ahead and we put together a sample so you could take a look.

So here you could see a number of school buses in the middle of a city.

Each bus is composed of 500,000 polygons and 2000 individual parts.

Each part requires a separate draw call, its own material argument buffer, index buffer, and vertex buffer.

As you could imagine, this would As you could imagine, this would be a lot of API calls on the CPU, but we are using indirect command buffers here, so everything is being encoded on the GPU.

We're also selecting the appropriate level of detail, and therefore, we're able to render multiple objects without increasing the CPU or GPU cost.

So on the left, you could see a view of the regular camera.

And on the right, we've zoomed in to a single bus, so you could see the level of detail actually changing.

ICBs enabled us to introduce another really incredible optimization.

We're able to split the geometry into chunks of a few hundred triangles and analyze those chunks in a separate compute kernel.

You could see the chunks in different colors on the screen.

Each thread of the kernel determines whether triangles are facing away from the camera or if they're obscured by other objects or geometry in the scene.

This is all really, really fast because we've performed the calculation for a chunk only and calculation for a chunk only and not on each individual triangle.

We then tell the GPU to only render the chunks that are actually visible.

And again, let's see the side-by-side view.

The left side is your camera view, and the right side is another view of the bus.

You could see the red and pinkish tint there.

That is what our compute shaders determined is invisible.

We never actually send this work to the GPU, so it saves us 50% or more of the geometry rendering cost.

Here's 1 last view showing which, how much this technique could save you.

So notice on the right, many of the buses and ambulances are actually invisible.

This is really amazing.

I love this.

So please take a chance to explore the code, and I hope I'll see this technology in some of your games in the future.

I think if utilized, ICBs can really push your games to the next level.

So now, I'm pleased to introduce Michael, who will show you how to optimize for the A11, improve performance, and extend playtime.

Thank you very much.

[ Applause ]

Thanks, Brian.

So everything Brian's just showed you is available for iOS, tvOS, and macOS.

Next, I'm going to dive into some of the new Metal 2 features for Apple's latest GPU, the A11 Bionic, designed to help you maximize your game's performance and extend your playtime by reducing system memory bandwidth and reducing power consumption.

So Apple-designed GPUs have a tile-based deferred rendering architecture designed for both high performance and low power.

This architecture takes advantage of a high bandwidth, low-latency tile memory that eliminates overdraw and unnecessary memory traffic.

Now, Metal is designed to take advantage of the TBDR architecture automatically within each render pass, load and store actions, make explicit how render pass attachments move in and out of tile memory.

But the A11 GPU takes the TBDR architecture even further.

We added new capabilities to our tile memory and added an entirely new programmable stage.

This opens up new optimization opportunities critical to advanced rendering techniques, such as deferred shading, order-independent transparency, tiled forward shading, and particle rendering.

So let's start by taking a look at the architecture of the A11 GPU.

All right.

So on the left, we have a block representation of the A11 GPU.

And on the right, we have system memory.

Now, the A11 GPU first processes all the geometry of a render pass in the vertex stage.

It transforms and bends your geometry into screen-aligned, geometry into screen-aligned, tiled vertex buffers.

These tiled vertex buffers are then stored in the system memory.

Now, each tiled vertex buffer is then processed entirely on ship as part of the fragment stage.

This tiled architecture enables 2 major optimizations that your games get for free.

First, the GPU rasterizes all primitives in a tile before shading any pixels using fast, on-ship memory.

This eliminates overdraw, which improves performance and reduces power.

Second, a larger, more flexible tile memory is used to store the shaded fragments.

Blending operations are fast because all the data is stored on ship next to the shading cores.

Now, tile memory is written to system memory only once for each tile after all fragments have been shaded.

This reduces bandwidth, which also improves your performance and reduces your power.

Now, these optimizations happen underneath the hood.

You get them just by using Metal on iOS.

But Metal also lets you optimize rendering techniques by taking explicit control of the A11's tile memory.

Now, during the development of the A11 GPU, the hardware and software teams at Apple analyzed a number of important modern rendering techniques.

We accelerated, we noticed many common themes, and we found that explicit control of our tile memory accelerated all of them.

We then developed the hardware and software features together around this idea of explicit control.

So let's talk about these features.

So programmable blending lets you write custom blend operations in your shaders.

It's also a powerful tool you can use to merge render passes, and it's actually made available across all iOS GPUs.

Imageblocks are new for A11.

They let you maximize your use of tile memory by controlling pixel layouts directly in the shading language.

And tile shading is our And tile shading is our brand-new programmable stage designed for techniques that require mixing graphics and compute processing.

Persistent threadgroup memory is an important tool for combining render and compute that allows you to communicate across both draws and dispatches.

And multi-sample color coverage control lets you perform resolve operations directly in tile memory using tile shaders.

So I'm going to talk to you about all these features, so let's start with programmable blending.

With programmable blending, your fragment shader has read and write access to pixels and tile memory.

This lets you write custom blending operations.

But programmable blending also lets you eliminate system memory bandwidth by combining multiple render passes that read and write the same attachments.

Now, deferred shading is a particularly good fit for programmable blending, so let's take a closer look at that.

So deferred shading is a many-light technique many-light technique traditionally implemented using 2 passes.

In the first pass, multiple attachments are filled with geometry attributes visible at each pixel, such as normal, albedo, and roughness.

And in the second pass, fragments are shaded by sampling those G-buffer attachments.

Now, the G-buffers are stored in the system memory before being read again in the lighting pass, and this round trip from tile memory to system memory and back again can really bottleneck your game because the G-buffer track consumes a large amount of bandwidth.

Now, programmable blending instead lets you skip that round trip to memory by reading the current pixel's data directly from tile memory.

This also means that we no longer need 2 passes.

Our G-buffer fill and lighting steps are now encoded and executed in a single render pass.

It also means that we no longer need a copy of the G-buffer attachments in system memory.

And with memory, Metal's memoryless render target memoryless render target feature, saving that memory is really, really simple.

You just create a texture with a memoryless flag set, and Metal's only going to let you use it as an attachment without load or store actions.

So now, let's take a look at how easy it is to adopt programmable blending in your shaders.

Okay, so here's what the fragment shader of your lighting pass would look like with programmable blending.

Programmable blending is enabled when you both read and write your attachments.

And in this example, we see that the G-buffer attachments are both inputs and outputs to our functions.

We first calculate our lighting using our G-buffer properties.

As you can see here, we're reading our attachments and we're not sampling them as textures.

We then accumulate our lighting result back into the G-buffer, and, in this step, we're both reading and writing our accumulation attachments.

So that's it.

Programmable blending is really that easy, and you should it where, whenever you have multiple render passes that read and write the same attachments.

So now, let's talk about imageblocks, which allow you to merge render passes in even more circumstances.

Imageblocks give you full control of your data in tile memory.

Instead of describing pixels as arrays of render pass attachments in the Metal API, imageblocks let you declare your pixel layouts directly in the shading language as structs.

It adds new pack data types to the shading language that match the texture formats you already use, and these types are transparently packed and unpacked when accessing the shader.

In fact, you can also use these new pack data types in your vertex buffers and constant buffers to more tightly pack all of your data.

Imageblocks also let you describe more complex per-pixel data structures.

You can use arrays, nested structs, or combinations thereof.

It all just works.

Now, direct control of your Now, direct control of your pixel layout means that you can now change the layout within a pass.

This lets you combine render passes to eliminate system memory bandwidth in ways that just weren't possible with programmable blending alone.

Let's take a look at an example.

So in our previous example, we used programmable blending to implement single-pass deferred shading.

You can also implement single-pass deferred shading using imageblocks.

Imageblocks only exist in tile memory, so there's no render pass attachments to deal with.

Not only is this a more natural way to express the algorithm, but now you're free to reuse the tile memory once you're finished reading the G-buffer after your lighting.

So let's go ahead and do that.

Let's reuse the tile memory to add an order-independent transparency technique called multi-layer alpha blending.

So multi-layer alpha blending, or MLAB, maintains a per-pixel, fixed-size array of translucent fragments.

Each incoming fragment is sorted Each incoming fragment is sorted by depth into the array.

If a fragment's depth lies beyond the last element of the array, then those elements are merged, so it's really an approximation, approximate technique.

Now, sorting the MLAB array is really fast because it lives in tile memory.

Doing the same off chip would be really expensive because of the extra bandwidth and synchronization overhead.

Now, the A11 actually doubles the maximum supported pixel size over your previous generation, but that's still not going to be enough to contain both the G-buffer and MLAB data structures simultaneously.

Fortunately, you don't need both at the same time.

Imageblocks let you change your pixel layouts inside the render pass to match your current needs.

So changing pixel layouts actually requires tile shading, so let's talk about that next.

So tile shading is the new programmable stage that provides compute capabilities directly in compute capabilities directly in the render pass.

This stage is going to execute a configurable threadgroup for each tile.

For example, you can launch a single thread per tile, or you can launch a thread per pixel.

Now, tile shading lets you interleave draw calls and threadgroup dispatches that operate on the same data.

Tile shaders have access to all of tile memory, so they can read and write any pixel of the imageblock.

So let's look at how tile shading can optimize techniques such as tiled forward shading.

So like deferred shading, tiled forward shading is a many-layered technique.

It's often used when MSA is important or when a variety of materials are needed and works equally well for both opaque and translucent geometry.

Now, tiled forward shading traditionally consists of 3 passes.

First, a render pass generates a scene depth buffer.

Second, a compute pass generates, calculates per-tile depth bounds and per-tile light depth bounds and per-tile light lists using that scene depth buffer.

And finally, another render pass is going to shade the pixels in each tile using the corresponding light list.

Now, this pattern of mixing render with compute occurs frequently.

And prior to A11, communicating across these passes required system memory.

But with tile shading, we can inline the compute so that the render passes can be merged.

Here the depth bounds and light culling steps are now implemented as tile shaders and inlined into a single render pass.

Depth is now only stored in the imageblock and, but is accessible across the entire pass.

So, now, tile shading is going to help you eliminate a lot of bandwidth, but these tile shader outputs are still being stored to system memory.

Tile shader dispatches are synchronized with draws, so that's completely safe to do, but I think we could still do better using our next feature, persistent threadgroup memory.

Okay, so threadgroup memory is a well-known feature of Metal compute.

It lets threads within a threadgroup share data using fast, on-ship memory.

Now, thanks to tile shading, threadgroup memory is now also available in the render pass.

But threadgroup memory in the render pass has 2 new capabilities not traditionally available to compute.

First, a fragment shader now also has access to the same threadgroup memory.

And second, the contents of threadgroup memory persist across the entire life of a tile.

Taken together, this makes a powerful tool for sharing data across both draws and dispatches.

In fact, we believe it's so useful that we've actually doubled the maximum size of threadgroup memory over our previous generation so that you can store more of your intermediate data on ship.

Okay, so now, let's use threadgroup persistence to further optimize our tiled forward shading example.

So with persistence, tile, the So with persistence, tile, the tile shading stage can now write both the depth bounds and the culled light lists into threadgroup memory for later draws to use.

This means that now all our intermediate data stays on ship and never leaves the GPU.

Only the final image is stored at system memory.

Minimizing bandwidth to system memory is, again, very important for your game's performance and playtime.

Now, let's take a look at how easy it is to make use of persistence in the shading language.

Okay, so the top function here is our tile shader, and it's going to perform our light culling.

It intersects each light with a per-tile frustum to compute an active light mask.

The bottom function is our fragment shader that performs our forward shading.

It shades only the lights intersecting the tile using that active light mask.

Now, sharing threadgroup memory across these functions is achieved by using the same type and bind point across both shaders.

That's how easy it is to take advantage of threadgroup persistence.

Okay, so now that you've seen tile shading and threadgroup persistence, let's revisit our order-independent transparency example.

Okay, so remember how I said that changing imageblock layouts requires tile shading?

That's because tile shading provides the synchronization we need to safely change layouts.

This means we actually have to insert a tile shade between the lighting and the MLAB steps.

So tile shading is going to wait for the lighting stage to complete before transitioning from G-buffer layout to MLAB layout, and it's also going to carry forward the accumulated lighting value from the lighting step into the MLAB step for final blending.

Okay, so now that we've covered imageblocks, tile shading, and threadgroup persistence, it's time to move on to our final topic, multi-sample anti-aliasing and sample coverage control.

So multi-sample anti-aliasing improves image quality by supersampling depth, stencil, and blending, but shades only once per pixel.

Multiple samples are later resolved into a final image using simple averaging.

Now, multi-sampling is efficient on all A series GPUs because samples are stored in tile memory, where blending and resolve operations have fast access to the samples.

The A11 GPU optimizes multi-sampling even further by tracking the unique colors within each pixel.

So blending operations that previously operated on each sample now only operate on each color.

This could be a significant savings because the interior of every triangle only contains 1 unique color.

Now, this mapping of unique color to samples is called color coverage control, and it's managed by the GPU.

But tile shaders can also read and modify this color coverage.

And we can use this to perform custom resolves in place and in fast tile memory.

Now, to see why this is useful, let's take a look at a multi-sampled scene that also renders particles.

Now, particles are transparent, so we blend them after rendering our opaque scene geometry.

But particle rendering doesn't benefit from multi-sampling because it doesn't really have any visible edges.

So to avoid the extra cost of blending per sample for no good reason, a game would render using 2 passes.

In the first pass, your opaque scene geometry is rendered using multi-sampling to reduce aliasing.

And then, you're going to resolve your color and depth to system memory, and we're resolving depth because particles can later be included.

Then in the second pass, the resolve color and depth are used in rendering the particles without multi-sampling.

Now, as you probably guessed by now, our goal is to eliminate the intermediate system memory traffic using tile shading to combine these 2 passes.

But tile shading alone isn't enough.

We need color coverage control to change the multi-sampling rate in place.

Using color coverage control is really powerful and really easy.

Let's take a look at the shader.

Okay, so remember that our goal here is to average the samples of each pixel and then store that result back into the image block as the overall pixel value.

Now, instead of looping through each color, through each sample, we're going to take advantage of the color rate capabilities of the A11 and only loop through unique colors.

To properly average across all samples, we need to weigh each color by the number of samples associated with it, and we do this by counting the bit set in the color coverage mask.

We then complete our averaging by dividing by the total number of samples and, finally, write the result back into the imageblock.

The output sample mask tells Metal to apply the results to Metal to apply the results to all samples of the pixel.

And since all samples now share the same value, the later particle draws are going to blend per pixel rather than per sample.

So that's it for sample coverage control.

Now, optimizing for Apple GPUs is really important for maximizing your game's performance and extending its playtime, but there's a lot more work that goes into shipping a tile in iOS, especially one that's originally designed for desktops and consoles.

To talk about that now and to put into practice what we just discussed, I'd like to bring on Nick Penwarden from Epic Games.

Nick?

[ Applause ]

Thank you, Michael.

So, yeah. I'd like to talk a little bit about how we took a game that was originally made for desktop and console platforms and brought it to iOS using Metal.

So some of the technical challenges we faced.

The Battle Royale map is 1 map.

The Battle Royale map is 1 map.

It's larger than 6 kilometers squared.

That means that it will not all fit into memory.

We also have dynamic time of day, destruction.

Players can destroy just about any object in the scene.

Players can also build their own structures.

So the map is very dynamic, meaning we can't do a lot of precomputation.

We have 100 players in the map, and the map has over 50,000 replicating actors that are simulated on the server and replicated down to the client.

Finally, we wanted to support crossplay between console and desktop players along with mobile.

And that's actually a really important point because it limited the amount that we could scale back the game in order to fit into the performance constraints of the device.

Basically, if something affected gameplay, we couldn't change it.

So if there's an object and it's really small, it's really far away, maybe normally you would cull it, but in this case, we can't because if a player can hide behind it, we need to hide behind it, we need to render it.

So want to talk a little bit about Metal.

Metal is really important in terms of allowing us to ship the game as fast as we did and at the quality that we were able to achieve.

Draw call performance was key to this because, again, we have a really complicated scene and we need the performance to render it, and Metal gave us that performance.

Metal also gave us access to a number of hardware features, such as programmable blending, that we used to get important GPU performance back.

It also has a feature set that allowed us to use all of the rendering techniques we need to bring Fortnite to iOS.

In terms of rendering features, we use a movable directional light for the sun with cascaded shadow maps.

We have a movable skylight because the sky changes throughout the day.

We use physically-based materials.

We render in HDR and have a tone-mapping pass at the end.

We allow particle simulation on the GPU.

And we also support all of our artist-authored materials.

artist-authored materials.

It's actually a pretty important point because some of our materials are actually very complicated.

For instance, the imposters that we use to render trees in the distance efficiently were entirely created by a technical artist at Epic using a combination of blueprints and the material shader graph.

So in terms of where we ended up, here is an image of Fortnite running on a Mac at high scalability settings.

Here it is running on a Mac at medium scalability settings.

And here it is on an iPhone 8 Plus.

So we were able to faithfully represent the game on an iPhone about at the quality that we achieve on a mid-range Mac.

So let's talk a little bit about scalability.

We deal with scalability both across platforms as well as within the iOS ecosystem.

So across platform, this is stuff that we need to fit on the platform at all, like removing LODs from meshes that will never LODs from meshes that will never display so we can fit in memory or changing the number of characters that we animate at a particular quality level in order to reduce CPU costs.

Within iOS, we also defined 3 buckets for scalability low, mid, and high and these were generally correlated with the different generations of iPhones, so iPhone 6s on the low end, iPhone 7 was our mid-range target, and the iPhone 8 and iPhone X on the high end.

Resolution was obviously the simplest and best scalability option that we had.

We ended up tuning this per device.

We preferred to use backbuffer resolution where possible this is what the UI renders at because if we do this, then we don't have to pay a separate upsampling cost.

However, we do support rendering 3D resolution at a lower resolution, and we do so in some cases where we needed a crisp UI but had to reduce 3D render resolution lower than that in resolution lower than that in order to meet our performance goals the iPhone 6s, for example.

Shadows were another axis of scalability and actually really important because they impact both the CPU and the GPU.

On low-end devices, we don't render any shadows at all.

On our mid-range target, we have 1 cascade, 1024 by 1024.

We set the distance to be about the size of a building, so if you're inside of a structure, you're not going to see light leaking on the other side.

High-end phones add a second cascade, which gives crisper character shadows as well as lets us push out the shadowing distance a little further.

Foliage was another axis of scalability.

On low-end devices, we simply don't render foliage.

On the mid range, we render about 30% of the density we support on console.

And on high-end devices, we actually render 100% of the density that we support on console.

Memory is interesting in terms of scalability because it doesn't always correlate with performance.

For instance, an iPhone 8 is For instance, an iPhone 8 is faster than an iPhone 7 Plus, but it has less physical memory.

This means when you're taking into account scalability, you need to treat memory differently.

We ended up treating it as an orthogonal axis of scalability and just had 2 buckets, low memory and high memory.

For low-memory devices, we disabled foliage and shadows.

We also reduced some of the memory pool.

So for instance, we limited GPU particles to a total of 16,000 and reduced the pool use for cosmetics and texture memory.

We still need to do quite a bit of memory optimization in order to get the game to fit on the device.

The most important was level streaming basically, just making sure that nothing is in memory that is not around the player.

We also used ASTC texture compression and tend to prefer compressing for size rather than quality in most cases.

And we also gave our artists a lot of tools for being able to cook out different LODs that aren't needed or reduce audio variations on a per-platform variations on a per-platform basis.

Want to talk a little bit about frame rate targets.

So on iOS, we wanted to target 30 fps at the highest visual fidelity possible.

However, you can't just max out the device.

If we were maxing out the CPU and the GPU the entire time, the operating system would end up downclocking us, then we'd no longer hit our frame rates.

We also want to conserve battery life.

If players are playing several games in a row during their commute, we want to support that rather than their device dying before they even make it to work.

So for this, what we decided to do was to target 60 frames per second for the environment, but vsync at 30, which means most of the time when you're exploring the map in Fortnite, your phone is idle about 50% of the time.

Using that time to conserve battery life and keep cool.

To make sure that we hit those targets, we track performance every day.

So every day, we have an automation pass that goes automation pass that goes through.

We look at key locations in the map, and we capture performance.

So for instance, Tilted Towers, and Shifty Shafts, and all of the usual POIs that you're familiar with in Battle Royale.

When one goes over budget, we know we need to dive in, figure out where performance is going, and optimize.

We also have daily 100-player playtests where we capture the dynamic performance that you'll only see during a game.

We track key performance over time for the match, and then we can take a look at this afterwards and see how it performed, look for hitches, stuff like that.

And if something looks off, we can pull an instrumented profile off of the device, take a look at where time was going, and figure out where we need to optimize.

We also support replays.

This is a feature in Unreal that allow us to go and replay that match from a client perspective.

So we can play it over and over, analyze it, profile it, and even see how optimizations would have affected the client in that play session.

session.

Going to talk a little bit about metal specifically.

So we, on most devices, we have 2 cores, right, and so the way we utilize that is we have a traditional game thread/rendering thread split.

On the game thread, we're doing networking, simulation, animation, physics, and so on.

The rendering thread does all of scene traversal, culling, and issues all of the Metal API calls.

We also have an async thread.

Mostly, it's handling streaming tasks texture streaming as well as level streaming.

On newer devices where we have 2 fast and 4 efficient cores, we add 3 more task threads and enable some of the parallel algorithms available in Unreal.

So we take animation, put it, simulate it over on multiple frames, CPU particles, physics, and so on, scene culling, a couple other tasks.

I mentioned draw calls earlier.

Draw calls were our main Draw calls were our main performance bottleneck, and this is actually where Metal really helped us out.

We found Metal to be somewhere on the order of 3 to 4 times faster than OpenGL for what we were doing, and that allowed us to ship without doing a lot of aggressive work trying to reduce draw calls.

We did stuff to reduce draw calls, mostly pulling in cull distance on decorative objects as well as leveraging the hierarchical level of detail system.

So here's an example.

This is one of those POIs that we tracked over time.

If you're familiar with the game, this is looking down on Tilted Towers from a cliff and was kind of our draw call hot spot in the map.

As you can see, it takes about 1300 draw calls to render this.

This is just for the main pass.

It doesn't include shadows, UI, anything else that consumed draw call time.

But Metal's really fast here.

On an iPhone 8 Plus, we were able to chew through that in under 5 milliseconds.

I mentioned hierarchical LOD.

This is a feature we have in This is a feature we have in Unreal where we can take multiple draw calls and generate a simplified version, a simplified mesh, as well as a material so that we can basically render a representation of that area in a single draw call.

We use this for taking POIs and generating the simplified versions for rendering very, very far away.

For instance, during the skydive, you can see the entire map.

In fact, when you're on the map, you can get on a cliff or just build a very high tower of your own and see points of interest from up to 2 kilometers away.

Digging into some of the other details on the Metal side, I want to talk a little bit about pipeline state objects.

This was something that took us a little bit of time to get into a shippable state for Fortnite.

You really want to minimize how many PSOs you're creating while you're simulating the game during the frame.

If you create too many, it's very easy to hitch and create a poor player experience.

So first of all, follow best practices, right.

Compile your functions offline, Compile your functions offline, build your library offline, and pull all of your functions into a single library.

But you really want to make sure you create all of your PSOs at load time.

But what do you do if you can't do that?

So for us, the permutation matrix is just crazy.

There's way too many for us to realistically create at load time.

We have multiple artist-authored shaders thousands of them multiple lighting scenarios based on number of shadow cascades and so on, different render target formats, MSAA.

The list goes on.

We tried to minimize permutations where we could, and this does help.

Sometimes a dynamic branch is good enough and better than creating a static permutation, but sometimes not.

What we had to do is we decided to identify the most common subset that we're likely to need, and we create those at load.

We don't try to create everything.

The way we achieved this is we created an automation pass where we basically flew a camera through the environment and recorded all of the PSOs that we actually needed to render the environment.

Then, during our daily playtests, we harvested any PSOs that were created that were not caught by that automation pass.

The automation pass also catches, like, cosmetics, and effects from firing different weapons, and so on.

We take all of that information from automation and from the playtest, combine it into a list.

That's what we create at load time, and that's what we ship with the game.

It's not perfect, but we find that the number of PSOs we create at runtime is in the single digits for any given play session, on average.

And so players don't experience any hitching from PSO creation.

Resource allocation.

So basically, creating and deleting resources is expensive or can be expensive.

It's kind of like, think of [inaudible].

You really want to minimize the number of [inaudible] you're making per frame.

You really don't want to be creating and destroying a lot of resources on the fly, but when you're streaming in content dynamically, when you have a lot dynamically, when you have a lot of movable objects, some of this just isn't possible to avoid.

So what we did for buffers is we just used buffer suballocation basically, a bend allocation strategy.

Upfront, we allocate a big buffer, and then we suballocate small chunks back to the engine to avoid asking Metal for new buffers all the time.

And this ended up helping a lot.

We also leveraged programmable blending to reduce the number of resolves and restores and the amount of memory bandwidth we use.

Specifically, the main use case we have for this is anywhere we need access to scene depth, so things like soft particle blending or projected decals.

What we do is during the forward pass, we write our linear depth to the alpha channel.

And then, during our decal and translucent passes, all we need to do is use programmable blending to read that alpha channel back, and we can use depth without having ever had to resolve the depth buffer to main memory.

memory.

We also use it to improve the quality of MSAA.

As I mentioned, we do HDR rendering, and a just an MSAA resolve of HDR can still lead to very aliased edges.

Think of cases where you have a very, very bright sky and a very, very dark foreground.

Just doing a box filter over that is, basically, if 1 of those subsamples is incredibly bright and the others are incredibly dark, the result is going to be an incredibly bright pixel.

And when tone mapped, it'll be something close to white.

You end up with edges that don't look anti-aliased at all.

So our solution to this was to do a pre tone map over all of the MSAA samples, then perform the normal MSAA resolve, and then the first postprocessing pass just reverses that pre tone map.

We use programmable blending for the pre tone map pass.

Otherwise, we'd have to resolve the entire MSAA color buffer to memory and read it back in, which would be unaffordable.

Looking forward to some of the Looking forward to some of the work we'd like to do in the future with Metal, parallel rendering.

So on macOS, we do support creating command buffers in parallel.

On iOS, we'd really need to support parallel command encoders for this to be practical.

A lot of our drawing ends up happening in the main forward pass, and so it's important to parallelize that.

I think it would be very interesting to sort of see the effects of parallel rendering on a monolithic, fast core versus what we had for parallel command encoders on the efficient cores on higher-end devices.

Could be some interesting results in terms of battery usage.

Metal heaps.

So we'd like to replace our buffer suballocation with Metal heaps first, because it'll just simply our code, but second, because we can also use it for textures.

We still see an occasional hitch due to texture streaming because we're obviously creating and destroying textures on the fly as we bring textures in and out of memory.

Being able to use heaps for this will get rid of those hitches.

will get rid of those hitches.

For us, we just, it's, the work we have in front of us to make that possible is setting up precise fencing between the different passes, right.

So we need to know explicitly if a resource is being read or written by a vertex or pixel shader across different passes, and it requires some reworking of some of our renderer to make that happen.

And of course, continue to push the high end of graphics on iOS.

Last year at WWDC, we showed what was possible by bringing our desktop-class forward renderer to high-end iOS devices, and we continue, we want to continue pushing that bar on iOS, continuing to bring desktop-class features to iOS and looking for opportunities to unify our desktop renderer with the iOS renderer.

And with that, I'll hand it back to Michael.

[ Applause ]

So Metal is low overhead out of the box, but rendering many objects efficiently can require multithreading.

Metal is built to take advantage of all the GPU, all the CPUs in our systems.

Metal is also really accessible, but advanced rendering sometimes requires explicit control.

Metal provides this control when you need it for memory management and GPU parallelism.

We also introduced indirect command buffers, our brand-new feature that lets you move command generation entirely to the GPU, freeing the CPU for other tasks.

Together with argument buffers, these features provide a complete solution to GPU-driven pipelines.

And finally, Metal lets you leverage the advanced architecture of the A11 GPU to optimize many rendering techniques for both maximum performance and extended playtime.

For more information, please visit our website, and be sure to visit us in tomorrow's lab.

Thank you.

[ Applause ]

Apple, Inc. AAPL
1 Infinite Loop Cupertino CA 95014 US