Metal for VR 

Session 611 WWDC 2018

On macOS, Metal 2 adds specialized support for virtual reality (VR) rendering and external GPUs. Learn about new features and optimizations to take advantage of these technologies within your Metal 2-based apps and games. Understand best practices for scheduling workloads across multiple GPUs and techniques for frame pacing while multi-threading.

[ Music ]

[ Applause ]

Good morning everyone.

My name is Karol Gasinski, and I am a member of GPU Software Architecture Team at Apple.

We will start this session with a brief summary of what is new in [inaudible], in terms of VR adoption.

Then, we will take a deep dive into a new Metal 2 features, designs specifically for the VR this year.

And finally, we will end this session with advanced techniques for developing VR applications.

Recently, we introduced new iMac and iMac Pro that have great GPUs on board; iMac is now equipped with [inaudible] based GPUs and have up to 80 gigabytes of video memory on board.

While iMac Pro are equipped with even more advanced and even bigger based GPUs, with up to 16 gigabytes of video memory.

That’s a lot of power that is now in your hands.

But we are not limiting our service to iMac on their own.

With recent announcement of extended GPU support, you can now turn any Mac into powerful workstation that gives you more than 10 [inaudible] of processing power.

And that’s not all.

Today we are introducing plug and play support for HTC Vive head mounted display.

It has two panels with 1,440 by 1,600 and 650 pixels per inch.

That’s 78% increase in the resolution, and 57% increase in pixel density compared to Vive.

And with support for better panels comes support for its new dual-camera front-facing system, so developers will now be able to use those cameras to experiment with pass-through video on Mac.

And together with Vive Pro support comes improved tracking system.

So, you might be wondering how you can start developing VR application on Mac OS.

Both HTC Vive and Vive Pro work in conjunction with Valve SteamVR runtime, that provides a number of services including VR compositor.

Valve is also making open VR framework that is available on Mac OS, so that you can do the map that works with SteamVR.

We’ve been working very closely with both Valve and HTC to make sure that Vive Pro is supported in SteamVR runtime on Mac OS.

So, now let’s see how new Metal features that we’re introducing can develop a [inaudible] Mac OS Mojave can be used to fiber-optimize your VR application.

As a quick refresher, let’s review current interaction between application and VR compositor.

Application will start by rendering image for left and right eye into 30 multi-sample textures.

Then it will resolve those images into iOS surface back textures that can be further passed to VR compositor.

VR compositor will perform final processing step that will include [inaudible] distortion correction, chromatic aberration, and order operations.

We can call it in short warp.

Once the final image is produced, it can be sent to the headset for presentment.

It is a lot of work here that is happening twice, so let’s see if you can do something about that.

See, now with VR application it wants to benefit from multi-sample [inaudible], it needed to use the dedicated textures per i, or single shared one, for both.

But none of those layouts is perfect.

The dedicated textures require separate draw calls and passes, as we just saw.

While straight textures enable rendering of both eyes in single rendered and results pass, they are problematic when it comes to post-processing the effects.

[Inaudible] textures have all the benefits of both dedicated and shared layouts, but currently they couldn’t be used with MSAA.

This was forcing app developers to use different rendering part-time layouts, based on the fact if they wanted to use MSAA or not.

Or use different tricks to work around it.

So let’s see how we can optimize this rendering.

Today, we introduce new texture types to the multi-sample already textured.

This texture type has all the benefits of previously mentioned types without any of the drawbacks.

Thanks to that it is now possible to separate from each other rendering space, which simplifies the post-processing effects, views count, so that application can fall back easily to monoscope rendering and control over anti-aliasing mode.

As a result, application can now have single rendering files that can be easily adopted to any situation, and most important, that can be rendered with single draw and render pass in each case.

So here we see code snippet for creation of mentioned 2D multi sample [inaudible] texture.

We set up sample count to 4, as it’s an optimal tradeoff between quality and performance, and at the same time, we set up our other length to 2 as we want to store each image for each I in separate slice.

So let’s see how our pipeline will change.

We can now replace those 2D multi-sample textures with single 2D multi-sample [inaudible] one.

So now application can render both I in single render pass and if it’s using instancing, it can even do that in single draw code.

So that already looks great, but we still need to resolve those 2D multi-sample array texture slices into separate iOS [inaudible] faces before we pass them to compositor.

So let’s focus on our way, application shares textures with compositor.

So now, for sharing textures, we use IOSurfaces.

They are sharing textures between different process spaces and different GPUs, that we’ve got [inaudible] comes a price.

IOSurfaces can be only used to share simple 2D textures, so if you have a multi-sampled one, storing [inaudible] or having [inaudible], they couldn’t be shared.

That’s why today we introduce shareable Metal textures that allow your applications to share any type of Metal texture between process spaces, as long as these textures stay in scope of single GPU.

This file features [inaudible] advanced view of these cases.

For example, sharing depth of your scene with VR compositor.

But, of course, it’s not limited just to that.

Now, let’s look how those textures can be created.

Because shareable textures allows us now to pass complex textures between processes, we will create 2D array texture that we will pass to VR compositor.

As you can see, to do that, we use new methods, new shared texture with this creator.

And while doing that, you need to remember to use private storage mode, as this texture can be only accessed by the GPU on which it was created.

Now, we see a code snippet showing us how our VR application would send IOSurface to VR compositor in the past.

We will now go through this code snippet, and see what changes needs to be applied to switch from using IOSurfaces to shared Metal textures.

So we don’t need those two IOSurfaces anymore, and those two textures that were backed by them can now be replaced with single shareable Metal texture that is over 2D array type.

We will then assign this texture to both texture descriptors from open VRSDK, and change its type from IOSurface to Metal.

After doing these few changes, we can submit image for the left and right I to the compositor.

Compositor will now know that we’ve passed shared Metal texture with advanced layout, instead of IOSurface, and if we check, if its type is 2D array or 2D multi-sampling array.

If it is, then compositor will automatically assume that image for the left i is stored in slice 0, and image for right i is stored in slice 1.

So your application doesn’t need to do anything more about that.

And of course, sharing Metal textures between application and compositor is not the only use case for shareable Metal textures.

So here we have simple example of how you can pass Metal texture between any two processes.

So we start exactly in the same way.

We create our shareable Metal texture, but now from this texture, we create special shared texture handle that can be passed between process spaces using cross-process communication connection.

Once this handle is passed to other process, it can be used to recreate texture object.

But while doing that, you need to remember to recreate your texture object on exactly the same device as it was originally created in other process space, as this texture cannot leave scope of GPU.

So now let’s get back to our pipeline and see what will change.

Application can now replace those separate IOSurfaces with one 2D array texture, storing the image for both i’s.

This allows further optimization as original 2D multi-sample array texture can be now resolved in one pass as well to just create it shareable through the array texture.

But that’s not everything.

Let’s look at the compositor.

Once we have simplified rendering parts on application site, there is nothing preventing compositor from benefiting from those new features as well.

So compositor can now use those incoming 2D array textures and perform work for both i’s in single render pass as well.

And as you can see, we’ve just simplified the whole pipeline.

So let’s do recap of what we’ve just learned.

We’ve just described two new Metal features.

Shareable Metal textures, and 2D multi-sample array texture type.

And the way they can be used to further optimize your rendering pipeline.

Both features will be soon supported in upcoming SteamVR runtime updates.

So now, let’s focus on techniques that will allow your application to maximize its CPU and GPU utilization.

We will divide this section into two subsections Advanced frame pacing and a reducing free rate.

We will start with frame pacing.

And in this section, we will analyze application frame pacing and how it can be optimized for VR.

So let’s start with simple, single-threaded application that is executing everything in serial monitoring.

Such application will start its frame by calling WaitGet pauses, to receive pauses, and synchronize its execution to the frame rate of the headset.

Both Vive and Vive Pro has refresh rate of 90 frames per second, which means the application has only 11.1 milliseconds to process the whole frame.

For comparison, blink of an eye takes about 300 milliseconds.

So in this time, the application should render 50 frames.

So once our application receives pauses from WaitGet pauses, it can start simulation of your trial [inaudible].

When this simulation is complete, and state of all objects is known, application can continue with encoding command buffer that will be then sent to GPU for execution.

Once GPU is done, an image for both i’s is rendered, it can be sent to VR compositor for final post-processing, as we talked about a few slides before.

After that, frames scanned out from memory to [inaudible] in the headset.

This transfer takes additional frame as all pixels need to be updated before image can be presented.

Once all pixels are updated, [inaudible] and user can see a frame.

So as you can see from the moment the application receives pauses, to the moment image is really projected, it takes about 25 milliseconds.

That is why application receives pauses that are already predicted into the future, to the moment when photons will be emitted, so that the rendered image is matching the user pause.

And this cascade of events overlapping with previous and next frame is creating our frame basing diagram.

As you can see, in case of the single-threaded application, GPU is idle most of the time.

So let’s see if we can do anything about that.

We are now switching to multi-threaded application, which separates simulation of its visual environment from encoding operations to the GPU.

Encoding of those operations will now happen on separate rendering threads.

Because we’ve separated simulation from encoding, simulation for our frame can happen in parallel to previous frame encoding of GPU operations.

This means that encoding is now shifted area in time, and starts immediately after we receive predicted pauses.

This means that your application will now have more time to encode the GPU [inaudible] and GPU will have more time to process it.

So, as a result, your application can have better visualize.

But there is one trick.

Because simulation is now happening one frame in advance, it requires separate set of predicted pauses.

This set is predicted 56 milliseconds into the future so that it will match the set predicted for rendering thread and both will match the moment when photons are emitted.

This diagram already looks good from CPU side, as we can see application is nicely distributing its work [inaudible] CPU course, but let’s focus on GPU.

As you can see, now our example application is encoding all these GPU [inaudible] for the whole frame into a single common buffer, so unless this common buffer is complete, GPU is waiting idle.

But it’s important to notice that encoding of GPU operations on a CPU takes much less time than processing of these operations on the GPU.

So we can benefit from this fact, and split our encoding operation into a few common buffers while a few common buffer will be encoded very fast, with just few operations, and submitted to GPU as fast as possible.

This way, now our encoding is processing in parallel to GPU already processing our frame, and as you can see, we’ve just extended the time when GPU is doing its work, and as a result, further increase amount of work that you can submit in a frame.

Now, let’s get back to our diagram, and see how it all looks together.

So as you can see, now both CPU and GPU are fully utilized.

So [inaudible] application is already very good example of your application, but there are still few things we can do.

If you will notice, rendering thread is still waiting with encoding of any type of GPU work before it will receive predicted pauses.

But not all [inaudible] in the frame requires those pauses.

So let’s analyze in more detail to pick our frame workloads.

Here, you can see a list of workloads that may be executed in each frame.

Part of them happen in screen space or require general knowledge about pause for which frame is rendered.

We call such workloads pause-dependent ones.

At the same time, there are workloads that are generic and can be executed without knowledge about pauses immediately.

We call those workloads pause independent ones.

So currently, our application was waiting for pauses to encode any type of work to GPU.

But if we split those workloads in half, we can encode pause independent workloads immediately and then wait for pauses to continue with encoding pause-dependent ones.

In this slide, we’ve already separated pause independent workloads from pause dependent ones.

Pause independent workloads is now encoded in [inaudible] common buffer, and is marked with a little bit darker shade than pause-dependent workload following it.

Because pause-independent workload can be encoded immediately, we will do exactly that.

We will encode it as soon as the previous frame workload is encoded.

This gives CPU more time to encode the GPU work, and what is even more important, it ensures us that this GPU work is already waiting for being executed on GPU so there will be exactly no idle time on GPU.

As soon as previous frame is finished, GPU can start with the next one.

The last subsection is a multi-GPU workload distribution.

We can scale our workload across multiple GPUs.

Current Mac Book Pro has two GPU on board, and while they have different performance characteristics, there is nothing preventing us from using them.

Similarly, if each GPU is connected, application can use it for rendering to the headset while using Mac’s primary GPU to offload some work.

So we’ve just separated pause-independent work and moved it to a secondary GPU.

We could do that because it was already encoded much earlier in our frame, and now this pause-independent workload is executing in parallel to pause-dependent workload of previous frame.

As a result, we further increased the amount of GPU time that you had for your frame.

But, by splitting this work into multiple GPUs, we now get to the point where we need a way to synchronize those workloads with each other.

So today we introduce new synchronization parameters to deal exactly with such situation.

MTL Events can now be used to synchronize GPU work in scope of single GPU across different Metal cues and MTL Shared Events extends this functionality by allowing it to synchronize workloads across different GPUs and even across different processes.

So here we will go through the simple code example.

We have our Mac, with attached eGPU through Thunderbolt 3 connection.

This eGPU will be our primary GPU driving the headset, so we can use GPU that is already in our Mac as secondary supporting GPU.

And we will use shared event to synchronize workloads of both GPUs.

Event initial value is zero, so it’s important to start synchronization counter from 1.

That’s because when we would wait on just initialized event, its counter of zero will cause it to return immediately, so there would be no synchronization.

So our rendering thread now starts encoding work for our supporting GPU immediately.

It will encode pause-independent work that will happen on our supporting GPU course, and once this work is complete, its results will be stored in locker memory.

That’s why we follow with encoding brief operation that will transfer those results to system memory that is visible by both GPUs.

And once this transfer is complete, our supporting GPU can safely signal our shared event.

This signal will tell eGPU that now it’s safe to take those results.

So our rendering thread committed this [inaudible] common buffer, and supporting GPU is already processing its work.

At the same time, we can start encoding command buffer for a primary GPU that is driving the headset.

In this command buffer, we will start by waiting for our shared event to be sure that the data is in system memory, and once it’s there, and the shared event is signaled, we can perform a brief operation that will transfer this data through Thunderbolt 3 connection, back to our [inaudible] GPU and once this transfer is complete, it’s safe to perform pause-dependent work, so a second command buffer will signal lockout event to let pause-dependent work know that it can start executing.

After encoding and submitting those two command buffers, rendering thread can continue as usual, with waiting for pauses, and later encoding pause-dependent work.

So now we have a mechanism to synchronize different workloads between different GPUs.

But as you can see, our secondary GPU is still a little bit idle.

That’s because in this example we decided to push through it, pause dependent workloads that have dependency with pause dependent ones.

Excuse me.

But of course there are types of workloads that have no dependencies, and they can happen at lower frequencies, the frequency of the headset.

One example of such workloads can be, for example, simulation of physically based accurate [inaudible] or anything else that requires a lot of time to be updated.

Such workload can happen in the background, completely asynchronously from rendering frames, and each time it’s ready, its results will be sent to primary GPU.

It’s marked here with gray color to indicate that it’s not related to any particular frame.

So, of course there are different GPUs with different performance characteristics, and they will have different bandwidth connections.

And your application will have different workloads in a single frame with different relations between them.

So you will need to design a way to distribute this workload on your own, but saying all that, it’s important to start thinking about this GPU workload distribution, as multi-GPU configuration are becoming common on Apple platforms.

So let’s summarize everything that we’ve learned in this section.

We showed multi-thread application to take full benefit of all CPU codes.

And split your command buffers, to ensure that GPU is not idle.

When doing that, if possible, try to separate pause-independent from pause-dependent workloads, to be able to encode this work as soon as possible, and even further, splitting workloads by frequency of update so if your application will execute on multi-GPU configuration, you can easily distribute it across those GPUs.

And while doing that, ensure that you drive each GPU with separate rendering threads to ensure that they all execute asynchronously.

Now, you switch to reducing fill rate.

Vive Pro introduces new challenges for VR application developers.

To better understand scale of the problem, we will compare different medium fill rates.

So, for example, application rendering in default scaling rate to Vive headset, produces 436 megapixels per second.

And most advanced [inaudible] against that [inaudible] HD TVs have fill rate of 475 megapixels per second.

Those numbers are already so big that game developers use different tweaks to reduce this fill rate.

Now, let’s see how Vive Pro compares to those numbers.

Vive Pro has a normal fill rate of 775 megapixels per second, and if you add to that four times multi-sampling [inaudible] or bigger scaling rate, this number will grow even more.

That is why reducing fill rate is so important.

There are multiple techniques there and new techniques are made every day.

So I encourage you to try them all, but today we will focus only on a few still as they are the simplest to implement and bring nice performance gains.

So we will start with clipping invisible pixels.

Here, you can see image rendered for left eye.

But due to the nature of the lens work, about 20% of those pixels are lost after compositor performs its distortion correction.

So on the right, you can see image that will be displayed on a panel in a headset before it goes through the lens.

So, the simplest way to reduce our fill rate is to prevent our application from rendering those pixels that won’t be visible anyway, and you can do that easily by using SteamVR Stencil Mask.

So we’ve just saved 20% of our fill rate by applying this simple mask, and reduce our Vive Pro fill rate to 620 megapixels.

Now, we will analyze implication of this lens distortion correction in more detail.

We will divide our field of view into nine sections.

Central section has field of view of 80 degrees horizontally by 80 degrees vertically, and we have surrounding sections on the edges and corners.

We’ve color tinted them to better visualize the contribution to final image.

So as you can see, corners are almost completely invisible and edges have matched less contribution to the image than in the original one.

In fact, if you see this image in the headset, you wouldn’t be able to look directly at the red sections.

The only way to see them would be with your peripheral vision.

So this gives us great hint.

We can render those edge and corner sections and a reduced fill rate, as they are mostly invisible anyway.

We render the central section as we did before.

But then we will render vertical edges with half of the width and horizontal sections with half of the height.

And finally, we will render corner edges at one-fourth of the resolution.

Once our expensive rendering pass is complete, we will perform cheap upscaling pass that will stretch those regions back to the resolution at which they need to be submitted to compositor.

So you are wondering how much we’ve gained by doing that.

In case of 80 by 80 degree central region, we reduced our fill rate all the way down to 491 megapixels per second.

But you remember that we just talked about clipping invisible pixels, so let’s combine those two techniques together.

By clipping pixels combined with multi-resolution shading, you can reduce your fill rate even further to 456 megapixels per second, and that is not a random number.

In fact, that’s a default fill rate of Vive headset, so by just using those two optimization techniques, your application can render to Vive Pro with much higher resolution using exactly the same GPU as it did when rendering to Vive headset.

Of course, you can use those techniques when rendering to Vive as well, which will allow you to bring visualize of your application even further and make it prettier.

There is one caveat here.

Multi-resolution shading requires few render passes, so it will increase your workload on geometric [inaudible], but you can easily mitigate that by just reducing your central vision by a few degrees.

Here, by just reducing our central vision by 10 degrees, we’ve reduced fill rate all the way to 382 megapixels per second.

And if your geometry workload is really high, you can go further, and experiment with lower fill rate, lower regions, that will reduce fill rate even more.

In case of 55 by 55 degrees central region, 80% of your [inaudible] eye movement will be still inside this region, but we’ve reduced our fill rate by more than half, to 360 megapixels per second.

So of course there are different ways to implement multi-resolution shading.

And you will get different performance gains from that.

So I encourage you to experiment with this technique and try what will work for you best.

So let’s summarize everything that we’ve learned during this session.

We’ve just announced plug and play support for Vive Pro Headsets, and introduced new Metal 2 features that allow you now to develop even more advanced VR applications.

And I encourage you to take advantage of multi-GPU configurations, as they are becoming common on other platforms.

You can learn more about this session from this link, and I would like to invite all of you to meet with me and my colleagues during Metal 4 VR Lab, that will take place today at 12:00 p.m. in Technology Lab 6.

Thank you very much.

[ Applause ]

Apple, Inc. AAPL
1 Infinite Loop Cupertino CA 95014 US