Introducing Metal 2

Session 601 WWDC 2017

Metal 2 provides near-direct access to the graphics processor (GPU), enabling your apps and games to realize their full graphics and compute potential. Dive into the breakthrough features of Metal 2 that empower the GPU to take control over key aspects of the rendering pipeline. Check out how Metal 2 enables essential tasks to be specified on-the-fly by the GPU, opening up new efficiencies for advanced rendering.

[ Applause ]

Welcome.

We introduced the host of new technologies with Metal 2 to allow you to make better, faster, and more efficient applications.

My name is Michal and together with my colleague Richard we'll explore three main themes today.

With Metal 2 we are continuing our direction of moving the expensive things to happen less frequently and making sure that the frequent things are really, really cheap.

Over the years we introduced precompiled shaders, render state objects, Metal Heap last year all to make sure that you can move the costly operations outside of your main application loop.

We gave you 10 times more draw calls by switching from open GL to Metal.

And this year we are introducing our new binding API that gives you some more.

And so we will talk about it a bit further.

We are also putting GPU more in a driving seat with GPU driven pipelines.

And you will be able to create new, novel algorithms, new rendering techniques, and whole unique experiences utilizing Metal 2 on modern GPUs.

Well, speaking of the experiences, we have a lot of new features in Metal and we have three other sessions that I would love you to attend.

VR is coming to Mac this year and with the new iMacs we are giving you really powerful GPUs.

The external GPU is coming to MacBook Pro to give you the same power.

And this all enables your users and your content creators to experience VR in ways not possible before.

Tomorrow's session will show you how to use our display direct display technology to get your content to HMD quick and with low latency.

You'll learn about the new Metal API editions for VR and our new Tools editions.

Machine learning is quickly becoming a key feature of our devices in many, many applications.

And with Metal 2 you can use Metal performance shaders to utilize the power of the GPU for machine learning on both test up and mobile devices.

And you're probably staring at that picture behind me and thinking, "How's that done?"

Well, we have a session for you on Thursday where you will learn about this and about the machine learning primitives the image processing primitives we have in our Metal performance shaders.

Lastly, our tools have seen the biggest advancement yet with Metal 2.

You'll be able to debug your applications quicker.

You can drill down to problems easier and we are exposing, for example GPU performance counters, to make sure that you can find your hotspots and your application fast pass quicker.

So I hope I got you excited about the few days ahead and let's get back to the present with the content of today's session.

So we'll start with argument buffers, probably our biggest core framework addition this year.

argument buffers provide an efficient new way of configuring which buffers, textures, and samplers your application can use freeing up considerable amount of CPU resources and actually enabling completely new schedules for the GPU at the same time.

Then we'll talk about Raster Order Groups, a new fragment shader synchronization primitive that allows you to precisely control the order in which fragment shaders access common memory, enabling you new use cases for example of programmable blending on MacOS or voxelization [phonetic] order independent transparency.

And then we'll switch to the topic of display and we talk about the new ProMotion Displays on iPads and how to best drive them using Metal.

And we'll also give you a recap of our best practices of getting your content from your render targets to the glass as quickly as possible and with the least amount of latency.

And finally we'll finish with a survey of all the other Metal features that we added to align iOS and macOS platforms into one big, common Metal ecosystem.

So the argument buffers.

Let's look at what they are and how they work.

And I will need an example for that so let's think of a simple material that those who actually wrote any sort of 3D render program would know.

In your material you have a bunch of numerical constants, a bunch of textures probably more than two now a days assembler.

And this is what you need to send to the GPU to be able to render your primitive.

Now the texture objects are interesting because they contain both texture properties such as width, height, pixel format perhaps, and then a pointer to a blob of memory which contains all the pretty pixels.

Well, unfortunately we are not really interested in those pixels in this presentation.

So off it goes and we'll only be talking about boring texture states.

So with traditional argument model we allow you to put all the constants into a Metal buffer and we created this indirection so that it's easy for you to use and also it gives GPU the unfiltered, direct access to all the data.

However, when it comes to things like textures or samplers you still need to go through quite about of an API and in your rendering loop you'll set the buffer, set all the textures, samplers, and only after that you can finally draw.

And even though Metal is really optimized this is quite a few API codes and if you multiply it with the number of objects you need to render, every frame, and the fact that you need to do all this work every frame, it actually at some point limits the amount of objects that you can put on the screen.

With argument buffers we decided that we would like to extend this very convenient indirection that we have for constants to everything.

So you can actually put texture state, samplers, pointers to another buffer into an argument buffer and this really simplifies your rendering pipeline because well, suddenly the only thing you need to do is set the buffer and draw.

And you probably figured out that with this few API calls you can put more objects on the screen, and as you'll see later, you can do actually even better with argument buffers.

So we've done a bunch of benchmarks and run argument buffers on our devices.

And this is for example what you get on iPhone 7.

While with traditional model, quite unsurprisingly, the cost of your draw call scales, with the amount of resources you use in a draw call, with argument buffers the cost stays pretty low and almost flat.

So this already shows that for example with a very simple shader, with just two resources, with texture and a buffer, or two textures, you're getting seven times the performance improvement.

With eight textures or eight resources, however you want to mix it up, you are getting 18 times performance improvement on iPhone 7 and it goes even better with 16 resources, obviously.

So I already talked about the performance.

I hinted toward a new use new use schedules.

And we'll talk about this in a minute.

And the last point the last benefit of argument buffers I would like to bring up is the ease of use.

And it comes from the fact that argument buffers are ultimately an extension of buffers.

So you can, for example go ahead and prepare them ahead of the time, let's say when your game is loading, and then don't have to worry about it anymore during your rendering loop, further improving your performance.

Or you can mix them with a traditional binding model, for example even within a single draw call, which means that your adoption can be as simple as using our new tools to figure out what is your most expensive loop in our application and optimizing that and then maybe return to the rest in a year when you have time.

And lastly, the argument buffers are supported across all Metal devices.

So once you take this adoption step and you get all the performance you can keep using it on all Metal devices.

The ease of use actually translates really well to the shaders.

And since we will be looking at the shaders quite a bit during this section this is an example of the material I gave you in the beginning.

And as you can see, the textures in the sampler are part of the structure, and that's the main thing to take away from this is that your argument buffer is just a structure in a shader and you can use all the language that you have at your disposal to make embedded structures, to organize your data, or use erase or pointers.

It just really works.

So let's now look at the three main new features of argument buffers, the first one being dynamic indexing.

And great example of it is crowd rendering.

If you played some of the recent Open World games you've seen that games try to render large crowds full of unique, varying characters in order to make these beautiful, immersive worlds.

Well, actually that's quite a costly thing to do if you need to create so many draw calls.

With argument buffers we already said that we could put all the properties required for let's say a character into a single argument buffer, bind it, and save all that performance on the CPU, but actually we can do better.

We can for example create an array of argument buffers where each element represents single character.

And then it suddenly becomes very, very simple because what you need to do is set this big buffer, this one API call, issue single instance draw call, let's say with 1,000 instances because I would like 1,000 characters on screen.

That's second API call.

And after that it's all on the GPU.

In a vertex shader you use instance ID to pick the right element from the array, get the character, put it somewhere where it needs to be in the world, give it the right pose, if it's for example mid-walk cycle, and then in the fragment shader again you use the instance ID and pick the right materials, the right hair color to finalize the look.

So we are suddenly getting from tens, hundreds, maybe thousands of draw calls to a single one.

And it's faster on the CPU.

It's faster on the GPU.

And this is how simple it looks in a shader.

Pretty much your argument buffer becomes an array of structures.

You pick the right element using instance ID referenced within and you can, for example take the pointer and pass it to your helper methods or whatever you need to do to process data.

The second great feature of argument buffers is the ability of the GPU to set resources.

And we actually created an example for this.

We created a particle simulation running completely on the GPU.

And I'll tell you how we done that and we'll see the we'll see it in action later.

So we created an array of argument buffers where each element is single particle and I guess you already spotted a trend here.

Our simulation kernel then treats and simulates one particle per thread, but we want to actually go further and we want it to be able to create the particles in the kernel as well, on the GPU.

So in order to do that, and to give it the right materials, we also have argument buffer with all the different materials that we would like our particles to have.

And our simulation kernel then, every time you do an action in our little demo, the simulation kernel looks into the environment and sees what's the correct, most appropriate material.

And let's say if you are in the forest, we pick moss as the right, appropriate material for a rock and copy it to the particle itself.

If you're on the rocks we pick the rock material.

On the hill we pick grass.

So this way everything stays on the GPU and it actually looks in the shader just as simple as I describe it.

If you want to modify data on your GPU you bind it as a device buffer and start assigning values as you are used to, but also this time around you can copy textures or copy the whole structure and it's really this simple.

And the last great feature I would like to mention is ability of argument buffers to reference another argument buffer.

So this way you can actually go ahead and create a reusable and complex object hierarchy just as you are used to from C++ Swift, Objective-C.

Let's say in the example of our renderer, if you have a ton of objects, but you probably have very few materials, so what you can do is reference the material from each object and save some memory or you can build your scene graph as a binary tree where actually you point to the objects and the tree nodes as you need them, as you would be used to from the CPU.

And you can share this data with the CPU as well.

So these are the main features.

And let's look at the support matrix.

We have two tiers.

The tier one is supported across all Metal devices and you get the CPU performance improvements.

You get the new schedule language.

But because of the limitations of the GPUs this tier does not is not able to utilize the GPU driven use cases that I mentioned earlier.

With tier two however you are getting all of this so you get all the new use cases and we are also really increasing the amount of resources you can access.

Your shaders can access half a million textures and buffers to for you to do this do these new algorithms.

While tier one is supported on all Metal devices, tier two is something you need to query for.

But don't worry, the support is really wide.

All the Macs with these three GPUs are tier two.

All the new MacBook Pros, the latest MacBook, the last tier MacBooks Pros are tier two.

So you can go ahead and have fun.

Now let's look at the demo I promised you.

We will be showing three videos with three different features.

The real time rendered terrain, with material that changes dynamically, we place some vegetation by the GPU on the terrain to make it interesting, and we have all these nice particles that I mentioned before.

So, as you see, we are painting high on the terrain.

We can changing sculpting the terrain and the material actually follows.

And this is a great thing about argument buffers because they allowed us to create a one big argument buffer with all the possible materials as layers in there and when we are rendering the terrain in a pixel shader we are looking at things like terrain height, slope, the amount of sun that reaches certain pixels, and based on these properties and some others we do decide what are the best and most appropriate materials for that given pixel.

And this is all happening in real time, whereas previously we would have to go ahead and split the terrain in small pieces offline, analyze which pieces need which textures in order to make it as optimal as possible, and only then render it.

So we are going from a pre-processing step, which is heavy and prevents real time modification, to something that is real time, without sorry without preprocessing and completely dynamic.

And we added vegetation on it and as you see the vegetation is also context sensitive.

You see the palm trees on the sand.

You see the little tiny apple trees on the hills.

And while the vegetation itself is fairly traditional instance rendering, the power of the argument buffers here is that it allows us to share the same terrain material with all the same properties and the same terrain analysis function between two completely separate pieces of code.

While terrain rendering uses all this data to render pixels, the computer that places the geometry, the vegetation, actually analyzes the same materials to figure out what is the best type of tree to place in the given spot.

And this is very easy because every time we make a change nothing actually changes in our code because we just add new layers or change our analysis function, whereas previously we would have to maybe juggle 70 textures between two completely separate code basis in order to make them run in sync.

Lastly, we have the particles.

I hope you can see that they nicely get the material of the terrain there.

Now what I did not mention is that this all is rendered with again a single draw call.

We are rendering 16,000 particles here with single draw call, with absolutely no involvement on the CPU.

And not only do particles have unique materials, they actually have unique shapes because argument buffers allow actually allow you to change your vertex buffer per draw call.

This is something where if you try to do that without argument buffers, we had to create a complicated control hand over between GPU that simulates and the CPU that tries to come up with the best set of draw calls to represent all this variety.

So with argument buffers this became just very, very simple.

Okay, so enough pretty pictures.

And let's wrap my portion of the session with a look at some APIs and some best practices.

As I mentioned before, argument buffers are an extension of Metal buffers and that means all of our API related to buffers just works.

You can go ahead and take argument buffer, copy it somewhere else; you can blitz it between CPU and GPU.

And while argument buffers look like structures on the GPU for shaders, on the CPU you will use MTLArgumentEncoder objects to fill up the content.

This abstraction allows Metal to create the most optimal memory representation for any given argument buffer on that specific GPU that you are actually running.

So you get the best performance.

It also frees you, as the developer, from all these details and worries about, for example how each GPU represents what the texture is.

Where does it live in memory?

All of this changes from platform to platform and we hide it between a simple interface so that you can write very simple and effective applications.

So I hope you're not worried about the encoder that I mentioned.

It's really, really simple to use.

For example, if you want to create an argument encoder for this argument buffer all you need to do is get your Metal function that uses the argument buffer and ask the Metal function for the encoder and that's about this.

This is all you needed to do.

You get an object and you start using a familiar set texture or filling constant API that is very, very similar to how you've been using Metal with command encoder.

So this also plays into what I said about ease of use and transition.

There are multiple other ways of creating the encoder.

You can go more explicit with the descriptor, but that's something you should look into in documentation if you need such thing.

We advise you to actually go and get argument encoders from the shaders.

Now with all those interactions, GPU being able to step in and modify the argument buffers or you know dynamic indexing and half a million textures, all that in a mix, it's not really possible for Metal to figure out what for example what textures or buffers do actually intend to use in your rendering, but luckily you as a developer have pretty good idea about that.

So we ask you with argument buffers to be quite explicit about it.

If you are using Heaps, and absolutely you should use Heaps to get the best performance out of your platform and the best way of organizing your data, the only thing you need to do is tell Metal that you intend to use a Heap , or multiple Heaps, it's up to you.

And this is this makes sure that the textures are available for you in the rendering loop.

If you want to do something more specific, let's say you would like to write to a render target from inside a shader, or you would like to read from a dev buffer, you use a more specific API and tell Metal that you intend to change resource and with a specific way.

And again, it's as simple as this.

You don't need to do anything else.

So let's start out with a couple of best practices.

I think if you know Metal they are very, very similar to what we are telling you about using Metal buffers.

The best way to organize your data is by usage pattern.

And you probably have a ton of properties that do not change per frame.

So put them into an argument buffer and share it with all the objects so you will save memory this way.

The same on the same on the other hand you will probably have a lot of properties that actually do change for every object and you need to manage them every frame.

And for these I think the best way is to put those into separate argument buffers so that you can double buffer it or whatever is your management scheme and you don't need to do all the other copies to keep all the data in there.

And then you will likely have a ton of argument buffers that just don't change at all.

Let's say the materials, or maybe some other properties, and for these just create them at the initialization of your application and keep using them.

Similar to Metal buffers, think about your data locality and how you actually use your argument buffers.

If, for example you have three textures that are accessed in a shader, one after another, then the best thing you can do is actually put those textures close to each other in argument buffers so that you maximize the use of GPU caches.

And as I mentioned at the beginning, traditional argument model is not going anywhere and you should take advantage of it and mix it with the argument buffers whenever it's more convenient.

So let's say if you need to change a single texture for every object, for example a cube reflection, it probably would be an overhead to create argument buffer just for that and upload it every frame.

So just use the traditional model for this.

That's it about argument buffers.

I really hope you will adopt our new API and get some creative use cases out of it.

And please welcome Richard, who will talk about the Raster Order Groups.

[ Applause ]

Thank you.

Hello. So thank you Michal.

So I'm going to take you through the rest of the day's presentation, starting with Raster Order Groups.

So this is a new feature that gives you control over the GPU's thread scheduling to run fragment shooter threads, in order.

This allows overlapping fragment shooter threads to communicate through memory, where before it wasn't always really possible to do in most cases.

So this opens up a whole new set of graphics algorithms that were not practically achievable with just write only access to your frame buffers or onward access memory to device memory.

For example, one of our key one of the key applications for this is Order-independent transparency.

We've been already talked a lot today about how to reduce the CP usage of your Metal application and this feature lets you build or an algorithm to include blending back to front without having to pay the CPU cost of triangle level sorting.

There's also been lots of investigations into advanced techniques such as dual layer G-buffers, which can substantially improve post processing results, or using the GPU rasterizer to sort of voxelize triangle meshes.

For both of these onward accesses to memory has been a really large barrier to efficient implementations.

But probably the simplest and most common application for this feature is just implementing custom blend equations.

iOS hardware could always do this pretty natively, but this is not something that desktop hardware has traditionally been able to do.

So I'm going to use custom blending as an example application to introduce this feature.

Okay, so pretty typical case of triangle blending; one triangle over another.

Pretty much all modern GPU APIs guarantee that blending happens in draw call order.

It provides this nice, convenient illusion of serial execution.

But of course what's really going on behind the scenes is GPU hardware's highly parallel.

It's going to be running multiple threads concurrently.

And only this fixed-function blend step at the end is going to be delayed until everything gets put back in order again.

There's this implicit wait that happens before that blend step.

Things change however if the ordering if we need to put things in order not at the end of our fragment shooter, but right in the middle because in this case triangle one wants to write something to memory that triangle two's threads want to read from.

If we want triangle two to be able to build upon and consume triangle one's data we need to get that ordering back.

And so that's pretty much what Raster Order Groups provides.

So I'm going to jump over to a shader code example.

So if I want to implement custom blending, an initial attempt that does not work is going to be to replace my classic graphics frame buffer with a read to write texture and perform all of my rendering and blending directly to this texture.

But of course if the threads that I'm blending over have yet to execute, or concurrently executing, this is this whole remodify/write sequence is going to create a race condition.

So how do we use Raster Order Groups to fix this?

It's really, really easy.

All I have to do is add a new attribute to the memory that has conflicting accesses.

At this point the compiler and the hardware are going to cooperate to be able to implicitly take the entire range of [inaudible] shader that accesses that memory from the very first to the very last access and turn it into a critical section behind the scenes.

You can also apply this attribute to normal device memory pointers, not just textures.

So with that we get the thread schedule that we want.

Thread one will proceed and write to memory and thread two is going to stop and wait until thread one's write's complete giving us basically race free access to this memory.

Oh, there's one other really important topic and that's talking about which threads are synchronizing with each other.

So of course GPU hardware's going to be running not just two, but tens of thousands of threads at the same time and in fact it's probably executing every single thread from both of these triangles simultaneously.

So of all of these thousands tens of thousands of threads, which one synchronizes with each other?

So I've highlighted one pixel here because that's the answer to this question.

You this feature only synchronizes against other threads that your current fragment shooter thread overlaps with, those other threads that are targeting the same frame buffer xy location, targeting the same multi-sample location, targeting the same render target index.

If I wanted and it specifically does not provide any guarantees at all against that you can safely access memory that are written by any neighboring pixels.

If you do need to have these kind of area or region of influence kind of algorithms then you will need to go back to using full memory barriers between draw call or full API barriers between draw calls or render passes.

But this comes at a much higher performance cost and it does not work in the case where you have triangle overlap within a single draw call.

But for these common algorithms that you do have only need overlap only synchronization, Raster Order Groups can get the job done at a substantially lower performance cost.

So this is a pretty actually easy one and that's really all I've got to say about it.

Raster Order Groups lets you efficiently wait for overlapping and only overlapping threads to finish their access to memory, which enables a collection of GP algorithms that were previously just too inefficient to use practically in GPU hardware.

This middle of shader thread summarization is a feature of the latest GPU hardware, so it is something you do need to check for at run time.

In particular it's supported on the newest AMD Vega GPUs announced this week as well as the past couple years' worth of Intel GPUs.

And that brings us on to our second feature and that is the new iPad Pro's ProMotion Display.

So ProMotion, this is a particularly great feature for graphics and game developers and so I really want to show you what you can do with it.

This is the first of a sequence of timeline diagrams I'm going to show you, showing us when the GPU starts and finishes producing a frame, and then when that same frame finally gets onto the glass for the user to see.

The first and most obvious thing that ProMotion does is we can now render at 120 frames per second.

This feels absolutely fantastic for anything that has really high speed animations, for anything that's latency critical such as tracking user touch or pencil input.

And it does have some catches.

You of course only get half as much CPU and GPU time available per frame so you really have to pay a lot of attention to optimization and it does increase overall system power consumption.

But if you've got the right content, where this matters, it gets a really payoff for the user experience.

But ProMotion goes a lot farther than 120 frames per second rendering.

It also provides much more flexibility regarding when to swap the next image onto the glass.

We're not limited to just 120 or 30 or 60 frames per second.

ProMotion behaves much more gracefully as your application's performance moves up and down compared to a fixed frame rate display.

For example, here I have a timeline diagram of a title that, you know just is just doing too much GPU work to target 60 frames per second.

You know they're producing frames every about 21 milliseconds or about 48 frames per second.

The GPU is perfectly happy to do that, but on the display side we can only refresh once every 16 milliseconds and so we end up with this beating pattern.

There's this stuttering that the user feels where some frames are on the glass a lot longer than others.

And it's not nice at all.

And so pretty much universally what applications do in this case is they all have to artificially constrain the frame rate all the way down to 30 frames per second.

They're basically trading away their peak frame rate in order to get some level of consistency.

ProMotion does much better here.

So if I just take the same application, move it to a ProMotion display, it does this to our timeline.

We now have a refresh point every four milliseconds rather than every 16.

Our timeline gets pulled in, even with the GPU doing exactly the same work as before.

The display can now present at an entirely consistent 48 frames per second.

The user is now getting both the best possible frame rate and perfect consistency from frame to frame.

This tradeoff that we had to make is completely gone.

Furthermore so a second example is that this time in application that wanted to make 60 frames per second, but one frame just ran a bit long and we missed our deadline.

On a fixed frame rate display we end up on the display side with a pattern that looks very similar to what we saw before.

ProMotion can fix this too.

So frame one's time on the glass, rather than it being extended by 16 milliseconds, is now only extended by four.

The degree of stutter that the user experiences is tremendously reduced and then frame two and three, their latency gets pulled right back into where they were before.

The system recovers right back onto the timeline right away, latency is improved, and your application can proceed on.

We've just gotten right back to where we wanted to be.

So put it all together, it just makes animation just feel that much more robust and solid no matter what's going on.

So how do you actually go about taking advantage of this?

For normal UIKit animation, such as scrolling through lists or views, iOS will do this entirely for you out of the box.

It will render it 120 frames per second when appropriate.

It will use the flexible display times when appropriate.

Metal applications though tend to be much more aware of their timing and so for those we've made this an opt in feature.

Opting in is done really easily just by adding a new entry to your application bundles info.plist.

Once you do this the timing behavior of our three Metal presentation API changes a little bit.

And so I'm going to walk you through those three APIs and how they change now.

So the first of our Metal presentation APIs is just present.

It's it says present immediately; schedule my image to be put on the glass at the very next available refresh point after the GPU finishes.

On fixed frame rate hardware that's 16 milliseconds and on iPad Pro that's now four milliseconds.

This is the easiest API to use because it takes no runs.

So it's the API that most of the people in this room are already using.

It's also the API that gives you the lowest latency access to the display.

It works identically on both our fixed frame rate and ProMotion hardware, but once you opt in it starts working with much, much better granularity.

The second of our Metal presentation APIs is present with minimum duration.

So this one says, whenever this image lands on the glass, keep it there for a certain fixed amount of time.

So if my image lands on the glass here, it's going to stay for 33 milliseconds.

And if my start time shifts so does the end time.

This is the API you'd use if you want perfect consistency in frame rate from frame to frame.

This is particularly useful in 30 frames per seconds on 60 rate per seconds displays, although it's also sometimes useful on ProMotion as well.

But our third presentation varying is the most interesting by far.

It's present at a specific time and it does exactly what it sounds like.

If the GPU's done well before the designated time, the display will wait.

If the GPU runs over your deadline the display will pick it up at the very next available point afterwards.

This is the key API to use if you want to build fully custom animation and timing loops.

This API to present and time, combined with ProMotion display basically lets you leave behind the concept of a fixed frame rate entirely and render your content exactly for the time the user is going to see it.

If you want to keep your Metal view perfectly in synch with something else happening on the system, such as audio, or if you want to basically provide the appearance of zero latency at all and be able to forward project your animation for exactly when the user's going to see your content this is what lets you do that.

Now of course the trick is implementing that project next display time.

That's your function.

To make that work you do need some feedback from the system to help you determine what your actual performance is.

And so we've added that as well.

So a Metal drawable object is a transient object that tracks the lifetime of one image you've rendered all the way through the display system.

It can now be queried for the specific time that frame lands on the glass and you can also get a call back when that happens.

So now you can know when your image is landing on the glass, when they're being removed, and you have the key signal to know when you are or are not making the designated timing that you intended and are giving you the signal to adjust for future frames.

So that's the story of ProMotion and what you need to do to make use of it on the future on these new iPad Pros.

It's incredibly easy to get more consistent and higher frame rates with almost no code changing at all in most applications.

From there it gives you a menu of options to decide what display time model is going to best benefit your particular app.

A really, really fast paced Twitch arcade game or something tracking touch or pencil input probably wants to go for 120 frames per second.

A really high end rendering title might want to stick with 30 or 60 frames per second or somewhere in between and just enjoy the consistency benefits.

And applications that want to really take control of their timing loop have entirely new capabilities here as well.

But regardless of what your app actually is, ProMotion gives you this powerful new tool to support its specific animation needs.

So that's ProMotion.

So moving on, I have a different display topic to talk about and that is a feature we're calling Direct 2 Display.

So the story of what happens after your GPU finishes rendering your content and the display is actually a little bit more complicated.

And then your image can take two paths to the display; GPU composition and direct to display.

The first of those is a your typical user interface scenario where I've got a collection of views or layers or windows and the like and at this point the system is going to take all of these and composite them together.

It's going to scale any content to fit the display.

It's going to perform color/space conversion.

It's going to perform apply any core image filters or blending and it's going to produce the one, final combined image that the user sees.

This is really, really critical abstraction for full-featured user interfaces.

But it's also all done on the GPU and it takes some time and memory there.

And if we're basically building, you know a full-screen application, you know it's a little bit overkill for that.

And so that's where direct display mode comes in.

If none of these operations are actually required, we can point the display hardware directly at the memory you just rendered to and so without any middleman at all.

So how do you enable this?

It turns out there is no single turn it on API for direct to display.

This mode is really an omission of anything that requires the GPU compositer to intervene.

When the compositer takes a look at the set-up of your scene and says there's nothing it needs to do here it will just step out of the way.

So how can you set up your scene to get the compositer to step out of the way?

So this is pretty straightforward, an intuitive feel of, does my content need any kind of nontrurial [phonetic] processing is a pretty good intuitive start.

But more specifically you do want your layer to be opaque.

I don't want to be blending over anything.

We don't want to apply anything that requires that core animation or the window server modify our pixels.

We don't want to put on rounded corners in our view or masking or filters or the like.

We do want to be full-screen.

If your content does not actually match the aspect ratio of the display it is okay to put a full-screen, opaque, black background layer to sort of give a black bar kind of effect.

But in the end we want to basically obscure everything.

We do want to pick render resolutions that match the native panel.

So this is actually a little bit tricky because all of our both on macOS and iOS we ship hardware that has a virtual desktop modes or resolution modes that are larger than the actual physical panel.

And the last thing we want to do is spend time rendering too many pixels only to have to spend time on the GPU to scale it all back down again.

And finally, you want to pick a color, space, and pixel format that the display hardware is happy to read from directly.

And so this one, there's any infinite number of combinations here so I want to help out by giving you a little bit of a white list of some particularly common and efficient combinations.

So right on the top is our good old friend; SRGB8888.

This is pretty much the universal pixel format that most applications use and all hardware is happy to read.

And so for most people that's all they need.

But we've been shipping wide color gamut P3 displays on both our macOS and iOS hardware and if your application does want to start making use of this ability to represent more colors, you need to pay a bit more attention.

In both the the concepts are the same between iOS and macOS, although the details differ a little bit.

In both cases we do want to render to attend the pixel format, but note that if you render P3 content onto a P3 display that's fine, but if you render P3 content onto an SRGB display the system the GP compositer might have to get involved to crush the color space back down to fit the display.

And so this is P3 is not something you want to do universally, all the time.

you do want to take a look at the current display and make this a conditional thing.

So finally, for completeness I'm also going to list RGBA float 16, which is sort of the universal, wide gamut, high dynamic range pixel format.

Although, in I do it's also necessary for MacOS's extended data range feature.

Although it is worth noting that it does require GPU compositing in all cases.

So I mentioned, you do want to be a little bit conditional if you write an application that's wide color aware.

Fortunately, both UIKit and AppKit provide really convenient APIs to check that.

So the last step is, how do you know if you're actually on the directed display path?

So this is a screen shot of our Metal system trace tool and instruments.

And Metal system trace is pretty much a developer tool that will give you a live timeline of the CPU and the GPU in the display.

Pretty much a real-world version of the diagrams I've been showing you in this presentation.

So in this case, I want to highlight my three frames that I've rendered.

The color-time intervals are my own application's rendering.

And the gray time intervals are some other processes in the GPU.

I can get more details down at the bottom of the window or I can see it's coming from backboard D, our iOS composition process.

So this is the case where my application is going down the GPU compositing path.

Going back and revisiting some of our best practices can remove that from the picture and now I can rerun my Metal system trace and see that I have a timeline where, you know I've got the GPU completely and entirely to myself.

So that's it for direct to display.

Our system compositors can make a lot of magic happen behind the scenes to make full-featured user interfaces possible, but that can come at a performance cost because they use the GPU to do it.

By being a little bit aware of what you're asking the compositer to do, or more importantly by not asking what you're not asking the compositer to do, it can get out of the way without using the GPU, returning some of that time to you.

Direct to display is supported on iOS and Tos and always has been and its support is new to macOS High Sierra for Metal applications.

So with that I want to touch on our last topic of the day and that's everything else.

There's a lot more that we've added to the core frameworks and sheeting language for Metal 2.

And so I'm not going to dive deep into any of these things, but I do want to give you a survey.

So right off the bat we've added some new APIs to be able to query how much GPU memory's being allocated for each buffer, for each texture, for each Heap.

This actually takes into account things that just generally happen behind the scenes, like alignment and various padding.

So this can give you a more accurate view of how much GPU memory you're actually using.

We also have a roll-up query on the Metal device, which is the entire GPU memory usage for your entire process.

And this is particularly notable because that also counts all of the memory that the driver needs to allocate that's not otherwise visible to you; things like memory to put shader code in or command buffers or anything else.

So this can give you where you're at relative you know everything all in compared to your memory usage target.

We have a couple compute oriented additions.

The first of those is that we've added a set of shading language functions to help to allow you to transfer data directly between threads in a SIMD group.

If you're not familiar; GPU hardware typically gains an individual vertex fragment and compute shader thread into SIMD groups and executes them together for greater efficiency.

This are also called wayfrencer [phonetic] warps.

Within a group these threads do have some ability to directly communicate without having to load and store through memory.

They can read values directly out of one thread's register and write them to another thread's register.

And that's what these new standard library functions allow.

So in this case broadcast means I can read a data directly read a field directly out of thread zero's registers and write it directly into the registers of 16 other threads that happen to be part of this group.

Our second compute addition is to give you more flexibility in how big your thread groups are.

So for example if I have a pixel bird here that I want to run some pretty classic image processing kernel over, but then I've written my compute kernel such that I'm using four by four thread groups everywhere.

Well, this leads to some problems because I've got if my image is not a nice multiple of my thread group size I've got a bunch of stray threads on the side.

I mean this means that I've got to dive into those and say when I actually write my code.

I have to be defensive.

Am I out of bounds?

I have to handle it in some special way.

It's doable but annoying.

It also means that we're just wasting GPU cycles.

So non-uniform thread group sizes, unless you declare what dimensions you want to run your kernel over, without being multiple thread group sizes.

So the hard working, smaller thread groups along the edges of my grid, in order to say in order to just shave off that unnecessary work it both improves GPU performance and just makes your kernels easier to write.

We've added support for a view port arrays.

You can now configure up to 16 simultaneous view ports and your vertex shader can select, per triangle, which view port that triangle gets presented into.

I'm not going to go further into this because it will be discussed in detail tomorrow in the VR with Metal 2 session.

It is particularly valuable for efficiently rendering to the left and right eyes.

We've added the ability to choose where in each pixel your multi-sample locations are supported.

This lets you do a few interesting things including maybe toggling your sample positions every other frame and giving you some new you know valuable input into some temporal anti-aliasing algorithms.

In the vein of trying to keep of working to bring our platforms up to date to have them have the same feature set wherever possible, we've brought resource Heaps, shipped last year in iOS 10 to macOS High Sierra this year.

So I'm going to actually do a little bit of a refresher on this because good use of your Heaps is really important to getting the most out of argument buffers.

So Heaps are of course where I can allocate a big slab of memory up front rather than going to the kernel to I want memory for texture a, and I want memory for texture b and so forth.

I can go to the kernel and get memory right up front and of course put textures you know add and remove textures and buffers to without having to go back to the system.

This has a few advantages.

It means that I can bind everything in that Heap much more efficiently.

There's much less software overhead.

It means that we can oftentimes pack that memory a little bit closer together.

We can save some padding and alignment, save you a little bit of memory.

It means when we delete memory we don't give memory back to the system.

That could be good or bad.

It means when we allocate new memory when we allocate a new texture it means we don't have to go back to the system and get new memory.

It also means that you can choose to alias these textures with each other.

If I have you typically render targets or intermediate render targets between different passes in my render graph.

It means that if I have two different intermediates that just don't have to exist at the same point in time I can alias them over each other and I can save tons of memory like this.

So that's it for a quick survey of Heaps.

We've added linear textures from iOS to macOS.

Linear textures allows you to create a texture directly from a Metal buffer without any copies at all.

We've extended our function constant feature a little bit.

A quick refresher, function constants allow you to specialize by codes.

When you've done all your front end compilation offline you can then tweak and customize your uber shader bi-code a little bit before actual generating final machine code.

If you have a classic uber shader this can save you the cost of having to re-run the compiler front end for every single permutation.

So we've made this a bit more flexible and added a few more cases where you can use these specialized arguments.

We've added some extra vertex array formats.

We had some missing one and two component vertex formats.

And we've also added BGRA vertex formats.

We've brought iOS surface texture support from macOS to iOS.

And we've also brought dual sourced blending to iOS as well, also particularly useful in some deferred shading scenarios.

So that's brings us to the end of introducing Metal 2.

My colleague, Michal, started with giving you a little bit of an overview of the overall scope of Metal 2.

From VR to external GPUs, to machine learning, and to new developer tools and performance analysis.

Of that, the pieces that we really covered today are our next big push toward reducing CPU overhead using argument buffers.

Argument buffers also unlock the ability for the GPU to start taking a little bit of its own destiny when it comes to configuring shader arguments, which is one less reason to take back to the CPU.

Raster Order Groups let us start using the rasterizer for things beyond basic in order blending.

We can now start taking advantage of the latest hardware capabilities to do, you know, vox slice triangle meshes or set transparency blending either in order or independent.

They're both it makes them both possible.

For the new iPad Pros, ProMotion gives you very fine grained control over exactly how your animations are presented to the user, giving you the ability to get both peak frame rates and the lowest possible latency.

Direct to display provides you a path to reclaim a little bit of GPU performance from the system by being aware of what our compositors do on your behalf.

So you'll be able to find the video and the slides for this session on the WWDC2017 website.

We have three other sessions on Metal 2 this year.

In particular, tomorrow afternoon we're going to have a session dedicated to VR and Metal 2.

This is going to go deep into what your application needs to do and a conceptual overview of how to do VR rendering, dive into specifically how to do VR with the combination of Metal 2 and the Steam VR toolkit.

It's also going to go into using Metal with external GPU hardware.

On Thursday we have a doubleheader starting with Metal 2 optimization and debugging.

This is going to go into what's new in our developer and performance tools and all the new workflows that enables to help you build the best applications possible.

And it's going to be followed up right after that with using Metal 2 for compute.

And that's going to really have a big focus this year on using the GPU for machine learning applications.

We've added a whole lot this year and we want to show you everything we've done.

I want to point you to a couple of last year's WWDC sessions.

The first, What's New in Metal Part One is where we did a deep dive on resource Heaps.

And instead if you're looking to get the best performance out of argument buffers, argument buffers and Heaps were built to go together and so I highly encourage you to go check out the video and really and, you know basically plan your application around both of those together.

They cover that in a lot more detail than we did here today.

Second, if the conversation about direct to display and wide gamut and wide color interested you we have a whole session that really goes deep into the concepts and the specifics behind that, we also talked about last year.

With that I think we'll wrap it up.

I thank you for all attending and I hope you enjoy the remainder of your week.

So thank you.

[ Applause ]

Apple, Inc. AAPL
1 Infinite Loop Cupertino CA 95014 US