Advanced Metal Shader Optimization

Session 606 WWDC 2016

The Metal shading language is an easy-to-use programming language for writing graphics and compute functions which execute on the GPU. Dive deeper into understanding the design patterns, memory access models, and detailed shader coding best practices which reduce bottlenecks and hide latency. Intended for experienced shader authors with a solid understanding of GPU architecture and hoping to extract every possible cycle.

[ Music ]

[ Applause ]

So, hello everyone.

My name is Fiona and this is my colleague Alex.

And I work on the iOS GPU complier team and our job is to make your shaders run on the latest iOS devices, and to make them run as efficiently as possible.

And I'm here to talk about our presentations, Advanced Metal Shader Optimization, that is Forging and Polishing your Metal shaders.

Our compiler is based on LVM.

And we work with the Open Source committee to make LVM more suitable for use on GPUs by everyone.

Here's a quick overview of the other Metal session, in case you missed them, and don't worry you can watch the recordings online.

Yesterday we had part one and two of adopting Metal and earlier today we had part one and two of what's new in Metal, because there's quite a lot that's new in Metal.

And of course here's the last one, the one you're watching right now.

So in this presentation we're going to be going over a number of things you can do to work with the compiler to make your code faster.

And some of this stuff is going to be specific to A8 and later GPUs including some information that has never been made public before.

And some of it will also be more general.

And we'll be noting that with the A8 icon you can see there for slides that are more A8 specific.

And additionally, we'll be noting some potential pitfalls.

That is things that may not come up as often as the kind of micro optimizations you're used to looking for, but if you run into these, you're likely to lose so much performance, nothing is going to matter by comparison.

So it's always worth making sure you don't run into those.

And those will be marked with the triangle icon, as you can see there.

Before we go on, this is not the first step.

This is the last step.

There's no point to doing low-level shader optimization until you've done the high-level optimizations before, like watching the other Metal talks from optimizing your draw calls, the structure of your engine and so forth.

Optimizing your later shader should be roughly the last thing you do.

And, this presentation is primarily for experienced shader authors.

Perhaps you've worked on Metal a whole lot and you're looking to get more into optimizing your shaders, or perhaps your new to Metal, but you've done a lot of shader optimization on other platforms and you'd like to know how to optimize better for A8 and later GPUs, this is the presentation for you.

So you may have seen this pipeline if you watched any of the previous Metal talks.

And we will be focusing of course on the programmable stages of this pipeline, as you can see there, the shader course.

So first, Alex is going to go over some shader performance fundamentals and higher level issues.

After which, I'll return for some low-level, down and dirty shader optimizations.

[ Applause ]

Thanks, Fiona.

Let me start by explaining the idea of shader performance fundamentals.

These are the things that you want to make sure that you have right before you start digging into source level optimizations.

Usually the impact of the kind of changes you'll make here can dwarf or potentially hide other more targeted changes that you make elsewhere.

So I'm going to talk about four of these today.

Address space selection for buffer arguments, buffer preloading, dealing with fragment function resource writes, and how to optimize your computer kernels.

So, let's start with addresses spaces.

So since this functionality doesn't exist in all shading languages, I'll give a quick primer.

So, GPUs have multiple paths for getting date from memory.

And these paths are optimized for different use cases, and they have different performance characteristics.

In Metal, we expose control over which path is used to the developer by requiring that they qualify all buffers, arguments and pointers in the shading language with which address space they want to use.

So a couple of the address spaces specifically apply to getting information from memory.

The first of which is the device address space.

This is an address space with relatively few restrictions.

You can read and write data through this address space, you can pass as much data as you want, and the buffer offsets that you specify at the API level have relatively flexible alignment requirements.

On the other end of things, you have the constant address space.

As the name implies, this is a read only address space, but there are a couple of additional restrictions.

There are limits on how much data you can pass through this address space, and additionally the buffer offsets that you specify at the API level have more stringent alignment requirements.

However, this is the address space that's optimized for cases with a lot of data reuse.

So you want to take advantage of this address space whenever it makes sense.

Figuring out whether or not the constant address space makes sense for your buffer argument is typically a matter of asking yourself two questions.

The first question is, do I know how much data I have.

And if you have a potentially variable amount of data, this is usually a sign that you need to be using the device address space.

Additionally, you want to look at how much each item in your buffer is being read.

And if these items can potentially be read many times, this is usually a sign that you want to put them into the constant address space.

So let's put this into practice with a couple of examples from some vertex shaders.

First, you have regular, old vertex data.

So as you can see, each vertex has its own piece of data.

And each vertex is the only one that reads that piece of data.

So there's essentially no reuse here.

This is the kind of thing that really needs to be in the device address space.

Next, you have projection matrices, another matrices.

Now, typically what you have here is that you have one of these objects, and they're read by every single vertex.

So with this kind of complete data reuse, you really want this to be in the constant address space.

Let's mix things up a little bit and take a look at standing matrices.

So hopefully in this case you have some maximum number of bones that you're handling.

But if you look at each bone that matrix may be read by every vertex that references that bone, and that also is a potential for a large amount of reuse.

And so this really ought to be on the constant address space as well.

Finally, let's look at per instance data.

As you can see all vertices in the instance will read this particular piece of data, but on the other hand you have a potentially variable number of instances, so this actually needs to be in the device address space as well.

For an example of why address space selection matters for performance, let's move on to our next topic, buffer preloading.

So Fiona will spend some time talking about how to actually optimize loads and stores within your shaders, but for many cases the best thing that you can do is to actually off load this work to dedicated hardware.

So we can do this for you in two cases, context buffers and vertex buffers.

But this relies on knowing things about the access patterns in your shaders and what address space you've placed them into.

So let's start with constant buffer preloading.

So the idea here is that rather than loading through the constant address space, what we can actually do is take your data and put it into special constant registers that are even faster for the ALU to access.

So we can do this as long as we know exactly what data will be read.

If your offsets are known a compile time, this is straightforward.

But if your offsets aren't known until run time then we need a little bit of extra information about how much data that you're reading.

So indicating this to the compiler is usually a matter of two steps.

First, you need to make sure that this data is in the constant address space.

And additionally you need to indicate that your accesses are statically bounded.

The best way to do this is to pass your arguments by reference rather than pointer where possible.

If you're passing only a single item or a single struct, this is straightforward, you can just change your pointers to references and change your accesses accordingly.

This is a little different if you're passing an array that you know is bounded.

So what you do in this case is you can embed that size array and pass that struct by reference rather than passing the original pointer.

So we can put this into practice with an example at a forward lighting fragment shader.

So as you can see in sort of the original version what we have are a bunch of arguments that are passed as regular device pointers.

And this doesn't expose the information that we want.

So we can do better than this.

Instead if we note the number of lights is bonded what we can do is we can put the light data and the count together into a single struct like this.

And we can pass that struct in the constant address space as a reference like this.

And so that gets us constant buffer preloading.

Let's look at another example of how this can affect you in practice.

So, there are many ways to implement a deferred render, but what we find is that the actually implementation choices that you make can have a big impact on the performance that you achieve in practice.

One pattern that's common now is to use a single shader to accumulate the results of all lights.

And what you can see form the declaration of this function, is that it can potentially read any or all lights in the scene and that means that your input size is unbounded.

Now, on the other hand if you're able to structure your rendering such that each light is handled in its own draw call then what happens is that each light only needs to read that light's data and it's shader and that means that you can pass it in the constant address space and take advantage of buffer preloading.

In practice we see that on A8 later GPUs that this is a significant performance win.

Now let's talk about vertex buffer preloading.

The idea of vertex buffer preloading is to reuse the same dedicated hardware that we would use for a fix function vertex fetching.

And we can do this for regular buffer loads as long as the way that you access your buffer looks just like fix function vertex fetching.

So what that means is that you need to be indexing using the vertex or instance ID.

Now we can handle a couple additional modifications to the vertex or instance IDs such as applying a deviser and that's with or without any base vertex or instance offsets you might have applied at the API level.

Of course the easiest way to take advantage of this is just to use the Metal vertex descriptor functionality wherever possible.

But if you are writing your own indexing code, we strongly suggest that you layout your data so that vertexes fetch linearly to simplify buffer indexing.

Note that this doesn't preclude you from doing fancier things, like if you were rendering quads and you want to pass one value to all vertices in the quad, you can still do things like indexing by vertex ID divided by four because this just looks like a divider.

So now let's move on to a couple shader stage specific concerns.

In iOS 10 we introduced the ability to do resource writes from within your fragment functions.

And this has interesting implications for hidden surface removal.

So prior to this you might have been accustomed to the behavior that a fragment wouldn't need to be shaded as long as an opaque fragment came in and occluded it.

So this is no longer true specifically if your fragment function is doing resource writes, because those resource writes still need to happen.

So instead your behavior really only depends on what's come before.

And specifically what happens depends on whether or not you've enabled early fragment tests on your fragment function.

If you have enabled early fragment tests, once it's rasterized as long as it also passes the early depth and stencil tests.

If you haven't specified early fragment tests, then your fragment will be shaded as long as it's rasterized.

So from a perspective of minimizing your shading, what you want to do is use early fragment tests wherever possible.

But there are a couple additional things that you can do to improve the rejection that you get.

And most of these boil down to draw order.

You want to draw these objects, the objects where the fragment functions do resource writes after opaque objects.

And if you're using these objects to update your depth and stencil buffers, we strongly suggest that you sort these buffer from front to back.

Note that this guidance should sound fairly familiar if you've been dealing with fragment functions that do discard or modify your depth per pixel.

Now let's talk about compute kernels.

Since the defining characters of a compute kernels that you can structure your computation however you want.

Let's talk about what factors influence how you do this on iOS.

First we have computer thread launch overhead.

So on A8 and later GPUs there's a certain amount of time that it takes to launch a group of compute threads.

So if you don't do enough work from within a single compute thread you can potentially, it leaves the hardware underutilized and leave performance on the table.

And a good way to deal with this and actually a good pattern for writing computer kernels on iOS in general is to actually process multiple conceptual work items in a single compute threat.

And in particular a pattern that we find works well is to reuse values not by passing them through thread group memory, but rather by reusing values loaded for one work item when you're processing the next work item in the same compute thread.

And it's best to illustrate this with an example.

So this is a syllable filter kernel, this is sort of the most straightforward version of it, as you see, it reads as a three- [inaudible] region of its source and produces one output pixel.

So if instead we apply the pattern of processing multiple work items in a single compute thread, we get something that looks like this.

Notice now that we're striding by two pixels at a time.

So processing the first pixel looks much as it did before.

We read the 3 by 3 region.

We apply the filter and we write up the value.

But now let's look at how pixel 2 is handled.

So stents are striding by two pixels at a time we need to make sure that there is a second pixel to process.

And now we read its data.

Note here that a 2 by 3 region of what this pixel wants was already loaded by the previous pixel.

So we don't need to load it again, we can reuse those old values.

All we need to load now is the 1 by 3 region that's new to this pixel.

After which, we can apply the filter and we're done.

Note that as a result we're not doing 12 texture reads, instead of the old 9, but we're producing 2 pixels.

So this is a significant reduction in the amount of texture reads per pixel.

Of course this pattern doesn't work for all compute use cases.

Sometimes you do still need to pass data through thread group memory.

And in that case, when you're synchronizing between threads in a thread group, an important thing to keep in mind is that you want to use the barrier with the smallest possible scope for the threads that you need to synchronize.

In particular, if your thread group fits within a single SIMD, the regular thread group barrier function in Metal is unnecessary.

What you can use instead is the new SIMD group barrier function introduced in iOS 10.

And what we find is actually the targeting your thread group to fit within a single SIMD and using SIMD group barrier is often faster than trying to use a larger thread group in order to squeeze that additional reuse, but having to use thread group barrier as a result.

So that wraps things up for me, in conclusion, make sure you're using the appropriate address space for each of your buffer arguments according to the guidelines that we described.

Structure your data and rendering to take maximal advantage of constant and vertex buffer preloading.

Make sure you're using early fragment tests to reject as many fragments as possible when you're doing resource writes.

Put enough work in each compute thread so you're not being limited by your compute thread launch overhead.

And use the smallest barrier for the job when you need to synchronize between threads in a thread group.

And with that I'd like to pass it back to Fiona to dive deeper into tuning shader code.

[ Applause ]

Thank you, Alex.

So, before jumping into the specifics here, I want to go over some general characteristics of GPUs and the bottlenecks you can encounter.

And all of you may be familiar with this, but I figure I should just do a quick review.

So with GPUs typically you have a set of resources.

And it's fairly common for a shader to be bottlenecked by one of those resources.

And so for example if you're bottlenecked by memory bandwidth, improving other things in your shader will often not give any apparent performance improvement.

And while it is important to identify these bottlenecks and focus on them to improve performance, there is actually still benefit to improving things that aren't bottlenecks.

For example, in that example if you are bottlenecked at memory usage, but then you improve your arithmetic to be more efficient, you will still save power even if you are not improving your frame rate.

And of course being on mobile, saving power is always important.

So it's not something to ignore, just because your frame rate doesn't go up in that case.

So there's four typical bottlenecks to keep in mind in shaders here.

The first is fairly straightforward, ALU bandwidth.

The amount of math that the GPU can do.

The second is memory bandwidth, again, fairly straightforward, the amount of data that the GPU can load from system memory.

The other two are little more subtle.

The first one is memory issue rate.

Which represents the number of memory operations that can be performed.

And this can come up in the case where you have smaller memory operations, or you're using a lot of thread group memory and so forth.

And the last one, which I'll go into detail a bit more about later is latency occupancy register usage.

You may have heard about that, but I will save that until the end.

So to try to alleviate some of these bottlenecks, and improve overall shader performance and efficiency, we're going to look at four categories of optimization opportunity here.

And the first one is data types.

And the first thing to consider when optimizing your shader is choosing your data types.

And the most important thing to remember when you're choosing data types is that A8 and later GPUs have 16-bit register units, which means that for example if you're using a 32-bit data type, that's twice the register space, twice the bandwidth, potentially twice the power and so-forth, it's just twice as much stuff.

So, accordingly you will save registers, you will get faster performance, you'll get lower power by using smaller data types.

Use half and short for arithmetic wherever you can.

Energy wise, half is cheaper than float.

And float is cheaper than integer, but even among integers, smaller integers are cheaper than bigger ones.

And the most effective thing you can do to save registers is to use half for texture reads and interpolates because most of the time you really do not need float for these.

And note I do not mean your texture formats.

I mean the data types you're using to store the results of a texture sample or an interpolate.

And one aspect of A8 in later GPUs that is fairly convenient and makes using smaller data types easier than on some other GPUs is that data type conversions are typically free, even between float and half, which means that you don't have to worry, oh am I introducing too many conversions in this by trying to use half here?

Is this going to cost too much?

Is it worth it or not?

No it's probably fast because the conversions are free, so you can use half wherever you want and not worry about that part of it.

The one thing to keep in mind here though is that half-precision numerics and limitations are different from float.

And a common bug that can come up here for example is people will write 65,535 as a half, but that is actually infinity.

Because that's bigger than the maximum half.

And so by being aware of what these limitations are, you'll better be able to know where you perhaps should and shouldn't use half.

And less likely to encounter unexpected bugs in your shaders.

So one example application for using smaller integer data types is thread IDs.

And as those of you who worked on computer kernels will know, thread IDs are used all over your programs.

And so making them smaller can significantly increase the performance of arithmetic, and can save registers and so forth.

And so local thread IDs, there's no reason to ever use uint for them as in this case, because local thread IDs can't have that many thread IDs.

For global thread IDs, usually you can get away with a ushort because most of the time you don't have that many global tread IDs.

Of course it depends on your program.

But in most cases, you won't go over 2 to the 16 minus 1, so it is said you can do this.

And this is going to be lower power, it's going to be faster because all of the arithmetic involving your thread ID is now going to be faster.

So I highly recommend this wherever possible.

Additionally, keep in mind that in C like languages, which of course includes Metal, the precision of an operation is defined by the larger of the input types.

For example, if you're multiplying a float by a half, that's a float operation not a half operation, it's promoted.

So accordingly, make sure not to use float literals when not necessary, because that will turn here what appears to be a half operation, it takes a half and returns a half, into a float operation.

Because by the language semantics, that's actually a float operation since at least one of the inputs is float.

And so you probably want to do this.

This will actually be a half operation.

This will actually be faster.

This is probably what you mean.

So be careful not to inadvertently introduce float precision arithmetic into your code when that's not what you meant.

And while I did mention that smaller data types are better, there's one exception to this rule and that is char.

Remember as I said that native data type size on A8 and later GPUs is 16-bit, not 8-bit.

And so char is not going to save you any space or power or anything like that and furthermore there's no native 8-bit arithmetic.

So it sort of has to be emulated.

It's not overly expensive if you need it, feel free to use it.

But it may result in extra instructions.

So don't unnecessarily shrink things to char that don't actually need it.

So next we have arithmetic optimizations, and pretty much everything in this category affects ALU bandwidth.

The first thing you can do is always use Metal built-ins whenever possible.

They're optimized implementations for a variety of functions.

They're already optimized for the hardware.

It's generally better than implementing them yourself.

And in particular, there are some of these that are usually free in practice.

And this is because GPUs typically have modifiers.

Operations that can be performed for free on the input and output of instructions.

And for A8 and later GPUs these typically include negate, absolute value, and saturate as you can see here, these three operations in green.

So, there's no point to trying to "be clever" and speed up your code by avoiding those, because again, they're almost always free.

And because they're free, you can't do better than fee.

There's no way to optimize better than free.

A8 and later GPUs, like a lot of others nowadays, are scalar machines.

And while shaders are typically written with vectors, the compiler is going to split them all apart internally.

Of course, there's no downside to writing vector code, I mean often it's clearer, often it's more maintainable, often it fits what you're trying to do, but it's also no better than writing scaler code from a compiler perspective and the code you're going to get.

So there's no point in trying to vectorize code that doesn't really fit a vector format, because it's just going to end up the same thing in the end, and you're kind of wasting your time.

However, as a side note, which I'll go into more detail a lot later, in later A8 and later GPUs, do have vector load in store even though they do not have vector arithmetic.

So this only applies to arithmetic here.

Instruction Level Parallelism is something that some of you may have used optimizing for, especially if you've done work on CPUs.

But on A8 and later GPUs this is generally not a good thing to try to optimize for because it typically works against registry usage, and registry usage typically matters more.

So a common pattern you may have seen is a kind of loop where you have multiple accumulators in order to better deal with latency on a CPU.

But on A8 and later GPUs this is probably counterproductive.

You'd be better off just using one accumulator.

Of course this applies to much more complex examples than the artificial simple ones here.

Just write what you mean, don't try to restructure your code to get more ILP out of it.

It's probably not going to help you at best, and at worst, you just might get worse code.

So one fairly nice feature of A8 and later GPUs is that they have very fast select instructions that is the ternary operator.

And historically it's been fairly common to use clever tricks, like this to try to perform select operations in ternaries to avoid those branches or whatever.

But on modern GPUs this is usually counterproductive, and especially on A8 later GPUs because the compiler can't see through this cleverness.

It's not going to figure out what you actually mean.

And really, this is really ugly.

You could just have written this.

And this is going to be faster, shorter, and it's actually going to show what you mean.

Like before, being overly clever will often obfuscate what you're trying to do and confuse the compiler.

Now, this is a potential major pitfall, hopefully this won't come up too much.

On modern GPUs most of them do not have integer division or modulus instructions, integer not float.

So avoid divisional modulus by denominators that are not literal or function consonants, the new feature mentioned in some of the earlier talks.

So in this example, what we have over here, this first one where the denominator is a variable, that will be very, very slow.

Think hundreds of clock seconds.

But these other two examples, those will be very fast.

Those are fine.

So don't feel like you have to avoid that.

So, finally the topic of fast-math.

So in Metal, fast-math is on by default.

And this is because compiler fast-math optimizations are critical to performance Metal shaders.

They can give off in 50% performance gain or more over having fast-math off.

So it's no wonder it's on be default.

And so what exactly do we do in fast-math mode?

Well, the first is that some of the Metal built-in functions have different precision guarantees between fast-math and non fast-math.

And so in some of them they will have slightly lower precision in fast-math mode to get better performance.

The compiler may increase the intermediate precision of your operations like by forming a fuse multiple add instructions.

It will not decrease the intermediate precision.

So for example if you write a float operation you will get an operation that is at least a float operation.

Not a math operation.

So if you want to write half operations you better write that, the compiler will not do that for you, because it's not allowed to.

It can't your precision like that.

We do ignore strict if not a number, infinity steal, and sign zero semantics, which is fairly important, because without that you can't actually prove that x times zero is equal to zero.

But we will not introduce a new not at new NaNs, not a number because in practice that's a really nice way to annoy developers, and break their code and we don't want to do that.

And the compiler will perform arithmetic re-association, but it will not do arithmetic distribution.

And really this just comes down to what doesn't break code and makes it faster versus what does break code.

And we don't want to break code.

So if you absolutely cannot use fast-math for whatever reason, there are some ways to recover some of that performance.

Metal has a fused multiply-add built in which you can see here.

Which allows you to directly request a fused multiply-add instructions.

And of course if fast-math is off, the compiler is not even allowed to make those, it cannot change one bit of your rounding, it is prohibited.

So if you want to use fused multiply-add and fast-math is off, you're going to have to use the built-in.

And that will regain some of the performance, not all of it, but at least some.

So, on our third topic, control flow.

Predicated GP control flow is not a new topic and some of you may already be familiar with it.

But here's a quick review of what it means for you.

Control flow that is uniform across the SIMD, that is every thread is doing the same thing, is generally fast.

And this is true even if the compiler can't see that.

So if your program doesn't appear uniform, but just happens to be uniform when it runs, that's still just as fast.

And similarly, the opposite of this divergence, different lanes doing different things, well in that case, it potentially may have to run all of the different paths simultaneously unlike a CPU which only takes one path at a time.

And as a result it does more work, which of course means that inefficient control flow can affect any of the bottlenecks, because it just outright means the GPU is doing more stuff, whatever that stuff happens to be.

So, the one suggestion I'll make on the topic of control flow is to avoid switch fall-throughs.

And these are fairly common in CPU code.

But on GPUs they can potentially be somewhat inefficient, because the compiler has to do fairly nasty transformations to make them fit within the control flow model of GPUs.

And often this will involve duplicating code and all sort of nasty things you probably would rather not be happening.

So if you can find a nice way to avoid these switch fall-throughs in your code, you'll probably be better off.

So now we're on to our final topic.

Memory access.

And we'll start with the biggest pitfall that people most commonly run into and that is dynamically indexed non-constant stack arrays.

Now that's quite a mouthful, but a lot of you probably are familiar with code that looks vaguely like this.

You have an array that consist of values that are defined in runtime and vary between each thread or each function call.

And you index it to the array with another value that is also a variable.

That is a dynamically indexed non-constant stack array.

Now before we go on, I'm not going to ask you to take for grabs at the idea that stacks are slow on GPUs.

I'm going to explain why.

So, on CPUs typically you have like a couple threads, maybe a dozen threads, and you have megabytes of cache split between those threads.

So every thread can have hundreds of kilobytes of stack space before they get really slow and have to head off to main memory.

On a GPU you often have tens of thousands of threads running.

And they're all sharing a much smaller cache too.

So when it comes down to it each thread has very, very little space for data for a stack.

It's just not meant for that, it's not efficient and so as a general rule, for most GPU programs, if you're using the stack, you've already lost.

It's so slow that almost anything else would have been better.

And an example for a real world app is at the start of the program it needed to select one of two float for vectors, so it used a 32-byte array, an array of two float fours and tried to select between them using this stack array.

And that caused a 30% performance loss in this program even though it's only done once at the start.

It can be pretty significant.

And of course every time we improve the compiler we are going to try harder and harder to avoid, do anything we can to avoid generating these stack access because it is that bad.

Now I'll show you two examples here that are okay.

This other one, you can see those are constants, not variables.

It's not a non-constant stack array and that's fine because the values don't vary per threads, they don't need to be duplicated per thread.

So that's okay.

And this one is also okay.

Wait, why?

It's still a dynamically indexed non-constant stack array.

But it's only done dynamically indexed because of this loop.

And the compiler is going to unroll that loop.

In fact, your compiler aggressively unrolls any loop that is accessing the stack to try to make it stop doing that.

So in this case after it's unrolled it will no longer be dynamically indexed, so it will be fast.

And this is worth mentioning, because this is a fairly common pattern in a lot of graphics code and I don't want to scare you into not doing that when it's probably fine.

So now that we've gone over the topic of how to not do certain types of loads and stores, let's go on to making the loads and stores that we do actually fast.

Now while A8 and later GPUs use scalar arithmetic, as I went over earlier, they do have vector memory units.

And one big vector loading source of course faster than multiple smaller ones that sum up to the same size.

And this typically effects the memory issue rate bottleneck because if you're running through a loads, that's fewer loads.

And, so as of iOS 10, one of our new compiler optimizations, is we will try to vectorize some loads and stores that go to neighboring memory locations wherever we can, because again it can give good performance improvements.

But nevertheless, this is one of the cases where working with the compiler can be very helpful, and I'll give an example.

So as you can see here, here's a simple loop that does some arithmetic and reads in an array of structures, but on each iteration, it reads just two loads.

Now we would want that to be one if we could, because one is better than two.

And the compiler wants that too.

It wants to try to vectorize this but it can't, because A and C aren't next to each other in memory so there's nothing it can do.

The compiler's not allowed to rearrange your structs, so we've got two loads.

There's two solutions to this.

Number one, of course, just make it a float to, now it's a vector load, you're done.

One load, a set of two, we're all good.

Also, as of iOS 10, this should also be equally fast, because here, we've reordered our struct to put the values next to each other, so the compiler can now vectorize the loads when it's doing it.

And this is an example again of working with the compiler, you've allowed the compiler to do something it couldn't before, because you understand what's going on.

You understand how the patterns need to be to make the compiler happy and make it able to do a [inaudible].

So, another thing to keep in mind with loads and stores is that A8 and later GPUs have dedicated hardware for device memory addressing, but this hardware has limits.

The offset for accessing device memory must fit within a signed integer.

Smaller types like short and ushort are also okay, in fact they're highly encouraged, because those do also fit within a signed integer.

However, of course uint does not because it can have values out of range of signed integer.

And so if the compiler runs into a situation where the offset is a uint and it cannot prove that it will safely fit within a signed integer, it has to manually calculate the address, rather than letting the dedicated hardware do it.

And that can waste power, it can waste ALU performance and so forth.

It's not good.

So, change your offset to int, now the problem's solved.

And of course taking advantage to this will typically save you ALU bandwidth.

So now on to our final topic that I sort of glossed over earlier, latency and occupancy.

So one of the core design tenants of modern GPUs is they hide latency by using large scale multithreading.

So when they're waiting for something slow to finish, like a texture read, they just go and run another thread instead of sitting there doing nothing while waiting.

And this is fairly important because texture reads typically take a couple hundred cycles to complete on average.

And so the more latency you have in a shader, the more threads you need to hide that latency, and how many threads can you have?

Well it's limited by the fact that you have a fixed set of resources that are shared between threads in a thread group.

So clearly depending on how much each thread uses, you have a limitation on the number of threads.

And the two things that are split are the number of registers and thread group memory.

So if you use more registers per thread, now you can't have as many threads.

Simple enough.

And if you use more thread group memory per thread, again you run into the same problem, more thread your memory per thread means to your threads.

And you can actually check out the occupancy of your shader by using MTLComputePipeLineState incurring maxTotalThreadsPerThreadgroup, which will tell you what the actual occupancy of your shader is based on the register usage and the thread group memory usage.

And so when we say a shader is latency limited, it means you have too few threads to hide the latency of a shader.

And there's two things you can do there, you can either reduce the latency of your shader, your save registers or whatever else it is that is preventing you from having more threads.

So, since it's kind of hard to go over latency in a very large complex shader.

I'll go over a little bit of a pseudocode example that will hopefully give you a big of an intuition of how to think about latency and how to sort of mentally model in your shaders.

So, here's an example of a REAL dependency.

We have a texture sample, and then we use the operative of that texture sample to run an if statement and then we do another texture sample inside that x statement.

We have to wait twice.

Because we have to wait once before doing the if statement.

And we have to wait again before using the value from the second texture sample.

So that's two serial texture accesses for a total of twice the latency.

Now here's an example of a false dependency.

It looks a lot like the other, except we're not using a in the if statement.

But typically, we can't wait across control flow.

The if statement acts an effective barrier in this case.

So, we automatically have to wait here anyways even though there's no data dependency.

So we still get twice the latency.

As you noticed the GPU does not actually care about your data dependencies.

It only cares about what the dependencies appear to be and so the second one will be just as long latency as the first one, even though there isn't a data dependency there.

And then finally here's a simple one where you just have two texture reads at the top, and they can both be done in parallel and then we can have a single wait.

So it's 1 x instead of 2 x for latency.

So, what are you going to do with this knowledge?

So in many real world shaders you have opportunities to tradeoff between latency and throughput.

And a common example of this might be that you have some code where based on one texture read you can decide, oh we don't need to do anything in this shader, we're going to quit early.

And that can be very useful.

Because now all that work that's being done in the cases where you don't need it to be done, you're saving all that work.

That's great.

But now you're increasing your throughput by reducing the amount of work you need to do.

But you're also increasing your latency because now it has to do the first texture read, then wait for that texture read, then do your early termination check, and then do whatever other texture reads you have.

And well is it faster?

Is it not?

Often you just have to test.

Because which is faster is really going to depend on your shader, but it's a thing worth being aware of that often is a real tradeoff and you often have to experiment to see what's right.

Now, while there isn't a universal rule, there is one particular guideline I can give for A8 and later GPUs and that is typically the hardware needs at least two texture reads at a time to get full ability to hide latency.

One is not enough.

If you have to do one, no problem.

But if you have some choice in how you arrange your texture reads in your shader, if you allow it to do at least two at a time, you may get better performance.

So, in summary.

Make sure you pick the correct address spaces, data structures, layouts and so forth, because getting this wrong is going to hurt so much that often none of the other stuff in the presentation will matter.

Work with the compiler.

Write what you mean.

Don't try to be too clever, or the compiler won't know what you mean and will get lost, and won't be able to do its job.

Plus, it's easier to write what you mean.

Keep an eye out for the big pitfalls, not just the micro-optimizations.

They're often not as obvious, and they often don't come up as often, but when they do, they hurt.

And they will hurt so much that no number of micro-optimizations will save you.

And feel free to experiment.

There's a number of rule tradeoffs that happen, where there's simply no single rule.

And try them both, see what's faster.

So, if you want more information, go online.

The video of the talk will be up there.

Here are the other session if you missed them earlier, again, the videos will be online.

Thank you.

Apple, Inc. AAPL
1 Infinite Loop Cupertino CA 95014 US