Advances in OpenGL ES

Session 505 WWDC 2013

OpenGL ES provides access to the exceptional graphics power of iOS devices. See how the innovations in iOS 7 deliver incredible graphics in games and other mobile 3D applications. Learn about advanced effects enabled by the latest extensions, and get specific tips and best practices to follow in your apps.

[ Applause ]

Good morning and welcome.

My name is Dan Omachi.

I work in Apple's GPU software group on the OpenGL ES framework.

I also work very closely with our GPU driver engineers on improving performance and implementing features on our graphics hardware.

And today I'm going to talk to you about Advances in OpenGL ES on iOS 7.

Apple offers a number of rendering API's that are highly optimized for a variety of specific rendering scenarios - Core Graphics, Core Animation, and now Sprite Kit are among those.

They do a ton for you, and they do it very well.

OpenGL ES, however, offers the most direct access to graphics hardware.

This enables a lot of flexibility to create custom effects and bring something new and innovative into your rendering.

Now, this flexibility can be a challenge to master.

It's a low-level library, and there can be some stumbling blocks, but if you can utilize the API to its fullest, you can bring some really wild custom effects that people are amazed by and love.

This can make the difference between shipping a good application that a few people download and maybe play with for a few days, and something great that people talk about, use day to day, and download in droves.

[ Pause ]

So what am I going to be talking about today?

First, there are a number of new features in the OpenGL ES API on iOS7.

The first feature I'll talk about is instancing, and we support two new extensions to implement that feature.

We're also now supporting texturing in the vertex shader.

I'll talk about why you might want to do that and how it can be done.

We're also now supporting sRGB texture formats, an alternate color space that you can use.

I'll also talk in detail about how you can utilize the API and really optimize it for your needs.

I'll give you an in-depth understanding of the GPU pipeline, which should give you some insight into the feedback that our GPU tools provide.

[ Pause ]

But before I get into any of that, I just want to touch briefly on a very important topic: power efficiency.

So rendering requires power.

All the GPU's on iOS are power efficient.

However, there's still considerable needed to put vertices into the pipe and spit out pixels.

The easiest thing that your application can do to conserve power is to manage your frame rate appropriately.

You can use the CADisplayLink API to sync to the display.

The display refreshes 60 times a second.

So that's really the maximum frame rate you could possibly achieve, but in many cases, it really makes sense to just limit your frame rate to a steady 30 frames per second.

You can achieve some smooth animations, and you're conserving way more power than rendering at 60.

Additionally, it's not necessary to render at all if there's no animation or movement in your scene.

You don't have to submit vertices to the pipe and have pixels produced if you're just going to show the same thing you showed a sixtieth of a second ago or a thirtieth of a second ago.

Just blit what's already in your buffers to the front or don't even blit at all because nothing's going to change.

This is particularly important with the multi-layered iOS 7 UI where there's a lot of compositing is going on.

The UI can skip this compositing if nothing has changed in the layer, thereby saving some power in the compositing operation.

[Pause] Alright.

I just wanted to touch on that briefly.

Now I would really like to get onto the meat of our talk and some of the new features.

The first of which is instancing.

If you're familiar at all with the types of games that are on the App Store, you'll know that the Tower Defense genre is quite popular.

In these games, you've got hundreds of enemies trying to storm your fortress.

The interesting thing about this rendering is these enemies often share the same vertex data and use the same models.

They may be doing something different.

Some may be running.

Some may be attacking you, but it's still the same base vertex data.

Also, maybe you've seen an adventure game where your hero's running through a forest that's densely populated.

It's got trees all about .

You've got trees in different orientations with branches in different configurations, but, again, all using the same vertex data.

They look distinct, however.

This type of rendering is a prime candidate for optimization with instancing.

[Pause] Let me start with a simple example.

I've got a gear model, and I'd like to render it 100 times on the screen as you see here.

Without instancing, what I would do before iOS 7 is, I would create a for loop, and in this case, I'm going down the width of the screen via the X axis, and then within that loop, I'm going up the screen on the Y axis.

For each iteration, I'm setting a uniform with the position of my gear, and then drawing that gear.

That's 100 uniform sets and 100 draw calls, and as you may know, draw calls consume a lot of CPU cycles.

So it would be great if we could trim that down a bit.

[Pause] Jere's what instancing does.

It allows you to draw the same model many, many times in a single draw call.

Each instance of that model can have different parameters.

You can have different positions for each model, a different matrix for each model, or a different set of texture coordinates.

Even though it's the same vertex data, these models can look significantly different.

So there are two forms of instancing that we're shipping on iOS 7, the first of which is using an extension called APPLE-instanced-arrays, and this allows you to send these instance parameters down via another vertex array.

The second form is Shader Instance ID, and we support this via an extension APPLE-draw-instance, and the way this works is there's a new built-in ID variable in the vertex shader that gets incremented for each instance drawn within the draw call that you made.

[Pause] Let me talk about the first method here: instanced arrays.

We're introducing a new call glVertexAttribDivisorAPPLE, which indicates the attribute array that is going to supply the instance data.

It also indicates the number of instances to draw before you advanced to the next element in this array.

You could, for example have ten instances that use the same parameter and then move on to the next parameter, but the most common case is to send a unique parameter down to each instance inside your draw call.

Now we're introducing two new draw calls to use this form of instancing.

This includes glDrawArraysInstancedAPPLE and glDrawElementsInstancedAPPLE, and these work exactly the same as the usual glDrawArrays and glDrawElements, but there's an extra parameter which indicates the number of instances you would like to draw.

Alright. Here's our example.

We've got three vertex arrays that have model data, the first of which is the position, the second normal, and the third is vertex colors, and we have an extra array that I'll get to in a minute.

We set up our arrays the same as we usually do.

We use glVertexAttribDivisor pointer to specify the location of the array.

It also specifies things like the type, whether it's unsigned byte, float, etc., whether the elements in it are normalized or unnormalized, and the number of scalars or number of values per element.

We do this for our per vertex position here, and, again, for our normal, and then a third time for our vertex colors.

Now we also do the same thing for this other array, the instance positions, and, additionally, we make a call to glVertexAttribDivisor.

The first argument here specifies it's attribute number three that has our per instance attribute data.

These are the per instance parameters that we'd like send to OpenGL.

The second argument here indicates that each instance will get its own value.

Alright. We've done the set up.

We're ready to draw.

This K argument here: this indicates the size of our model, the number of vertices in our model.

It's the same as in glDrawArrarys.

The second or last argument here, N, is the number of instances we would like to draw, and since each instance is getting a unique value, we're setting it to the same value as the number of elements inside this instance array.

Alright. We're ready to submit our vertices to the vertex shader, and here's what happens.

That instance element gets set the vertex shader, and it's used for all of the vertices inside of the vertex array containing our model.

The second instance is drawn, in the same draw call, and we set the second value here, and all of the vertices inside of the model are submitted to vertex shader.

They all use that same value throughout the entire array.

And we go through all of instances in our instance array, and all of them get a unique instance value, and we submit for every element inside that instance array all of the vertices in our model.

Here's the API set up...just going over again.

As we usually do, we call glVertxAttribPointer to indicate how we've set up our model data.

We also call glVertexAttribPointer for this instance array and glVertexAttribDivisor.

We're indicating that attribute three is our instance array, and we're iterating one element for each instance.

Finally, we're ready to draw.

We call glDrawArraysInstanced with the value 100 since e're going to render 100 gears.

Here's the vertex shader.

As usual, we've got attributes for our per vertex model data.

Here we've got position and normal.

And another attribute which will contain our per instance data.

Per instance position.

Not per vertex.

Per instance.

And we do a simple add of that instance position to the vertex position.

We're displacing all the vertices by this constant value, or at least it's constant throughout that instance.

And, finally, we will transform our model space position into clip space by transforming with our model view projection matrix and output to the built-in gl-Position variable.

We also will do any other per vertex processing such as maybe computing color via lighting or generating texture coordinates, etc. Alright.

Here's the second method.

This is using the instance ID parameter.

We've built in this gl-InstanceIDAPPLE variable inside the vertex shader, and it gets incremented once for each instance.

You can use this ID in a number of ways.

You can calculate a unique info for each instance.

You can use the standard math functions that available in the vertex shader to figure out unique details of that instance, or you can use it as an index into a uniform array or a texture, and I'll talk about texturing in a vertex shader in just a minute.

This method also uses the same glDrawArraysInstanced or glDrawElementsIntanced as the previous method.

Here's how this works: We call glDrawArraysInstancedAPPLE, and the instance ID is set inside the shader, and it's the same value for all the vertices.

It's incremented for the next instance, and we submit all the vertices using the value of one.

Finally, we iterate through the entire number of instances until we get to the Nth instance, and we submit all the vertices for each instance value.

And we can reference that gl-InstanceID within our vertex shader.

And here's what that looks like.

We use this gl-InstanceIDAPPLE variable, and it's actually an integer value, but we don't have integer math in the OpenGL ES 100 shading language.

So the first thing we need to do is cast it to a float so that we can use our floating point math operations on it.

And now we perform a modulo of ten, which will give us the x position, and we multiply it by a gear size, then we divide by ten to give us the y position.

Now we have an instance position, which we can add to our vertex and output to this temp position.

And like the other method, we will do our model view projection matrix multiply, which will put our position into clip space and give us a position that we can output.

[ Pause ]

So that was instancing.

The next feature is vertex texture sampling.

Why would you want a texture in the vertex shader?

It's not like you can see an image [in the vertex stage], right?

Well, there are a number of uses for this.

The first and most obvious is displacement mapping.

You can put an image in memory and fetch it in the vertex shader, and if you've got a mesh, you can take the values from that texture and displace that mesh with the values in the texture.

You can also use it as an alternative to uniforms.

Uniforms have a much smaller data store whereas textures has a very large data store that you can now access in the vertex shader.

Here's a height mapping example.

On the left, we've got our grey scale height map image, and on the right, we've got the results of that.

And here's how we implemented it.

First, we've got an x and z position that we've sent down via a vertex array.

Just X and Z.

No Y here, and we have a height map sampler.

Now this looks exactly like it would in the fragment shader.

This, however, is a vertex shader, and this height map is a reference to a texture.

Now we sample from that texture and get our Y value from it.

Now, it splats the Y value across all four components of temp position.

And so we overwrite the X and Z values with the X and Z positions.

Now we have X, Y, and Z inside of our temp position.

The Y we just happened to have gotten from the texture.

And as with our other shaders, we can transform to clip space and output to gl-Position.

[ Pause ]

Alright. That's a pretty simple example of how you might use vertex texturing.

As I mentioned, the more interesting way you can use this is to store just about any kind of generic data into a texture for shader access.

It's really just a very large store of random access memory.

Read-only random access memory, that is.

Data normally passed in via a glUniform can be passed in via a texture.

There are a number of advantages here.

It's a really, a much larger store.

We support 4K by 4K textures on most iOS 7 hardware.

Whereas uniform arrays are limited to 128 uniform, that's four values per uniform, there's way more storage inside of a texture.

This also enables potentially less API calls to set the data.

If you load your texture at app startup, and you have all these values inside this large data store, you can just bind the texture, and it's set up for you to draw.

You don't have to load a bunch of values to set up for your draw call.

There's a bit more variety and types that you can use whereas uniforms only allow you to use 32-bit floats.

You can use unsigned byte, half float, and float.

Any of the texture types that you can use, you can use for vertex texture sampling.

You can choose the appropriate type for the data that you'd like to consume in your shader.

You can use filtering with the texture.

Anything you can do with the texture, you can do with a vertex texture, and filtering is kind of nice because you can average sequential values that are in your texture, and with wrapping, you can actually average the last value in your texture with the first value.

So you can do a wraparound of averaging.

And because you can render to a texture, you can have the GPU produce data.

Instead of just loading it in from CPU generated values, you can render to the texture and then consume that data in the vertex shader.

[Pause] Now I'd like to show you a demo with some of these features.

Here we have 15,000 asteroids rotating about this planet, and this is using what we call immediate mode.

There is a draw call for each asteroid here.

So that's over 15,000 draw calls.

Now we're running at 17 frames per second, maybe 18 in some case.

That's alright, I guess.

The real problem here is that we're consuming a lot of CPU cycles.

We're really leaving nothing for the app so that if you've got some logic there, the frame rate's going to slow down even more.

So what we like to do is offload this to the GPU.

Here we have the first improvement, which is using instance ID, the built-in variable within our vertex shader.

Now what's cool about this is we're actually rotating or spinning each asteroid.

They all have unique values, and, obviously, unique positions.

And here's another mode that we've implemented.

This uses the glVertexAttribDivisor method, and we're getting even a slightly better frame rate here.

This is due to our pre-computing all of the rotations and position of values outside the shader, and we're just passing them in.

We're not actually doing much computation inside of our vertex shader.

What's cool to note about this is that a few years ago we presented this on a Mac Pro with, I don't know how many cores and a beefy desktop GPU.

This is really pretty nice that we are now showing this to you on an iOS device.

[ Pause ]

[ Applause ]

[ Pause ]

Let me talk about some implementation details here.

With that second mode using the instance ID, we calculate the transformation matrix in the vertex shader.

First, we figure out a spin value by doing a modulo of our instance ID, and this gives us some spin value in radians and we can then use the cosine and sine functions to build a rotation matrix.

We then apply a translation matrix that gives us the position of the asteroid.

We also use the instance ID variable to figure out the positions, and then we create this matrix.

Now the matrix calculations are done per vertex.

So even though this matrix will be the same for the entire asteroid, which is about 30 to 60 vertices (I think it's maybe a little bit on the lower end) but that's 30 times that we're calculating this transformation matrix, at least.

What we'd really like to do is just create this matrix once per instance, not per vertex.

This is what the instance arrays method does.

We actually calculate this matrix array up front at app startup, or all these matrices up front at app startup.

We calculate positions and rotations.

We stuff that into a vertex array, and then set up the vertex array with the glVertexAttribDivisor call, and pass the parameters down for each asteroid, not for each vertex.

There are a couple of advantages and disadvantages to each of these methods.

Using the instance ID method, we're not using any memory or really very little memory because we're doing all the calculation as needed on the GPU.

Another advantage is that you're using the GPU as another computation device.

If you're not GPU bound, and you need the CPU for a lot of cycles, well, then this may be the way to go.

But in general you may, if you have a number of instances using the GPU, you could potentially overload it with computation, which would really slow it down if you need to do other computations.

So what we've got here is a different method where we use instance array.

Instance arrays is generally faster than computing on the GPU since you can save cycles on the GPU.

There's a lot more flexibility and types over uniforms.

You can use any type that a vertex array can use, including bytes, unsigned bytes, floats, half floats, etc. Now there's a third method that I didn't demonstrate, but this would be to use the instance ID as an index into a texture.

So instead of passing parameters down via a vertex attribute array, you stuff them into a texture and then fetch using the instance ID variable to get the location, the position, and the rotation.

Now, as I mentioned before, the textures are just this large storage of random access memory.

It's often logically simpler [to store data in a texture], since you've got a 2D array, to put tables or any other sort of data inside of a texture.

So this is really cool for bone matrices, you can use the first row for the arm matrix, the second row for the other arm matrix, the third row for the leg matrix, head, and so on.

So it's actually a lot easier to use a texture for your bone matrix parameters.

[ Pause ]

So here's a summary of instancing and vertex texture sampling.

Instancing allows you to draw many models of the single draw call, which is particularly important because draw calls consume a number of CPU cycles, and even though it's the same model that you're drawing, they can look distinct since you are passing down different parameters for each instance.

Vertex texture sampling: just think of it as a large data store for random access read-only memory in the vertex shader.

You can use it with the instance ID to fetch per instance parameters.

These extensions and these features are supported on all iOS 7 devices.

[ Pause ]

OK. Let's move on to the third feature in iOS 7 on OpenGLES.

sRGB is an alternate caller space, which is more perceptually correct.

It matches the gamma curve of displays.

If you're looking at blacks and greys and whites, what you'd see with the usual color space is that you'd move from black to grey much more quickly than from grey to white, which effectively means that your brighter colors are weighted more heavily when you're doing averaging or mixing of colors.

So it's not a linear distribution.

There's weight on some of the values.

aRGB compensates for this by basically applying an inverse curve so that the darker colors get a little bit more weight than usual, and this allows you to have a linear mixing when your image is presented on the display.

Here's some API details.

There are two external formats that you would put your data in.

This is sRGB8 and sRGB8-Alpha.

There is an internal format, SRGB 8 alpha 8, and four compressed internal formats that you can read from that support this sRGB color space.

Now the non-compressed format here is renderable.

This allows you to do linear blending or color calculations in the shaders and have them come up in a linear fashion.

You need to check for the GL-EXT-sRGB extension string because this is supported on all iOS 7 devices except for the iPhone 4.

This is a great new feature.

It's perceptually correct.

However, you don't want to just turn this on.

You'll start getting some things that may not look right.

You need to author your textures for it.

You need your artists to keep the SRGB color space in mind so that when they're actually presented, they look as you intended them to.

And you should only use these SRGB textures for color data.

Lot of people encode normal maps or just use an alpha map perhaps.

You shouldn't even use this for alpha.

Alpha is often thought of as going with RGB, but alpha should use its own linear space.

[ Pause ]

Alright. So a lot of great new features in the OpenGL ES API, but you really need to have a rock solid foundation before you start adding to your rendering engines.

And, fortunately, Apple provides a slew of excellent GPU tools to help you build this foundation.

The first tool I'd like to talk about the OpenGLES frame debugger.

It allows you to capture a frame of rendering and debug it and play with it and experiment with it.

Now, there are a ton of widgets here that I'll and I'll just go over a few of them.

The first thing I'd like to point out is the scrubber bar.

So you've captured a frame of rendering, and the scrubber bar allows you to position on a particular call through your frame.

You can stop at a draw call or a bind or a uniform set, etc., and you can see what has just been rendered.

You can see your scene at it gets built up not only in the color buffer, which is on the left, but also the depth buffer on the right, and whatever you've just rendered, the results of last draw call you've made, shows up in green.

[ Pause ]

You can also examine all of the contents of context state at a particular call inside that frame.

You can see everything in the context, the whole state vector of OpenGL ES.

Everything that's bound, the programs, textures, etc. Your blend state, your depth state, whatever state you'd like.

If you think something may be going wrong with the state vector, you can search in there for it.

But what's even nicer is that in Xcode 5, you can now view the information that pertains to the particular call that you're stopped on.

Instead of looking through all of the context state, you can look at what's really useful to you at the moment.

Here, I am stopped at a glUseProgram call.

And so now I can look at all of the information that pertains to that GLSL program.

All the uniforms and their values, what attributes are necessary for that program, etc. You can set that view in the lower left-hand corner here.

There's this auto variables view, and this is new with Xcode 5.

[ Pause ]

You also have an object viewer.

You can view any of the objects in the OpenGL context.

You can view textures, vertex buffer objects, and I think the most powerful feature here, the most powerful object viewer is your shader viewer.

And you can take a look at the shaders and edit your shader within it, and hit this button here on the lower left-hand corner, which will compile your shader immediately, apply it to your scene, and then you can see how it has changed your rendering.

[ Applause ]

So this allows you to experiment and even debug shader compiler error.

As you see here, I've got use of an undeclared variable, and it flags my error, and I can go ahead and fix it right away.

[ Pause ]

So an often overlooked feature of the OpenGL ES frame debugger is the OpenGL issues navigator.

Here we point out a number of things that you could do to improve your rendering.

There's also some information about things that may cause rendering errors, but more importantly, there is a lot of information about how you can improve your performance.

Also in Xcode 5, we have the performance analysis page, which allows you to hit this button in the upper right-hand corner, and we'll run a couple of experiments on your frame and figure out what bottlenecks that you've got, whether you're vertex bound, fragment bound, etc., and there are some helpful suggestions as to what you might like to do next.

It also gives you some information such as whether your GPU is pegged or your CPU is pegged.

So a lot of useful information here as well.

[ Pause ]

And new in Xcode 5 is the ability to break on any OpenGL error.

Now, what you used to have to do is add a glGetError call after every single OpenGL call to stomp out these errors, Figure out if your OpenGL call produced some sort of some error because you sent in some bad arguments or the state wasn't set up properly.

Well, you don't have to do this anymore.

In the lower left-hand corner here, you can just say add OpenGL ES breakpoint, and any OpenGL call that produces an error will break immediately, and you can immediately fix it.

[ Applause ]

We also have the OpenGL ES Analyzer instrument, and there are a number of very helpful views for improving performance.

And a very powerful part of the OpenGL ES Analyzer is the OpenGL ES Expert, which points out more information, more things that you can do to improve the performance in your application.

This points out a lot of data that is very similar to what comes up in the issues navigator.

Whereas the issues navigator can actually run some more in-depth experiments and give you more data, it only can analyze one frame whereas the OpenGLES expert can analyze multiple frames of rendering.

[ Pause ]

We offer a number of tools that really provide an excellent means for debugging your rendering.

Additionally, with the OpenGL ES Expert, the performance analysis page and the frame debugger with the issues navigator, we're providing lots of valuable data to improve performance.

But there is a lot of data coming at you, and it can be difficult to digest and assess the severity of the issues that come up.

So I think it would be helpful if I can give you a more in-depth understanding of how OpenGL works and, in particular, how the GPU beneath it takes the vertex data and transforms it into pixels on the screen.

That way, you can keep the OpenGL architecture in mind when you're designing your rendering architecture and really assess the severity of issues that crop up.

[ Pause ]

I'm going to give you an overview of the GPU architecture now.

All of the iOS GPU's are tile-based deferred renderers.

They are high-performance, low-power GPUs, and the TBDR pipeline is significantly different than that of traditional streaming GPUs that you would find on the Mac.

There are a number of optimizations to reduce the processing load, which increase performance and really save lots of power.

Very important on these iOS devices.

Now the architecture depends heavily on caches because large transfers to unified memory are costly not only in terms of performance and latency, but also in terms of power.

I t takes a lot of power to reach out across the bus and grab something back in.

So we have these very nice, significantly large, caches so that we can do a lot of work on the GPU.

There are certain operations that developers can do that can prevent these optimizations or cause cache misses.

Fortunately, these operations are entirely avoidable.

[ Pause ]

What I thought I'd do is take you on a trip down the tile-based deferred rendering pipeline, and along the way, I'll point out some issues that you may stumble across and describe what's going on when we warn you about these issues.

Let's start out with the vertex processor.

On your left, you've got the vertex arrays that we've set up.

Hopefully, you've used a vertex buffer object or a vertex array object to encapsulate this data, And we issue a draw call, which begins this trip down the pipeline.

We shade the vertices, transform them into clip space, and actually also apply the view port transformation so that they're now window coordinate vertices.

The vertices are shaded and transformed, as I mentioned, and stored out to unified memory.

[ Pause ]

Now a frames' worth of vertices are stored.

Unlike a traditional streaming GPU where it only needs three vertices to produce a triangle to go onto the next stage and start rasterization and fragment processing, we defer all of that work until you call presentRenderbuffer or somehow change the render buffer another way, by either binding a render buffer or changing an attachment to a frame buffer object.

Let's say now we call presentRenderbuffer.

This, and only now is when we move to the next stage of the pipeline, which is the tiling processor.

Every render buffer is split into tiles.

This allows rasterization and fragment shading to occur on the GPU in little tile-sized pieces of embedded memory.

We can't push the entire frame buffer onto the GPU; that's just way too large.

So we just split up this render buffer into much smaller tiles, and then we can render to those one by one.

Here's what the tile processor does: It works in groups of triangles, and it figures out where the triangles would be rendered here.

Which tile they'll go to.

The larger triangles, which intersect multiple tiles, may be binned into these multiple tiles.

[ Pause ]

And then we're ready for raster set up, or set up for the rasterizer.

Here's the first issue that you could run across - logical buffer load, and here's what this means.

The rasterizer uses tile size embedded memory, as I said.

Now if there is data already in this render buffer, the GPU needs to load it from unified memory because you're going to write on top of it.

This is pretty costly, OK.

We need to reach out across the bus, pull it in.

Same for the depth buffer: if there is data in it, we also need to pull it in from unified memory.

Fortunately, you guys can avoid this.

Loading tiles is called a logical buffer load, and you can avoid such a logical buffer load if you call glClear before your rendering.

The driver knows that there is nothing important out in memory since you're clearing the buffer so in can just start rendering to this tile memory.

Great. No load necessary.

Very fast.

[ Pause ]

Logical buffer loads can happen in some less obvious ways.

For instance, if we render to a texture, render to a new buffer or a new texture, and then render to that first texture again.

Here's what happens: we render to our texture.

Now we want to render to a new texture.

We clear it, and render to that.

Great. Now we would like to render to our first texture.

Well, logical buffer load.

Need to load both the color buffer and depth buffer.

Developers should avoid frequent switching of render buffers.

Complete your rendering to one buffer before switching to another.

Don't just say, "hey, you know, I've finished a pretty good amount of rendering.

Let's just switch my buffer.

Go out and render something new, and then now I'd like to go back to that first buffer."

You'll get this tile thrashing that I've just described.

[ Pause ]


We're ready to actually do some further computation.

The GPU reads the triangles assigned to the tile, and it computes the X and Y pixel coordinates and the Z value, the depth value.

The fragment shader is not yet run.

Positions and depth are calculated only.

This allows an optimization called hidden surface removal.

Now let's say we submit a triangle, and it's partially obscured by another triangle.

Well, a portion of that triangle is hidden.

W e don't need to run the fragment shader on that hidden portion.

That saves us from fragment shader processing.

We can reject those fragments.

Now this is why we deferred all the rendering until you called present render buffer.

W e have the entire frames' worth of triangles.

That's potentially a lot of fragments that we can reject.

[ Pause ]

But you can get this warning.

Loss of depth test hardware optimizations.

Loss of hidden surface removal.

It's really costly to enable blending or use discard in the shader.

Lots of times we like to use discard for things like implementing an alpha test, but it defeats the hidden surface removal optimization.

We submit a triangle that maybe is blending and it's transparent.

So you can see stuff behind it.

We need to run that fragment shader even for triangles that are behind that other triangle.

The shader must run a lot more times.

This is a cost of performance and power.

We're doing a lot more processing.

Therefore, you guys need to be judicious in your use of discard and blending.

Allow the GPU to reject as many fragments as possible.

[ Pause ]

Next up, we can perform fragment shading.

And what's great about the TBDR renderer is that, if the hidden surface removal algorithm is allowed to work, we only need to run the fragment shader on each pixel once.

It doesn't matter how many layers of triangles.

Doesn't matter what your depth complexity is.

Only one fragment shader is run on each pixel.

The fragment processor shades and produces color pixels, and those colors are written to the embedded tile memory on the GPU.

Now we're ready for tile storage.

[ Pause ]

Alright. The tile stored into unified memory, and once all the tiles are processed, the renderbuffer is ready for use.

You can present it to the user on the screen or you can use it as a texture for another pass.

Storing a tile to unified memory is called a logical buffer store, and each frame needs at least one.

It's considered a frame because you've presented your buffer to the user, and that requires a logical buffer store.

However, you can get this warning - unnecessary logical buffer store.

And here's what that's about.

A depth buffer is only needed to be stored if you're using an effect like shadowing or screen space ambient occlusion.

In general, if you're not using an effect like that, it doesn't need to be stored; it's unnecessary to push it out to unified memory.

So developers could call glDiscardFramebuffer to skip this logical buffer store on the depth buffer.

It's simply flushed away.

We don't need that after rendering is complete.

The same thing for multisample anti-aliased renderbuffers, and this is particularly important because these are big.

A multisample 4xaa render buffer has four times the amount of data as a regular color buffer.

Fortunately, you guys don't need the pre-resolved MSAA buffer.

What you need is the resolved, much smaller tile that you can store out to unified memory.

Not the large tile that has not been resolved yet.

You can call glDiscardFramebuffer for the MSAA color buffer as well.

Same thing for depth.

Don't need the MSAA depth buffer.

Call glDiscardFramebuffer on the MSAA depth buffer.

Don't store that out.

[ Pause ]

We finished our trip down the tile base deferred rendering pipeline.

Here are some take aways.

Hidden surface removal is a really unique strength of this architecture.

It greatly reduces work load which saves power, increases performance.

There are certain operations, however, that defeat this HSR process, alpha blending or using discard and the shader.

But I'm not saying you shouldn't use them.

There are some really cool effects that you can achieve by enabling blending or using discard, but there are some perfereable ways to use them.

First of all, draw all your triangles using discard or blending after triangles that do not.

Hidden surface removal can at least be used for the triangles in that opaque group.

Additionally, trim the geometry around the triangles that need this sort of operation.

If you've implemented an alpha test, make sure you wrap your alpha-tested object so that you produce less fragments that need this operation.

It's worth adding more vertices to reduce fragments that need them.

[ Pause ]

Also, we've seen that transfers between the unified memory and the GPU are expensive, and the best things that you can do to avoid them is to call glClear to avoid the logical buffer loads so that the GPU can just simply start rendering.

Doesn't need to read the framebuffer.

Also avoid frequent render buffer switches, which can cause tile thrashing.

And avoid logical buffer stores.

Use the glDiscardFramebuffer call, especially for large multi-sampled anti-alias buffers.

[ Pause ]

There are a couple of things that didn't fit on that pipeline diagram, and I want to point those out to you now.

The first is dependent texture sampling.

Now this happens if you calculate a texture coordinate in the fragment shader and then sample from that texture with the texture function.

Here I've got our texture sampler and two varyings here, and the first thing I do is I add these values together to produce a coordinate offset cord, and I use this offset cord in the texture function.

Because it's a result of two previously-calculated varyings, we now are making a dependent fetch or a dependent sample or dependent read.

Here's a more devious example, a much less obvious example of a dependent texture read.

Some developers get clever, and they think, "hey, you know what, I've got two textures I want to sample from, and I only need two scalars to get a 2D texture for each texture.

What I'm going to do is pack them into a single vec4.

So I've got an S and T texture coordinate in the first two components of the vec 4 and another S and T texture coordinate in the second two components of the vec4.

And then what I'm going to do is I'm going to use the first two as the first texture coordinate, make the first texture fetch with the X and Y and then a second one with Z and W."

Now these are actually both dependent reads.

Because what happens is the texture coordinates need to be converted first from a vec4 to two vec2s.

This is happening all under the hood.

You don't actually see it, but there is some calculation being done which makes these dependent texture read.

[ Pause ]

Here's why it's bad.

There's a high latency to sample a texture in unified memory.

Now we avoid this latency when you're not doing a dependent texture read because the rasterizer says, "Hey, this triangle uses a texture in this fragment shader, and we've already got the coordinates.

So let's signal out to a memory and pull that data back in, and soon as we start that fragment shader, we'll have the data."

We can't do that if you're calculating the texture coordinate in the shader.

The shader stalls.

It waits for the data to come back to it.

So minimize your dependent texture samples.

Hoist your calculation.

Do it in the vertex shader if possible, put it in a uniform or put it in the vertex array.

Try to avoid, putting the calculation in the fragment shader.

Here's the fixed version of that devious shader here.

We've now split that vec4 into two vec2's.

There's no calculations done.

We simply fetch using these two separate variables.

[ Pause ]

Alright. Here's another warning that shows up.

Fragment shader dynamic branching or also Vertex shader dynamic branching.

Here we've got our varying and attributes that vary from vertex to vertex, and because it varies, it becomes a little bit difficult for the GPU to manage because we now test, and the outcome of that test in the if statement is dependent upon the test.

Here's why it's difficult.

GPU's are highly parallel devices.

It can process multiple vertices and fragments simultaneously.

We need a special branch mode for execution of a dynamic branch, and this adds a bit more latency for the parallel device to stay in sync.

If it's possible, calculate the predicate of your if statements outside of the shader.

A branch on a uniform does not incur that same overhead because it's constant across all of the vertices or fragments.

All of the shader execution.

And really if there's a shader that uses both a dependent texture sample and dynamic branching, this adds a lot of latency and can be really costly.

Really look for that.

[ Pause ]

OK. I've talked a lot about how to utilize the GPU to its fullest.

You also really want to get to the GPU as quick as possible and minimize the CPU overhead.

And as you may know, a lot of time is spent in draw calls.

But what's less obvious is that while state setting looks inexpensive, if you make a bind call or an enable call or use the new program, and you profile that or add timers around it, it doesn't look like much time, but that's because a lot of that time, a lot of the work is deferred until draw.

We don't actually do a lot of processing during the state setting.

It's all done later on.

The more state you set before a draw, the more expensive that draw becomes.

So maximize the efficiency of each draw, and the tools give you a couple of warnings of ways that you can reduce the overhead for a particular call.

Redundant call and inefficient state update are these two warnings you should look out for.

And what you can do is there are some algorithms such as shadowing state.

Keep the state vector that you've been changing in your application and don't set it in OpenGL if you've already set it.

Also a more elegant algorithm is to use state sorting, which minimizes the number of state sets.

You can use a state tree, for example, and only set the expensive states once, and draw with a unique vector each time.

[ Pause ]

However, there is some fixed overhead for a draw.

It doesn't matter how little the number of state setting you make.

We still have to do some state validation.

We need to check that the parameters you've set in the draw are appropriate for the state that has been set, and we need to make a call to the driver, and the driver needs to do some calculations to convert to hardware state.

So minimize the number of draw calls you make.

The most obvious way is to don't draw things that don't show up on the screen.

Cull your objects.

You can use frustrum culling if it's a 3D scene.

Just draw things that are in the area of visibility, and don't draw things that are not in the area of visibility.

You can combine your draw calls via instancing, which I talked about a lot earlier.

And also vertex batching and texture atlases.

[ Pause ]

Here's a way to reduce your binds.

What we would normally do is we'd have these four models and four textures.

We would bind, draw, bind, draw, bind, draw, and bind and draw.

Now that's four binds, four draws, and each draw needs to validate that that bind made sense for that draw.

We can reduce the number of binds, create a texture atlas by combining all of these textures into one.

Simply bind once, then we can draw, draw, draw, and draw.

Great. We can even go further and combine our draws, which would allow us to bind once and draw them all.

This would require us to combine all of our vertex data into one vertex buffer object.

[ Pause ]

There is a new texture atlas tool.

Sprite Kit is a new framework in iOS 7, and it is mainly for 2D games, but there are some nice tools that we can take advantage of in OpenGL.

The texture atlas tool combines images efficiently, and it produces a property list denoting the subimage.

You can scale your texture coordinates based on this property list, enabling you to render your 3D models with this texture atlas that has been produced.

This texture atlas tool comes with Xcode.

[ Pause ]

For more information, you can talk to Alan Schaffer, our graphics and games technologies evangelist, and there's some excellent documentation on our developer site.

You can also contact the community via the developer forum, and there are some engineers that lurk on those forums as well.

So you can get your questions answered in a lot of detail.

There are a couple of related sessions.

There were 2 Sprite Kit sessions that happened yesterday, but you can catch the video of them.

And the Sprite Kit sessions talked a little bit more in detail about their texture atlas tool.

Later on in the afternoon there is "What's new in OpenGL for OS X."

OpenGL ES is derived from its big brother on the desktop world.

So you can get a bigger picture of what's happening in 3D graphics there.

[ Pause ]

In summary, you want to reduce your draw call overhead, use the techniques including instancing and texture atlases to do that.

Consider the GPU's operation when you're architecting your rendering engine and in your performance investigations.

The GPU tools really help greatly in this effort while the tile-based deferred rendering architecture has some special considerations that you want to think about.

Thank you very much.

Apple, Inc. AAPL
1 Infinite Loop Cupertino CA 95014 US