OpenGL for Mac OS X

Session 420 WWDC 2010

OpenGL is the foundation for accelerated graphics in Mac OS X, taking advantage of the most recent innovations in graphics hardware. See how advances in OpenGL enable you to unlock the incredible rendering power of the GPU. Get all the details to take advantage of OpenGL extensions, and learn best practices and tips for modernizing and streamlining your graphics code.

Matt Collins: Good morning.

Thanks for all coming this morning.

A little early, I see a few bleary eyes out there.

Today we're going to talk about OpenGL on Mac OS.

We're going to give the desktops some love.

So let's begin.

My name's Matt Collins, I'm a member of Apple's GPU Software Team, we work on the OpenGL software stack and the driver stack.

So today I'm going to go over a little bit about things we added in 10.6.3 and some of the other software updates.

First you might be wondering where we're at.

So you wonder, I want to make a Mac game or some 3D app on the Macintosh and I want to know what I can target, what kind of features exist.

Want to know what's new in 10.6.3 and 10.6.4.

You may have heard of new extensions or new features we've added for you guys.

So we'll go over those.

And you might want to know what does that mean for your app.

What's new, what's cool, how can you leverage it.

We'll also talk a little bit about performance.

So say you got your rendering looking great, and now you want to get a little bit of performance, you want some speed.

So I'll go over tips and tricks, and we'll make your app really shine.

And lastly, we'll do some pretty pictures, because I'd be remiss if I said I'm going come up here and talk about graphics, but it's going to be nothing cool to look at, so we'll have some cool techniques, rendering techniques, and cool demos.

First let's talk about OpenGL.

Give you a little background.

You may have gone to some of the early talks, talking about OpenGL ES.

This is the lowest level access to the graphics hardware.

So on the desktop if you want to get the GPU's power and really use it, you've got to use OpenGL.

Most of our other frameworks are built on top of it.

So everything else, like Core Image, Core Animation, Quartz Composer, they're all built on top of OpenGL and they all use it to leverage the GPU's power.

So let's talk about last time at WWDC.

So some of this should be a quick recap for some of you.

You might have heard of buffer objects, vertex buffers, index buffers, frame buffer objects, fixed function pipeline, multitexturing, shader pipeline, and you've heard of the vertex shader, the geometry shader, the fragment shader.

This should all be a quick overview.

We'll go over some of these topics in more detail and if you have any questions you can always come and talk to us.

So that's where we were.

But where are we now?

Well, now we have new extensions, new features.

Give you better access to hardware functionality.

These are things your GPU could already do but you didn't have good access to.

Most of this is 10.6.3 and above only.

So if you want to target these cool new things you've got to use 10.6.3 and above.

But first some advice.

The first piece of advice, and you may have heard this at some of our other talks is use generic vertex attributes.

So when you send a vertex down to the GPU to render they have attributes.

This could be the position, the color, the normal, et cetera.

And those are all built in to OpenGL.

But we're going to tell you to use the generic ones.

This is because this is the best way to use a shader, and shaders are native.

So if you have a fixed function which is the old style rendering, glBegin, glEnd, et cetera, those are all actually emulated in the drivers.

If you went to Dan Omachi's talk you may have heard that whenever you set up fixed function state your driver will actually generate a shader behind the scenes to emulate that state.

And even better, you can port to OpenGL ES 2.0 if you use generic attributes, because OpenGL ES doesn't have any of the old fixed function stuff at all.

Now let's talk about some new things.

This is all stuff that's now available in 10.6.3.

First we have a set of extensions that are here really to make your life easier, compatibility.

So provoking vertex, vertex array, BGRA, depth buffer flow.

This is all to help you with compatibility and porting.

Next we have a set that's really about empowering your app.

So some new features that allow you to do techniques you just couldn't do before.

Frame buffer objects, texture arrays, and instancing.

And last we have a set for performance and memory, conditional rendering, different texture formats, texture RG, two-channel textures, packed float which is mainly for HGR and shared exponent textures.

So we'll go over these in detail individually.

And we'll do some learning.

So first I'll talk about an extension, I'm going to tell you what it does.

Then I'll tell you why you should care.

Lastly, I'll show you a demo of something cool that you can do with it.

So let's get started.

First we'll talk about the flexibility of the compatibility extensions.

The first one is provoking vertex selection, this is EXT_provokes_vertex.

Now who here is familiar with the term provoking vertex?

Anyone? That's good.

Okay, well I'll give you a little explanation.

When you draw something you can either have it smooth shaded or flat shaded.

A flat shaded thing is the same color.

So if you have a square, a quad that's flat shaded, excuse me the color has to be chosen from one of the vertices that make up that quad.

And that is the provoking vertex.

Now in OpenGL, normally the provoking vertex would be the last one.

This let's you select which one is the provoking vertex, so you can tell which vertex supplies that attribute.

You have a new entry point, very simple glProvokingVertex.

You can tell it pick the first or pick the last.

Couldn't be easier.

Now I brought up quads.

What about quads?

Well quads are hardware dependent, because all of our GPUs, they're really good at rendering triangles.

You can make a quad with two triangles, but that's not how the hardware works, you know, on the very basic level.

So the behavior for a quad is hardware dependent.

And you can query this with this enum GL_QUADS_FOLLOW_PROVOKING_VERTEX, and it will tell you whether your setting is either ignored or obeyed.

Now that's interesting, but why do you care.

Well, for better flexibility, sure.

But mainly it allows you to pick which vertex you're pulling your color, your attributes from, without modifying your art assets.

So let's say you had a game and it had some cool particle system.

And when a Jeep went across the ground it kicked up a bunch of dust.

So you have a particle spray, and it's a gray dust cloud, but maybe you want a red dust cloud or a green dust cloud.

You could set a color on one of the vertices for that particle, and you could get a colored dust cloud.

So here's an example of a flat-shaded gear.

Now when you think about it, you don't just have to use color, you could use anything to be flat shaded, right?

So here you have an example of normals that you're being visualized.

So even the normal can be flat shaded.

It's anything you want.

It doesn't just have to be color.

This is just an example of how you could use anything to be flat shaded.

Next let's talk about BGRA ordering.

Now in OpenGL normally when you provide a color it's going to be RGBA.

Other APIs may not work this way.

So this allows you to specify colors in BGRA order.

Again, you can use something without modifying your art assets.

Something a little strange about this extension, though, is that you actually supply GL_BGRA as a size parameter, not as there's no order parameter.

So normally you would specify ColorPointer, SecondaryColorPointer, or VertexAttribPointer, and I put VertexAttribPointer in gold because that's how you supply a general vertex attribute which is good.

But if you use this, the sizes implied to be four, even though you're actually setting the size as GL_BGRA.

And this has to be unsigned bytes only, because it has to do with how you're packing your stuff in.

So here's some code, it's very simple.

You bind a buffer and your color VBO name.

If you've gone to some of the other talks they talk about the importance of using VBOs.

So I put all my colors in this VBO, now I bind it.

And now I say VertexAttribPointer.

I give it the index, which is the index of my attribute in my shader.

And then instead of size I have GL_BGRA and UNSIGNED_BYTE, FALSE means I don't want it to be normalized, 0, NULL.

Pretty simple.

Next we'll talk about some floating point depth buffers.

This is just a couple new formats for your floating point depth buffer or for your depth buffer in general.

It allows you to be floating point, 32-bits of float, and there's also a 32-bit with an 8-bit stencil.

New type. And one thing to keep in mind is you notice 32-bits plus 8 bits, that's more than a single double word.

So you're going to be using a lot more space if you use a 32-bit depth buffer and an 8-bit stencil buffer, something to keep in mind.

Now why would you want to do this?

Well, it's mainly for very deep scenes or very small scenes.

So keep in mind if you're rendering, like, the universe, that's really, really deep, right?

So a floating point number is going to be better at doing something like that.

Or if you're rendering something really, really small, like the insides of an atom and you want something that's .0001, you know some extremely small number, a floating point number is also going to excel at doing something like that.

This is worth keeping in mind, because the standard depth buffer is actually has better precision closer to the near plane.

So you think of your depth, your depth is actually projected.

So it's the Z value over the W value.

And that results in a curve with greater precision, better precision closer to the near plane.

Floating point numbers also have better precision closer to 0.0, so when you use this, you have to keep both of those in mind.

That's something to watch out for.

All right, now let's talk about empowerment.

These are some techniques, some extensions to allow you to use techniques you couldn't use before.

So the first one I'd like to talk about is array textures.

An array texture is an array of 1D or 2D images with each layer being a distinct image.

There's no filtering between layers and you have distinct bitmaps per level.

This allows you to do some cool things.

It also means you have to use the programable pipeline.

So the only way to use an array texture is to use a shader.

So you also have new texture targets, array, 2D, 1D array, 2D array, and new samplers.

So you sample these just like you'd sample a normal texture except that so say the 2D case, the third texture coordinate will actually pick the layer.

So you have layer 0, layer 1, layer 2, et cetera.

And in the 1D case you actually have a 2D texture coordinate and the second coordinate picks the layer.

Now why is this interesting?

Well, it allows you to store unique data slice per layer of this texture.

This is completely unique and it doesn't isn't touched by any of the other ones.

It's like a distinct image you literally have an array of 2D images.

So you think well 3D texture is kind of like that.

It's a volume texture.

I could just think of it as unique things.

But that doesn't quite work because you can't bitmap each level.

If you've ever tried bitmapping a 3D texture everything becomes this blob that's, you know, an average of all the layers smashed together.

So let me show you a demo of this technique.

So this is a little terrain demo I wrote.

And you can see here there's a couple little guys on the mountains.

And there's the water and below that there's a little gray and some green grass going up to the snow.

Now you've probably seen terrain demos like this, but the cool thing is this is actually one texture that's an array texture with four layers depending on the elevation.

So there's a rock level, a grass level, a more mountainy one, and then the snow.

And I can dynamically change the terrain and it will actually update which texture is being sampled from.

So you can see that part that went up, and the top part has snow and the bottom part has grass, and it sort of blends in between the layers, and that can go down, changes again.

So this is really cool for, like, a dynamic terrain engine, because you have one texture, and it will automatically texture correctly based on the elevation of the point.

So there's actually sample code that you can download and check this out.

I highly encourage you to do so.

It's a pretty cool technique.

We'll come back to that a little bit later.

Next I want to talk about instancing.

How many people have heard of instancing.

A few people, that's good.

So we expose instancing in with our instance arrays.

And this allows you to reuse premieres with a single draw call.

Some of the other talks talked about the importance of batching your draw calls.

Fewer draw calls is always going be better.

So if you can draw 100 things with a single draw call that's awesome.

Again, programmable pipeline only, and you must use generic vertex attributes to use this technique.

This is because you're essentially sourcing attributes at different rates.

If you have a position for a point you're always going to have a separate position for vertex, otherwise they're going to be on top of each other, right?

But you may think what other things could I pass?

Well, you could pass an orientation matrix, you could pass color, you could pass normals.

So you could reuse the same model, draw it 100 times with different orientation matrices and put the same guy in 100 different places.

So to do this you use GL vertex attrib deviser R.

Now you specify deviser, which tells OpenGL how to source these attributes differently.

So let's give a little example.

First you can see we have the positions and they're moving, and the attributes.

So different positions get the same attribute.

Gets the purple one, now it gets the red attribute, yellow position, completely different, gets the same red attribute.

Green one gets the yellow attribute and so on.

They're different positions, and they're reusing the same attributes.

So consider each position as its own instance.

And then the attributes are being so here in this case the deviser is 2, because every 2 you repeat the attribute.

So you might want to do this because it saves overhead.

That's the main reason.

And performance is always going to be better the fewer draw calls you too.

There's different techniques.

This is commonly referred to as stream instancing, if you ever heard of that.

You're sourcing vertex attributes at different rates and the best way to do this and the example that you can download is sourcing different position and orientation matrices per instance.

So let's give a demo of that.

So this is a bunch of spinning gears, use the gear model I showed you earlier.

And it's a 9 by 9 grid times 6 phases.

And this is a single draw call.

There's one draw, glDrawElements instance.

And you can see that they're all animating separately, some of them turn one direction, the others turn the other direction, they're all lit.

See any per pixel lighting.

They're kind of shiny.

And it's actually a pretty simple technique.

There's just one big buffer that is the attributes.

And I have a matrix packed in there.

So each individual gear gets it's own matrix that represents its orientation, the turning animation, and its position in space.

And finally I have a camera matrix that controls the camera so I can move around this this cube of moving sphere moving gears.

This is also available for you to download.

You can check it out.

This is really common in a lot of games, for say tree rendering if you want to render some foliage or some tufts of grass, anything like that.

Even particles are great for instancing.

Any time you have to render something over and over again that looks really similar this is a great technique to leverage.

All right, let's talk about frame buffer objects.

This is a big one that a lot of people have asked for, so we've implemented it for you guys.

Our frame buffer object is a generalized off-screen render target.

Now we recommend when you do your rendering, you do render to an FBO.

This is an off-screen target.

You can render to a texture or a render buffer.

If you're familiar with the iPhone you have to render to an FBO.

So you should also be doing this on the desktop.

You can attach different dimensions to your FBO now, and you can attach different formats.

This is something you can't do on the iPhone, but you can do it here.

And there's a bunch of reasons you might want to.

Now you might, FBOs themselves are not new, this is a technique that's been around for a long time, capability that we've had for a while.

But now you can attach different size buffers or different types of buffers.

So you can do things like reuse the Z buffer.

Say you want to render something at full size.

So you have a full resolution Z buffer.

Then you want to render at a quarter size or half size.

You can reuse that Z buffer so you don't have to rebuild that information, and you can get the benefits of having some early Z optimizations.

You can also use this to render data to a texture.

Now you could always do this before, but you had four components and you were pretty much forced to use RGBA or RGB.

But you may not always need that.

We'll get a little more on this later.

So with that in mind, let's move on to performance and memory.

First let's talk about texture_RG.

Texture_RG is a one- or a two-channel texture, R or RG, respectively.

There's many formats, 8-bit, 16-bit, 32-bit on sign in, and we also have 16-bit, 32-bit floating point.

And this is mainly for data storage.

And it can also be a render target which fits nicely into the ARB FBO extension.

So you can see here on the right we have a teapot.

And you might wonder well, why do I care about one- or two-channel textures.

It's only going to give me red and it's going to give me green.

Like this teapot is red and green.

Right? What am I looking at?

Well, you can combine this with ARB FBO to render to.

Render data to a texture.

What type of data.

Oops, let's move on.

Sorry, luminance.

You think I can render to luminance, which is a one-channel texture, right?

Well, luminance is not renderable, so that won't quite work.

But going back to data, you might say why do I care.

Why don't why do I need four components?

Well, it saves screen space motion blur.

Screen space motion blur really only needs two components, an X and a Y vector.

So you can write your X and Y vectors out to an RG texture and then sample along those two vectors to get a nice blur.

This is also really useful for a technique called deferred shading.

Which so deferred shading, let's go into that, how many people have heard of this, by the way, anyone?

This is commonly used in a lot of games.

And you need two passes.

On your first pass you're going transform your geometry as usually , then you'll render the lighting attributes for the calculations to what's called a G buffer.

Your second pass you draw a full screen quad and you read from the G buffer to perform lighting calculations in screen space.

Now let me give you an example of a possible G-buffer layout.

So here I have three render targets, three separate textures that I'm going render to at the same time.

My first texture will store my position.

This is typically going to be your position in world space or camera space.

I've chosen to do it in camera space, because I think it's easier, but it's up to you.

Just make sure everything's in the same coordinate space so your lighting looks right.

So my first texture will contain my X, my Y, and my Z position.

My second render target, my second texture will actually represent the color of that pixel, the unlit colors.

So if I have a texture map, it will be that color.

Or if I don't have a texture map it can be an ambient color or the material color too, so red, green, blue, and alpha.

Lastly, I'm going to store my normal.

And you see here I actually only have the X and the Y.

That's because I have a two-channel texture, and I can reconstruct the Z component and the shader.

Consider that your normal XYZ, you know the length is going to be 1.

It's normalized.

So that way you can reconstruct that you know 1 has to be the length of your final thing.

So the Z component is pretty easy to reconstruct.

You also know that the Z is going to be at least facing moderately towards you, if it's in screen space, or if you store it in camera space.

Because if it's facing away from you it wouldn't be lit at all.

Each of my components here is a 16-bit float.

So the whole thing here, you can see how much space I'm going to take up.

Space is important to always keep in mind.

So as I start out attributes, here's the pixel shader, the fragment shader, I have a varying position, a varying normal, and I have my color as well.

So frag data is what you use to store to multiple render targets in OpenGL.

So my first frag data 0, my first render target, I'm storing at X, Y, and Z of the position.

My second render target, I'm just storing the color.

And my last one I'm going to store the X and Y components of my normal.

Here we go.

So let's take a look at what deferred shading looks like.

Here I have the Utah teapot with some lights.

So the big advantage with deferred shading, you might wonder why you want to use it, all your lights will be done in screen space.

Normally, you would do lighting either at the vertex stage or at the fragment stage.

The vertex stage you're going to apply your lights once per vertex, it will be interpolated across your primitive.

In a forward render, if you apply your light at the fragment stage you can redo the lighting calculations any number of times.

Typically, you've have lots of overdraw in your scene.

Which means that everything you're rendering may not be visible because it may be behind something else, occluded by it.

That still means you have to do the lighting calculations because you don't know if it will be visible till the over end.

The good thing about deferred shading is you only do lighting calculations once per pixel.

So you're guaranteed everything that you're actually lighting will be visible.

So let's visualize these buffers that we had, we talked about before.

Move the teapot a little bit so we get a better view.

This is the position buffer, the X, Y, and Z.

Now consider that the Z is probably going to be pretty consistent across it because the teapot is an equal length away, so you can't really see the blue part.

But the X and the Y are going to be red and green.

And notice there's a big black part.

Well that's because those are negative coordinates, right?

And there's no negative color.

So that would be like negative 1 would be in the lower left.

And you just can't see it with color because that's how we're visualizing it.

This is the color buffer.

So my teapot is actually one color, it's just gray.

If you had a texture map, the texture map would appear here.

It's the unlit color of the screen.

This is commonly referred to as the albedo.

You may have heard that term.

That's this is the color buffer.

Lastly, we have the normal buffer.

This is the transform normal that we use in our final lighting calculation.

You can see the depth phases being called in the teapot's spout there.

The normals are nicely smooth around the side so you can see that I'm reconstructing them correctly.

Finally, we have the deferred shaded teapot.

It's pretty cool, you can see it has full specular lighting per pixel.

And it's running at 60 frames a second no problem.

I was going to show this with like, 18 lights, because it's easy to do.

But it's so overwhelming when you have a bunch of colors dancing, it's almost seizure-inducing.

So I decided to keep it a little simple.

Next we'll talk about packed floats.

This is a new format, it allows you to pack three floating point numbers into a 32-bit value.

New internal formats, R11, G11, B10, so 11, 11, 10 textures.

You may have heard that if you've been familiar with other APIs.

So you can see how this is packed in.

You have 11 bits for the red, 11 bits for the green, 10 bits for the blue.

So what is this useful for.

Well, this is mainly for HDR rendering, high dynamic range.

And you can do this because if you look at the sun or say you look at a bright light, like these here are shining on me, they're very bright, or look at the projector's light.

And then if you look at the ground at something in shadow, that's orders and orders of magnitude difference.

So if you looked at the sun and then you look at something that's completely dark, that's like a million times brighter, right?

Just this huge magnitude.

And 8 bits, not really enough to express that.

Floating point is way better at doing something that's vastly different.

So you could say well, I could just use some floating point numbers.

We have 16-bit floats, we have 32-bit floats.

That's true, but that's a lot of space.

It would be much better if you could somehow pack that into just 32 bits.

So that's what this allows you to do.

The next thing I'm going to talk about is called conditional rendering.

Now conditional rendering is rendering based on the values of an occlusion query.

Who here is heard the term occlusion query?

Okay, good.

An occlusion query is a test to see how many pixels are actually rendered by the GPU.

You can use this to do a lot of different rendering.

It will give you a value back called samples passed.

And you can say like, oh, these last rendering calls rendered 100 pixels, these last rendered 1,000.

You can do rendering based on that to see like, oh, is this house behind a skyscraper, is it not.

The problem is you have to query the GPU and it has to wait to give you that value back.

So there's a whole round trip involved here.

And we like to remove any round trip we possibly can.

So conditional render allows you to do that.

You can just give it the query, which is here, as ID.

You can give it a mode which tells it to wait or not.

So you would say begin conditional render with this information, put in all your rendering commands, and then end conditional render.

And the stuff that's bracketed between those two will be conditionally rendered based on the result of your occlusion query.

Little code here.

The first thing to keep in mind when you're rendering an occlusion query is you want some sort of course bounding volume.

But you don't want to actually draw that volume because you'll have some bounding box, right?

You have a complicated guy, he might be behind some house or skyscraper, you don't want to draw the whole guy, you just need a box that goes around him.

You know, sort of like the cone of silence or something.

So you turn color mask and depth mask off.

It is much faster to render to nothing than it is to actually render to something to a frame buffer, and you don't need to actually draw this stuff.

So you want to turn these off before you render your occlusion query.

Next, you render the occlusion query as normal.

The course bounding volume.

Begin query, samples past with your query name.

Draw elements or whatever your draw calls are, and then you end the query.

Now I actually want to start drawing again, so I have to turn my state back on.

Lastly, pretty simple, begin conditional render based on the same query I used before.

Draw all my other stuff, and then end conditional render.

Pretty simple technique, and could be pretty powerful.

Now the funny thing about conditional rendering demos is there really isn't much to see, because if it's working you're not actually rendering anything.

Now here are something like 10,000 gears stretching off into infinity.

And this is just drawing them all normally.

I have a whole bunch of them, they're all lit per pixel and kind of multicolored and shiny.

And I put something between the camera and the back of the gears.

So here the problem is I'm only rendering like three rows of gears.

There's still, you know, 9,000 something rows behind that white plain that you can't see.

But they're still all being rendered.

That's wasting time that doesn't need to be wasted.

So I press this.

And you notice that it jumped and the frame rate actually got a little better.

That's because all the other 9,000 rows are not being drawn anymore.

So back to normal.

So sometimes you can get, like, a good 10 FPS or maybe even more, depending on the complexity of your rendering.

This is something you have to build into your engine or your app.

But it's not that hard to do if you're already using occlusion queries, that could be a very powerful technique.

I would urge you to take a look at the sample code because it's a lot easier to see and understand when you're looking at the code than it is for a demo.

Because as I said before, it's doing its job, there's nothing to see.

So now we move on to performance.

I promised you I'd talk a little bit about performance, and we're going to do that.

There's a bunch of performance characteristics on the desktop that you may not be familiar with if you're an iPhone programmer.

There's some stuff so let's begin with stuff to avoid.

First of all, I want to advise you not to use immediate mode.

Immediate mode is costly.

So when you do immediate mode, you say GL begin, GL vertex, GL vertex, GL vertex, GL vertex, GL vertex, you're specifying every point.

And that's really slow.

Consider your average model, which could be anywhere from 2,000 to 10,000 points.

So you really want to specify 10,000 points with GL vertex, GL vertex, you know, thousands of times, probably not.

Also, you have to send that data over the bus every single frame.

Every single time you say GL vertex that's some more data to be sent over the bus.

You have all this VRAM.

All of our desktops have tons and tons of VRAM, 128 Megs, 256 Megs, 512 Megs, you want to use this stuff.

So send all your data up to the card and render from there instead of specifying it every time.

So if you have any code that looks like this in your application I want you to go and cut it all out.

Just get rid of it.

Use VBOs. Draw arrays, draw elements, this is the way to go.

By the same token, if you've ever heard of a display list, I'm here to say that display lists probably don't really help you.

They're really not much of a performance boost.

You may see a little, but it's really not good.

You're caching commands in the display list.

But what really hurts you is caching state.

Now if you went to my colleague's talk yesterday, he talked about state validation in the driver.

This is really where a lot of the CPU overhead with the drawing is going to go.

And since display lists inherit state, we can't really cache it for you.

You could say call list, but you can change all the state in between each call.

So we still have to revalidate all that state, which FBO is bound, which texture is bound, you have depth tests, you have alpha tests, all that could be different, different fragment programs.

We can't validate that, so you're really not getting any benefit of caching these draw commands.

So if you have stuff that looks like this in your apps, begin list, a bunch of stuff, end list, call list, that also needs to go away.

[inaudible] isn't available on the phone anyway, so if you really wanted to port any code or you wanted to share code between the two platforms you couldn't.

So it's much better just to use draw arrays, draw elements, use vertex buffer objects to draw.

So to reiterate what you might have heard yesterday, batch your state.

This is an important way to improve performance, because all state changes require validation by the driver.

There's a ton of state in OpenGL, and it all has to be consistent before you get good rendering.

So the driver has to go validate all this stuff, make sure it's coherent.

It also requires a vector to be sent down to the hardware of all your current state.

So if you're batching this all the time between your draws it's constantly revalidating and constantly resending back down to the driver, or the driver is resending back to the hardware.

And that takes time.

This is expensive, that's precious time you could be doing something else with.

So you want to avoid it, if you batch all your similar draw calls together, all your similar objects, you don't have to repeat this.

Now as my colleague said yesterday there's sort of a hierarchy of what costs more, what costs less, and I would urge you to go look at his presentation, take a look at that and think about how you can re-architect your app to take advantage of grouping all your similar draws together.

You can also use Shark to check to see where time is spent.

We have a tool called OpenGL Driver Monitor that can help you as well.

You can look for things like CPU wait for GPU, or GPU wait for CPU to find out if you're bound by the CPU, if you're bound by the GPU.

If your GPU is constantly waiting for the CPU that means there's probably something you can do in your app to help your rendering go faster.

Also you may have heard of hoisting.

How many people have heard the term hoisting?

Okay, hoisting is moving something up, you want to pick it up, you want to move it up the pipeline.

So consider you have a vertex shader.

If I have a model with 10,000 vertices, that vertex shader is going to run 10,000 times.

Now consider I'm rendering at 16 by 12.

That's almost 2 million pixels.

My fragment shader is going to run on the order of 2 million times, maybe even more with overdraw.

A common scene may have four layers of overdraw, right?

Someone could be inside a house and you could have the front row of the house, so you could have a window so you can see inside the house, so you still have to draw everything inside because you don't know if the window's going to be able to see it or not, right?

That's way too much overdraw, and you're going to run this shader millions and millions of times per frame.

So if it's possible to move any calculations out of the fragment shader into the vertex shader your performance is going to jump.

Because would you rather run something 10,000 times or over 2 million times.

Pretty simple idea, but something I think a lot of people overlook.

Another thing to keep an eye out is fall back.

To keep our platform fairly homogenous, we implement everything in software.

So if something's not supported on the particular hardware you have, like if you're running on an Intel integrated part, if may fall back to the software path.

There's two parameters you can check to see, there's a separate vertex fallback and a separate fragment fallback.

And of course the software path is going to be much slower than the hardware path.

So if you just change the shader and your performance went down to like 1 or 2 frames a second, you might think, this is weird, something's wrong.

Well the first thing to check to see if you're falling back to the software path.

And if you are, you can figure out exactly where you're falling back and then fix your app so you stay on the hardware.

Now the other thing I'm going to talk about is what's commonly referred to as a Z prepass.

A Z prepass is just drawing the depth information into a buffer.

So you can do this quickly and early in your drawing cycle.

You would draw all your objects just into the depth buffer.

So turn the color buffer off.

Solo depth rights are about two times as fast as color rights.

And since we're only interested in the depth information and constructing that information, we want to turn color rights off.

And this is done via GL color mask as we did in the occlusion query example.

And why would you want to do this?

Well typically you've seen some diagrams of the pipeline, and at the end of the pipeline you saw the fragment stage.

Well the Z test is done after the fragment stage.

This gets back to what I said earlier about rerunning the same fragment shader on stuff you can't see.

Normally, stuff you can't see is wiped out by the depth test, the Z test.

But if you already did all the expensive calculations, this doesn't help you much with performance.

If you have a pre-made Z buffer, it can allow your GPU to perform early Z optimizations and actually not do those expensive fragment operations if the resulting pixel won't actually be visible.

So it's important to do a Z prepass if you have really expensive shaders or lots of overdraw.

This can be mitigated by certain techniques such as deferred shading.

But there are things that happen even in those techniques that you can use this to take advantage of.

There's also rendering techniques that need an incoming Z value.

If you were just doing normal forward render, no Z prepass, you wouldn't have a depth value of that pixel.

You could write to it, but you can't get it back.

Certain techniques, like screen space ambient occlusion, need an incoming Z value.

Now if any of you have played the game Crisis they used this technique, they detailed it in a paper I'll cite later.

And this reads from the Z buffer of the pixel and the surrounding pixels to determine this ambient occlusive factor, which is sort of like how ambient light works.

It's just an emulation of the real ambient lighting in the real world.

So if I'm just writing to my depth buffer it's going to look something like this.

This is the terrain demo I showed you earlier.

This is just the Z pass.

You can see the closer part to me, it's going to be darker, the further away it's going to be lighter.

That's just visualizing the depth of value as sort of a gray scale color.

So let's talk about extensions that are specific to the Mac.

The first one I'm going to talk about is FlushMappedBufferRange.

Normally when you modify a buffer you're going to use glMapBuffer and glUnmapbuffer.

This will actually map what's in the VRAM into your system RAM, then you can modify it or take a look at it.

It allows you to asynchronously modify that VBO, which is important if you're doing animation and you're updating things.

Now if you flush the buffer, you can only flush a small amount.

Normally when you unmap it, you're going to DMA the entire buffer back up to the GPU.

And so you have about a megabyte of buffer data, you have a big complicated model, and just the vertex position is about a Meg.

Probably don't want to be pushing a Meg back up every frame if you're only modifying like 10 bytes.

So FlushMappedBufferRange allows you to only flush those 10 bytes.

This also minimizes data that needs to be copied back to system memory.

Normally, when you map a buffer the system doesn't know if you're going to read from it.

If you're going to read from the whole buffer, say the first 10 bytes and the last 10 bytes, actually has to copy the entire thing back to your CPU memory.

This can be really slow, especially if you're constantly copying back, modifying, sending back up to the GPU, copying back down to the CPU, sending back up to the GPU.

You can imagine that makes a lot of time.

So if you're only mapping and flushing a portion of it, you can minimize the copying.

So to do this, here's some sample code.

You would set the buffer parameter to false, the flushing parameter.

Then you could do some unrelated work, your app goes along, does some things.

Then you map your buffer, you get this data pointer.

You update your buffer, and finally you say FlushMappedBufferRange with the offset that you started modifying and the number of bytes you modified.

Then later on when you're done with all your modifications you can do your unmap, start drawing with it, and go on your way.

Pretty simple.

This also ties into our next thing, which is glFence.

A fence let's you test the command when a command is done.

So you could be adding commands to the command stream, you can say like, I want to say GL bind buffer and then GL draw elements, GL draw elements, GL draw elements, then I want to say glMapBuffer, modify some things, GL FlushMappedBufferRange.

And then I'm going to say GL set fence.

And then I can see when my flush is done.

So you think I mention this is asynchronous.

When I say this, it's going to tell OpenGL flush this up to the GPU via DMA.

And you don't know when the DMA is going to complete, exactly.

It's going to give you control back right away.

Later on if you want to actually draw with this, you want to make sure that the data's gotten up to the GPU.

So an easy way to check is to set a fence after your FlushMappedBufferRange.

And later on you can test to see in the fence is complete.

If the fence is complete you're guaranteed that everything before it has actually executed.

If you ever used a multithreaded engine with OpenGL, this is absolutely necessary.

Because there can be latency between the time you modify your thing, the time you say FlushMappedBufferRange, and the time you want to draw.

So if you start drawing and then you start modifying again, you don't know if anyone's actually using that buffer you modified as you draw.

But with fences you can make sure I'm not modifying it as I draw, and I'm not drawing it as I modify.

You can also use this for other multithreaded synchronization.

You may have seen multicontext and people mentioning you can upload textures on a background context.

Well, you can do this, sure.

But you also want to make sure that your texture upload is done before you start rendering with it.

If you have two contexts, you can set a fence after your texture upload and then check it, and then signal your first context, yeah, my texture upload's done, you can start drawing with this texture.

This is great for texture streaming in the background.

So this is a little bit how a fence would look like.

I would map buffer, write some data, FlushMappedBufferRange, set my fence, and then unmap it.

And somewhere later on in my application I would call GL test fence.

It would test that same fence that I set before and tell me, yeah, we're done, we're good.

Or no, we're not done.

So you might want to wait.

Sample code is really easy.

You gen your fence, you do some work, you set your fence, and then later on, you test it.

Now might wonder if we can put this all together.

We talked about a bunch of different techniques, and we want to bring them all into one application, one demo, leveraging all this stuff.

So leveraging every technique we've talked about.

So what did we talk about.

Well, we talked about instancing.

So let's think of something that has a bunch of objects.

Talked about texture RGs.

So we want to do some data storage, some form of deferred rendering would be cool.

We talked about array textures.

We had that cool terrain demo, can we bring this together into something else.

So let me show you a little something that I put together.

So this was the previous array.

This is just forward rendering.

Not as interesting.

Now I can switch it to deferred rendering and I can have multicolored lights flying around it.

Kind of crazy.

You can see I have these little robot guys which is from the Quest demo.

And he is actually being lit correctly.

And it's one draw call drawing all these, and they're actually all over the mountain, you just can't see them super-well.

See some of them there.

And I can throw on some animation of the ground swallowing up this guy.

This puts everything together.

This is using deferred shading to draw.

So you can see, you can use texture mapping with deferred shading.

And it's perfectly compatible with instancing as well.

And array texture.

So this is putting it all together, and it runs at a perfectly fine frame rate.

You know, we have quite a few things that are being drawn.

The textures aren't super-high resolution, but they do the job for this.

So all this sample code is available, you can go onto the web, onto the WWDC attendees site and download it, take a look at it, play around with it.

I highly urge you to try to integrate it into your apps, into your games, or if you just want to learn, take a look at it, see how we do things.

Feel free to look on the forums or send us email.

Here are some sources.

There's a lot of information on deferred shading on the Web, on NVIDIA's site, on other sites.

We have some sources from Crytek and other game companies to talk about deferred shading, the Crytek paper has information on screen space ambient occlusion, if you're interested.

If you have more information, you should e-mail Alan Schaffer, he is our Game and Technology Graphics Technology Evangelist.

We also have a lot of documentation that's really good on the OpenGL Dev Center, our programming guide and some other things.

Lastly, we have our developer forums.

You can go on, take a look, talk to your peers, get answers from us.

Now this is the second-to-last session, although the game sessions will be repeated tomorrow, but I urge you to go online and take a look at some of the past ones and see the presentations, the design practices, and all the ES overview and advanced rendering sessions.

Next up is Taking Advantage of Multiple GPUs and they have some cool techniques that you can use for desktops that have multiple graphics cards in them.

Thanks for coming, hope this was a little informative and enjoyable.

If you have any questions, you can come talk to me on the side.

Have a good day, and enjoy the rest of your WWDC.

Apple, Inc. AAPL
1 Infinite Loop Cupertino CA 95014 US