Jean-Francois Roy: Hello, everyone.
And welcome to OpenGL ES Tuning and Optimization.
This is going to be a session all about making your application as fast as possible.
So this is going to be a three-part session.
We're going to begin by showing you the new OpenGL ES Analyzer instrument.
Then my colleague, Alex, will come on stage to tell you all about tuning the graphics pipeline.
And then I will come back to give you a demo of the Analyzer instrument.
And so, without further ado, I'm Jean-Francois.
I'm an engineer on the GPU Software Developer Technologies group, and I'm here to tell you all about the new OpenGL ES Analyzer instrument.
You may have caught a glimpse of this instrument during the Developer Tool State of the Union.
This is a brand new instrument that will allow you to measure all the OpenGL ES activity in your application, get some key information out of those statistics, and then allow you to quickly troubleshoot correctness and performance problems in your application.
Now, whether you're a seasoned OpenGL developer or a newcomer, you may have been faced with something like this where the application doesn't quite render correctly.
You may even have faced the dreaded black screen of death where absolutely nothing renders.
With this new instrument, you'll be able to solve these sorts of issues in a very short amount of time.
Now, this instrument is also a preview itself; so there's a few things you should be aware of.
First of all, the preview only functions on PowerVR SGX-based devices, so this is iPhone 3GS and anything that came after.
And it only functions with the new iOS4 operating system.
So you'll need to install that on your device.
Additionally, you will need to launch your application from within Instruments to use this new instrument, meaning that you cannot both debug your application with GDB and use the Analyzer instrument at the same time.
Now, there's a lot more than that.
You can check the release notes for all the details.
And we also highly encourage you to give us feedback to improve this brand new instrument.
So, why an instrument?
Well, Instruments is this extremely powerful developer tool that we've been shipping for a number of years now.
It includes a wide variety of instruments that allow you to measure just about anything you do in your application, from disc I/O to network activity to graphics.
Not only that, but you can combine multiple instruments at the same time and correlate all the information they're gathering to get a much more comprehensive picture of what your application is doing, what is its performance characteristics, where it's spending its time.
Additionally, Instruments provides you with powerful data mining tools, which will allow you to extract the information you need out of the sea of data that instruments gather to solve the specific problem that you're trying to solve.
Finally, Instruments for many of you will be a familiar interface, a familiar tool, so you'll be able to dig right in and start using this new instrument right away.
So, the new OpenGL ES Analyzer instrument has three main components.
The first is the Activity Monitor.
This guy traces all the OpenGL ES activity in your application and then extracts some key statistics out of that data to allow you to better understand what is going on.
The second component are the Overrides.
The Overrides allow you to do quick performance experiments by allowing you to disable or alter certain parts of the OpenGL ES pipeline to see what effect that has on how your application is performing.
Alex is going to go into much more detail on how to leverage these in combination with some knowledge on how the graphics pipeline work to really hone in on where you're spending your time.
Finally, the last component, I'm actually going to keep it a secret for now; and we're going to get back to it in the second session of the Analyzer.
So Activity Monitor, Overrides; let's start with the Activity Monitor.
The first step in solving any general engineering problems is usually to get all the facts right, get all the information and really understand what is going on.
For the specific case of graphics, a complex application such as a game or a visualization application, it may be very difficult for you to look at the source code and translate that into the OpenGL command that your application will make at runtime.
There's many reasons for this.
A common one, for example, is a data-driven application where the commands that end up actually being produced by the application depend not entirely on the code but also on data that the application will receive.
So, without you really knowing, your application may actually be doing a lot more work than you thought it was doing.
It may also be doing work in an unexpected order.
A common case of this particular type of problem is when you're driving your rendering through some data structures, such as a tree, if you have a subtle bug in your tree traversal algorithm, you may end up actually issuing OpenGL commands in an incorrect order.
Finally, your application may just be doing the wrong kind of work for the hardware it's running on.
For example, if you have produced assets for a game on multiple platforms, these assets may not be optimized for the graphics hardware in iPhone and iPad.
And this could severely impact your performance.
So the Activity Monitor allows you to record all the OpenGL ES activity in your application.
And then it presents you this information through four main hubs: frame statistics, API statistics, command trace, and call tree.
Let's go through all four of these now, starting with the frame statistics.
The frame statistics allow you to get an idea of your per-frame workload.
This is the amount of work your application does every single frame.
This table's also a great place to navigate the trace frame by frame.
And also know that you can constrain the amount of data this table shows by making a time selection in the instrument's timeline.
Now, let's go through a few key statistics and also some relationships between those statistics, starting with primitives and batches.
Primitives is the number of points, lines, and triangles your application rendered for any given frame, in any given frame.
And the batches or draw commands is the number of OpenGL commands, draw commands that you issued to draw those primitives.
Now, one of the key ratios to remember about these two is the ratio to primitives and batches.
You want to maximize this ratio as much as possible.
In other words, you want to draw as much as you can with as few draw commands as you can.
You also want to minimize the number of batches.
Just do the least amount of work possible to get the result you want.
And, finally, this is a great way of finding your most costly frame with respect to geometry simply by sorting the table by the number of primitives.
Oftentimes, the frame that renders the most geometry is going to be your most expensive frame; and you'll want to focus your optimization efforts on that frame.
Next, we have OpenGL commands.
What you want to do here is minimize how many commands you issue per frame.
And you can look at this through two different columns: again, the number of batches, which are all OpenGL commands that draw something on the screen; and then you have all OpenGL commands, which is everything that you call in the OpenGL API.
Now, two key goals here.
Number one is to minimize how many commands you issue every frame.
So, again, do the least amount of work possible.
More importantly, you want to look at the ratio of GL commands to the number of batches or batches to GL commands, rather.
And you want this ratio to approach 1.
In other words, you want to do as little stuff that isn't drawing with respect to the stuff that is drawing.
If this ratio is small, you may benefit from doing state sorting, which will allow you to minimize the number of commands you have in between your draw commands that reconfigure the OpenGL state and improve your performance.
Next, we have redundant state changes.
These are commands that your application issued to modify a part of the OpenGL ES state to its existing value.
So these commands will have no functional impact whatsoever.
However, they will still incur a cost in the framework and in the driver to validate the change, confirm that it's the same state, etc., etc. So these redundant state commands are not free, but they're completely wasteful work.
So you just don't want to do any of them.
This column should be full of 0s.
Finally, we have render passes.
The tile-based deferred renderer hardware that we have in iPhone and iPad structures your rendering through a number of render passes.
Now, these are very expensive to configure, to set up.
So you want to minimize how many render passes you have every frame.
Now, ideally, you only want one render frame.
But there are very good reasons to have more than one, particularly if you do off-screen rendering.
However, do note that some operations such as texture uploads can force the hardware to end the current render pass and begin a new one.
And so it is critical that you structure your application in a way that will minimize render passes.
And there's some things you can do that will have an impact on this.
Next we have API statistics.
The API statistics allow you to see which commands, which OpenGL commands your application used the most frequently and also which ones cost you the most.
So, starting with the cost and time, ideally, you want draw commands to dominate your application after your loading phase.
In other words, you want to be spending your time drawing, right?
However, do note that simple commands with a very low average time or average cost can still end up costing you a lot overall if you call them very, very frequently.
And a good example here shown on screen is BindBuffer, which, you know, only cost you a small amount; but, because you call it so often, it's in the top ten functions.
There are also some commands that are just expensive.
A very good example of this is all the APR related to shaders.
Compiling a shader, linking a GLSL program are very expensive operations.
The key here is to do these operations once, ideally, when your application loads; or, if you need to be more dynamic, do these operations in a background thread such that they will not impact the main rendering of your application.
You can also look at API statistics from a frequency point of view.
So this is just the total number of times you've called each and every single API in OpenGL.
And the key here, again, is to make sure that you're not doing more work than you absolutely need to.
If you see many commands that alter the OpenGL state in this table that have you know, near the top if you're sorting by cost, you need to ask yourself if how many of these commands are redundant and if you can eliminate them to improve your performance.
There are also a number of extensions such as vertex array objects that will allow you to eliminate or significantly reduce the number of times you have to use for an OpenGL API.
And so you should definitely try to take advantage of those extensions.
Finally, a note about time range filtering.
The API statistics are actually computed live to reflect the current time range selection in the instrument's timeline.
And, so, if you constrain the selection, the API statistics will only reflect the OpenGL activity for that time range.
This is great in combination with the frame statistics table to allow you to look at the API statistics only for a specific set of frames.
You can also combine this with other instruments.
These other instruments may highlight some activity that you didn't expect, such as a spike in CPU usage.
You can hone in on that spike and activity based on that other instrument, then go look at the OpenGL ES activity statistics to see what OpenGL was doing at that time.
Next, we have the command trace.
This is very simply the complete list of OpenGL command issued by your application.
Now, even a simple application is going to have tens of thousands if not hundreds of thousands of commands.
So what you want to do here is start with the other hubs of information, such as the frame statistics or the API statistics, find a particular command of interest or a particular point in time of interest, then go look at the OpenGL ES trace to see what was going on at that time: what commands came in before the command of interest, what commands came after.
Do note that every single one of those commands has a back trace so you can actually know where it was called from.
Finally, we have the call tree.
This is the standard instrument call tree view but focus exclusively on OpenGL commands.
Now, do note that there's a significant difference between this table and what you would get from the CPU sampler.
The CPU sampler is going to be a statistical sampling of the activity on the CPU.
The information you see here is exact.
It's exact tracing.
The amount of time, the running time is the wall clock time spent in that OpenGL function including both time spent actually running the code on the CPU and time spent by the CPU being blocked.
This is a great view for seeing which OpenGL ES commands took the most amount of time in your application but also where each of these commands were called from.
And so it's interesting because some commands will sometimes have a higher or lower cost, depending on where they were called in your application.
And this table will allow you to distinguish between those cases.
And so that is the Activity Monitor.
And moving on to the Overrides; and for that, I would like to invite my colleague, Alex Kan, on stage.
And he's going to explain you how to tune the graphics pipeline both using, you know, knowledge about how the graphic pipeline works but also using the Overrides in the Analyzer instruments.
Alex Kan: I'm Alex.
I work in the iPhone GP Software group at Apple.
And so we're going to talk about a few things.
We're going to talk about how to target optimization-specific parts of the pipeline and how to tune your shaders.
But first we're going to talk about what goes into getting a rendered frame from your app onto the screen.
So, really, this happens in four stages.
First, the CPU translates your OpenGL commands into a set of commands for the GPU.
Next, the GPU operates on these commands and on tile-based deferred renders like the PowerVR MBX and SGX.
This really happens in two phases.
So, in the first phase, the GPU will read and shade all the vertex data in your scene.
And the next, in the second stage, it will rasterize and shade all the fragments that have been generated by the geometry in your scene and write those out to the frame buffer.
So, once that's done, then core animation will take your rendered frame buffer and composite it with the rest of the items in your view hierarchy.
So one thing to keep in mind when you're looking at this list of stages is that they don't always take or not all stages take the same amount of time.
And, in particular, you'll find that, depending on what your application actually renders, some stages will actually take significantly longer than others.
In addition, and really, these stages all take place in different places in the hardware.
And so what that actually means is that what will typically happen is that you'll have multiple components of the hardware working on different parts of different frames at the same time.
So what does this actually mean for you when you're optimizing your application?
It means that your frame rate is really determined by how long it takes the slowest stage in your pipeline to finish what it's doing.
And so that means that this is really the stage that you should be targeting when you're optimizing your application.
Now, the nice thing about this is that, because the faster stages in your pipeline don't necessarily affect the frame rate overall, you can sometimes add additional work to these stages without impacting your app's frame rate.
So how do you actually understand what this pipeline looks like in your app in particular?
And you do that via Instruments.
So there are a few instruments in Instruments that are useful for understanding how active your CPU and GPU are.
In particular, the CPU sampler can tell you, well, how much time your CPU is spending doing your rendering work and where it's doing it.
And the OpenGL ES Driver instrument can tell you how active both the vertex and fragment processing components of the GPU are.
And so you can take this information and use the Overrides in the OpenGL ES Analyzer to change your application's workload and see what effect this has on performance.
And so, with that, let's actually take a look at the stages of the pipeline, you know, stage by stage.
So, for the purpose of this discussion, we'll only focus on the stages that are relevant to OpenGL ES rendering, so that means the CPU and the GPU.
So let's start with the CPU.
So, first of all, you can typically identify a CPU bottleneck by noticing that the GPU is not active nearly as much as you think it would be.
And, in particular, you'll often see that, if you change the drawing complexity of your app by, say, changing the number of triangles or whatever to well, in your draw calls to basically be like a single triangle, you may see no impact on your frame rate whatsoever.
And what you'll also observe in situations like this is that the CPU is actually spending a lot of time in the OpenGL framework on behalf of your app, and it can actually be doing this in one of two ways.
It can be completely busied actually doing work; or it can simply be blocked, waiting for other components.
So let's take a look at the situation where the CPU is fully busy.
In this case, this usually occurs because the CPU is busy handling state changes issued by your application.
And the nice thing about that is that the Activity Monitor in the OpenGL ES Analyzer can tell you a lot about, you know, what's actually going on here and how many of your state changes are redundant.
And one thing to keep in mind when you're looking at this diagram is that not all of these state changes are equal.
So, in particular, certain types of stage changes, like particularly those having to do with vertex and fragment programs, can be particularly expensive compared to other ones.
So what can you do about this?
A lot of this comes down to just structuring your application to minimize the amount of state changes that you issue overall, and that includes redundant stage changes.
So, in addition, when you're, you know, thinking about how to organize your app, you should also be thinking about where a state is stored in OpenGL.
So, in particular, there are a lot of objects that actually carry a good amount of state with them.
For example, textures have their own per-texture wrap modes and filters; and program objects have their own uniforms.
And vertex arrays can store a whole bunch of vertex submission state all, you know, in one thing.
And so the nice thing about this is that, when you bind that object, all that state comes along for the ride.
And so that's something you should be taking into consideration when you're writing your app so that you don't do a lot of redundant work just because you changed what object was bound.
So what happens if, instead, your application was simply spending a lot of time waiting?
So what this typically means is that your application is waiting for the pipeline that you saw earlier to drain, and usually this is because the CPU needs to access or modify some kind of resource that the GPU also happened to be using.
And so this can happen in a number of ways.
It can happen because you're reading back the contents of a frame buffer that you rendered to or you're modifying the contents of a texture or vertex buffer object or something like that.
And, generally, the thing that you want to do here is minimize the amount of time that the well, that the CPU spends waiting.
And how you do that is that, well, in the case of frame buffers, you might want to wait as long as possible before issuing a glReadPixels call to actually get the image data back from the GPU.
And, similarly, when you're modifying resources, you generally want to avoid modifying a resource that you've already used in that form in the current frame.
And one thing that you can do to take this a step further is, if you can afford to have multiple objects around, is to do exactly that.
And, so, if you're doing something like this, you would want to typically read back from or modify the oldest object in the pipeline.
And this can actually go a long way towards minimizing the amount of time that the CPU spends waiting.
Okay. So that was generally CPU bottlenecks.
Most of it comes down to exactly how you've structured your app.
And, for that, you know, generally there's a lot that you can get from the OpenGL Essential Design Practices session, and that's mostly reducing state changes and avoiding situations where the CPU has to wait for the GPU.
And, you know, Instruments can tell you a lot as Jeff showed, like about situations in which this is happening and, you know, why.
So now let's talk about the GPU.
And before we actually talk about the vertex and fragment stages, I'd like to, you know, introduce a metric that's generally useful for thinking about what the GPU is doing.
And that's that the amount of time that the GPU spend active is a function of really two factors.
It's a the first factor is really the amount of objects that you're sending to it like vertices or fragments.
And the second is really the cost of processing each individual element.
And this, too, can be broken down into a few factors.
You can think of it as, you know, an interaction of both what it takes to fetch the data that the GPU need to operate on and to write it back out when it's done and to actually perform and and for the GPU to actually perform the calculations on these elements.
So let's take a look at how this applies to vertex processing.
So, typically, if you have an application that's bottlenecked by vertex processing, what you'll see in the OpenGL ES Driver instrument is that the tiler utilization percentage which, if you were here for the Shading and Rendering Techniques session, you should have seen an example of this.
You'll see the tiler utilization is near 100 percent.
And so, as I just mentioned, the workload size is really something that you control directly.
It's the number of vertices that you send through OpenGL.
And so what you'll see in this case is that you can often make the frame rate increase simply by using simpler models or by sending less vertex data through the system.
Now, let's talk a little bit about the cost per vertex.
And so the two factors here are really fetching of the vertex attributes that you specified for your data; and the actual computation, which is, you know, the shading, transformation, lighting that you may be applying to the vertex.
And, in this case, you can distinguish between vertex processing cases that are bound by data versus computation by simplifying what you're doing in your vertex shader to see if that increases the performance.
So, first, let's actually take a look at the data case.
So there are generally a few things that you can do to improve performance of an application that's bound by vertex fetching rather than computation.
And the first well, the first thing which is actually the most important thing is that, on PowerVR SGX in particular, you should be using vertex buffer objects for all your vertex data.
Now, the nice thing about this is that it allows the GPU to fetch the vertex data directly without, you know, involving the CPU in it, which can significantly speed things up.
When you're using vertex buffer objects, you should definitely keep in mind how you're actually using your vertex data and hint that to OpenGL appropriately.
Now, what OpenGL has is something called a usage hint for vertex buffer object; and that specifies both how often you intend to modify the contents of that vertex buffer object, as well as how often you intend to actually render from it.
And, so, in OpenGL ES 2.0, there are three possible things that you can specify, as you can see.
And so we recommend that you pick the one that matches what you're trying to do.
In addition, we highly recommend that you use index draw calls wherever possible.
So this means using an index buffer and using glDrawElements instead of glDrawArrays.
And what's good about this is that it typically allows the GPU to benefit from many reuse of vertices that you have in your data which, you know, in turn, translates into more efficient fetching.
Another thing that you can do to improve the efficiency of vertex fetching is to take all your vertex attributes, and instead of having them in separate arrays, you can pack them all together into a single interleaved array.
And, when you're doing this, you should keep in mind that, for best performance when you're doing this, you should have all your attributes, both the pointers and the strides, aligned to 4-byte boundaries.
So I haven't talked about computation yet.
We'll actually talk about that in the shader tuning section.
But just to remind you, for vertex processing, there's really just a few things that you should do.
You should take advantage of vertex buffer objects, vertex array objects, and you should generally structure your vertex data so that the GPU can fetch it as efficiently as possible.
Okay. Now let's talk a bit about fragment processing.
This is typically a fragment processing bound application is typically identified by a renderer utilization percentage that's near 100 percent.
So how did this you know, how did these two factors map to or the workload size and cost per element map to a fragment processing?
So how things stack up in fragment processing is that the workload size is determined by the number of visible fragments.
And this is, you know, slightly less straightforward than the number of vertices that you're sending through the system.
But let's see.
One thing that you could do generally to identify fragment processing bottlenecks is to use the minimize number of pixels rendered override that is in the OpenGL ES Analyzer.
And this what this will actually do is to restrict all your rendering to a 1 by 1 box at the corner of your frame buffer.
So, now, the cost per fragment is determined by, well data fetching in terms of cost per fragment is determined by what it takes to fetch both frame buffer and texture data for your fragments and to write frame buffer data back out.
Computation is determined by the contents of your fragment shader.
And so this is one of the things that we'll also talk about in the shader tuning section of this talk.
So let's take a closer look at the number of visible fragments since, you know, that's the interaction of a few factors.
In particular, let's talk about hidden surface removal.
On tile-based deferred renderers, such as the PowerVR SGX, there's a hidden surface removal mechanism that typically works by looking at groups of opaque geometry and determining which of these things are actually covered up by other opaque geometry.
So what that implies is that, when you're drawing in your application, you can generally get the best efficiency out of this mechanism by taking all your opaque objects and drawing them together at the start of your scene.
And, to follow onto that, once you've drawn all your opaque objects, what we typically recommend is that you draw all alpha-tested objects or objects using the discard keyword in GLSL ES after that, and only after all of that is done do you draw your alpha blended objects.
So let's take a look now at the last two objects the last two object types that I mentioned.
So, in particular, one thing that we want you to keep in mind when you're thinking about fragment processing is that, really, that processing work happens regardless of what result it actually has on the frame buffers.
So, in particular, for quads that may have a lot of transparent area in them, that work is really still happening.
And, in particular, this can be even worse if you have alpha testing enabled, as this can be particularly expensive on GPUs like the PowerVR SGX.
And in well, when you think about this and, like, if you have a lot of layers that are drawn like this, the amount of wasted work can really accumulate as the number of layers increases.
So what can you do about this?
One thing that you can do is you can just try to reduce the amount of wasted area by simply trimming it out with your geometry.
And so there's the simplest way is to really just restrict your quads so that they bound your well, they bound the interesting parts of your sprites or whatever assets as closely as possible.
And, you know, you can go a step further and actually pick geometry that's custom tailored for what you're trying to draw.
So I mentioned earlier in the talk that you can often add work to other parts of the pipeline without impacting your frame rate overall.
And this is really an example of that.
This is an example of trading extra vertex processing to reduce the load on the fragment processing part of the pipeline.
So now let's look at the other side of the equation, the cost per vertex.
So, as before, there's always a bandwidth and computa there are always bandwidth and computation factors at play when determining the cost of a fragment.
And you can and you can determine which have these two things is really, you know, impacting you also using the Analyzer.
So there are two Overrides that the Analyzer provides for these situations.
One of them is minimize is called minimize utilized texture bandwidth, which can greatly reduce any bottlenecks that you might be seeing on the bandwidth side.
And, similarly, you have simplified fragment shader processing, which will replace whatever fragment shaders you have in your app with a trivial fragment shader that draws a simple a single color.
And so, by completely removing the workload on one side or the other of this particular equation, you can determine which of these was actually your bottleneck.
So let's take a look at what you can do for a bandwidth bound app.
So a simple thing that you can generally do in almost every app is minimize the amount of write out that's being done for fragment data for frame sorry frame buffer data.
And, generally, you know, we find that a lot of apps don't need the contents of their buffers from frame to frame.
And, generally, what you can do in this situation is you can issue a full screen clear of all the buffers at the start of your frame.
So that's, you know, color, depth, stencil, or whatever you have.
And, then, at the end of the frame, typically you only need color, especially if you're not reusing those other buffers anyway.
So what you can do in this situation is you discard all of the other buffers at the end of the frame.
And, as you can see, this can greatly reduce the amount of writes that the GPU does to memory.
Now, this becomes even more important once you start using multisampled rendering because these color these color and depth frame buffers in the multisample frame buffer can be so large.
And this becomes well, yeah.
So the biggest thing about this is that, with proper usage of both clearing and discarding, you can typically allow the GPU to resolve the contents of the multisample frame buffer directly to memory without any extra without any extra traffic.
So what can you do on the texture side?
Usually this comes down to using the smallest texture format that's suitable for whatever your assets are.
So the first thing that we typically suggest that you try, particularly if you have assets that are photographic in nature, is the PVR texture compression format.
So a lot of this comes down to actually compressing your images with a format and seeing if they look, you know, good enough for what you're trying to do.
And if they don't, there are still a number of other formats that you should be considering: single channel, luminance, and alpha formats or a number of 16-bit RGB and RGBA formats.
In addition to choosing the right depth for your textures, you should also be sizing them, you know, based on how big they will actually appear on the screen.
Now, if you have a texture that will actually appear at a number of different scale factors, what you should generally be doing is generating mipmaps for these textures and using mipmapping when you actually render from this, as this basically allows the GPU to pick the right size.
So before we talk about shader tuning and fragment and vertex computation, I'd like to just quickly recap what we talked about for fragment processing.
So what you generally want to do is you want to minimize the number of actual fragments that the GPU thinks it has to work on, so that means minimizing the number of well, the amount of screen area that goes in and giving the GPU the best chance that it can to remove all hidden surfaces in your scene.
And, similarly, you just want to minimize the amount of external bandwidth consumed for both frame buffered data and for textures.
So now let's talk a little bit about shader tuning.
So we're going to focus on a few topics that we think are generally relevant to writing performance shaders on this platform.
So we'll cover precision qualifiers in GLSL ES.
We'll talk about how to structure your shaders to minimize the amount of computation that's performed.
And we'll talk about dependent texture reads and why you should care about them.
So, first, precision qualifiers.
If you've come from the desktop and you've written GLSL shaders there, this will be something that's slightly new to you.
So what these are, are hints that you can give to the compiler regarding the precision of every variable in your shader.
And one thing that's interesting about this is that, if you have the varying variable that appears in both the vertex and fragment shader stages, you can specify a different precision for each.
Now, we emphasize these precisions because choosing smaller precisions can sometimes increase performance; and, in general, picking the right precisions for the variables in your shaders can often make the difference between an application that runs fast enough and one that's just not quite there.
So, with that, what precisions are available in GLSL ES?
First, we have highp.
This is typically a single precision floating point format.
And what we find that this is generally good for are things like position and texture coordinate transformation, like what you see in the shader snippet below.
So this shader snippet, you know, applies a model view and a projection transform to a vertex position and also uses that vertex position to generate a texture coordinate from something like a top-down light map or some other like some other world space texture.
Next, we have mediump.
On this platform, this is a half precision floating point format.
This can sometimes give an increase in computation throughput, but that comes at a within a decrease in both range and precision.
But what we find is that this is often good enough for things like lighting calculations like what you see in the shader snippet.
This can also be good for storing texture coordinates that come out of your shaders; and, in particular, we recommend this if you're dealing with small textures and textures that don't use a whole lot of wrapping or perspective, you know, to ensure that you don't run into any precision issues.
Finally, we have lowp.
This is a much more restricted precision than the other two.
I mean, it only covers a range from negative 2 to 2; and it does so with 8-bit fractional precision.
But what's nice about this particular format is that that's enough range for things like texture samples, colors, normal data, other factors that you would use to mix between colors, and so on.
And so what that means is that this is really a precision that you want to be using a lot in your fragment shaders.
But, when you're using this precision, there's a few things that you'll generally want to keep in mind.
You want to stick to 3- and 4-component vectors where possible, and you generally don't want to swizzle the components of these vectors if you don't have to.
So, in this particular shader example, you know, what we're doing in this fragment shader is we're sampling a texture value.
And we're just modulating with another lowp color that we've passed through from the vertex shader.
So now that we've looked at all the different precisions that are available, what things do you have to keep in mind when you're actually choosing these precisions?
And the first thing is really that these precision hints are exactly that.
They're minimums that the compiler must respect when it's compiling your shader.
But the compiler is actually free to use more precision than what you've asked for.
And so, with that in mind, you generally want to pick precisions that actually, you know, make sense given the range or whatever of the computations that you're trying to perform.
And on that note, you should generally be querying the implementation to see exactly what precisions and ranges are available for those particular qualifiers in your specific GPU.
Additionally, when you're picking precisions, you also want to avoid introducing situations where the compiler has to convert variables between different precisions.
And this is actually even more important when you're dealing with lowp, because you generally want computations to stay in lowp once they're there.
[ Pause ]
So now let's talk about expressing your computations efficiently.
And we'll talk about both how and where.
But, actually, let's talk about where first.
So consider the case of like some model that you're rendering.
You know, there are really three places that you can express computations.
So let's look at uniforms first.
So, you know, in this situation, you basically calculate your uniforms on the CPU.
And then you just pass them in and then this value just gets used, well, everywhere in your shader.
And so that really that computation happens once.
So now let's consider the case where you want to do a calculation in your vertex shader.
So, really, that computation happens once for every invocation with the vertex shader.
So that's every vertex.
And for, you know, some kind of model for a character that you might have, that, you know, amounts to maybe a few thousand times that this calculation occurs.
Now, if, instead, you choose to do the calculation with the fragment shader, that calculation now has to run once for every single fragment that's generated by this model that you might be drawing.
And so, if you think of it in terms of screen pixels, that can actually be a lot of times that this particular computation happens.
So what's the takeaway message from this?
You generally want to do your calculations as early as possible.
If things are constant, you want to express them as uniforms.
And, generally, this can do a lot to minimize the number of times that a particular operation happens.
So now let's talk a little bit more about the how of efficient computation.
So what you want to do in this case is to well, to take in mind keep in mind that this GPU is really a scalar GPU.
And so you only want to operate on the elements of your variables that you actually need.
And so, for example, let's take a look at the way this attenuation factor is calculated in this shader.
As you can see, we actually operate on the elements individually to avoid extra work that might have been done on the X component of attenuation factor or the Y component of the attenuation factor, for that matter.
Similarly, when you're operating on a mix of scalar and vector variables, you generally want to keep all your scalar vectors or your scalar variables together in your computations.
And so this avoids situations where a scalar operation has to be applied to every single element of a vector.
So consider the way the direction factor in attenuation is applied to this lighting calculation.
So both the attenuation and ndotl are both scalar qualities.
And so, by performing these calculations together, you can do this division once instead of once for every single element or every single component of the color.
So now let's talk about another feature of GLSL, which are the built-in functions.
Now, these implement a lot of functionality that's generally useful for a lot of shader writers.
And there's actually another perk to using GLSL built-ins, which is that they give the compiler leeway to express these calculations in the way that makes the most sense for the hardware.
So let's take a look at this particular example which basically blends between two colors based on a third factor.
And so there are a number of ways that you could write this.
You could try to express this particular interpolation yourself in, you know, a few different ways.
Or you could simply use the GLSL mixed keyword and let the compiler figure out what's best.
And so this is something that we highly recommend; because, well, you know, you don't require any hardware-specific knowledge of, you know, what makes the most sense.
So now let's talk about dependent texture reads.
Some of you may already be familiar with this term.
What it generally is, is texture samples that use texture coordinates that have been calculated in the fragment shader.
Now, this is as opposed to dependent or nondependent texture reads which come from texture coordinates that may have been passed directly from the vertex shader in varyings.
So this can actually happen in a number of different ways, some of which are more obvious than others.
So, first, let's consider the straightforward way in which a texture read can become dependent.
And that's basically when you modify that texture coordinate explicitly in the fragment shader.
Now, I've provided two examples here; because I specifically want to point out the second example.
Now, this example applies a constant bias to the texture coordinate.
So, in a situation like this, this is really a calculation that you could have done in the vertex shader instead.
And if you think about what happens if you take this particular calculation and move it to the vertex shader, you know, what occurs is that, well, one, this calculation happens a lot less because it's now occurring per vertex instead of per fragment.
And you've also turned this back into a nondependent texture read.
So there are a few other slightly less obvious ways in which this might occur.
And, in particular, the ones that I'm about to point out are somewhat specific to the PowerVR SGX.
And, in particular, this is when you use texture samples that are projected or use an LOD bias or specifically select the LOD.
And so this generally happens when you use the special texture 2D sampling functions in GLSL.
So I pointed out all these particular situations.
And why do they matter?
Generally, what you'll find is that doing nondependent texture reads is, well, faster than doing dependent texture reads.
And this happens for a number of reasons, the first of which is that this generally costs fewer shader cycles; and the second is really that doing nondependent texture reads allows the GPU to better take advantage of parallelism in the fragment shader.
So let's take a step back and look at what we've covered in shader tuning.
There really are a few messages that we want you to take away from this.
And, really, the first one is to choose precisions carefully.
You want to pick them based on what you're actually doing with a particular variable.
Second, we want to make sure that you're looking at where you're actually putting your calculations, like which stage of the pipeline you're putting your calculations; and, generally, that you're expressing them in a way that's as efficient for the compiler as possible.
So let's take another step back and just look at optimization in general.
And here, the biggest thing that you should take away is that you really want to be spending your time tuning the slowest stage in your pipeline.
And this is something that the tools can really help you with.
The tools can do a great job of pointing you at exactly which pipeline stage is the slowest.
And, you know, and that with that, in turn, you should not be afraid to do more work in other stages if you think that that can actually help alleviate your bottlenecks.
And the second thing is, of course, to do less work if you can.
Or you want to also give the GPU the ability to do less work by taking advantage of its hidden surface removal abilities.
And you want to really give it less work by giving it smaller data types to work on and just minimizing the amount of computation you perform overall.
So that was probably a lot of guidance.
And, you know, one thing that you might be wondering is, well, how do I know exactly which of these things applies to me?
And, for that, I'd like to bring Jeff back on to talk about the last feature of the OpenGL ES Analyzer.
Jean-Francois Roy: The last feature of the OpenGL ES Analyzer is the OpenGL ES Expert.
The OpenGL ES Expert is an expert system that has comprehensive knowledge of the OpenGL ES API of our implementation of OpenGL ES and of our hardware.
It will allow you to easily find problems in your application by finding those problems for you and will also help you fix these problems by providing you with actionable recommendations on how to address each one of those problems.
Here's some categories of problems that the Expert knows about: redundant state changes; invalid frame buffer and texture configurations; invalid OpenGL operations; suboptimal vertex formats, layouts and storage and this also applies to textures; suboptimal operation order; and, finally, some hardware-specific performance conditions that would otherwise be very difficult or impossible for you to know about.
Rather than talk about each one of those categories in detail, let me give you a demo of the OpenGL ES Analyzer.
So we have an application here that is trying to draw pictures but, well, isn't quite doing so correctly.
And if I bring in a frame, you know, FPS counter, it shows about 33 frames per second.
So this is not bad.
But this is a very, very simple application.
So we actually expect a lot more performance than this.
So let's see if we can use the OpenGL ES Analyzer to figure out why this application is rendering incorrectly and why it's so slow.
And the easiest way to get access to the OpenGL ES Analyzer is to use this new OpenGL ES Analysis Template.
This will get you both the Driver instruments and also the Analyzer instrument.
Now, like most instruments, the OpenGL ES Analyzer has some amount of configuration.
Specifically, you can toggle which frame statistics you're interested in.
And so now let's select our application and see what we can find out.
So immediately we see this red thing that looks very ominous.
So we probably have already found one of our two critical problems.
So let me just stop recording right away and start taking a look at what we have.
So the first thing I'd like to highlight is this red flag in the instrument's timeline.
Any issue that's considered by the Expert to be an absolutely critical issue, either for correctness or performance, is going to be highlighted right up there.
And this generally should be the first thing you start looking at.
In this case, the Expert is telling us that we're performing unoptimized multisampling results.
This is also what you can see in the list of recommendation or problems that the Expert has found.
You'll also notice that the problems are sorted by severity.
So the general guideline here is to go down the list top to bottom.
These are the broad categories of problems.
So once you've decided on what kind of issue you want to work on, you can click on the little arrow to focus on this category.
And this will bring you to a table that will show you every single unique instance of this particular problem.
These are unique by the back trace, meaning that, if you're doing something or your application is doing something incorrect every frame, you'll only get one entry for that particular problem.
And the occurrences is the number of times it has, you know, actually occurred.
So what we're seeing here is that we're doing a suboptimal MSAA resolve.
And the occurrence actually matches the number of frames we've recorded, so we're doing this every frame.
Now, how do we solve this problem?
Well, to get more information, we can bring the instrument's extended detail view by using this icon here in the tool bar.
And we can immediately see that the OpenGL ES Expert is giving us a recommendation on how to address this problem.
So here it's telling us that we've performed an expensive, suboptimal multisampling resolve, and that this is typically caused by failing to use discard frame buffer after resolve multisample frame buffer.
It's also pointing us at documentation here, the specifications for an extension to get more information on this problem.
Now, if you've attended the OpenGL ES Overview session, you will know that this is a critical part of getting good performance out of multisampling.
So we really want to adopt the discard extension here to improve our performance.
Now, where is you know, where should we change our code?
Well, like many things, Instruments provides you with a stack trace.
And so we know exactly where in our application this problem should be fixed.
In this case, it's in some rendering class on line 690.
So let's actually go fix our problem.
Going to bring up Xcode 4.
I conveniently happened to be on the source line.
And we, indeed, can see that we have the resolve multisample frame buffer API call here and no discard.
So that's bad.
Let's bring in our discard and just going to bring in over here from the snippets area in Xcode 4 the whoops the discard command.
And this is all you need.
This is two lines of code to get rid of the multisampled color attachment and depth attachment, since we're never going to use them again; and call discard.
And so pretty confident that this will fix our performance problem.
So, all right.
That was fast.
But we also have black textures, and that's not cool.
So let's try to see if we can find that problem.
So we're going to go back to the categories and look at what else the Expert is telling us.
And, hmm; there's this mipmapping without complete mipchain thing here.
I'm not an expert on OpenGL, but I know that mipmapping has something to do with textures.
So, you know, let's take a look at this.
It's the next item in the list, right?
So we're going to focus on this.
And here we can see that there's two unique locations where this is happening.
And if you pay attention to the batch rates, you'll see that it sort of changes the lines.
So two different locations in the source code where this is happening.
And we can see that one of them is occurring more frequently than the other, so let's take a look at that one.
Here we can see that the Expert is telling us that mipmapping has been enabled for this texture, but we are missing levels in our mipchain.
The rendering results will likely be incorrect.
Hmm. Well, this sounds like what we're seeing.
So this may be our problem.
You can also see that it points out a documentation on GL tech image 2D and GL tech parameter.
These are the OpenGL commands to configure texturing and add an image to a texture.
But to really make sure that this is our problem, we can leverage the other hubs of information in the Analyzer to really convince ourselves that this is our problem.
So a good strategy here is we know we're loading our texture at the beginning of the application.
So we're going to switch to the frame statistics table right here and highlight the second frame, which is going to move the instrument's time head just before you know, at the very beginning of that second frame.
We can then move is kind of just move the timing a little bit and select that time range.
And this basically selected the first frame.
So now we're going to focus our attention only on the first frame.
I'd also like to mention, if you notice, that the redundant state changes of this application is 3 at the beginning and then it's all 0s.
So we're good on that.
So now that we selected this range, we can go take a look at the API statistics.
And, yeah; indeed, we are calling GL tech image 2D in there 14 times.
So this is we have the right spot.
And we can finally go look at the trace and see at all the OpenGL commands that your application issued that this application issued in this time range.
So we see the usual suspects.
We're creating our context, creating our frame buffers.
So that's expected.
And, oh, look.
Texture calls that enable mipmapping as the minification filter but we only ever specify one texture for each of these textures at the full resolution.
So that's that's clearly wrong.
That's not what we want to be doing.
And so, yeah.
This is definitely our problem.
We are missing mipmaps in our textures.
So where can we fix this problem?
Well, let's take a look at this tech image 2D command and see where it's called from.
It's called from some function called create texture, blah, blah, blah, on line 933.
So let's go right there, see if we can fix our problem.
Again, I'm going to bring up Xcode.
Go to line 933.
There we go.
And this is a generic function that this application is using for loading textures.
It has a convenient use mipmaps argument; but, well, we never submit all the other texture levels.
So there's typically two solutions to this.
The first is that you can pregenerate your mipmaps offline.
This is the recommended approach if your textures are static.
But if you just want the performance and quality improvements of using mipmapping without having extra assets, you can use the convenient generate mipmaps command to have OpenGL generate the mipmaps for you.
And so this is a single line of code.
You can just bring this over.
And there we go.
We have hopefully fixed our texturing problems.
Let's convince ourselves that this is the case by running the Analyzer again.
I'm going to magically switch to a fixed version of this application, click record, and let's see what the Expert tells us.
The big, red ominous flag is gone, and so are the so is the mip recommendation, as well, or problem.
And if we can switch to the demo phone, please.
We can see that the application renders the pictures correctly, and it seems a lot speedier too.
And, indeed, if I bring the frames per second, you'll see that we're pretty much pegged at 60 frames per second.
Great. We fixed our application.
We've improved its performance.
We didn't have to scratch our heads.
So that was the OpenGL ES Expert, one of the three components of the new OpenGL ES Analyzer instrument, a brand new instrument designed to help you solve your performance and correctness problems in your application.
Now, I'd like to conclude this with a call to arms.
If you've attended the Game Design sessions, you've been told that, you know, play testing is critical; that you need to do it every you know, as much as you can every day.
Well, this is also true for performance.
You need to be looking at your performance continuously until you ship your application and then continue to do so, even if it's you know, for version 2.
So I want you to go out after this session and get the tool.
And I want you to start using it on your application every single day to improve your performance or keep it where it's at.
Make sure you never regress.
Also, I highly encourage you to send us feedback so that we can make this tool even better for you.
And, finally, go make an awesome app.
OpenGL is this fantastically powerful API that can let you create fantastic graphics.
And so, with great tools and great API, you can make the most amazing applications.
For more information, there's two evangelists you may want to talk to: Allan Schaffer for anything graphics and games related, and Mike Jurewitz for anything developer tools related.
We have a lot of great documention on OpenGL on the Developer Web site; in particular, the OpenGL programming guide includes on paper many of the recommendations that we've expressed in this session and that the Expert has internalized.
There's also the Khronos group, the Web site.
Khronos is the standard organization responsible for OpenGL.
They have a lot of information on the OpenGL, also including the specifications, which I highly recommend you go read if you haven't done so.
And, finally, the Apple Developer Forums is the place to go to get information or ask questions and have it answered both by your peers and by Apple engineers.