Taking Advantage of Multiple GPUs

Session 422 WWDC 2010

Your application may be running on a Mac that contains more than one graphics processor. Understand how to adapt to renderer changes and the actions to take when the user drags your app from one screen to another. Discover how your application can drive multiple GPUs with OpenGL and OpenCL simultaneously, and see how to leverage the low-level power of IOSurface to share media data between them.

My name's Ken Dyke, I'm a member of the Graphics and Architecture Group at Apple Computer, and I'm going to talk to you guys this morning about how to take advantage of multiple GPU's in your apps.

So, what are we going to learn today?

First, the basics of supporting multiple GPU's in your apps, how you find all of the renders in the system, how do you switch between then on your own or when you want the system to do it for you.

Second, we'll talk a little bit about how to support multiple GPU's at the same time and what that entails.

Some things involved with that are shared contexts, resource management, how synchronization works between the two GPU's, and some performance tips when you really want to get the best, best that you can out of your system.

And lastly, we'll talk a little bit about IOSurface and multiple GPU's and how that can help you.

All right, so what are the motivations for talking, for using multiple GPU's?

So, we've been shipping systems with multiple GPU's for quite a long time, you know, you can buy a Mac Pro with up to four GPU's in it, recent MacBook Pros have multiple GPU's in them as well, so it's good for users to support them there.

But probably the biggest reason is, you know, getting a better user experience, you know, again, increased performance is one thing, if your apps respond to GPU changes the correct way, the system can tell you, "Hey, you know, the user moved their window from one GPU to the next."

If you don't do that, the window server might be stuck copying data between the GPU's on your behalf.

It works, but it's not the best performance message.

Another thing that goes along with that is that it's sort of important to support multiple GPU's for hot-plug, you know, it's like on a Mac Pro, you can yank the display card out of one, or not the display card, but the display cable out of one card, plug it into the other one, everything is supposed to move over, and most of the time it works, but again, if your app isn't written the right way, and the system decides to switch GPU's, things might not work quite so well.

Okay, so let's talk about the basics of Multi-GPU support.

So first, a little bit of terminology here.

So, a renderer is basically a single piece of graphics hardware in your machine, you know, it could be an AMD card an NVIDIA NVIDIA card, or it could be the CPU software based renderer, as well.

Now each one of these guys has a unique renderer ID in the system, that way if you have more than one NVIDIA card or more than one AMD card or like say you've got four GT120's in your machine, you can still identify specific pieces of hardware, even though they're all basically the same.

Now a pixel format object is something you normally think about being, hey, you know, I want 32-bit depth and 32-bit color and Multi-Sample and that sort of thing, but the important thing for this talk is that it also embodies what set of renderers your OpenGLContext is going to use.

In this case, I might have a pixel format that actually supports three renderers at the same time.

Okay, so how does that relate to context?

So when you create a context, you always have to pass in the pixel format that you want to use, again that's what decides what renderers your context is going to use.

Now once you have that, there's this concept of a virtual screen.

Those are assigned, basically into the renderer slots within your OpenGLContext.

Now, OpenGLContexts all have the concept of a current virtual screen.

You know, you're always saying, "For this OpenGLContext I want to be using my AMD card or I want to be using a software renderer, or the NVIDIA card, depending on maybe what screen I'm on, that sort of thing."

An important thing, I'll show you in the demo a little bit later, is that the virtual screen order does match the order of the renderers in the pixel format.

All of this stuff correlates so that you can go from a context virtual screen back to the pixel format, and figure out what piece of hardware you're on, figure out what display that might support, core renderer attributes, that sort of things.

Now, we call it a screen, but the virtual part of it really is because it doesn't have any correlation with physical displays on your system.

You might have one display and three GPU's, you might have two displays and one GPU, they don't necessarily really line up with each, so don't let that confuse you.

And last, and something that's, again, sort of cool for this talk, is that Mac OS X is the only platform that supports NSOpenGLContext that can support different renderers at the same time, and switch between them.

It, you know, it's just one way that Mac OS X makes supporting head systems a lot easier than it is on other platforms.

Now OpenCL are similar concepts to all of this.

It's a different set of API's, but the concepts are all the same.

So instead of using Choose Pixel Format to choose a particular set of OpenGL renderers, you can call clGetDeviceIDs to get a list of all supported OpenCL devices in the system.

And instead of passing, you know a pixel format to NSOpenGLContext, when you call clCreateContext, you get to specify exactly the set of renderers that you want to use.

And instead of using a virtual screen to select the OpenGL renderer you want, in OpenCL land, what you're going to do is create a specific command queue against an OpenCLContext and that lets you pick the particular piece of hardware you're going to use.

Okay, so how do you allow your app to see multiple renderers in the system?

So, say you've got some code like this in your app, you know, setting up some pixel format attributes for OpenGL, you want accelerated, you want it to be double buffered, 32-bit color, but say you've also told the system, "Hey, this is the one display I want to use."

In this case I'm saying, "Hey, I just want whatever GPU's hooked up to the main display, give me that one."

Please don't do this.

First of all, it's guaranteed you're only ever going to get a single renderer if you do this, okay, which means your app is guaranteed not to support hot-plug, that sort of thing, or even if the user has two cards plugged in with four displays, you're only going to end up using one GPU to get any OpenGL acceleration, and the window server is going to be stuck copying the contents of your drawable across to the other card, not good.

The whole point of ScreenMask is really only from sort of legacy full-screen context where, you know, you would have your OpenGLContext and then you just tell it, Set Full-screen, you had no way to say, "Hey, I want to be using, you know, go full-screen on this particular GPU versus another."

Now in 10.5, we added a new API that lets you have multiple hardware supporting full-screen context, but as of 10.6, you don't need to use full-screen context anymore anyway.

You can just create a full-screen window, covers the display, and will automatically give you all the performance benefits, so don't use ScreenMask, it's really not necessary anymore.

Now what you should use though is allow offline renderers.

This one is the biggee.

If you don't specify this one, you won't get any GPU's that don't have displays attached at the time.

So, if, you know, you just do the normal thing and the user has two GPU's and they would start up your app and they switch to the other card, for whatever reason, your app isn't going to even be able to move to the other GPU, even if you wanted it to, okay?

So this is what's allows you to see renderers that don't have displays attached.

Now, the reason we don't do this by default is primarily for compatibility.

Unlike most of the other pixel format attributes, this one actually adds to the list of renderers, rather than takes away from them, and there's some apps that just don't deal with that, so this is kind of an opt-in thing, but at this point, everybody should be doing it.

And, again, it's important for hot-plug.

So, how do you trigger a renderer change to happen within your OpenGLContext?

Well, normally the system will do it for you, but if you just call NSOpenGLContext update method, the system will look at the window your OpenGLContext is attached to, and automatically chose the right renderer based on, you know, if I'm on this display or another display or how, if I'm straddling, which one has more screen coverage that sort of thing.

So, normally you would do this in response to NSOpenGLViews update method, you just have almost a single line of code in there that just says, "Hey, my view could have moved displays, or it's been scrolled or something like that, OpenGLContext, now's a good time, go ahead and re-choose what the best renderer to be using is."

Now if you're not using NSOpenGLView and you're sort of just using a regular NSView yourself and attaching a context to it by hand, there's an AppKit notification you can sign up for that you see here NSGlobalFrameDidChange, AppKit will post that notification when it believes that your view could have moved to a different display.

It doesn't post it all the time for performance reasons, but if you move to a different display, or you've been scrolled or something like that, AppKit will post it for you.

You can also tell the OpenGLContext to use a particular virtual screen.

This forces it to use a particular renderer, no matter what it thinks, you're, what card it thinks you're on.

This is primarily useful for off-screen context, like say, you've got one context doing some rendering to an FBO or something like that, or doing some texture uploading, well you want to make sure your texture uploading goes to the same virtual screen or the same renderer that your onscreen does.

So, this is how you would accomplish that.

Now, how do you respond when a renderer change happens?

What do you have to do?

Well, for a lot of apps, you don't have to do anything.

If you're not using a lot of crazy or very hardware specific OpenGL extensions, calling update will be enough.

However, if you're doing something NVIDIA specific, or AMD specific, you're right up against some hardware limits, you want to basically look and see if the virtual screen has changed since the last time you called update.

Now let's say that it changed.

What do I do then?

Well, in that case, check the extension list, see if the extensions you want to be using have changed, now that you're on a new renderer.

Other things you can look for are, see if there's a hardware limit, some of your textures might exceed or something like that, and you might have to re-upload them.

And lastly, again, make sure that if you have any off-screen context, that they're synchronized with your onscreen ones, so we don't have to copy data across the bus for you.

Okay, so how does resource management work when you've got one of these context supporting multiple GPU's?

Well, OpenGL and OpenCL will automatically track anything you've specified, via the API textures, anything like that, we can keep track of which GPU's that data is on and move it back and forth as necessary.

Resources that you specify, like an OpenGL texture, or VBOData, something that you're not going to be modifying using the GPU, we can just simply re-upload to the second GPU and everything will continue.

Now, if you've modified a texture, using like FBO, or something like that, on one GPU, when a renderer change happens, we're going to have pull that data off one card and ship it over to the other one.

Now the nice thing is is that happens completely automatically just as a result of you doing the context update.

Okay, any objects that are bound at the time you perform the renderer update, we automatically move over, because you could just keep doing any rendering right on them and expect it to work.

Any other objects are basically done lazily as you bind to them, we'll detect that they're on the wrong GPU, pull them back to system memory, then upload them to whatever renderer is necessary.

So, this will sort of give you an idea of how all this works.

So let's say on my OpenGLContext, I go ahead and create a texture and give some texture data to OpenGL, in this case it's just a simple little brick texture.

At this point, the OpenGLContext in host memory has a copy of the texture, but it doesn't exist on either GPU yet.

So let's say I do some simple rendering code that draws with the texture, it still sitting just in system memory, not until that command stream gets Flushed to the open GPU does it finally get uploaded, as we see here.

Now let's say I do something like I talked about before where I go ahead and modify that texture in some way, that's going to, and in this case, let's say that that function was accelerate by this particular driver and all of that data is now basically going to in-flight up to the GPU to modify that texture up in VRAM.

The host memory copy of that texture has now become invalid.

We can't really use it to upload to the other card if necessary.

And let's say we stay bound to that texture and we do a context update that switches us to the NVIDIA card, part of that process involved paging that back to host memory, and then switching our virtual screen over to the NVIDIA renderer in this case.

So, again, let's say I draw with that texture over on the NVIDIA card, again nothing happens until I do a Flush.

At that point, the data gets copied up and everything is all good.

All right, so let me give you a little demo here.

Was anybody here at my 2002 Quartz Extreme Talk?

Nobody, oh man, all right.

So, all right, Peter was.

So, this is a really simple demo, but it's designed to show that an app can basically seamlessly move between renderers without having to do much of anything at all.

So, I have the ever-important Allow Off-screen renderers in here so that I can detect things.

Now I should back up a step a little bit, the machine that I'm using for this has both a GTX285 and an AMD4870 card in it at the same time, but we're only hooked to one display.

Now were driving the display off the NVIDIA card, so if I want to see or be able to get at the AMD renderer, I have to put all offline renderers in here, or I'm not going to be able to get at it.

Now for demo purposes, which you'll see why in a minute, I'm also signing up to just detect whenever the window moves, because I want really fine grained control over when we're going to switch GPU's.

Now in my GPU changed thing, for the purposes of this app, I really don't have to do anything important.

Everything that's in here, is just really to update the User Interface with what renderer I'm going to be showing you guys.

So the simple part of that is here, I'm just calling OpenGL, doing a little bit of Cocoa to take the C-string I get back from OpenGL and slam that into a text field, so we can see the name of the renderer I'm going be on.

This code down here, which I'm not going to go into the gory details on, lets me go from the OpenGLContext virtual screen, whichever one that is, all the way back to being able to query the renderer for how much video memory it has.

So finally, down here, I can say, you know, does it have half a gigabyte or a gigabyte.

Now, also for the purposes of this demo, normally you would just do this, you'd just call Update, because you said I needed to, for the demo, I have to do something a little bit more interesting.

In this case, I want to switch renderers based on which half of the screen I'm on, so this part figures that out for me.

It lets me sort of simulate having two displays hooked up here.

So, let me go ahead and run this.

So, anybody recognize what this is from?

Come on, geez, all right, nobody grew up in the '90's I guess.

Ken: Thank you.

All right.

So, the other thing about this is kind of funny, I implemented this to use VBO's for performance and of course it doesn't matter in this case, because there aren't enough Polys and it uses per pixel lighting and you still can't tell, so anyway, so now if I move this over, it switches, now everything's being rendered on the ATI card and the Window's server just automatically figuring out.

The take away from this is nothing happens on the screen when I do it.

There's no glitch, it just keep going and the OS basically does the right thing for you.

That's what we want the user experience to be in this case, all right, you know, just nice and smooth transition as far as your user's concerned, life is good, nothing happened, that's the whole idea.

So, check out the sample code, it should be posted already and you can take a look at what the rest of the app is doing.

Okay, so let's talk a little bit about advanced Multi-GPU Support, when you want to purposely be using more than one GPU at once.

So, what are the motivations for this?

Well, the biggest one, obviously, is performance.

You know, you've got this other card sitting there, maybe it's really good at doing OpenCL stuff and you want to make use of it.

Due to the way GPUContext switch, or don't these day, you know, if you have an offline GPU doing a lot compute, it doesn't really matter if you're tying that thing up for really long periods of time, the GPU driving your display will be free to just do GUI stuff.

Another reason is, you might just find yourself on a system that has two GPU's, but only one of them has the extensions you want, so you really have to use the one that, just for whatever reason the user doesn't have the display hooked up to.

In that case, you know, you might just simply need to use the other one along with the one that you're using for display.

So, there's some issues to consider with this.

First, context sharing, how does that work?

How do we, you know, make resources share between multiple contacts running on different OpenGL renderers and how does synchronization and resource management work when you have more than one renderer involved.

So, first we'll talk a little bit about context sharing in OpenGL, which is something not everybody might be familiar with.

On MAS OS X, you can have multiple OpenGLContext sharing the same set of resources.

This mean like textures, display lists, FBO's, VBO's, whole bunch of stuff like that.

That way, no matter how many different OpenGLContext you have in the system, you only have to give it the texture data once.

It's better for system performance, there's not duplicates of everything, life is all good.

Now normally when you create a context, this is just a really simple example, here you can see that the second parameter here is null, that's saying that parameter is the sort of shared context parameter, okay.

If you switch that out to a different context, now when you create context B, it's going to share all of your, you know, resources with the first one.

Now there's some limitations here.

Both contexts have to have exactly the same set of renderers in it, you can't have one context that just has the AMD card and another context that has just the NVIDIA card and share resources between them, that won't work.

Both contexts have to have both renderers in them.

So, be very careful if you're going to call ChoosePixelFormat once for each context.

If you use DisplayMask, or something else that limits you to one particular hardware device or another, you're probably not going to get what you want.

The safest thing you can do is just ask the first context for its pixel format and pass that in here, that way when you create the second context, you're guaranteed that it's going to work.

In OpenCL, it's a pretty similar story, but it's easier to sort of force the sharing stuff.

Whenever you create a command queue in OpenCL, off of a given OpenCLContext, it's basically all of the command queues created against that OpenCLContext are guaranteed to share stuff.

You don't necessarily have to jump through the same hoops you do in OpenGL.

Now the interesting thing on Mac OS X you can do is you can create an OpenGLContext, that already has automatically in it, the set of renderers that another OpenGLContext is using, and if you do it this way, as the code shows here, you automatically get resource sharing between OpenCL and OpenGL's, so like images and textures will be shared, or buffer objects will be shared automatically and you can pass data between them.

That's how you do that.

It is worthy to point out, in this particular case, your OpenGLContext might have the software renderer in it, but this code here for OpenCL won't get you the CL software compute device, you would have to add that in as well.

So, let's talk a little bit about multiple-context synchronization and what goes on there.

So, if you've got two different contexts, even if they're on the same GPU, you really have to pay attention to order of operations if you're going to be sort of doing a producer consumer sort of thing, okay?

On Mac OS X, OpenGL at least uses what we call Flush and then Bind semantics, if you're going to do this.

Any context that's modifying a resource, like rendering to a texture using FBO, or anything along those lines, has to Flush that context before we do anything else with the data in another context.

That could be a GLFlush, GLFlush renderer Apple, GLFinish, not the best performance, but it would work, anything that basically drains the entire open GLPipeline dry will ensure that everything is good.

After that point, any contexts that are going to use that modified resource, must bind to it.

You can't just already be sitting there bound to a texture that you modify in another GPU or another context and then draw with it, it won't work right, you have to redo the bind.

The bind calls how, where we get a chance to go in and detect that the data might not be on the GPU where it's now needed.

Now this applies to both single and multi GPU cases.

Even if, you know, I've got two contexts, it's on the same renderer, if I do a text sub image, or something along those lines on one context and don't do a Flush and go to the second context, even though it's all on the same GPU, because of the command buffering that happens and host memory and everything else, those text for modifications might not simply be visible to that second context, unless it's been Flushed first.

In the Multi-GPU case, it's also very critical, because at that point that's what allows us to pull the data off the first GPU and ship it over to the second one.

Now OpenCL you can use either event model to accomplish the same sort of thing.

If you're shipping data from GL to OpenCL, for example, you kind of have to follow the appropriate rule for each API.

In this case, if you're going to render to a texture using OpenGL, and then you want to do some CL specific image processing on the result, you're going to have to Flush the GLContext, make sure all those commands are in flight first, then you can do an acquire on the image in CL, and everything will work the way you expect.

So, this is similar to my previous diagram, but what I've got to show here, is that I have two OpenGLContexts, okay, they're using the same Share Group, in this particular case, but each one is talking to a different GPU, so how's this going to work?

So, using my previous example, I create a new texture object and I put some data into it.

At that point, it's really the Share Group that owns the copy of the data.

In the previous animation that I showed for you, I didn't show the Share Group as a separate thing from the OpenGLContext because I didn't want to sort of muddy the waters, but the reality is is that even a standalone OpenGLContext always has a Share Group sitting under it, and that's really where all the resources are owned.

So in this case, again, I've specified a texture, it's now owned by the Share Group, now I come along and I render with it.

Now, again, that draw texture command, or whatever that entailed in my app, whatever those GL commands were, is still just sitting in host memory, it hasn't been shipped off to that other GPU yet.

Or to the AMD card in this case.

Once I call Flush, that's when everything gets going and that data will be uploaded to the card and consumed.

Now, again, let's say I to a text sub-image operation that winds up being accelerated, so we're not going to end up modifying at host memory first, that causes the host memory copy to now be out of date with respect to what's up in video memory on the AMD card.

And again, Flush it to make sure that everything is really up there.

Now let's say I go over to my there OpenGLContext that's currently bound to the NVIDIA renderer, and do something interesting over there.

So, I'm going to bind to that same texture object.

Now the instant I do the bind, the Share Group is smart enough to say, "Hey, wait a minute, you know that texture, we don't have the most up-to-date copy in system memory.

I've got to reach all the way over to this other piece of hardware and pull the data back.

This is why all of the, both contexts have to have the same set of renderers in them, because even though over on the Open, the second OpenGLContext on your right, I'm talking to the NVIDIA card, I might need to have sort device access to the AMD card so that I can pull resources on it back to host memory, on your behalf, okay?

So now I'll go ahead and I draw with it and I Flush, that causes that data to get pulled up to the NVIDIA card now, and everything continues the way you want it to.

Now, let's say I go over here and do an accelerated copy text sub-image.

Well the instant you do that, OpenGL knows that, not only is the Share Group copy out of date with respect with what's up on the NVIDIA card, but the AMD's data is now going to be out of date too, so if I go back to the first context and do something with that, it's going to have to start this entire process all over again and pull the data off the NVIDIA, back to host memory, and back up to the AMD card.

And, again, make sure you finish your rendering with a Flush.

So, there's some performance things to think about, if you're going to use multiple GPU's within your application.

The biggest one is make sure you're doing enough work to make it worthwhile.

You have to take into account how much compute or rendering you're going to do versus what it's going to cost you to transfer the data, you know, back to host memory and up to the second GPU.

Especially in the Mac Pros where it might even matter what slots you're in, you have to pay attention to that, you know the user may have four cards in the system, but only two of them will have 16X slots, the other two will have 4X, getting data off certain GPU's is going to be considerably more expensive.

So, what you want to do ideally as well, is decouple the workloads between the two GPU's as much as possible, you know, say if you had four GPU's in there, and you're doing some kind of, I don't know, like image processing thing where you don't have to have dependencies between the GPU's, you can just fire stuff off the one card, run some image processing, get it back.

You could probably set up four different threads, one to work on each GPU, get them all fired up and running completely in parallel with each other, that's the ideal, okay?

Another thing to watch out for is, while we just talked about how the resource synchronization stuff works in multiple context, don't rely on that to get the best performance.

You know, before we can pull data off of one GPU, we have to synchronously wait for it to be done before we can pull it off.

Okay, so ideally, you want to find some way, be it whatever API you're working with, GL or CL, to make sure that that data has somehow gotten back to system memory.

So, consider using extra buffering sort of to double or triple buffer your data if this is cases where you know you're going to be streaming data from one GPU to the other, you want to get those two things working and parallel as much as you possibly can.

Another thing is, if you've got some compute device in the background, or GL doing offline rendering in the background, make sure you don't wind up bottlenecking yourself behind trying to display ever single frame that you get out, it's really not necessary, show a progress bar, or just take snapshots, you know, every now and then of where the GPU's at, but don't try and shove every single frame that you compute onto the display, it's just going to slow everything down.

All right, so now I am going to bring up Abe Stevens, one of our OpenCL Engineers to show you a couple of demos that use multiple GPU's at the same time.

Abe?

[Applause]

Hi, my name is Abe Stevens, and I work with the OpenCL group at Apple.

Yesterday during the OpenCL, OpenGL sharing talk, we showed a simple demo that was, that consisted of a number of objects that bounced around the desktop with some post process effects that were rendered in, rendered in OpenCL.

What we've done for this demo is we've taken that same code and made a certain number of changes to it, to enable Multi-GPU support.

So first, let's take a look at the demo that we saw, or that some people saw yesterday.

In this example, the spheres that are bouncing around the image are rendered with OpenGL, using a GLSLShader that computes refractions and reflections and then the caustic effect that you see is rendered in OpenCL as a post process.

Now with only a few spheres bouncing around the desktop, there really isn't much need for an additional GPU, however, if we switch to a more complicated example, with a much larger number of spheres, if we look at the frame rate, which is displayed in the lower corner here in milliseconds, the caustic is relatively expensive and incurs a relatively high latency.

If I switch to the physic step in this program, which right now is running in OpenCL on the GeForce card, over to the Radeon card, the performance will increase by about, in this case, about 35% or so.

This increase in performance isn't, you know, it's not a 2X increase in performance, but in this case all we've done is we've taken a simple CL application and made a small modification to it that allows some of that CL work to be performed on the second GPU.

If we wanted to take this application and design it from the ground up, and I'll show another example in just a moment that does this, we could set up our application so that more work could be run concurrently between the two devices.

In that case, we might have to double buffer our data, so that one GPU doesn't always have to wait for the other GPU before it can perform work concurrently.

So this is another demo that was built from the ground up to run on multiple devices.

In this case, we're running the demo on the Radeon for graphics, so all the OpenGL rendering happens on the ATI device, and the simulation, which is simulating the movement of these particles around the desktop, is occurring in OpenCL on our NVIDIA device.

Now if we switch to the Radeon device for both graphics and compute, the performance changes significantly because now that one GPU has to perform both task, both the OpenGL display and the compute process to render the position of the particles, and of course, if I switch that back, our performance goes back up.

But this application is very different than the one that I showed just a second ago.

In this case, we've set up the application to double buffer the data, and we have really designed it from the ground up to use both devices, so if you consider using both devices and allowing work to execute concurrently on the two pieces of hardware, you might be able to get about a 2X performance increase.

If you just take your application and add support for multiple devices or multi GPU's the performance increase will be a lot less, but were still able to achieve that, about a 20 to 30% improvement.

Anyway, I'm going to hand this back over to Ken who will talk about another advanced multi GPU topic.

[Applause]

Ken: Okay.

So let's talk a little bit about IOSurface.

So this API was introduced in Mac OS 10.6 with not a lot of fanfare, this is actually the first place we're talking about it.

The whole point of IOSurface, and what's relevant to this talk, is that it makes resource sharing between different parts of the system a lot easier than it used to be.

IOSurface is basically nothing more than a really nice high level abstraction around a chunk of system shared memory.

So, what this is designed for is to do very efficient cross process and/or cross API data sharing, you know, you might need to send some data from CoreImage to OpenGL and you don't have control over the context involved, so you can't set up sharing, this can help you with that.

Germane to this talk is that it's integrated directly into the GPU software set, for all supported hardware on Mac OS X.

This is what allows us to pull off some really cool tricks that we'll talk about a little bit later.

Now the really neat thing about it is that from the app developer's point of view, it hides nearly all of the details about moving data from GPU to the other, or between the CPU and GPU and vice versa, okay?

If you follow a few simple rules, it pretty much is designed to just work.

So, let's talk about the GPU integrations stuff, because it's important for this talk.

So, an OpenGL texture can be bound to an IOSurface.

This is sort of a live connection, it means that anytime the contents of that surface are modified anywhere in the system, that texture at anytime it gets it used is going to see those modifications happen right away, you don't have to keep copying the data into the texture.

Also, IOSurface does support planar image type, so you can bind and OpenGL texture to a single image plane, for example, say you had an IOSurface with NV12 42.0 style video in it, you can bind an IOSurface once to the luminants plane once to the cromonent and write a shader together to do RGB conversion, works out really well, and we do that internally in some cases.

If you want to modify an IOSurface, you just take your IOSurface back to OpenGLTexture, bind it to an FBO and go, it's really no more complicated than that.

For the most part, you just get to use the standard OpenGL techniques to do it.

Now, OpenCL itself doesn't have direct binding to IOSurface at this time, but via the resource sharing stuff we talked about before, you can more or less take and OpenGL texture, bind it to an IOSurface, then take that OpenGLTexture and use it with the appropriate extension, whose name escapes me at the moment, but you can take that texture and use it in OpenCL's and image and get access to it that way.

Now, what's also cool about IOSurfaceTexture, is it doesn't matter how many textures in the system get bound to that IOSurface.

They all are going to use exactly the same video memory, on any given GPU.

Okay, so this mean, if I have two different processes in the system, both looking at the same IOSurface, and they both create a texture off of it, and they're both using the same GPU, there's not going to be any copies that happen back to host memory, just because were crossing process boundaries, okay, that's part of the just works part of the whole API, and it's a good performance thing as well.

Also, no matter how many different renderers we have in the system, the host memory backing for IOSurface is shared between them.

So if we do have to transfer stuff from one GPU to host memory and up to another card, they're aren't every any CPU copies involved in this process.

With regular OpenGLTexture objects, there actually can be, in this case it's DMA to system memory, DMA up to the other card and that's it, CPU does not touch all of the data.

So this is just a really simple example of creating an IOSurface and getting it usable inside of OpenGL.

So, IOSurface is standard, sort of Mac OS X, Core foundation based API, you give it a dictionary of all of the properties you want for the IOSurface and away it goes.

In this came I'm cheating a little bit and using toll-free bridging because it's a lot less code.

In this case, I'm just going to create a simple 256 by 256 IOSurface, 4-bytes per pixel, and that's pretty much all I need to specify as far as IOSurface is concerned.

IOSurfaces do not have an intrinsic format associated with them.

You can give it a pixel format identifier, like you know, BGRA or any of the sort of quick draw style 4CC or NB12 or anything like that, but IOSurface really doesn't care.

The only reason that's there at all is just so that two processes can sort of pick something to agree upon.

Now from the OpenGL side, all I really have to do is basically generate a new texture object and call this, you know, Mac OS X specific API, CGLTextImageIOSurface2D, it's kind of a mouthful, and that will take that currently bound texture object and bind it to the backing store of that IOSurface.

Okay in this case, I'm telling OpenGL I want the internal format of this texture treated as RGBA, that's 256 by 256, and I want OpenGL to look at that data as if it's BGRA onsite, and 888Reverse, which is just your basic ARGB format.

And you give it the surface that's involved and, in this case it's not a planar surface, so I just specify zero.

If the IOSurface had multiple planes, this is where you would stick that argument.

I want to call out this, again, because it's kind of an important point.

OpenGL is going to view that data in the IOSurface, via these parameters.

It doesn't really matter what the data is, or what format it is, what you specify here is what OpenGL is going to interpret that data as.

When we transfer it back and forth between host memory and the GPU, there's no CPU touching, there's no data formatting, it's straight copy up, straight copy back, you know the GPU's might do hidden tiling or that sort of thing in their local video memory, but that's not exposed to the app developer in any way.

Now, the nice thing is, IOSurface follows the same synchronization rules that we talked about earlier, there isn't anything new to learn here, they work exactly the same way, okay?

If you're going to take a texture on one context, and modify it with the GPU and ship it over to another context, you just have to do the Flush and you just have to do the bind, behind the scenes, IOSurface sort of works outside the Share Group to figure out in the system, that hey, this data is not in the right card.

Now the neat part about this too, is that the two contexts involved in this don't have to know about each other at all and they don't have to even share the same renderers, this is where, because this is integrated at such a low level on the system, we can still get at a GPU that your app doesn't even necessarily know about to go pull the data off of it.

Now the other sort of neat thing about IOSurface is that it lets you get direct access to that backing memory.

You know, for regular textures, if you're not using all of the texture range extension and client storage and all of that stuff, normally you can't get at the sort of shared system memory copy of the textures.

With IOSurface, you can get direct access to it, but you have to be careful about synchronization.

If you're going to write into an IOSurface directly with the CPU and then consume it using the GPU, you have to do what we are doing here, you have to lock it, put your data in it, unlock it.

At that point, we realize that you've changed the host memory copy and then you can go off and use it with OpenGL and everything is good.

For the opposite case, where you're consume some data using the CPU, after you've used a GPU to modify it, you, again, you have to make sure that all of the commands that may have been buffered to that GPU, have been Flushed and are in flight.

If they're not, the kernel part of IOSurface has no way of knowing how long it has to wait before it can DMA that copy back to host memory so that you can use it, so, again, follow the same synchronization rules as you would before.

So, let's talk a little bit about some performance tips when using IOSurface.

So, as I alluded to earlier, the automatic synchronization that we do when data is on the wrong GPU, it's easy to use, but it's not asynchronous, you know, if you've got one GPU consuming some data, and you immediately need to use it on the second one, there's a synchronization point there that we just simply can't avoid, and we don't want to give you bad data, so we're going to go ahead and wait for the data to be done before we pull it back to host memory.

One trick you can pull here, and this is a little bit advanced, because you can force IOSurface to page the data back to host memory by performing a lock, one trick you might consider if you want to use IOSurface for doing double buffering between different GPU's, is you could produce some content, and on that same thread immediately do a lock.

What that means is that the GPU is going to do all of its work and then the first thing it's going to do is page it back to host memory so it's ready to go.

Then you could go and do a second frame, do the same thing.

Get a couple of those frames going like that in buffered and host memory, then fire up the second GPU and start consuming the data, that way you get a nice overlap, you know if you go ahead and bind to the IOSurface and the host memory copy is already up to date on a downstream GPU, you won't pay any synchronization penalty for that.

All right, and again, this gives you really good tight control over exactly when that DMA happens.

And again, earlier, I talked, said there's no CPU copies, that's true in this case as well, so you're not going to pay any extra CPU overhead, other than the wait, for getting the data from one GPU to another.

Another really neat trick you can do is, remember in the slide before I showed you that, you know, IOSurface is going to, you know, view that texture based on the format and type.

Well, one thing you might want to do, for whatever reason, is say you've got, you know a luminants playing a video, and you want to do something with all the luminent channels, like run some kind of filter or something interesting like that, you know, change a gamma, setting, something, what you could do is basically lie to OpenGL and say, you know what, you know, 19, 20, or let's do something that I can do in my head, 640 by 480 video frame, it's really 160 by 480 luminant, or RGBA, even though it's really luminants, now I need to basically process four pixels at the same time and save to my shader instead of one.

So, the neat thing about this is that you can have different textures all pointing at the same piece of video memory, viewing it as different pixel formats, which again, sort of cool trick for doing image processing stuff.

Now if you're going to do that trick, the total data sizes have to match, you know, if you're going to say, you know, it was 640 by 480 four bytes per pixel, whatever width tie in to sort of bytes per pixel OpenGL is going to use works out to, it's going to have to work out to that same amount, or things will fail.

So what are some sort of cool examples for using IOSurface and how does it apply to the Multi-GPU stuff?

Well, say your plug-in, you know, you're an application developer and you want to support plug-ins and you're really having this quandary about, well, do I make this CPU based, or do I make it GPU based, and if we're going to make it GPU based, how do we tell them what renderer to use and Oh my God, this is really complicated, what do I do?

If you just say, here's an IOSurface, go modify it and hand it back to me, we'll abstract everything for you.

You could be looking at it with a CPU, they'll look at it with a GPU, they do their thing, they ship it back to you.

Or, if you're really lucky, they're going to use the same GPU you are, and there's not going to be any copies back and forth, so that's pretty cool.

Another really cool thing to use IOSurface for is Client Server applications.

Because we can pass stuff back and forth across process boundaries so cheaply, even keeping them on the same GPU, this is just really good if you need to use like a renderer server type operation, we use this internally in Mac OS X in a couple of situations as well, just to, you know, do things in sort of a secure manner.

And again, even in the Client Server situation any resources that are up on the GPU, will stay there if the downstream sort of client process is using those exact same GPU's, so again, there won't be any copies involved.

Now, probably the coolest thing you could do is combine both.

You could actually run your plug-in, in a different address space, on a different host architecture, and even on a different GPU, and it would all still just work, you know it's a really nice case to say, you know, as an app developer, "Hey, my plug-in guy crashes, it's not going to take down my app, I don't have to care if he's using a CPU to do his work, I don't care if he's using the GPU, everything is all pretty cool."

So, let me give you a real quick demo.

Okay, so this app here, this is the server, he's just generating Atlantis Frames and waiting for clients to come can check in with him.

The client checks in with the server, and then the server basically starts sending these Atlantis Frames over to this other application.

Now in this case, they're both on the same GPU, there shouldn't be any transfers going back and forth.

But I can say, "Hey, server, start using the hardware, the other hardware renderer."

Now the system is just automatically still just shipping the frames across GPU's to the other application.

Now again, you can't really see any visual difference.

I can even force this guy to use the software renderer, and it starts writing into the IOSurface directly and the client is still just going.

Now, I wrote this server to actually support multiple clients simultaneously, so I can make a duplicate of this client, start it up, and just to be interesting, I'll force it to run 32-bit, okay, and now the server's running 64-bit in this case.

So I can launch another copy of it and now he's running, you know the ponies are in different positions, but they're basically on two different, or actually in this case let's even make it more interesting, so now, the software renderer is writing into the IOSurface, it has no idea what GPU either of the two clients are using.

The clients, in this case, each one of them is using a different GPU and one of them is even a different architecture than the first, and it all still just works.

I think that's pretty cool, I don't know about you guys.

[ Applause ]

So the code for this, you can check out the sample, it's really not all that complicated, let's see, here's where, you know, I set up a little pool of IOSurface buffers and I am going to do frame rendering in here, I'm using two or four, something like that, a bunch of mock port goo to get the stuff between the two guys, but for the most part, the server really doesn't have to do too much complicated stuff.

When it starts up, I set up a texture and an FBO with a depth buffer all together so I can render into an IOSurface, I set it up so that if I, you know, had to stretch the IOSurface I get linear filtering, that sort of thing.

Then I have a little routine that lets me render that IOSurface, not that complicated, I bind the FBO that's attached to it, do all the Atlantis fun stuff, and then bind back into the system drawable, then I'd turn around and, just so you can visualize what I just drew, it just draws a copy of that IOSurface back into the window.

Now the client, he knows even less about what's going on.

For the most part, where is it, so again, he doesn't have to set up an FBO, he's just rendering from the IOSurface, it's not a big deal, so I just set up a texture, turn on linear filtering, clamp to edge, a few things like that and go.

I modified the blue pony code a little bit, so that I could pass in a texture name, and a set of dimensions, but that's pretty much all I had to do, and now I can have this guy rendering stuff that was generated on a completely different GPU, in a different process entirely, and everything just works.

Okay, all right, so in summary, please support systems with multiple GPU's whenever possible for you guys.

They're becoming more and more common, they probably, they're not going to go away anytime soon and your users will be happier.

Again, if it's advantageous to your app and you can get a performance win out of it, please try and take advantage of multiple GPU's are available as well.

Again, the person who spent money on his big scary Mac Pro is going to be very happy with you.

I know I wish more apps supported it on my system, so please take advantage of it if it helps you.

And lastly, you know, if you need to use, if you're in one of these tricky situations where you can't use OpenGLContext sharing, or you need to be a different process, or you don't want to have to care about what GPU you're on and you want to be using multiple GPU's, IOSurface is a great tool to help let you do that.

And read the sample code, you know, they're not particularly complicated, the whole idea of IOSurface is it's not this insanely complicated API.

It is kind of a big API when you look at the header file, but don't let it seem too daunting, it's really not that big a deal.

So, for more information, please contact Allen Shaffer, he's our Graphics and Game Technology Evangelist at Apple, or check out our Apple Developer forum, you can ask questions in there and hopefully we can get back to you.

Related sessions, unfortunately, have all happened before this, but please use these for reference, and go and check them out, there's, you know, more details on OpenCL and how to do the sharing in that case, with the previous session to this, some performance, cool performance stuff for Mac OS X.

And with that, thank you very much.

[Applause]

Apple, Inc. AAPL
1 Infinite Loop Cupertino CA 95014 US