Harnessing the Power of the Mac Pro with OpenGL and OpenCL

Session 601 WWDC 2014

The new Mac Pro enables you to unleash the power of dual workstation-class GPUs and multiple CPU cores in ways that just weren't possible until now. Gain a deeper understanding of the integration between OpenCL and OpenGL and see how to tap into the parallel compute and rendering power of the Mac.

[ Music ]

Good morning and welcome to the session.

My name is Abe Stephens, and I'm an engineer in the GPU Frameworks Team at Apple, and this morning, I'm going to tell you about how to take advantage of the New Mac Pro Workstation, which is a really exciting platform to work with for graphics and compute applications.

We're going to talk a little bit about the Mac Pro and the hardware that it contains and the hardware that's available for you as a programmer.

Then we're going to take a look at some of the graphics and general purpose GPU compute APIs that you'll use to program on this computer.

And then, at the end of the talk, we're going to take a look at some common patterns that you might follow when you're writing an application to take advantage of the hardware that's available in this configuration.

So let's take a look at-let's take a look at the Mac Pro.

And as you can see here, this is actually the Mac Pro tower and the new desktop.

And the tower's actually eight times larger than the new workstation and it actually is only about four times heavier.

And so there's a lot of hardware packed into a very small package.

And if you're an application developer, you know, this means that someone can have a very powerful workstation sitting on their desk that you can - you know, that can run a pretty high-performance application and in a much smaller form factor than the tower, the tower system.

If we take a look at this computer in some detail, probably the most exciting thing as a graphics programmer is that the Mac Pro has two GPUs in every configuration.

And these are two identical devices.

So, in the tower Mac Pro, you could-a customer could configure this system to have multiple GPUs and there are actually cases in that system where you might end up with two different GPUs that were from different vendors.

In the new Mac Pro, you always - your application will always have two identical GPUs available to use.

And as we'll see in a little while, that's - you know, that's kind of an advantage because it means that you don't have to or your application doesn't have to have a lot of logic to query and find out if the capabilities of the available GPUs are different.

You can write to a single platform.

And then there are some very specific things that you can do to distinguish the two devices.

In this configuration, one of the GPUs is directly connected to the display hardware, and the other GPU isn't.

And so there's some specific things that your application can do that you can write into your application to take advantage of that and to make sure that when you are sending work, graphics work, or compute work to one of the GPUs, you know which one is connected to the display.

So inside this configuration, the GPUs have about two thousand stream processors.

That's important if you're working with OpenCL to do general purpose GPU compute work.

The configuration that we'll be looking at today when we show a demo later on has six gigabytes of memory and 3.5 teraflops peak.

And so it's a very capable package for graphics and for GPU compute.

So now I'd like to explain some of the APIs that you can use to program this configuration.

And if you're - if you've worked with Apple Graphics or Compute before, these will be familiar to you.

Actually, you've probably used OpenCL and OpenGL on - say, an ordinary laptop or on a single GPU system in the past.

Maybe you've even worked with a tower Mac Pro and done multi-GPU or multi-display programming.

What I'm going to talk about in this session is, the parts of those APIs that you have to pay attention to and that you have to use when you are setting up an application to use this Mac Pro configuration, because it's a little bit different than some of the other configurations that have been available in the Mac platform in the past.

Okay, so let's take a look at the Software Stack.

So on top of the Software Stack, is your application.

So this is your code that you've implemented and this might be a graphics application that does some type of 3D rendering.

It could be an application that's doing say, image processing or video processing using OpenCL or GPU - for GPU compute.

Maybe your application does a little bit of OpenCL and a little bit of OpenGL to fully take advantage of the GPU.

Anyway, the first sort of level of GPU programming that you might have in your app is some Cocoa code that is using an NSOpenGLView or a CAOpenGLLayer.

And I'm going to show you how to configure this level of the software to correctly set up or to set up your application to take advantage of the two GPUs in the Mac Pro in the most efficient way.

And then on top of the next sort of level on this stack are a number of lower level APIs.

The CGL API is something that you may be familiar with in the Mac from other systems.

It's a lower level API that you can use to configure the GPU and to find out information about the displays and the hardware that's in the system.

Then, the few other sort of programming APIs that you're quite familiar with, OpenCL and OpenGL, which are the APIs that you'll end up writing a lot of the code that dispatches work to the GPU and also the kernels and shaders that are executed on the GPU itself.

And then of course, underneath OpenCL and OpenGL, there's some graphics drivers.

And these graphics drivers are handling things like allocating memory on the devices and moving memory between the host and the device.

And we'll look at certain parts of the driver that end up performing some of the standard movement for us and we'll try to understand exactly how this will impact the way that we design applications.

So now I'm going to talk about OpenCL and OpenGL in some detail.

And most of the programming that we'll see in a little while, is going to focus on this level of the stack.

The CGL, OpenCL and OpenGL layer.

So, on the Mac Pro, we support OpenGL.

It's our accelerated 3D rendering API.

And we support OpenGL 4.1 Core Profile and the shaders that you write that are executed on the GPU are written in a language called GLSL.

And we support version 4.10.

And, OpenCL is the data parallel programming API that the Mac Pro supports.

And it is-the version that is supported is OpenCL 1.2 with a number of extensions.

So the Mac Pro has a pretty advanced AMD graphics card, er, a pair of AMD graphics cards.

And there are a couple extensions that are supported, so on the Mac Pro you can use double precision.

And there's actually an extension that allows you to set the priority of the OpenCL command queues that are used when you enqueue work onto the GPU.

And so for example, on this configuration, you can set up a work or an OpenCL command queue for background priority work.

Say, for example, if you are performing an operation that isn't related to the GUI and can be performed at a lower priority, maybe you're applying some type of final render to an image processing application, you can send that work off to a lower priority queue.

And then a if higher priority work comes to the GPU, that work won't interfere-or the lower priority work won't interfere, say, with rendering work that was being used to display the GUI in the system.

And so, if you take advantage of these extensions, you can work around some of the challenges that we'll talk about in a little while.

Okay, so when you have this computer and the GPUs that are in it, you can do a couple different things actually.

You can take advantage of this second device.

One thing that you can do is you can take, say, the OpenCL, the compute portion of your application and simply run that compute portion on the second device, and let the primary GPU be continued to be used for GUI rendering and for maybe OpenGL 3D graphics.

And this is actually a relatively easy type of operation to perform.

You can sort of offload compute work to the second GPU.

This is kind of similar to if you took an application to where you were, say, running most of your compute work on the CPU and you decided to move some of that work from the CPU off to a GPU in OpenCL, we would perform the same operation.

We'd take that compute work and just move it to the secondary GPU.

Another common design or another common task that you might use for the Mac Pro is to perform off-screen rendering on the second GPU.

So, if you were going to do some type OpenGL work, and that the results of that work don't necessarily have to be displayed, every single frame, that's a great candidate for moving to the secondary GPU as well.

And we'll take a look at an example of that in a little while.

So let me tell you about how to set up your application to use this configuration and to take advantage of both GPUs.

And it's important to know that, if you have an application that's running on a Mac and using OpenCL and OpenGL, it will run just fine on the new Mac Pro, but there are a couple things that you can do to make sure that it's using one GPU or the other.

And also a couple things that you can do to make sure that if it's possible, you can divide your application into pieces and run one piece of the work on the first GPU and the other piece of the work on the second GPU with a small number of changes.

So to get started with modifying the application, there are a number or steps that you can go through.

There are four steps.

There's creating the context that you're using.

This is either the OpenCL context or the OpenGL context in a specific way.

And if you follow this procedure, the runtime in the system will be able to do a lot of the tasks associated with moving data between the two devices and between the host automatically.

Then, your application should identify what happens or actually it should identify which device is the primary GPU and which is the secondary GPU.

Then once you've figured that out, you can dispatch work in a particular way.

And I'll show you how to do that.

The way that you select the GPU and send work to it is a little bit different in OpenCL and OpenGL.

And then after you've done that, you can synchronize data between the two devices or offload data from the secondary GPU to the host's main memory.

So, let's take a look now at context creation.

And it turns out that this process is, it's similar in OpenCL and OpenGL, but there's some specific terminology that I'd like to go over that I think will make the process a little bit more clear.

So before we showed the Software Stack, now we're going to concentrate on the graphics APIs in the stack.

So, OpenGL is our graphics API and, like I said before, CGL is this API that we use.

It's a Mac platform API.

And we use it to set up our OpenGL context and to select devices in the system and figure out get some information about the hardware that's in the system, the devices and the renderers.

And we also have OpenCL, and it turns out that, in OpenCL, a lot of the operations that we were going to perform in CGL, about learning about displays and devices are actually included in that OpenCL API.

And so as I walk through this, I'm going to show you an example of how to perform an operation using the graphics APIs, OpenGL and CGL.

And then I'll also show you how to perform the same operation using OpenCL.

And in most cases, we're performing very similar tasks or operations.

We're just using two different APIs to do that.

Okay, so the first thing to think about in OpenGL is the notion of a piece of hardware.

And in OpenGL, there are a certain number of renderers in the system.

And each renderer is assigned a render ID number.

So for example, in the Mac Pro, there's going to be one renderer for each of the GPUs, and then there will also be a software fallback renderer that's available on all Macs that the system can use if, for example, if you were on a configuration that didn't have hardware support for a certain feature, the system might fall back to this software renderer.

And so, if you look at the render IDs in the system, you'd see two render IDs for the two discreet GPUs, and then one for the software renderer GPU.

So when you start to set up an OpenGL context and an OpenGL application, you have to figure out how to select between all of these different renderers and these different render IDs in the system.

And in OpenGL, you do that by putting together a list of attributes.

These are called pixel format attributes.

And these are things like, does the renderer you're looking for support double buffering?

Does it support a certain color format or a certain depth format?

Does it support Core profile or legacy profile OpenGL?

Anyway, you put together this list of attributes and one important attribute for setting up an OpenGL application on the new Mac Pro is the offline renderers attribute.

Now, the Mac Pro has two GPUs and one of those GPUs is always connected to the display hardware that's in the Mac Pro.

The other GPU isn't.

And the terminology for this on the Mac platform is that the display connected GPU is considered "online", and the GPU that's not connected to the display is considered "offline".

Now, it turns out that you know, both GPUs are powered up and both can perform rendering and compute operations, but the terminology is that the one that's connected to the display hardware is "online", and the other one's "offline".

So we when we put together a pixel format attribute list, we add an attribute that says that we want to include offline renderers.

And then, when we send that into the system and we call the Choose Pixel Format API routine, and we'll look at that in a second at what that looks like.

We're going to get back a list of the renderers of all the renderers in the system.

In this case, we're going to get both GPUs - the online GPU and the offline GPU and then the software renderer.

Now, the next step in this context creation process is to actually create the context.

And this is really just a container that points to these renderers and associates state with them.

So there's a lot of state in the OpenGL API and that state is associated with this context object.

Now, once we have a context, we need a way for that context to refer back to the renderers.

And this is actually a difference between OpenCL and OpenGL.

In OpenGL, the context assigns virtual screen numbers to each of the renderers that are in the system.

And so here we have a context with three renderers, and we have virtual screen numbers, zero through two.

Now, the last piece of terminology I want to go over in OpenGL before we start talking about OpenCL, is the share group.

If you have a context, now remember, a context is that state container.

If you have a context, and you want to set some state and create some objects and then maybe you decide that you need another thread to do some OpenGL work and maybe it's going to have slightly different state, you can create another context that has the same set of renderers in it and can share objects, serve buffers and textures.

And you would obtain a share group from the first context and use that to create or to communicate with a second context.

So a share group in OpenGL terminology and CGL terminology is this entity that lets you bridge two OpenGL contexts.

And that's important because, as we'll see in a second, we can also use that share group to communicate its objects and state between OpenGL and OpenCL.

Okay, so let's look at what the equivalent operations and components are of the OpenCL API.

Now, OpenCL has device IDs which are kind of like renderers.

A CL context object, that's a lot like that GL context, and then a command queue, which as it turns out is a little bit like the virtual screen.

We use it in a similar way.

The API is presented in a slightly different fashion, but these operations are very similar.

Now, our CL context is something-it's a lot like a GL context.

It turns out, when we set up our CL context, we actually are going to set it up using, in some cases, a share group that we obtain from a GL context.

And I'll show you what that looks like in just a moment.

Okay so, now that I've described the terminology, and remember the, you know, the term-there's a lot of terminology in OpenGL and there's some terminology in OpenCL, but the two APIs are performing very similar operations.

Let's take a look at the API that you have to use when you set up, when you start to set up your application for working on the Mac Pro with OpenCL and OpenGL.

So, let's say that we're in an application that's using an NSOpenGLView in Cocoa.

Now, I'm going to create an NSOpenGLView and I'd like to use the Core profile.

So I'd like to use the newest features in OpenGL.

And so I'd like to make sure that I get a Core profile OpenGL context.

And in order to do that with an NSOpenGLView, I have to implement my own NSOpenGLView class that is derived from the Cocoa base class.

And then I would implement my own version of the initWithFrame method and in that function, I'm going to set up my pixel formal attribute list.

And as you can see here, at the top, I included the Core profile attribute and then at the very bottom-and this is the important piece for the Mac Pro-at the bottom, I also said that I wanted to allow offline renderers.

And now when I create a GL context using this pixel format attribute list by passing it up to the super class, when I go and do that, I'll get a context object initialize that has all of the devices in the system.

And that's really the important part.

Okay, let's see how to do that in OpenCL.

Well, if you're in an ordinary OpenCL application, that's an application that is just going to do some OpenCL programming, it's not going to do anything where there is sharing between OpenCL and OpenGL.

The easiest way to get a context that has all the GPU devices in it, is just to create a context with a type.

So here, I'm calling clCreateContextWithType and I'm asking for CL Type GPU.

That's going to give me a CL context that contains all of the GPUs in the system.

On the Mac Pro, that means that I'm going to get a context that has two device IDs, one for each of the discreet GPUs.

Now, if we were in an application that was going to do some OpenCL and some OpenGL, and those operations were going to interact with each other, I'd want to create a context in a slightly different way.

So here what I'm doing, is I've created already in my previous slide, I set up my NSOpenGLView.

And then I would obtain the context object from that NSOpenGLView.

And then you can see here, I'm using the CGL API to get the share group that is associated with that GL context.

Then I take the share group, remember the share group is that entity that we use to create a pair of contexts that operate on the same objects and use the same devices.

I use that share group now with clCreateContext and another property list to create a CL context that contains the same devices that were in that original GL context.

And now what I'm going to do is I'm going to end up with my CL context here, C, that contains all the GPUs in the Mac Pro.

And that's really the important part.

It's very important that I always create a context in either API or in this case, I've created a GL context and then a CL context that contains all the devices in the system.

Okay, so now that I've done this, now that I have this very versatile and flexible context, the next step is to take a look inside it and figure out which device corresponds to the primary GPU and which device corresponds to the secondary GPU.

And so that's important because, if I'm doing a task that is going to be, the results of which are going to be displayed on the screen, it might make sense for me to use the primary GPU first because it's directly connected to display hardware.

And then maybe if I have a task that isn't related to the GUI or isn't related to the display, I might want to always send that task to the secondary GPU.

So if I look at the OpenCL API and the OpenGL API, it turns out there are a number of different queries that I can make, but since the two devices in the Mac Pro are identical, all of those queries, all those CL device info queries, are going to return exactly the same information for both devices.

In order to distinguish the two devices, we have to do something different.

So what we're trying to do here, is we'd like to figure out which GPU is the online one-that's the primary GPU-and which GPU is the offline one, that's the secondary GPU.

And then we're going to try to figure out what its virtual screen number is if we're doing OpenGL work, or what the CL device ID is for it if we're doing OpenCL work.

Okay, so let's walk through some code here.

This is the process that you go through to decide which GPU is the primary GPU or the secondary GPU.

In this particular example, I'm going to be looking for the secondary GPU.

So I'm going to go through a bunch of steps here where I issue some queries against the system to figure out which GPU is the offline GPU.

So the first thing that I do is I iterate the renderers in the system.

I obtain these using this CGLQueryRendererInfo call.

I iterate over all of the renderers and I ask the system, "Is the renderer online or offline?"

So this will actually tell me, once I get past this step, I'll know if I have that GPU that's connected to display and that's online, or the one that's offline.

Of course, as I mentioned earlier, there are some other renderers in the system.

There's the software renderer and I have to make sure that I am able to distinguish between the offline GPU and the software renderer and so, we'll do that in just a second.

So, if I find the GPU that's offline, I then check to see if it supports accelerated compute.

This is basically saying, does it support OpenCL?

And now, in the Mac, the OpenCL API actually does have a CPU device, but it's presented to the system differently than the software renderer.

Those are two separate entities within the system.

And so if I obtained-if I've sort of iterated over the render ID for the software renderer, it wouldn't match this accelerated compute query.

And so I'd be able to distinguish between it by making this check.

And then, if I get past that step, I'm going to issue another query here using CGLDescribeRenderer and I'm going to ask for the renderer ID.

So I started by getting a renderer info object.

I then walked over all of the renderers that were in the object.

And then filtered them using a number of other queries, and eventually ended up querying them for their render ID number.

And in this case, I found the secondary GPU, its renderer ID and I'm going to write that or store that to a value.

And we'll use that in a little while.

So now I have a renderer ID but in order to actually select or send work to a GPU in the system, I need to know its virtual screen number because, if you recall, the virtual screen number is how the context refers to the different renderers that it contains.

And so here what I'm going to do is I actually have to have a context in order to have virtual screen numbers.

So I'll get a context from my NSOpenGLView and then I'll check to see for each virtual screen in the context, I'll check for its number and also its render ID.

So here I am, getting the number of virtual screens that are available.

And then the next step is to walk over those virtual screens, make them current, and then ask for the renderer ID associated with each virtual screen.

So I've iterated over all the virtual screens, gotten their render IDs and then matched those with the renderer ID that I'm looking for, and that tells me the virtual screen number that corresponds to that particular GPU.

Okay, so it's important to always check virtual screen numbers.

So in the example that we just looked at, when I walked through that and actually executed that code, it turned out that the primary GPU was actually virtual screen one.

And so if I had just assumed that you know, virtual screen zero would be the primary because primary comes before secondary, I would have been wrong and I might have ended up rendering all of my work on say the secondary GPU instead of the primary GPU.

And so our - and so the Mac is very flexible.

It actually can handle this case.

It's just not as efficient as, say, rendering all that 3D work to the primary GPU and then displaying it immediately.

Okay, so that's how you do it in OpenGL.

Let's take a look at how to do the same operation of the same set of operations in OpenCL.

So, in OpenCL, I would have gone through the process.

I would have started with the CGL API and gone through the process of figuring out which render ID in the system is the secondary GPU which - and if I had sort of flipped that process around, I could have determined which one was the primary GPU.

Now I have to go from a renderer ID that I obtained from CGL to a CL device ID.

And in Yosemite, there's an API that we can use that will convert a CL device ID directly to a renderer ID and that function is CGLGetDeviceFromGLRenderer.

I pass in the renderer ID and it gives me back a CL device ID.

Then I can use that CL device ID to create a command queue and dispatch work directly to that GPU.

And so instead of having to do a query for virtual screens in the OpenCL API, I can just create a command queue and then use that command queue to directly dispatch work, in this case, to that secondary GPU.

Okay, so the next stop is dispatching work.

So in OpenGL, the context - the GL context - refers to the renderer or interacts with the renderer via this virtual screen number.

And to do - to set the virtual screen, we saw an example of this earlier, if I'm going to set up some draw calls, I'm going to issue some draw calls to one of the devices, the first thing I have to do is make sure that the context that I created is the current context.

So I'll call CGLSetCurrentContext.

And pass in the context that I'm entrusted in working with.

Then once that context is set, I can set the virtual screen number and, like I said a couple slides ago, it's really important and I can't emphasize this enough, to make sure that you know which virtual screen corresponds to the primary GPU and the secondary GPU as opposed to just assuming that the first one or the second one is always the primary or the secondary.

Anyway, I can call CGLSetVirtualScreen and pass in the number that I want.

And then issue my bind calls and my draw calls in OpenGL.

In OpenCL, instead of having to set a virtual screen, I just use a command queue.

And so here, I'm not setting state.

Instead what I'm doing is I'm creating an object, this queue object, and then using that queue object to enqueue work to a particular device.

And so there are no bind calls in OpenCL.

Here I'm just creating a number of objects.

There was already a kernel here.

I set some arguments on it.

I have a command queue that I've created based on the device ID that I looked up using the process that we just described.

And I can queue work to that GPU.

Okay, now that I've created a context that has two GPUs in it, then identified the primary GPU and the secondary GPU, dispatched work using a virtual screen or a CL command queue, the last step is to get results or to get the data off the GPU that I've selected and to use it in my application.

And of course, in an OpenGL application, you might - you know, the results might be displayed on a primary GPU.

In an OpenCL application, if you were doing CL-GL sharing, you might end up sending the results from the secondary GPU to the primary GPU in order to render them, or you might download the results from the secondary GPU to host memory if you were working on something that wasn't related to rendering or to display.

And either of those things are possible.

Either of those techniques are possible.

So if you're in a CL-GL sharing case, there are a couple things that you have to do that we'll look at in a second, but the runtime is going to do most of the work for you.

If you follow a specific procedure, after you've dispatched the work to the - to one GPU, when you start using that work on the second GPU, the runtime and the driver will take care of moving the data between the two devices.

And well actually, this is a great advantage because it means that we can very easily take advantage of using the secondary GPU in our application, but we have to follow certain rules to make sure that the system will behave in an efficient manner when we move data between the two devices.

So let's take a look at what to do when we're switching work between - from one GPU to another.

So, let's say that we were going to do some OpenGL work on the secondary GPU.

So we'd call SetVirtualScreen and we'd pass in the secondary virtual screen.

And then we would bind some objects, maybe some textures that we're going to work with, and do some drawing.

And then we would call glFlushRenderAPPLE() and that's going to - that's going to cause all of that GL work to be submitted to the device, and it's going to push all of that work off to the GPU and the GPU will start working on it.

At some point in the future, we're going to want to use the results that we had computed.

Maybe we are rendering into an FBL or something.

We want to use those results on the primary GPU.

And so, we're going to call CGLSetVirtualScreen, pass in the primary GPU's virtual screen number, and then start working with the data on the primary GPU.

Now, in this case, there is a single OpenGL context.

And because there was a single OpenGL context, it wasn't necessary for me to change the state or to re-bind the objects that we are working with.

That state was already set.

I simply changed the virtual screen and continued using the objects - the GL objects - that I was working with previously.

And that allows me to - that allows the runtime, that SetVirtualScreen call, allows the runtime to realize that I'm going to start sending work to the other GPU.

And it will take care of synchronizing the data that I wrote to you on the other device over to the device that I'm going to start using when I issue the next draw call.

Okay, so let's take a look at what that looks like in a schematic.

So I've taken some graphics work and I've issued a bunch of Draw calls.

On the secondary GPU, I call glFlushRenderAPPLE() and the runtime pushes all of that work onto the device.

Now, if the runtime was going to issue any more commands, like for example, if the runtime decided that it had to issue a page off command, that page off command would be sitting behind all the work that I'd previously flushed to the device.

And that's exactly what happened.

So when the primary GPU or when the runtime detects that CGLSetVirtualScreen call going to the primary GPU, it in turn will page the data from - or page off the data from the secondary GPU, after those previously flushed commands have been executed and then page it on to the primary GPU so that I can then execute my draw calls and continue working with the data on the primary GPU.

So, the movement of the data, takes place automatically and as a programmer, I've made sure that that data movement is in the right order or it takes place after the commands that I used to create the data by calling glFlushRenderAPPLE().

Okay, so in OpenCL, it's a little bit different.

In OpenCL, we have command queues instead of virtual screens, and we're going to do something that's very similar.

We're going to enqueue work using a command queue that we created in the primary GPU and then flush it using just clFlush.

That will cause that queue to start working or that device to start working on the data, on the commands that we enqueued.

And then when I enqueue work to the secondary queue, once that work gets to the head of the command queue, the system will execute a similar page-off operation that in this case is going to be guaranteed to be behind the work that was sent to the primary queue.

And so we'll see a similar type of behavior as we saw in the GL case where I made sure that that page-off would arrive at the GPU, after it had already started working on the producer or the operations that were producing the data.

So, on Mac, you have heard of a pattern called Flush and Bind.

And this is a pattern of APIs that is used in multiple GPU situations and in instances where there is more than one OpenGL context.

It's also used in a situation where you have an OpenGL context that's, say, producing the data and an OpenCL context that's consuming the data.

So, any instance where on Mac you have two different contexts, you have to use Flush and Bind.

And what that means is that when you - after you queue the work that's doing the production, that's producing the texture or maybe it's producing some geometry, after you enqueue that work, you always have to make sure that you flush it.

You flush that command queue or you flush before switching virtual screens.

And then after that, when you switch to the other API or to the other context, you have to make sure that you rebind any objects that were modified, in this case, by OpenCL.

So in the single instance that we looked at before, when we were - just had one OpenGL context, we could flush and then immediately use the objects on the other device.

In an instance where there are either two OpenGL contexts or there's an OpenGL context and an OpenCL context, we have to use Flush and Bind.

We have to flush like we did before, but then we have to rebind those objects once we switch to the other device.

So if you follow these steps, the runtime will take care of moving this data between the two GPUs for you as you work on the data in those two different places.

And the reason that the runtime's able to do this, is that you've created a context that contains all the devices in the system, and so that the runtime and the driver are able to track enough state to perform these operations for you.

And so it's very important to emphasize that when you create a context, always create a context that contains all the devices in the system, all of the GPU devices in the system.

There are some other design patterns that you might follow.

For example, you might create or be tempted to create a, say, a set of objects, a context, a command queue, a whole stack of objects.

One stack per device in the system.

But on Mac, really the best thing to do is to always create a - the context to contain all of the devices in the system, even if you're, say, on a different configuration, only going to use one of the devices.

If you do this, it will be very easy when you move your application onto the Mac Pro to start using two GPUs because the application and the structure of the program has already been written to handle a context that contains both devices.

It makes it a lot easier to migrate to the system and allows the runtime to take advantage or allows the runtime to move objects between the two GPUs for you.

Okay, so now that I've showed you how to program or the API that's involved in programming for the Mac Pro, I'd like to show you some programming patterns.

And what I'm going to focus on here is what the system does, or what the system's doing on your behalf, when you perform different tasks on the GPU.

So, what I'm going to start with is an example of an offline or an offload task where you have some kind of operation that isn't related to display, and you're going to perform this operation on the secondary GPU.

And so, I've called this - I call this an offline task.

And you might have an offline task in your application if it's something that - say it's something that you apply once or one - sort of one set of time.

So for example, you have an image processing application and the user goes to the Edit menu and they select a filter and they change some filter parameters and click Apply, that might be a great offline task because you're not going to perform the bulk of the work, of the bulk of that say, OpenCL compute work until the user clicks Apply.

Then you're going to perform a large amount of work on some input data, on some giant image.

And then once you're done, you're going to say save that image off to main memory or maybe you're going to save that image off to disk.

And that operation is a discrete operation.

It takes a long time.

If you were to run it on say the main thread, it might cause the GUI to respond more slowly.

And it's something that is - it's something that's separate from the main sort of, GUI loop of the application.

So let's take a look at what that - what I'm talking about here.

So I have some OpenCL work in green, and this OpenCL work is going to apply my operation.

And it might take a long time.

And then I'm going to first have some OpenGL work that's going on that's related to my GUI.

And my application actually may be using OpenGL in the GPU if I'm using a - if I'm using certain parts of the UI, even if my application itself doesn't use OpenGL.

Now, what this looks like is, it has a lot of - we might have a lot of sort of short or inexpensive OpenGL operations being performed.

And then we have this giant compute operation.

And of course, if I then have some more GUI-related OpenGL work coming through, what's going to happen is, I'm going to - you know, my system's going to lag.

I might end up with some sort of progress problem or maybe I'll even get a beachball, if this OpenCL program or this part of my OpenCL application takes too long.

And so what we're going to do is we're really just going to take that green box, the Open - the expensive OpenCL compute operation, and we're just going to move it over to the secondary GPU.

And if we've set up our application so that we have both devices in our context, and we followed the API that we just described, it's very easy to perform this offload task and move an offline operation off to the secondary GPU.

So here's what this looks like.

It's very straightforward.

I have an application here.

The user went into my - or the user is going to go into my Edit menu and select Apply Effect.

I end up in an action here.

And I have a kernel that I'm executing iteratively a large number of times.

This makes the application a little bit slower.

And I'm doing this right now in the primary queue, and all I'm going to do is make sure that I've set up the secondary queue and just send that operation off to the secondary queue.

And now, at some point in the future, after these operations are finished, the existing code in my application to move the data off the GPU and back to disk will just move that data, those memory objects, off the other GPU instead of the primary GPU that I was previously using.

Another pattern that you might end up following is an instance where you're going to perform graphics work on both GPUs.

And once you've divided the work, the rendering work, between the two devices, the window server actually will take care of copying data from the secondary GPU to the primary GPU for display.

So I'm going to show you what this looks like and what happens here.

And actually in a second, I'll show an example where we perform these kinds of operations in an actual application.

So, here I have an application's app thread, and it's going to perform a - it's going to select a - make the context current.

It will set the virtual screen to the primary.

It's going to call the drawScene method and that's going to do a lot of OpenGL work.

And then I'm going to call glFlushRenderAPPLE and maybe I call flushBuffer to put the work on the screen.

And now I'm going to do the same thing on the secondary GPU and also call flushBuffer.

In this example, I might have two separate parts of my application and I'm going to render one part on one GPU and the other part on the other GPU.

Now, what happens here, if I look at the operations that are being performed, at some point and time, both of these GPUs are going to get flush commands and then flushBuffer commands.

And the window server is going to wake up and it's going to realize as it's getting ready to composite the image for the next frame, that some of the data it needs is on the secondary GPU.

And so it's going to actually perform a very similar operation to what we saw a little while ago with the page-off.

It's going to realize that that - the data's on the other device, send a page off request for it to the device, move the data back, then page it on to the primary GPU performance composite, and then display the image.

And so, there's a period of time here where the window server has gotten involved for display and it's going to end up executing that page-off and then the secondary GPU is free to continue working on more graphics work.

But the primary GPU is going to have to end up copying the data back on, and then rendering the composite.

And so there's a certain amount of overhead that your application has to be aware of when you're performing work on both devices.

But the system, if you follow this API, the system will handle this for you and your application will be able to use both devices.

Okay, so the challenge here, when we're taking an application and modifying it to use this configuration, is really that we have to divide our work somehow.

We might divide it by taking a task that isn't related to display and moving it to the secondary GPU or maybe we can parallelize the work between the two devices in such a way that the overhead is not a problem.

And we have to always be aware that there is one device that's connected to the display, and the other device that isn't.

And so that can be a source of overhead, especially if the data is - especially if the data has to be moved back and forth very frequently.

So, I'd like to take a second here and talk about some other situations that involve multiple GPUs on our platform.

So the - probably the most common multi-GPU situation is a laptop that has two GPUs: a discrete GPU and an integrated GPU.

And it's important to remember that although the APIs that you use when you modify an application to support automatic graphics switching are similar, the way that you use those APIs is a little bit different.

And so, if you are interested in supporting both GPUs in a laptop, be sure to take a look at the automatic graphics switching feature.

Also, the tower Mac Pro configuration is still available or is still out there.

And is - your application may run on it.

And it is something that can support multiple displays connected to multiple GPUs, and there's a lot of infrastructure in Mac graphics for handling instances where an application moves from one GPU to another, based on the display that's connected to the GPU.

And that uses a different set of APIs and callback mechanisms.

And so, if you're working or if you're concerned about that situation, you should take a look at that documentation.

So let's look at a complete example of everything that you have to do in an application.

The first step of course was to create a context that contains all the GPUs.

The second is to check and see if you have an offline device.

So if you're on a Mac Pro, you should have a single - basically you should have one device that's online and one device that's offline.

Then check to see if you have two identical - in this case, I'm looking for device names.

And if I fail any of these checks, it might mean that I'm in one of those other situations.

I might be in an instance where I have a laptop that has an integrated and discrete GPU that are different devices, or maybe I'm in an instance where I have a tower Mac Pro and I have two different - two displays connected to two different GPUs.

Once I've determined that I'm on a Mac Pro, a new Mac Pro, then I have to be concerned about dividing work between the GPU and then synchronizing the results.

So, if I fail that first check, I'm going to go and take a look at supporting multiple displays.

And if I - if the second check doesn't work out, then I might have to be concerned about automatic graphics switching.

Okay, so let me show you a multi-GPU example now.

And this is a - this demo is a system that's performing some OpenCL work and some OpenGL work.

And we're performing the OpenGL work on the primary GPU, and the OpenCL work in the demo mode that we'll see in a just a second, is going to be performed on - partially on the primary GPU and then also on the secondary GPU.

So when I launch this application, this is performing a physics simulation where there's a particle system that's being rendered in OpenGL and it's being simulated in OpenCL.

And right now, we have a large number of particles.

We're performing the physics simulation and we're using one GPU.

And we're getting about 15 frames per second.

We have a - this is a relatively fluid animation but it turns out we can do much better if we go over here and reconfigure the demo to use two GPUs.

And here you can see that we're doing about 30 frames per second, maybe a little bit more than 30 frames per second, and we were able to accomplish that by working through our application and making sure that we enable the application to move data between the two GPUs.

And then we found a way of dividing the data and the computation in such a way that the speedup that we obtained from dividing our work and executing with twice the amount of GPU capability, that speedup was a lot greater than the overhead of having to move a small amount of data back to the primary GPU for display.

So here actually we're - I guess towards the end, we're getting even faster than the very beginning of the simulation.

So this is a relatively simple example.

Let's see if I can restart here.

This is a simple example.

And without a huge amount of effort, we're able to divide our application into pieces and then execute the compute part on both devices and obtain a significant speedup.

[ Silence ]

Okay, so in the demo what we saw was an application that we had modified to move between or to take an application and perform some of the work on one GPU and a lot of the work on the secondary GPU.

And we saw that that produced a pretty significant speedup.

There are a lot of other applications that can benefit from working on this configuration and I hope that using this API and understanding some of the terminology and the way that the system behaves will help you port applications to this configuration.

For more information about using OpenCL and OpenGL in the Mac Pro, please talk to the WWDR representatives.

Thank you very much for attending this session and please let us know how we can help you.

[ Applause ]

[ Silence ]

Apple, Inc. AAPL
1 Infinite Loop Cupertino CA 95014 US