Harnessing OpenCL in Your Application

Session 416 WWDC 2010

OpenCL is a groundbreaking technology that lets your application tap into the vast computing power of modern graphics processors and multicore CPUs. Understand the OpenCL architecture, its approachable C-based language, and its powerful API for managing parallel computations. Discover how to refactor your application code to unleash the full computational capabilities of the Mac OS X platform.

My name is Ian Ollmann and I'm the first speaker in a series that will talk to you about OpenCL.

OpenCL is Snow Leopard technology which was just released a couple of months ago, this fall.

And, I wanted to give you an overview of what OpenCL is all about before we get started on the more in-depth talks.

So, first I want to discuss OpenCL design and philosophy, and then explore some common developer questions about, that have emerged over the last year.

Some of you may have seen them on the developer forums.

And then, offer up a few debugging tips about how we go about to tune our OpenCL kernels.

Now, since OpenCL has been out for awhile, I'm going to assume that many of you have already played with it.

Maybe you've downloaded some of the sample code and looked at it, read through the spec, maybe even tried to write a few applications of your own.

And so, what I wanted to do is provide an overview to kind of tie it all back together, because there are a lot of APIs and a lot of spec to read, and sometimes seeing the forest for the trees is a little tricky.

So, when we set out to build OpenCL, what we wanted to do is bring programmability to all of the devices on your system.

One API to bind them all; CPU, GPU.

If we had them, you could also program an accelerator.

And, this brings in new problems into the programming paradigms.

Some of these devices may have their own memory attached to them, which sit off in a separate address space and do not fit together contiguously with the RAM you're used to using on your system in your ordinary C or C++ or Objective-C program.

Many of these devices run on a different instruction set than the Intel instruction set that you're used to.

So, we have to overcome these problems when we build OpenCL.

So, OpenCL, in a nutshell, is sort of the minimum set of objects that you would need to encapsulate this information and make it all work.

We have an object which represents a device.

This might be a CPU or a GPU.

We have a context, which does not do very much; it's just a sandbox to hold everything else.

This serves two purposes.

It allows you to keep the damage localized if something goes horribly wrong, which we hope won't happen.

And also, if you're doing data sharing between OpenCL and OpenGL, it acts as a counterpart to the CGL share group.

You'll want to put your data into something.

Because we have to copy the data up to a device sometimes, we need to know how big your data is, so a simple C pointer is not enough; we need an extent.

So, the object's designed to encapsulate your data.

There are two kinds.

There is a buffer, which is almost exactly like what you'd get out of malloc .

When you call malloc, it's essentially a range of bytes; what you put it in is up to you.

And then, there's image data type, which is useful for sample data in a regularly spaced grid, and images are designed to be used by the GPU texture unit, which has hardware to make sampling out of the image much faster.

Collectively, these things are called MemObjects.

You also want to write code, so OpenCL provides a C like programming language.

We essentially started with C99, and then we tagged on a few extra features, like vectors and vector intrinsics, and we stripped out a little bit of the C standard library that didn't make any sense for GPUs.

And then, so you'll build a program against multiple devices, and then you'll need some sort of function pointer-like thing to go find the functions within that program or compilation unit, and these things are called kernels.

And, you can make many kernels for the same function if you want, and that actually turns out to be quite useful.

Finally, we have to have some verbs in this sentence, and those are the command queues.

You can queue commands into the queue to make all of these objects start doing things.

They might be to copy data from one place to another.

They might be to run a program on some of your data.

And, that's basically it in a nutshell.

The important thing with the command queues though, is that these are fully asynchronous queues.

So, when the device is finished with command one, it's going to want to go straight to command two, and if there isn't a command two, then it's going to sit and be idle, and you're not taking full advantage of the computational horsepower of your entire system when they're sitting idle.

So, what you really want to do is enqueue a pile of stuff into the command queues and let it take off and run in asynchronous fashion.

As an example of how these things work, here I have, at top left, some data that you might have allocated, called My Buffer or My Image.

And then, in your program somewhere; I put it in Main, but it can be anywhere, you'd start enqueueing commands to then operate on that data.

We might start with an enqueue to write a buffer, and what this does is copy data from your buffer into OpenCL's counterpart.

You might call EnqueueWriteImage, which will do the same thing for images.

And then, you can call EnqueueNDRangeKernel, which will copy all of that data automatically up to whatever device you enqueue it for.

The device will run your program, and something will happen.

You could have some results.

And then finally, you might enqueue a read buffer to copy the data back to your thing, and proceed in that way.

The OpenCL API itself is very consistent.

We have about eight object types and they all have the same set of functions that are used with them.

They have a creation function; it's used to create the object.

We turn the object out the left side and enter out of the right.

These things follow Core Foundation reference counting semantics, so you'll find that familiar probably with retain and release.

When the reference count goes to zero, the object is then destroyed in the background by OpenCL.

We have getters and setters in the standard object paradigm to get information in and out of the object.

Almost all of the objects have getters.

Only a very few of them have setters, because that introduces mutable state into the object, which makes it harder to make a thread safe back in.

So, and then finally, if you want to queue a command, we have a clEnqueue and then some command type thata bunch of functions like that.

So, the next question is; OK, I've got this object infrastructure, but I don't really understand, like how I write code for it.

I mean, what does the code I actually write look like?

And, this is the code I mean that you run on the device, not the thing running in your regular C program.

So, we have to find some way to split up lots and lots of parallelism in a way that you write your code where it still makes sense to you.

So, a traditional way to do that would be to invoke task level parallelism.

This is something you might have done yourself, you know, using Pthreads or NSThread or something, where you divide the work up into different tasks.

So, if we take a mail reader app as an example, we might have a thread to get the mail from the server, another one to scan through the results and identify junk mail as it comes in, another one that might run mail filters, a thread to draw the UI, a thread to get keyboard input, a thread to play the audio if you want to make a little beeping noise.

You know, we can divide up many such tasks like that and, you know, usually you come up with five or ten different things you can do, and that's great when you've got five or ten different processors to work with, because you can get a five- or a ten-way parallelism if you can manage to stack those things all up concurrently.

But, in OpenCL we're really targeting a much more parallel system than that.

We're going after many core systems, you know, that might have hundreds or thousands of cores, so how do you break up your workload in a way that's more amenable to those kinds of systems?

And, what we can do is just learn from standard shader languages, like OpenGL shader language, and take advantage of data-level parallelism; that is, rather than breaking things up by different kinds of tasks, we break things along data boundaries.

So, if we continue our example with the email reader, you might have a separate thread for each email.

So, if you're like me and you come in in the morning and you have 200 emails waiting for you to waste a good part of your morning, then you know, this would be a great way to get 200 way parallelism out of your system.

And, hopefully these kinds of computations that you're doing on the different data elements are largely independent.

You know, in my case, what happens to one email probably doesn't influence much what happens to another one.

And, assuming you have enough data, then you can presumably get your 1,000 way parallelism potential.

Or, better yet, you know, find a million way parallelism in there.

And, this is exactly the sort of problem that OpenCL is set up to go after, is where you can just get massive parallelism in your computation.

[ Silence ]

So, as an example of a way to break things down, let's take this image.

This is an old OpenCL logo.

And, I've broken it down so that you can see each pixel outlined in the grid, and each pixel we would call a workitem, and we would run a single function against that pixel.

So, for even a very small image of, you know, one kilopixels by one kilopixels, which is much smaller than you can get out of most cameras today, you still easily see I could get a million workitems out of this and you can get a great deal of parallelism.

And, what OpenCL will do is essentially implement the outer loop for you and go through and call the function in turn for each one of these workitems.

And then, in your function, when you get called, you call a little get_global_id function, which tells you which one you are, and then you go operate on that one.

And then, just to be complete, it's not really required that your dimensions of your problem map directly to the workitem dimensions that you've called with OpenCL; it can be another dimension heading off in a orthogonal space and you can map it back, and I'll talk more about that in a minute.

However, there is a little bit of a complication here.

It turns out that modern silicon is generally not a vast array of very simple scalar processors all running in parallel.

Over the last few decades, as you know, we've been adding all sorts of ways to get instruction level and data level parallelism into the single instruction stream.

We have super scalar processors, symmetric multi-threading.

You can do a lot of work at the same time in a SIMD fashion using vector units.

And, it turns out that if we actually look at the possibility for doing concurrent work on general modern cores, these things are capable of running many, and sometimes a great many, workitems concurrently.

So, what we do is we take our giant grid of workitems, and we break it out into little groups, called workgroups.

And, here I've outlined them in yellow, in one way of breaking down the problem.

And, we run a workgroup on a particular core, and all of the workitems in that workgroup will run as roughly concurrently, more or less, on that core.

And this is actually a very convenient motif, because it means that you have spatially adjacent data working together at roughly the same time, which means you get great cachetilization.

You can share resources, like local memory.

If one workitem accesses a piece of memory and another one right next to it accesses the piece of memory right next to it and so on down the line, the compiler might be smart enough to spot; hey, you just loaded a continuous range of data, and turn that into one big load rather than doing it in smaller loads.

So, that's called a coalesced load, or you can do it with storage, too.

It's also very cheap to synchronize.

It could be as cheap as a NoOP to get all of those things to work together, because you're essentially doing it all in software.

Or, it could be a little hardware interlock within a single core.

What's really hard to do though, is to synchronize between workgroups, because they're running on different cores, which might be far apart on the silicon.

There's actually sort of speed-of-light information travel problems to get from one core to another, so communication can be slow.

It would add complexity to the chip to have each core capable of being interrupted by every other core, in sort of an N-squared fashion.

And then finally, you run up into sort of a memory limit.

If we imagine that we have a relatively small image, say a million pixels, and we make a little stack for each workitem, which is four kilobytes, again very modest.

Multiply that and quickly you just realize you've just exhausted your address space in a 32-bit process.

So, we can't actually have all of those stacks living at the same time.

So, it's not possible to say, put a barrier in the middle of your code, have all million workitems come up to that barrier, then once they get there they all continue.

It just doesn't work.

So, how do we map workitems directly to hardware?

There are varieties of ways to do this.

One way is a simple, direct model, and this is what we do on a CPU today in Snow Leopard, where essentially one workitem is one hardware thread running there, just like a Pthread.

And, in this model, you have to vectorize your code if you want to get full use out of the hardware available, and then there might be other sources of parallelism on the core that you might want to take advantage of, like superscalerism or try to use the reorder buffers to get more to happen concurrently.

But, there are other ways to do it in a more GPU-like fashion.

We can parallelize our work items through the SIMD engine, and here we run a separate workitem down each lane of the vector register in the vector unit.

We can go further than that.

We can write a software loop to do multiple vectors at a time; meaning, in this case, 32 workitems at a time.

And of course, since it's a loop, we can turn the loop over many, many times, and pretty soon you can see how you can end up with a workgroup that's hundreds or thousands of workitems in size, all running on the same piece of parallel hardware.

And finally, we can parallelize this in a different way, using SMT on each one of these vector things down through a symmetric multi-threading engine, and use hardware to do all of the scheduling.

So, these are just simple examples of how it might happen.

There are some ways to synchronize within a workgroup.

The simplest is mem_fence.

That actually synchronizes within a single workitem.

And, it's mostly there just for hardware that has really weak memory ordering.

This is largely unnecessary on Mac OS X, because all of our hardware is more strongly ordered than that.

So, you should not need to use mem_fence.

Barrier is very important, however.

If you are working with local memory, you'll find that you will want to copy data over.

Make sure all of the copies are done, issue a barrier, and then proceed forward.

Now you can use the data that's in local memory.

But, barriers only work across a single workgroup for the reasons I mentioned before.

Then finally, there's a call, wait_group_events, which is a barrier-like synchronization, but it works with an asynchronous sort of switch that copies data around, called async_workgroup_copy.

There is also a problem with figuring out how many workitems to put in your workgroup.

You can pick just about any number, as long as the hardware will swallow it.

So, what will the hardware swallow?

Well, it depends on the hardware.

OpenCL provides quite a diversity of different interfaces to try to figure that out.

You can see I have eight of them here.

Six of them are APIs and there's also some constraints in the standard; the dimensions of your workgroup have to divide evenly into the total problem size and it might turn out that you need so much local memory per workitem in order to do your work.

There's only so much local memory on the system, so that kind of can limit the size of your workgroup, too.

So, that's the bottom entry.

So, how do you wade through all of this and try to figure out how big your workgroup can or should be?

Well, the first solution is to give up.

You can pass in NULL with the workgroup size.

We'll swallow that.

You can hopefully, you know, do some magic, which we'll try.

We'll do the best we can, and magic is wonderful stuff, but unfortunately it's not real grounded in reality, so we may not get it right for reasons I will describe in a minute.

So, in cases where you have more information then you guess we do, it's often best to handle this yourself.

So, another approach is divide and conquer, which is the standard technique, of course.

And, you just go through and sort the information you're getting back from OpenCL according to dimensionality.

So, you won't need to call all of these interfaces.

You'll quickly realize that only some of them really apply to your particular problem.

But, you'll get a number of one-dimensional limits that you might have to take the minimum of; we can't have any more workitems than this.

And then, there's a couple of APIs that will give you a 3D shape, which will constrain the size of your workgroup, and this is required because there are devices out there that are only able to vectorize in one dimension and not all three.

So, these APIs can be used to do that, and then, of course, the overall size of your global problem set will constrain your data shape because it's got to divide evenly into your workgroup size, or by your workgroup size.

You can run into problems when the global work size is a prime number.

What divides into a prime number?

Well, the only thing that divides into it is one and the prime number.

Prime number's not going to work; it's too big.

One means we're only running one workitem per core; you're not going to get any parallelism that way, so it's going to perform terribly.

So, what do you do?

Well, one thing you can do is just make your problem demand a little bigger.

And then, in your kernel, when you go write your code and you go get, you know, what problem am I, if you have to be outside the original problem size, then you just have an early out.

So, that will let you solve the local size must divide into the global size problem and run on a prime number workgroup of global size.

Another thing you can do, as I mentioned earlier, is enumerate your workitems out into some abstract dimension.

And then, you write a little function such as this where you take in my global ID in the abstract dimension and map it back to something more real, like your X or Y position.

So, these are all things that you can just do yourself in code and, you know, it's limited only by your imagination.

So, I wanted to work through a few developer questions we've had over the years.

They come from a variety of topics.

Many of you have noticed that we've added half precision to the spec. This is a 16-bit floating point number.

And, it's easy to get quite excited about that; ooh, what is this thing?

You know; ooh, I have a new float to work with.

It's not quite that.

It's a storage-only format, which means that it only exists as a 16-bit floating point number in memory.

As soon as you pick it up and start trying to work on it, the first thing OpenCL does is convert it to single precision number.

You do all of your arithmetic in single precision, and then when you go to store it back out somewhere, then it gets converted back.

Which interface you use to load and store your data depends on whether you're working with buffers or images.

Buffers will use vload_halfn/vstore_halfn; images will use read_imagef or write_imagef, as for other pixel formats.

There is an extension you'll see in the end of the spec, which is cl_khr_fp16, which actually specifies half precision direct arithmetic.

That's not supported on our platform.

There's a couple of good reasons for that.

The native hardware doesn't do it; we'd have to emulate it in software, which would be a lot slower than doing it in single precision.

And, the other problem is, of course, half precision only has about 11 bits of precision if you do enough in multiplies and adds and whatever else.

You're going to start losing bits and you'll be down to eight, seven, six bits of precision and for most algorithms, that's just not enough.

Many people see that; oh, OpenCL has four address spaces, which are sort of disjoined places to put memory.

What's that all about?

We have global, which is akin to the main system memory used to use.

We have local, which is a little, user managed cache tied to your compute unit.

We also have a constant memory space, and private, which is just local storage for your particular workitem.

And, the confusing one is, what is local memory?

It's just a user managed cache, and the way you use it is you either explicity or using a convenience function, like async_workgroup_copy, just pick up the data from global memory and write it over there.

So, it's as simple as, you know, just doing an assignment from A to B in your OpenCL C code.

So, what you want to do is have all of your workitems work together to copy in the data from global to local, then issue a barrier to make sure everyone's done so that we don't try to read any of it before any of them are done.

And now, the data you know is resident in local memory and you can read that out much quicker than it would have taken to access it from global.

Now, a key point about local memory is that it only really works if you touch the data more than once.

If you just touched it once, then you would essentially be reading it once, copying it over here and then reading it back from over there.

You didn't save yourself any time.

You want to be in a situation where you read it once here, put it over here and then all sorts of different people use this over time.

So, if it turns out that you only plan to use your data once, let's say in My Image, I'm just converting RGB to YUV, so each pixel is largely independent and I only touch it once.

Then we wanted to use a variety of different approaches, depending on what kind of hurdle you're on.

On a GPU, there's a texture cache designed to accelerate that kind of read-once access, and that is backed up by the image data type, so you'd want to use that.

On the CPU, there is a global cache backing up buffers, so as long as your data has good spatial locality, then you should get some acceleration out of the caches.

I should note that local memory, while it seems like a predominantly GPU technology, we have found that it's actually quite helpful on the CPU.

It can make vectorization easier.

It also can avoid polluting your caches, and what I mean in that sense is, let's say your global data structures are an Array of Structures, data type.

Here I have an AoS strapped with x, y and z in it and then some of these telephone numbers, some unrelated piece of data.

And, I know many of you just love doing code like that.

I've seen it everywhere.

And, this might be an array in your buffer, but when you go actually work on it in your kernel, it can be quite useful to then transpose that around into an array of x's followed by an array of y's and an array of z's.

If you only intended to work on x, y and z, and you didn't care about the telephone numbers, then you end up compacting the data down into a much smaller space.

Also, because it's plainer in orientation, it's much easier to vectorize.

The GPUs have a watchdog timer.

And, what is a watchdog timer?

It's somebody looking over your shoulder to make sure you don't monopolize the GPU for too long.

And, the reason for that is the UI will not interact as long as you're busy on the GPU.

If you use the GPU for more than a few seconds at a time in a single kernel, you'll probably get a message in the console, such as this one shown here.

You may see a flash on screen and you definitely will not get the right answer out of OpenCL.

Your contacts might be invalidated.

So, if you start running into this, the simple solution is just divide your task up into smaller chunks so that it runs faster and doesn't use up quite as much time.

You can enqueue them one after another in the queue; you know, the second one will start as soon as the first one's done, so you won't waste too much time.

You want to be careful though, because the breadth of capability between sort of a low-end GTU and a high-end can be quite large in order of magnitude, maybe.

So, if you are doing this, be sure you're testing out a low-end system to make sure that it works everywhere.

Some of you have noticed that OpenCL provides a way to get out of your kernels compiled as a binary.

There's this interface, clGetProgramInfo (CL_PROGRAM_BINARIES).

And, the intention of this thing is to give you a cache, or a way to create your own cache, to avoid having to compile your kernels every time your app runs.

Some people want to use it for code obfuscation, but it's not suitable for that use right now, and the reason for that is that Apple has not committed to an API for the kernels.

And so, that means that on some future OS we might change the API; your binary will not work anymore.

When you go try to load it, you have to call clBuildProgram before you can actually use it, and that call will fail.

If the only thing you shipped with your app was the binary, you're now in deep trouble because you have no code to run.

So, what you need to do is ship your source.

If this happens to you, then you rebuild your source fresh and override the cache that you had set up for yourself, and continue on.

Some developers are curious; when should I use buffers, when should I use images?

I think if you're familiar with OpenGL, it should be obvious.

But, if you're from CPU land, like me, then it's not so clear.

Buffers are sort of native territory for a CPU.

We have caches to back them up; they're very fast.

On a GPU, on current generation there's no cache to back up global memory accesses, so you're taking essentially a big, long trip; several hundred cycles out to get uncached memory.

GPU, you'd want to use, copy that data in the local memory, which is very close and much faster to use.

Or, use coalesced reads wherein multiple workitems are reading data from largely contiguous regions of memory.

Images are great on the GPU.

There's a texture unit to accelerate those accesses, if they have great spatial locality.

But, on the CPU there's no such hardware, so we have to emulate the whole thing in software.

So, these things look a lot more complicated than you would think just looking at the spec. But, at least the CPU is extremely accurate, so it's good for debugging to make sure you're doing the right thing.

But, as you can see, this is the implementation of a single pixel read using linear sampling; it's 168 instructions.

So, I would only want to use the read image feature on the CPU if you've budgeted ample CPU time to go through and do all of that work.

Some developers are wondering; how do I use OpenCL in my multi-threaded app?

It says it's not completely thread safe.

The intended design is to use a separate queue from each thread that you intend to enqueue work into OpenCL.

The other thing you have to do is make sure you're not getting reentrant access into individual objects.

And, you can end up in some patterns also where you can step on yourself.

For example, on this one, I might set the kernel argument to be a value and then get interrupted, have another thread which is using the same kernel come along and set it to a different value and queue its kernel, then finally I wake up and queue my kernel, but it has the wrong argument.

So, you want to make sure that you don't get these sorts of races happening in your app.

Now, you could implement some very fancy locking schemes to try to guarantee this, but it's going to be heavy, it's going to damage your concurrency, and you're not going to like it.

What we are actually thinking that you would do is, if you intend to call the same kernel for multiple threads, make multiple kernel objects, all pointed back to the same kernel function.

That's cheap to do.

You don't have to do any locking, because each thread will have its own copy of the kernel object, and it's safe to do.

Probably the number one thing that I've seen developers do is block too much when they're enqueueing stuff into OpenCL.

A number of the enqueue APIs have the capacity to block until the work is completely done in OpenCL before returning control to you.

But, that's extremely insensitive because you don't get to do anything while OpenCL's doing stuff, and then once OpenCL is done, then it has to wait for you.

And, so you end up losing a lot of your concurrency, and I'll show you an example of that a little bit later.

So, there's an API entirely intended to block your queue, clFinish.

You should almost never need to call that.

The only time I've seen where it was a good use case of that is where somebody wanted to shut down OpenCL completely, wanting to make sure all of the work was done and all of the reference counting had resolved itself and all of the objects were freed and all of the memory was released.

So, that's great use for clFinish, but you should otherwise almost never need to do it.

Some people just seem to instinctively put it in there proactively after every single call, and it's killing them, I'll tell you.

There are calls to read and write data in and out of OpenCL.

These can be made blocking if you want, but you'll only need to be blocking on these some of the time.

And, you can probably figure out for yourself when this is, but often you'll see, like for example; I need to enqueue multiple reads to read back results from multiple buffers after my computation.

It turns out because the queue is in order, which means each job is finished before the next one can start; you only actually need to block on the last one, because you know that the other ones have already completed.

Likewise, when you're doing writes, the typical pattern is write data into OpenCL and queue a bunch of kernels, and then read back the results; the last one is blocking.

Well, OK; I know my write finished a long time ago, way up here, so no need to block on that either.

So, there really, in any, like giant sequence of calls, you probably only need one block at the end.

People also run into some performance pitfalls using half-full vectors.

This can come in two forms.

On the CPU, the vectors are fixed width, they're all 16-bytes on SSE, and so if you write, like a float2, which is only an 8-byte type, you're essentially issuing a 16-byte instruction work on 16 bytes of data, but you've only populated it half-full, and so we end up doing extra work on some, who knows what's in the rest of the register.

So, this is bad for two reasons.

Obviously, you've wasted half of your potential to do work.

But, in floating point, if those lanes in the vector happen to get any NaNs or infinities or denormals, you might set yourself up to take a hundred cycle stall for each one of those, and that can hit you operation after operation after operation after operation, and make your code run, like orders of magnitude slower than it should.

So, you want to be sure, when programming for the CPU, on the direct model like we have, that you try to make sure you use 16-byte vectors or larger.

Larger will actually do a little free unrolling for you, in the compiler is at times a little faster than just using the 16-byte vectors.

The GPU has sort of the reverse problem.

When it's revectorizing your problem along a different dimension in the way the GPU would like to do it, you might have a float4, but you've only put data in the first two elements and have a bunch of garbage after that because you couldn't figure out what to do and you declared it a float4 somewhere.

Well, when the GPU vectorizes that, it will make a big vector full of x's and a big vector full of y's, and then two big vectors full of junk, which it will then go do arithmetic on needlessly, so that just wastes time.

Finally, we've noticed that people often will make objects, use them once, then delete them; make them, use them once, delete them.

And, that can be kind of wasteful.

Many OpenCL objects are heavy; they're intended to be reused a lot.

Like a program, you'd have to compile it each time, which can take a big chunk of a second, sometimes.

Images and buffers have a big, giant backing store, megabytes in size; a bunch of driver calls to set it up, and then there's some state associated with who used it last, so we can track which device the actual data lives on right now.

And then, finally, on any buffers that you make that are new on the system are subject to the usual zero full activity that the kernel will do the first time you use it.

So, if you reuse them, you save yourself this cost the second and later time.

However, it's only really useful to reuse things if they're about the same size, or in the case of images, exactly the same size as the previous use.

Otherwise, OpenCL has no concept of only copy part of this buffer up there; it'll copy the whole thing up to the device.

So, you only really want to reuse them if they're about the same size.

So finally, I'd like to talk about a few debugging tips.

These are standard techniques we use back in Infinite Loop.

Pretty much all of us run with the environment variable CL_LOG_ERRORS set all of the time.

I just put that in my bashrc.

And, you can set this to either standard out or standard error or console, depending on where you want the error messages to go.

And, what it does is, whenever you call an OpenCL API and you manage to miss some little gotcha in the spec, and OpenCL returns an error out, you also get spewed to the console or standard error or whatever; some hopefully human understandable English message about what exactly you did wrong.

So, if you're encountering any problems with the API, getting that to do what you want, then CL_LOG_ERRORS is very much your friend.

There's also a way to programmably hook into it.

You can roll your own function and pass it in when the context is created.

Finally, when you're working on the CPU, we make heavy use out of Shark and Instruments to see what's going on in there, and what I'd like to do is give you a quick look at what that process looks like.

So, this is an iPhoto.

It's unmodified, and as it turns out, iPhoto on Snow Leopard for certain things will use OpenCL on the CPU.

So, we can take a look at what it's doing.

So, we call up our little Adjust panel and we can start, you know, adding little filters onto the image, and these are all processed in real time, as you can see.

And then, we can go and start jiggling stuff.

So, if we run Shark; Shark has a bunch of different ways to run samples, but the two most interesting are Time Profile and System Trace.

I'll run a Time Profile first; I'm sure many of you have done this.

So, while I'm getting OpenCL to do something through iPhoto, here I'm jiggling this thing around.

I can hit option escape to get it to record, and what it's doing is it's taking a sample server so many milliseconds or microseconds, and records where the CPU or CPUs, what instruction they were on.

And then later, it puts it all back together, backtraces it against which functions they were there, and you can get a breakdown as to where your time was going using this stochastic technique.

So, you can see it was spending about 12 percent of its time in this function.

This is an OpenCL library.

OpenCL provides a number of libraries which are strangely named; they look a little bit more like pixel formats than libraries.

So, if you see this, this is a cost accrued to you due to the cost of going to use readImage on the CPU.

You will also see lines that say unknown library in them.

Well, why doesn't it say OpenCL?

Well, this is your code.

It doesn't actually exist on disk anywhere.

We just built this code, stuck it in the memory and run it.

So, Shark doesn't really know what it is, so it says unknown library.

But, you can still drill down and see all of the code that we prepared, and if you're good, you can figure out what parts map onto your kernel and see if you got the code you wanted, and if there are large stalls in here, you can generally figure out what went wrong in your kernel.

I can also show you a system trace, which is very useful for understanding how your interactions with the OpenCL queues are progressing.

So, if we go back and play with the image a bit more, I'll now record a system trace for a second or two.

This is a 16-core system, or 8-cores with 2ASMT.

So, we have here a bunch of iPhoto threads.

I'll just limit it down to iPhoto.

And, I'm looking at the timeline, and what you can see here; these are threads, horizontally.

Regions that are amber is time during the timeline, which progresses to the right, when the CPU was active.

And, what we can see is that it's single-threaded a lot, but then there are these little windows where we're multi-thread.

And, these are when OpenCL is running, and we can zoom in on these things.

These little telephones are system calls, and here we can see; here is the main thread on the top and it's running through, making various system calls, and we can go look at this and track this back to OpenCL.

This is a release_mem_object call to release some memory object.

This one is enqueue a kernel, and you see a little while after a kernel we get a little blip of something happening.

And here we enqueue another kernel, and then this one's a big one.

And, it seems like there is some serial process here to kick off each CPU as it goes along, so you can see them all firing up.

But, sometime before we manage to get them all fired up, we're already done with the work.

So, maybe the kernel you enqueued is too small, because we didn't actually get all 16 of the threads up and running, the hardware threads, before we ran out of work to do.

Well, why did that happen?

Well, we can go and look at this one.

What's this thing?

The main thread wasn't doing anything during this time; that's kind of a little strange.

We can go track that back.

Oh, look! It's a clFinish.

Apparently the enqueued a kernel and then issued a clFinish to wait for it to be done, and then did some more work and then repeated the process again.

You can see another clFinish over here.

Well, this is quite costly for a number of reasons.

For example, let's see if I can zoom in here.

While we weren't doing anything here, we probably could have been doing this much work in the main thread.

It only would have cost OpenCL one thread out of 16.

So, it probably wouldn't have slowed down too much.

So, you could have gotten this much work done, which meant all of this dead time, all down here, would have gone away and we would have compressed the launch-to-launch time from here to over here by about that much time.

So, you can see, just that clFinish, that's what it's costing you.

Another thing is, all of the CPUs go back to sleep after we're done with this, and I only have to wake them up again.

If you didn't have that Finish in there, it's possible we would have just picked right up where we left off, and then started running these things, but, except all of the CPUs would be awake now.

So, rather than getting a little, tiny bit of work out of these threads, you might have gotten a full width.

So, the whole thing, you know, might have been about twice as fast if we go by one-half base times height on this little triangle here.

So, you can use Shark to dig right in to how much residency you're getting on the CPU.

And, the same techniques apply in the GPU.

You won't see these little triangles, because the work is happening on the GPU, but you will certainly see time that's dead time in your main thread, when you could have been doing something else.

So, that's Shark with OpenCL in a nutshell.

And, what I'd like to do now is invite Abe Stephens up to tell you all about how to integrate your workflow between OpenCL and OpenGL, and that allows you to quickly and seamlessly, and with a minimum of cost, share data between the two.

[ Applause ]

Abe Stephens: Hi.

My name is Abe Stephens and I work with the OpenCL Group at Apple.

Today I'm going to talk about OpenCL and OpenGL sharing, which is a mechanism that allows us to create objects in OpenGL and then import them into OpenCL in such a way that the actual data that is operated on by both APIs is the same.

So, we can avoid copying data or making duplicate copies of the data, and accelerate our programs.

The motivation here is that we'd like to combine these two APIs.

Now, OpenCL and OpenGL are similar in many ways.

A lot of programs that are written for both APIs run on the GPU.

It's possible to write programs for the CPU in OpenCL, and that's a little bit less common for OpenGL, but really, we're looking for a mechanism that allows us to move data efficiently between these two interfaces.

Let's take a look at a simple example; a case where we have an application that's going to perform some kind of physical simulation and then visualize the results.

The physical simulation part of this example; for example, a bunch of objects bouncing around the screen, is a very OpenCL oriented kind of task.

It might involve collision detection, computing maybe Newtonian mechanics or something like that.

And then, it also might involve rendering.

We could compute our position and velocity in our compute side of the application, and then take that data, move it to graphics, and actually render the scene.

Now, this type of task could also be performed without OpenCL.

We could produce data on the CPU and then transfer it to the GPU, and then in the next frame, repeat that process.

Alternatively, with a sharing API, with a sharing mechanism, we could produce the data in CL on the GPU, and then move it from the CL side to the GL side, and use the same data over again.

So, let's take a look.

In OpenCL, we'll produce a list of vertex data and then move that data from CL into OpenGL, where we can provide, or we can implement, some kind of sophisticated shading operation.

In this example, we've rendered these spheres with a refraction shader and some other special effects.

That shader operation is obviously best suited to graphics, and the physics operation in this case is very well suited to OpenCL.

We can also perform the opposite kind of operation.

Instead of producing data in OpenCL, we can consume the data in OpenCL and produce something in OpenGL.

For example, in that previous slide, OpenGL could produce data about surface normals and surface positions of fragments, and then that data could be passed into OpenGL for some type of post-process.

So, for example, in our physics application, we could render the spheres using OpenGL to a vertex buffer or to a fragment buffer object, and then take a surface of that fragment buffer object, transfer it to OpenCL, where the CL program might use that data as input to a ray tracer, trace a caustic effect, and then move that caustic effect back into OpenGL for final display and compositing.

Now, under the hood, under the hood both of these applications, or both of these APIs, operate using similar data structures in the driver, and are implemented using shared structures.

As you might be familiar, OpenGL selects the devices that are used for computing using a pixel format.

So, a developer or a programmer sets up a pixel format and that format is matched with the rendering devices in the system, GPUs and CPUs.

Matching the pixel format to system devices and then passing that format to a create function generates an OpenGL context.

At this point in setting up a OpenGL system, the programmer is done.

We can use this OpenGL context to start sending draw commands to the system.

But, if we wanted to take our program a step further and add an OpenCL application or an OpenCL task, we have to take the context and convert it into OpenCL, and this is done by extracting the ShareGroup from the GLContext, and then taking that ShareGroup and moving the ShareGroup into an OpenCL context.

And then, if we looked inside that OpenCL context, we'd see that the CLContext contained all of the devices that were originally in that pixel format.

Now, the CGL structure, the CGLShareGroup, and the OpenCL context both contain very analogous types of structures.

The ShareGroup has a list of vertex buffer objects, textures and render buffers, and the CLContext has data objects that are wrapped by the cl_mem object type.

When we convert the ShareGroup into a CLContext, we're making all of those GL data structures available to the CLContext, and then we have to obtain references to those structures from the CLContext to use in our program.

There's some relationship between the CGLContext that's used by OpenGL, and the list of CL devices, although in CL, we have a very explicit representation for the device.

In OpenGL, we select a virtual screen, or we use another mechanism to select a rendering device that the system will use.

So, although there's a relationship, the devices aren't exactly analogous.

There are five steps for setting up a sharing process between OpenGL and OpenCL.

We've already looked at the first step, which was to obtain that CGLContext.

From the Context, we obtain a ShareGroup and then use that ShareGroup to create the CLContext with which we'll send commands on the CL side.

After the Context has been created, we can import data objects from OpenGL to OpenCL.

In the first example that we looked at, those data objects consisted of a vertex buffer and then in that second post-process example, the shared data object was a GL render buffer.

After we've imported the data objects, the set-up phase of our application is complete, and now we can concentrate on executing commands, and there's a specific flush and acquire semantic that we have to use to coordinate between the OpenGL side of the program and the OpenCL side of the program.

And then, finally, when we're done, we have to tear the whole system apart and clean up by making sure to release the objects safely in CL before destroying them in GL.

So, let's take a look at some source code.

The first step is to obtain the CGLShareGroup for an application from the CGLContext.

And, the example that we'll look at will focus on a Cocoa application.

In Cocoa, the NSOpenGLView is commonly extended to add OpenGL to an application.

Within the initialization function of our Cocoa program, we can obtain first the CGLContext associated with that OpenGLView, using an accessor function.

And then, we use the CGLContext to obtain a CGLShareGroup, which is essentially what we'll use to create the CLContext.

Now, the second step in our five-step process is to take that ShareGroup and use it to create a CLContext.

We do this using OpenCL; an OpenCL function with a special enumerant, the CL_CONTEXT_PROPERTY_USE_CGL_SHAREGROUP_APPLE enumerant, which is really hard to forget.

And, this is passed in a property list to the CLCreateContext function and in this case, without any other arguments except for the error argument.

Now that we've created a CLContext, we have to obtain a list of CL devices, and if you remember from the slide at the very beginning of the talk, those devices will reflect the devices that were used, that were passed in the pixel format when that CGLContext was originally created.

In a standard Cocoa application, the runtime has already taken care of creating a pixel format, and we simply obtained the CGLContext that was provided by the runtime.

If we wanted to obtain all of the devices that were associated with our CGLShareGroup in our CGLContext, we could simply use the existing or the standard CL accessor method, clGetContextInfo, passing it the correct enumerant, and get a list of all of the devices.

However, in that case, we might obtain a GPU device that doesn't correspond to the GPU device our Cocoa application is currently using for display.

If we want to obtain the device that's currently, the virtual screen that's currently being used for display, we have to use another special enumerant and another special function, clGetGLContextInfoAPPLE, to obtain the CL device that matches the current virtual screen that our application's running on.

So, let's take a look now at what we've done.

We've made our way around this entire figure, and we're back to the point where we have a list of CL devices inside of our CLContext; however, if we look at those devices, we'll notice that the CL API has removed the CPU device from that initial list of GL devices.

If we want to add the CPU device back in; for example, if we wanted to run some CL programs or some CL kernels on the CPU, and others on the GPU, when we create our CLContext we have to explicitly add the CPU device, as if we were creating a normal CLContext.

So, this involves getting a list of device IDs with the CPU device type, and then passing that to CLCreateContext.

OK, now here's the fun part.

Now that we have our Context, our CLContext, we have to move data objects, or tell the runtime which data objects we'd like to move from OpenGL into OpenCL.

We have to import the shared objects.

OK, so here are the two ShareGroups we'd like to end up with.

If we'd like to move that vertex buffer object into OpenCL, we use the function clCreateFromGLBuffer, passing it both the CLContext and the GLBuffer, and then memory access flags that tell OpenCL how we plan on using that data structure.

Now, if we look under the hood, what we've really done by sharing the vertex buffer object between OpenGL and OpenCL is we've created a structure within the OpenGL runtime and the OpenCL runtime, but those two structures, the VBO and the cl_mem object, actually refer to the same piece of driver storage.

This piece of driver storage, called a backing store, is the actual memory that contains that vertex data, or texture data, or render buffer data, and is the piece of memory that is moved between devices as we execute our program.

Now, if we execute a command either in GL, like a draw command, or if we execute a kernel in CL, the runtime, the driver, will take care of moving a cached copy of that data to the device that the command is executed on.

So in this case, when I created that mem_object in OpenCL, it's not as if I was allocating memory on a device.

I was really allocating memory inside of this driver storage pool, and then if I execute a command using a device, that memory is cached, in this case in Device VRAM, on the GPU.

Now, if I execute a command in either API on a different GPU, the runtime will take care of moving that data back to the host and then to the other device.

Now, OpenCL is a little bit different than OpenGL in this respect, in that unlike the GPU, which has; each GPU has its own piece of device VRAM.

In OpenCL, the CPU device shares VRAM with driver storage.

So, when I execute a command on the CPU in OpenCL, this copy operation doesn't occur.

The CL kernel is able to use the same backing store as the runtime.

Now, the operations we've looked at so far, and the example in the demo, involve vertex buffer objects or pixel buffer objects.

But, there are three other functions we can use to create shared data between OpenGL and OpenCL.

The first two functions involve textures; 2 and 3D textures.

And, the third object allows us to manipulate render buffers, which was the structure we used for that post-process example that we saw earlier.

After these structures are created in OpenGL and then imported into OpenCL, we can do a lot of things with them.

In either API we can modify their contents, execute commands that use them; but, one thing we can't do is we can't modify their properties.

So, after an image is imported from OpenGL into OpenCL, we can't change the width and height of the image.

We can't change other properties of the image.

We'd have to create a new copy in GL and then move it back into CL in order to make those sorts of changes.

OK, now that we've created the objects and imported it into OpenCL, as we're launching commands and executing commands in either API, we have to use a specific type of semantic, called Flush and Acquire, in order to coordinate access between the two APIs.

Let's take a look at the standard queueing system that's used by OpenGL by itself.

As I execute commands, they're enqueued in a command queue on the host, and at a certain point, those commands are flushed to the device and executed.

Now, the order of commands as I call functions is sort of maintained by that command queue, and those commands will be executed in the same order when they get to the device.

And, since I have a single command queue, there's no chance of commands being executed in an order other than the one I've specified.

However, if I add the OpenCL command queue alongside that OpenGL command queue, and then execute a bunch of OpenGL commands in my thread, and then a bunch of OpenCL commands in my thread, without any explicit synchronization, I don't have any control over the order that those commands get submitted to the device.

So, it's very possible that even though I sent the two OpenGL functions first, they might be submitted to the device by the runtime in an interleaved order.

Now, if there are data dependencies between OpenGL and OpenCL such that I had to execute all of those GL operations because I was producing that render buffer in OpenGL before I consumed it in OpenCL, this type of unsynchronized execution would cause a problem.

Therefore, before we move data, or before we move commands between those two APIs and execute work on one side or another, we have to make sure to flush the GL side and then that flush operation sends those commands out to the device, and then acquire those shared objects on the CL side, then execute our CL commands.

Once we then call clEnqueueRelease, our CL commands will be flushed to the device after the GL commands.

This explicit type of Flush and Acquire semantic ensures that those GL commands are in progress on the device before the OpenCL commands have a chance to be submitted to the device, and the order between the two APIs will be maintained.

OK, now that we've gone over how to create the ShareGroup, move the GLShareGroup into OpenCL, then create data objects and import those data objects, and then safely execute commands, the very last step is how to safely clean up this system.

The key here is that we always release objects in the opposite order that we created them.

We always release the objects in OpenCL, and then destroy them in OpenGL.

This ensures that in OpenGL, the OpenGL driver won't take objects out from underneath the OpenCL implementation.

Now, as you might be aware, OpenGL automatically takes care of retaining objects for you, so if you pass a kernel into the runtime, or a data object into the runtime, the runtime will make sure that the reference count of that object reflects that the runtime is holding onto a pointer for it.

So, it's necessary to make sure that after you've enqueued commands that use memory objects, those commands have been executed and have been completed before OpenGL has an opportunity or might accidentally delete or destroy an object.

So, the objects have to be completely released by OpenCL before they can be destroyed by OpenGL.

And, releasing a mem_object is simple.

Essentially, there's one function, regardless of whether the mem_object was created on the GL side, whether it's a buffer or an image.

We simply call clReleaseMemObject.

OK, now I'd like to show you an example; a live demo of the example I showed earlier.

This is the case where a vertex buffer object is created in OpenCL and shared with OpenGL, and then a GLFrame is created, or a GLFBO is created and then shared with OpenCL to perform some post-processing.

In this example, In this example OpenCL is rendering, OpenCL is computing the physics interaction between the spheres that are bouncing around the screen, and then OpenGL is rendering the refraction and the reflection on each individual sphere.

As OpenGL renders that effect, it also produces a buffer which contains the surface normal and position on the surface of various fragments for the spheres.

That buffer is passed back to OpenCL using a shared render buffer.

OpenCL then provides or performs some photon tracing to compute the caustic effect that you see towards the bottom of the screen.

This is a simple example.

The physics that's computed in OpenCL isn't particularly sophisticated, but this allows us to perform all of the computation on the GPU, instead of having to coordinate between the CPU and the GPU for each frame.

So, both the physical simulation and the rendering of the spheres can happen on the GPU, and then the photon tracing and the rendering of the caustic highlight can also occur on the GPU.

OK, so the three steps for that demo were simply to update vertex positions, perform the photon trace, and then render the scene.

And, by using the sharing API, it was very easy to perform all of those operations on a single device.

If we had some type of application where we wanted to perform, say updating the vertex positions or rendering the photons on the CPU or on another device, the OpenCL runtime would have allowed us to automatically move the data back and forth between those devices.

And so, even if we're not running applications that do all of their work on the GPU, it's still possible to use OpenCL and the sharing mechanism to handle moving data between the two devices.

OK, so five easy steps to shared data between OpenGL and OpenCL.

The first step is to make sure that we select our pixel format and the devices that we're going to use for OpenCL, using that CGLPixelFormat function, and using that pixel format to create our CGLContext and ShareGroup.

We passed that ShareGroup to CLCreateContext in our second step, to produce and initialize a CLContext containing those devices.

Then we create objects in GL, import them into CL.

We use a GLFlush and CLAcquire pattern to handle coordination of commands between the two APIs, and then lastly, when we're done, we release in CL before destroying in GL.

Now, this concludes the CL session.

For more information, you should contact Allan Schaffer, who's our Evangelist, and of course, look at the Apple developer forums.

There's a CL Dev forum and also a GL Dev forum that are great ways of getting in touch.

There are a number of other sessions this week that will address OpenCL and OpenGL.

Immediately following this session in this room, we'll hear from a number of vendors that will describe how to maximize OpenCL performance for different devices.

Tomorrow there's a session on OpenGL for the Mac, and then later in the day tomorrow is a session that will describe multi-GPU programming, both with OpenGL and OpenCL.

Apple, Inc. AAPL
1 Infinite Loop Cupertino CA 95014 US