The Accelerate Framework

Session 713 WWDC 2013

The Accelerate framework contains signal and image processing, matrix, and linear algebra computation. Learn about new signal and image processing functionality. Find out how you can use the Accelerate framework to achieve dramatic improvements in performance and energy consumption.

[ Silence ]

Good afternoon.

Welcome to the Accelerate Framework Session.

My name's Jeff Belcher.

I'm an engineer in the Vector and Numerics Group.

Today I want to start off with a pretty common scenario.

Imagine you've got a great idea for an application, and that application has a computationally intensive component.

You look around and you find an open source solution to the problem, you bring it into your application, you test it, and you find the graph's too slow, or maybe it's a battery drain.

At this point you're forced to spend the next several hours or maybe days, profiling and optimizing that code to get the performance to where you need it to be.

We don't think that's right.

The goal of the Accelerate Framework is to solve this problem.

The Accelerate Framework is a collection of functions of commonly used computationally intensive operations.

The Accelerate Framework is designed to be high performance and deliver great energy savings for all of these APIs that are available.

When you adopt the Accelerate Framework you're going to get great performance and amazing energy characteristics from the smallest iPhone all the way up through the biggest Mac Pro without changing a single line of code on your end.

Let's dive into the details of the Accelerate Framework and see how it can help you make a really great app.

So what is the Accelerate Framework?

When you think Accelerate Framework there's a few things that I want you to remember.

First, easy access to a lot of functionality.

There's more than 2,000 APIs available on the Accelerate Framework.

Throughout the rest of the talk we'll break this down into four easy-to-remember categories and show you what exactly is available.

Think accurate.

We spent a lot of time testing so that you don't have to.

The big one is fast with low energy usage.

You guys really pushed the limits of the hardware available today with your great applications.

When you use the Accelerate Framework you're going to get great performance, and that's going to come with amazing energy characteristics.

The best part for you is it works great on both OS X and iOS and it's optimized for all generations of hardware, so when new hardware comes out you're not going to have to revisit your code.

So I mentioned that there's a lot of functionality and the Accelerate Framework is geared toward commonly used computationally intensive operations, but what exactly is available?

We break it down into these four categories.

First we've got image processing, with vImage, we've got digital signal processing and VVSP, transcendental math functions and vForce and vMathLive, and finally, linear algebra in LAPACK and BLAS.

At the end of this talk there's a few points that I want you to come away with.

The first of these is how the Accelerate Framework can help you create a really great application.

I'm going to show you some examples of real world performance and energy savings that you can expect when you utilize the Accelerate Framework.

I want you to have an idea of areas of your code that are likely to benefit from the Accelerate Framework, and finally, how to use the Accelerate Framework.

So this is going to range from linking against the Accelerate Framework up through some tips and tricks that can really allow you to get the most out of the Accelerate Framework.

I want to move now to why the Accelerate Framework is fast.

Understanding why the Accelerate Framework is fast can help in understanding when and why to use the Accelerate Framework.

One of the big reasons the Accelerate Framework is fast is we utilize SIMD instructions.

This is Single Instruction Multiple Data.

For those of you unfamiliar, if we're trying to for example add 2 arrays together, there are instructions on current hardware that allow us to add multiple elements simultaneously.

For those of you more familiar with SIMD operations, on Intel this means we're taking advantage of SSE, AVX, and now AVX2.

On ARM we're taking advantage of NEON.

Utilizing SIMD instructions in certain situations can have significant energy and performance savings.

We also spend a lot of time matching the microarchitecture for the complete Apple hardware lineup.

This includes optimizations like instruction selection and instruction scheduling, as well as software pipelining and loop unrolling.

So I bring these up because it requires a certain amount of data before optimizations like loop unrolling become beneficial, so it helps to understand that this is sometimes happening behind the scenes in the Accelerated Framework.

The last reason the Accelerated Framework is fast is because it's multithreaded using GCD.

When it's appropriate we're going to take advantage of all the cores available.

So I wanted to talk about why it's fast so that you have an understanding of where some of the tips for successful use of the Accelerate Framework come from.

The first tip is preparation of your data.

When you prepare your data there's a few things that I want you to remember.

The first is if you can make your data contiguous.

This means that if you're creating an array, you want to make that array such that the elements are contiguous.

If you're allocating or have control over the layout of that buffer and memory, if you can align the beginning of that buffer to 16-byte boundary, that's going to be ideal.

With the Accelerate Framework we always strive to deliver the greatest performance, but if you can meet these recommendations, in certain situations we can algorithmically exploit that to give you slightly more performance.

The next tip is to understand the problem size.

Any function call has a cost associated with it.

The Accelerate Framework is not immune to this.

On the previous slide we also saw that in certain situations optimizations like loop unrolling are used.

What this means for you is that when you're using really small when you're using the Accelerate Framework with really small datasets, it may not deliver the best performance.

There's not a problem size that I can say don't use the Accelerate Framework for something that's small; it's going to depend on the operation you're performing.

For example, if you're scaling a vector it might be on the order of 100 elements; whereas if you have a more complicated operation for example, Matrix Multiply, it could be as small as 8 elements.

The best thing you can do here is to experiment.

The Accelerate Framework is always going to deliver the great functionality, just for these smaller problem sizes it may not be the best performance.

The last tip for successful use is to do setup once and destroy once at the end.

There's a handful of operations in the Accelerate Framework that require a setup structure.

Creating this setup structure can be costly and time-consuming.

These setup structures are designed to be used multiple times, so if you find yourself in a situation where you need to do these setups, create the setup, do all of the computation that you want to do with that setup, and then destroy once at the end.

Throughout the rest of the talk we'll see some examples of this and it will become more clear.

Now I want to move on to using the Accelerate Framework.

For those of you brand new to the Accelerate Framework, including it is just like including any other framework.

Here we have a typical Xcode project, and we're just going to navigate to the build phases.

In the build phases we're going to find the link with the library section and we're going to find the Plus button.

This brings up the list of available frameworks.

The Accelerate Framework's right at the top, we'll just select it and click Add.

And then we can be sure that the Accelerate Framework is included in our project because it's going to show up in this link the Library section.

The only other step to using the Accelerate Framework is to include the headers.

This is accelerate/accelerate.h. That's all it takes to use the Accelerate Framework.

Linking from the Command line is just as easy.

In your link step simply include -framework accelerate.

So now I want to dive into the details of what's available in the Accelerate Framework.

I mentioned there's over 2,000 APIs and we've got these four categories so we'll start to step through these now.

And we'll begin with image processing.

For image processing we have vImage, our vectorized image processing library.

There's a lot of functionality in vImage, and rather than just list it I put together a short video to show you some of the features that are available.

We've got alpha blending and alpha compositing, dilation, erosion.

You can create Sobel filters for form edge detection, various types of CONVLs to perform blur, deblur, or multi-kernel CONVLs, MaxFilters, MinFilters, color transformations, warps and Shears.

So this is just some of what you'll find in vImage.

We also have some great additions and improvements in both iOS 7 and OS X.

First we have improved conversion support.

Conversions are operations like converting between planar and chunky data or changing between a pixel component type, so an 8-bit image format to a 16-bit image format or a floating point image format, just to name a few.

We also introduced vImage buffer creation utilities, so in the tips I talked about how important it is to create a buffer, getting the alignment right and getting everything contiguous, so to take some of the guesswork out of that for vImage, we introduced the utilities where you can just specify the size of the image, and this function will create the appropriately sized buffer to deliver the maximum performance.

We also introduced resampling of 16-bit images, so all the operations like Warp and Shear that were available for 8-bit and floating point image formats are now available for 16-bit image formats as well.

The last addition is streamlined core graphics interoperability.

This is a big one, and I want to dive into the details of this with an example.

So we got the question a lot.

How do I use vImage with my CGImage ref?

To solve this problem we introduced two new utility functions.

To go from CGImage ref to vImage buffer, we introduced a utility function vImage buffer and with CGImage and for the reverse direction, we introduced vImage create CGImage from buffer.

Let's take a look at an example of this, and see just how easy it is to use.

So here we're going to look at how to go from a CGImage ref to a vImage buffer.

As always, we're going to begin by including the Accelerate Framework header and then we're going to create an openImage ref.

I'm not going to go through the details of this here.

There's a lot of documentation and examples of this, but assume after this line that we have our CGImage ref open.

The first step that we're going to do then is specify the image format.

This image format describes the format of the vImage buffer that we want to create.

We've introduced the vImage/CGImage format structure.

You'll find several elements in here; for example, bits per component, bits per pixel, information about the color and bitmap info to name a few.

This descriptor is describing an ARGB 8-bit image.

We see that the first entry in this structure is bits per component of 8, so each component in the picture is going to be 8 bits.

The bits per pixel is 32, so there's going to be 4 components.

Color space, we pass null.

When we pass null this means that we're going to get a default RBG color space, so we have 3 color components.

And then in the bitmap info, we have kCGImage alpha first.

This means we have a single alpha component and it's the first component.

So this describes our 8-bit ARGB image format.

With this format we're going to call vImage buffer with CGImage.

The first argument is the input buffer that we want to create from our CGImage ref.

The second argument is the reference to that format description that we just created.

The third argument is unused in this case.

This is information about background color.

In certain conversions when alpha channels are involved, it may be necessary to provide information about a background color.

The next argument is NImage this is our CGImage ref that we want to convert to the vImage buffer, and finally any additional flags.

In this case we don't have any so we pass kV image no flags.

Upon successful return of this function, we've allocated a new vImage buffer.

It contains the image format, the image and the format that we've described, and we're free to at this point release the CGImage ref.

The reverse is just as easy, going from a vImage buffer to a CGImage ref.

So we've done our image processing, and we have our vImage buffer out buffer.

We haven't changed the format so we're going to use our same format specifier that we created before.

To create the CGImage refer we're going to call vImageCreate CGImage ref from buffer.

The first argument is going to be the output vImage buffer that we just finished processing, that same format type, because we haven't changed the format.

The next two arguments are user callback and user functions, user callback functions and user data.

For this particular conversion we don't need that so we're just going to pass null.

And then we pass flag, any additional flags.

Again, in this case there are none, so we pass k at vImage, no flags.

And then finally a reference to a vImage error to capture the error state.

Upon successful return of this function, we're going to return the CGImage ref, out image in this case.

This is going to be a freshly allocated CGImage ref containing the image information, and we are free to release the vImagem buffer.

All of this is built around a really powerful API that we're introducing now called vImage Convert Any to Any.

What vImage Convert Any to Any does it it converts between the image format specifiers that we just saw, so you'll create two of these format types, one for the source and one for the destination type, and you'll create a converter.

Once you've created this converter, you can then convert as many images as you want from that source type to that destination type.

So this is one of those cases where you want to create that converter once and use it as many times as you can.

The vImage Convert Any to Any is really fast, and I want to show you an example of hits with a real world application.

I want to show you that with software jpeg encode performance running on the iPhone 5.

What I have here is a graph.

On the y-axis I've got megapixels per second, so this is the rate at which we can perform that software jpeg encode.

On the x-axis I have various image format types.

For the sake of this example, think of this software jpeg encode as happening in two steps.

Step one is to convert from our input image format type, so those that we see on the x-axis; two the image format type that the encode step consumes, and the second step is to perform the actual encode.

What we're interested here is step one, so converting from the input image format type to the format type consumed by the encode.

Let's take a look at the performance the original way.

We see a few things here.

First we see a lot of variability.

For example, if you start from an 8-bit RGBA image, your encode performance is going to be almost twice as fast as if you start from a floating point RGBA image.

The reason that this is happening is because step one is so variable.

So what we wanted to do is change just step one.

We replace step one now with vImage Convert Any to Any, and let's look at the performance.

We see everything gets a lot faster now.

We also see that the performance is quite consistent.

So our 8-bit RGBA image is now only a few percent faster than our floating point RGBA image.

The reason that this happens is because we reduced the amount of time that we spent in step one, converting from the input image format to the other format, to a very small percent of the overall operation.

This type of result is what you can expect in your applications.

This is a real world application.

vImage is delivering great performance and consistent results.

I want to stay on the topic of conversion for a little bit longer.

I want to talk about an example of scaling a premultiplied image.

A lot of people will have an image format and they'll have it in a vImage buffer and they'll want to scale it.

They'll look through vImagem and see that the only way you can scale an image is a non-premultiplied image format.

So the way that you need to do this is three steps in vImage.

I'm not going to go into the details of each of these steps, but in step one, we're going to unpremultiply the data.

In step two, we're going to perform the scale.

And then in step three we're going to premultiply the results of that output.

A lot of people see this as three times the amount of work, and they get afraid and they go off and they implement their own scale.

I want to show you how much time we spent in each of these steps.

What I have here is the percentage of time in each of those three same steps as we saw them.

At the top we see unpremultiply, a little over 1%, at the bottom we see the premultiply, a little of 1/2%.

The vast majority of time is spent in the actual operation.

What I want you to take away from this is don't take away the conversions, they're fast.

If your image isn't in the right format, use the conversions.

It's going to be worthwhile getting into the image.

Now I want to talk about some performance of vImage as compared to some of the other options, and I want to do that by comparing to OpenCV.

OpenCV is a third party open source computer vision library.

It has an image processing module.

That image processing module has a lot of the same functionality that vImage has.

There's a couple points that I want to compare.

The first is execution time.

Everybody wants their applications to run fast.

The second is energy consumed.

We're increasingly reliant on our batteries so it's important that we get that performance while being aware of the energy consumption.

To begin we'll look at the execution time and we'll do that by looking at the speedup of vImage over OpenCV.

So on this graph I've got numbers where numbers above 1 means vImage is going to be that many times faster than OpenCV, and for numbers below 1 it means OpenCV is going to be faster.

I've got a handful of operations here, and we see that vImage is between 1.6 and over 20 times faster than OpenCV, so these are some really great performance results.

But as I mentioned, it's not just all about performance.

We're concerned also with energy consumption and battery life.

I want to explain this relationship between performance and energy consumption and battery life a little bit, and there's a few points.

First, fast code tends to decrease energy consumption, therefore, fast code tends to increase battery life.

Let's look at why this tends to happen.

What I have here is a typical energy consumption profile.

So we're measuring the instantaneous power.

Energy is the area underneath that power curve.

So on the x-axis I've got time.

In the beginning, on the y-axis I've got our instantaneous power measurement.

In the beginning we're running at some idle state and using a very small amount of power.

At time t0 our application begins and we increase the amount of power that we're consuming.

The application runs through time t1 and we return back to some idle state.

The amount of battery that we're using, the energy consumption, is the area underneath this curve.

Let's look at how an optimized routine compares to an unoptimized routine.

So here in blue I've got an optimized routine much faster.

In certain situations it's going to take more power to make that routine run faster, but the important part here is that the energy consumption is the area underneath, and we can seek that the optimized routine is using significantly less energy.

So now let's look at that same vImage OpenCV comparison for the energy numbers.

So I've got the vImage energy savings over OpenCV here.

So again, numbers above 1 means vImage is using that much times less energy than OpenCV, and for numbers below 1 it means OpenCV is using less energy.

This ranges from .75 up through almost 7 times less energy.

So we're delivering really great performance, and we're also delivering really great energy savings.

This is what you can expect in your applications.

We love to get feedback about use of the Accelerate Framework and we found this tweet I wanted to share with you: "Using vImage from the Accelerate Framework to dynamically prerender my spreads, it's the only way to make it fast."

Now I want to move on to the next big category of operations available on the Accelerate Framework and that is digital signal processing.

You'll find digital signal processing in vDSP.

This is our Vectorized Digital Signal Processing library.

In vDSP you'll find basic operation on arrays, additions, subtractions, multiplies, conversions, accumulations.

You'll also find discrete Fourier transforms, discrete cosine transforms, as well as convolutions and correlations.

In both iOS 7 and OS 10.9, we've introduced some great new features and functionality.

The first of these is a multi-channel IIR filter.

This is an infinite impulse response filter.

So whereas before if you needed to perform an IIR filter on multiple channels, maybe you have a surround sound system that you want to filter, you'd have to do that with individual calls into an IIR filter.

Now with this new multi-channel you can do that with a single function call, and we've been able to give you some great performance and energy savings by doing that operation in a single function.

We've also improved power of 2 support for the discrete Fourier transform and the discrete cosine transform.

I want to talk about this with an example.

So before we essentially had two entry points for the same operation based on the number of points that you wanted to evaluate.

So if you had a power of 2, you would call into the FFT.

If you had a non-power of 2 you would call into the DFT.

Starting in iOS 10.9 and iOS 7, the DFT supports certain powers of 2.

When the DFT supports the number of points that you want to compute, we recommend that you use the DFT.

So this brings up another question: How can I be sure that my number of points is supported?

If you can't find it in the documentation for some reason, you can always programmatically check.

The DFT is one of the routines that requires a setup structure, and that setup structure is designed to return 0 if the number of points isn't supported.

You can always be sure that you're using the correct routine.

Let's look at an example of the DFT.

Again, we'll start by including the Accelerate Framework, then we're going to create and prepare our data.

In this case we've got 4 buffers, 2 input buffers, one for the real numbers and one for the imaginary numbers, 2 output buffers again, one for the real and one for the imaginary.

We want to align these if possible.

Then we're going to perform a DFT setup, and we're going to do that with vDSP zop create setup.

Takes a few arguments.

The first argument is information about any pervious setups that may have occurred.

We don't have one in this case so we'll pass zero or null.

The next is the number of points that we want to compute, 1024, and then information that describes the DFT that we want to perform, in this case the forward DFT.

Once we've created a setup, we're going to execute our DFT.

We do that with vDSP DFT execute, takes that setup structure that we just created and the 4 buffers that we had set up before.

Again, we want to do this as many times as we can with that same setup structure.

We can use it over and over again.

Once we've done all the computation one time at the end, then we want to clean up our setup with vDSP DFT Destroy.

So I want to do another comparison now vDSP versus FFTW.

FFTW is called Fastest Fourier Transform in the West.

This is another third party freely available library, supports one and multidimensional transformations, both real and complex data.

It's parallel.

It's a good freely available library.

It's a fair comparison.

I'm going to show again the vDSP speedup over FFTW on the iPhone 5.

So again, numbers above 1 means vDSP is going to be that many times faster than FFTW and numbers below 1 FFTW is going to be faster than the vDSP.

Across the x-axis I have several number of points that we're going to execute.

Let's take a look at the performance that we get.

We see that vDSP is between 1.8 and about 2.5 times faster than FFTW for all of these number of points that we looked at some really great performance results.

It's one thing to look at benchmarks, though.

It's another thing to look at the performance that you can expect from a real application.

So imagine you need to code an audio signal using AAC enhanced low delay.

This is a process that's done in face time.

The DFT is one of many of the DFT routines in use, but it's the only one that we're looking at here.

And we're going to look at this by looking at the percentage of time that we spend in the DFT.

So what I've got here is the percentage of time for the DFT at 54% and at 47% is everything else in the operation.

This is when we're linking against FFTW.

The only thing we change is we link against vDSP so that we get the DFT out of vDSP.

And let's look at how this changes.

When the DFT is replaced with the DFT out of VDSP, the time spent goes to 30%.

This translates to significant performance and energy savings.

This is what you can expect in your applications.

A little bit more details about what VDSP supports.

It supports single and double precision, both real and complex values, as well as strided and non-strided data accesses.

So again, we love to get feedback.

Another tweet about using vDSP.

Want to do FFT on iOS?

Use the Accelerate Framework.

Highly recommended.

Thank you.

So now I want to move on to transcendental math functions.

And for that, I'm going to turn it over to Luke.

Luke: Hello, everyone.

My name's Luke Chang.

I'm here to talk about math functions.

In our group, we support math for every data level.

For scaled data, we have libem, takes a scalar input, returns a scalar output.

If you're writing vector code, we have the method.

It takes a SIMD vector S input and then return a SIMD vector S output.

And you want to handle a lot of data, will have vForce.

It takes Arias input and then returns Arias output.

We're going to talk about them one by one.

First, libem.

It's a standard C math library, it has a collection of [inaudible] like exponents, logarithm, trigonometry, power functions.

You're probably very familiar with it, so I'm going to talk about what we added this year for libem.

What we added is an extension to the C11 standard, so we prefixed the function name with double underscores.

They are available on both iOS 7 and Mac OS 10.9.

They are power of 10 function, trigonometry in terms of pi, and sine and cosine pairs.

First, power of 10, why do we add power of 10?

It's a very common operation in decimal calculation, so if you're writing audio apps, you need quite a lot of it.

Without a specific power of 10 function you have 2 options one, to use Pow and use constant 10 as base.

However, this is inefficient, because Pow is designed to handle generic inputs.

if you know your base is a constant, there are a lot of optimization that we can do to make it go faster.

The other way is to use X.

You can prescale your input by log(10) to do power of 10.

But it has its own problem.

It's not accurate.

There's routing error in the multiplication.

For example, if you want to calculate 10(5), using this method, you will not exactly get 100,000.

There's a small error at the end.

That's why we added X(10) so you can do power of 10 faster and more accurate.

Next is trigonometry function in terms of pi.

Basically it's the same regular trigonometry function with your input scale by pi.

It is faster because we can do automatic reductions faster.

It's much easier to reduce the argument by multiple of 2 than multiple of 2 pi.

It's also more accurate when you're dealing with degrees.

For example, if you want to calculate cosine of 90 degrees, 90 degrees [inaudible] into 1/2 pi.

With the regular trigonometry function you will have to say cos pi x 0.5, and you will not get 0 back; you will get a very small number, because pi is not so accurate.

So if you use cos pi 0.5, you will get exactly 0 back.

There's no error sine/cosine pairs.

A lot of times when you can't really sine, you'll need cosine for the same value.

For example, if you want to do a polar 2 [inaudible] conversion you will need cosine for the x-axis and sine for the y-axis.

Because we do it simultaneously, there is only one argument reduction.

You will have to do the argument reduction twice to save time.

And what's even better is that compiler recognize we have sine cos, so you will optimize your code into calling sine cos, without even knowing it.

Of course, if you want to call sine cos yourself, you can.

We also added C11 support for CMPLX.

This macro is used to define a complex number.

Without this, you're more likely to do the real part + imaginary part x I.

But in that expression, there's addition and a multiplication in it, so sometimes you will not get what you expect like this example: 0.0 + infinity x I.

Using CMPLX allows you to specify the real part and the imaginary part of the complex number directly, so you don't have to worry about multiplication.

We also have CMPLXF and CMPLXL for float and load level.

So that's the new addition to libem.

Vmathlib is a SIMD vector math library.

It is designed to take a SIMD vector as input and then return a SIMD vector.

Similar to libem, it has a collection of [inaudible] functions.

We prefix the function then with a single V, so we have VX, Vlog, Vsine, et cetera.

You want to use V method when you're writing your own vector code.

Accelerate Framework provide a wide range of functionalities, but sometimes you have your own special algorithm that you write, and you want to be fast, so you write in vector code.

What if you need the, for example?

You could use libem and then use a for loop to iterate through each of your element in the SIMD vector.

But obviously you're not going to take full advantage of the vector unit, so we can replace it with Vmathlib.

Instead of including Math.H, you include accelerator header, accelerator.h. Instead of the for loop you make one function call to VsineF.

You will take your SIMD vector and then return the result SIMD vector.

But you can go on with your vector code.

The code looks simpler, cleaner, and it's also faster.

So it's VMathlib.

You use it when you write your own vector code.

Next, vForce.

vForce is designed to handle a lot of data, called the vectorized math library.

It works on arrays, so it prefix the function then with double Vs, VVX, VVlog, VVSine, et cetera.

Let's say you want to write a signal generator app and you want to generate a sine wave, for example.

You can do it with Libem, again, write a for loop, go through each element in your buffer you could do better by using vForce.

Here's how.

Instead of using a for loop, you make one function call to VV Sine F.

You're passing the upper buffer, inner buffer, and the pointer to the length.

The generator sine will be ready in the upper buffer right after this function call.

Again, the code looks simpler, cleaner, and most importantly, is faster.

Let's look at the performance measured on the iPhone 5.

As you can see, vForce is more than twice faster than using a for loop.

Within the same amount of time it can generate more than twice the restful than the for loop.

This is not it.

It also has great energy performance.

It use lot less energy than using a for loop.

It use about only 60% of the energy when you use vForce compared to a for loop.

So your app will last longer, you will not drain the battery, and we did not cherry pick just VVSineF to show you the performance.

There is performance improvement across the board.

The graph doesn't even fit into the screen.

For the Trunk F, vForce is more than 5 times faster than using a for loop.

For all other functions they are at least twice faster than using a for loop.

A few words about vForce.

vForce supports single and double precision floating point numbers.

It handle Edge cases currently, so if you have infinities or nins in your input, you don't have to worry about them.

vForce will handle the Edge cases correctly.

vForce require minimal data alignment.

We only require native data alignment for a single precision floating number that's 4 bytes aligned, double precision floating point number is 8 bytes aligned.

Supports in place operation, so you don't have to create a temporary buffer.

That minimize the memory movement.

We get this question a lot.

Like Jeff mentioned before, how much data is enough, so using vForce or any other server function is beneficial?

Well, for vForce, I can give a rule of thumb; that is, if you have more than 16 elements in your array, consider using vForce.

Of course, the actual crossover point may vary for each function in vForce, but if you have more than 16, you're probably good to go.

So that's vForce.

I'm going to hand the presentation back to Jeff.

He'll talk about linear algebra, my favorite section of the presentation.


Jeff: Thanks, Luke.

So for linear algebra we've got the industry standard LAPACK and BLAS libraries.

LAPACK is linear algebra package, and BLAS is basic linear algebra subprograms.

Let's begin with LAPACK.

In LAPACK you'll find high level linear algebra functionality.

This includes things like solving systems of linear equations, performing matrix factorizations, as well as computing eigen values and eigen vectors.

One of the great ways to tell how you're doing with LAPACK and BLAS is to look at the LINPACK benchmark.

So as I mentioned these are industry standard.

They've been around a long time, and people came up with LINPACK benchmark to see how they're doing.

LINPACK benchmark is essentially answering the question, how fast can you solve a system of linear equations?

There's a couple variations of the LINPACK benchmark.

The one that we're going to look at here is using a matrix of 1,000 x 1,000 elements.

Let's look at the performance.

So this is the LINPACK performance of Brand A.

Two years ago we did this comparison and we compared Brand A.

We looked around at all the published benchmarks that we could find, and they were at 40 megaflops.

In 2 years, there's been a lot of time, improvements have been made, and that performance for Brand A has come up to 788 megaflops, just under a gigaflop pretty good.

Let's look at the performance of the LINPACK benchmark using the Accelerate Framework.

1200 megaflops this is 1.2 gigaflops.

This is pretty good.

There's just one thing.

We've had 2 years, too.

This is the performance running on the iPhone 4S.

Let's look at the performance of the Accelerate Framework running on the iPhone 5.

It's quite a bit better.

Thank you.

Well, LINPACK benchmark using the Accelerate Framework on the iPhone 5 is at 3,400 megaflops.

That's 3.4 gigaflops.

This is a phone that fits in your pocket and runs on a battery.

This is really impressive.

As I said, the LINPACK benchmark's been around for awhile, and so we wanted to do a comparison to an older machine for fun.

And so we're going to compare the iPad with the Retina display to a Power Mac G5.

For those of you that have been around for awhile, you might remember some of the bake-offs with the Power Mac G5, so we're having a triumphant return.

This is a 10-year old machine, and if any of you remember this machine, it's returning with all fans blazing.

I think there's 7 case fans, when you turn it on you know it's in the room.

When you run LINPACK benchmark, sounds like you're driving down the highway with your head out the window.

Let's look at the performance.

LINPACK benchmark on Power Mac G5 is 3,643 megaflops.

Let's see how the iPad compares.

Just edges it out at 3,686 megaflops pretty impressive for a little tablet.

Thank you.

Let's look at an example of how to use a LAPACK.

As always, we'll begin by including the Accelerate Framework header, and then we're going to create and prepare our data, so we'll create 2 major Cs, A and B, which describe our system that we want to solve.

In this case, we're going to use a system solve that's going to perform pivoting, so we need a vector to contain information about the pivots that we're going to perform, and then we're going to perform this all with DGESV.

There's a couple things I want to point out.

So as I mentioned, the LAPAC is industry standard, it's been around for awhile.

It's originally written in FORTRAN and maintained in FORTRAN, so the entry points look like this.

It's going to be DGSB followed by an underbar.

It also means that all the values are going to be passed by reference, must something to be aware of.

It's pretty easy to get tripped up with this.

But to perform the system solve, we simply pass in the size of the matrix in N, the number of right-hand sides which is the number of systems that we're going to solve, the matrix, the leading dimension of the matrix, and then the pivot vector that we created, and that right-hand sides B.

Info will capture any errors that happen in this operation.

It's pretty easy to solve a system with linear equations with a LAPACK.

Next is BLAS.

So a LAPACK is the higher level linear algebra operations.

It's built heavily on BLAS, the lower level linear algebra operations.

All of BLAS is available through the Accelerate Framework.

It's typically broken down into three categories: vector operations this is DOT product, scalar product, vector sums, matrix vector operations, matrix vector product, outer product, and matrix/matrix operations, like matrix multiply.

Let's look at an example of how to use BLAS in the Accelerate Framework.

We'll begin by including the Accelerate Framework header.

As always we'll create and prepare our data, so we'll align these buffers if we can.

In this case we have 2 operands matrices A and B, and the result matrix C.

And then we're going to call into C BLAS DGEM.

BLAS supports both row and call major, so the first argument is going to be to specify if we're a row or call major.

The next 2 arguments specify if we want to perform a transpose on the 2 operand matrices.

It's important with BLAS and a LAPACK to understand that these transposes don't actually happen; the operation is organized as such that they are implied as transposes.

And then the last several parameters for this argument are information about the size of the matrix, the matrices themselves, their leading dimensions, and any scalar values which will scale the operands or a result matrix.

Just to cover some of the data types and details supported by both BLAS and LAPACK, they both support single and double precision values, both real and complex, and multiple data formats for your matrices, so dense matrices, band in matrices, triangular matrices.

As we saw before, transposes as well as conjugate transposes and again, these disappear in the operation.

They aren't explicit transposes.

And then finally, BLAS supports both row and column major while LAPACK only supports column major.

Another tweet I wanted to share with you, playing with the Accelerate Framework today, having BLAST.

So in summary, there's a lot of functionality in the Accelerate Framework.

You'll find image processing in vImage, digital signal processing in vDSP, transcendental math functions in vForce and vMathLib and linear algebra, LAPACK and BLAS.

When you think Accelerate Framework, think easy access to all this functionality, over 2,000 APIs.

Accurate, we tested so that you don't have to.

You're going to get great performance with low energy usage.

It's going to work great on OS X and iOS, and it's going to work on the complete Apple hardware lineup, everything that's available now and everything that's to come.

Just a recap of the tips to be successful with the Accelerate Framework.

When you're preparing your data, if you can make the buffers contiguous and you can align the beginning of those buffers to a 16-byte boundary, we can in some cases get you slightly more performance.

Again, Accelerate Framework is always going to give you the best performance possible when you can't meet these recommendations.

Understand the problem size.

For small problem sets, the Accelerate Framework might not be able to deliver the best performance.

It's always going to deliver the functionality, though.

Finally, do set up and destroy once.

If you find yourself creating a setup structure, use that setup structure as many times as possible.

The Accelerate Framework is for you guys, and so I want to leave you with this.

If you need a feature, please request it.

The best way to do that is by filing a bug.

And one more tweet: "The discrete cosine transform was my feature request that made it into the Accelerate Framework.

I feel so special."

So we do listen.

Please request.

And then lastly, thanks, Apple, for making the Accelerate Framework.

Thank you, guys, for making it a success.


Just a little more information here, if you guys need to get in touch with us, contact Paul or George.

There's some documentation available online, and as always, check the Apple developer forums.

That's all we got, thank you, guys.


Apple, Inc. AAPL
1 Infinite Loop Cupertino CA 95014 US