What's New in LLVM

Session 405 WWDC 2016

The Apple LLVM compiler in Xcode 8 adds new language features, improved diagnostics, and more powerful optimizations. Get an overview of some new Objective-C and C++ features and learn how to use advanced optimizations to speed up your apps.

[ Music ]

[ Applause ]

Hi, I'm Alex Rosenberg.

I'm incredibly excited to share with you some amazing new features we have in store for the Apple LLVM Compiler.

But first, I want to talk a little bit about LLVM, the project upon which we build the Apple LLVM Compiler.

LLVM is a modular framework for building compilers and related tools.

However, it need not be used solely in this traditional way.

We all know and love Xcode, which uses many of the LLVM frameworks internally.

And now the amazing, new Swift Playgrounds app, has joined it with similar internal uses of LLVM.

LLVM is also an integral part of the inner workings of Metal and our other graphics APIs.

LVVM is open source.

Apple has a long history with open source compilers and languages.

The Swift language was born out of the LLVM project.

We're very happy with how you, the community of contributors, have move Swift forward via the open evolution process on Swift.org.

That process, and the Swift development processes were informed by the existing LLVM project processes.

LLVM has a thriving open source community, backed by the nonprofit LLVM Foundation.

LLVM is highly customizable and composable.

By leveraging the same robust and mature infrastructure, underpinning the Clang front end, we power the Swift compiler as well.

For more info about Swift, review the talk from earlier.

Let's take a brief look into some of these frameworks available from open source and how they might be used.

Clang is the front end for C, Objective-C, and C++.

Its libraries provide for advanced features, like static analysis, and in place for your writing, for code mitigation, and modification.

It also powers development environment integration features, like source code indexing and intelligent code completion that we all love in Xcode.

The open source tooling library helps harness the power of Clang for your own custom command-line tools to process source code.

Imagine the amazing things you could do with the power of a full parser at your command, running on your code.

The LLVM optimizer, which we'll speak more about in a little bit, has a full suite of modern compiler optimizations.

And this is where features like Link-Time Optimization happen.

Stay tuned for some exciting developments in LTO.

And then we have the back end, where final code generation happens.

There are libraries for object file manipulation, features like assembly and disassembly, and even advanced features like just-in-time compilation can be simply plugged together.

LLVM has many other tools composed from its frameworks.

Let's take a look.

As part of a complete LLVM based toolchain, there are the usual suite of binary file utilities, including a burgeoning open source effort on a framework for making linkers and related tools.

But perhaps the most prominent of these projects is LLDB.

LLDB combines many of the libraries from Clang, Swift, the Code Generator, and its own suite of debugger frameworks.

There will be a great session of LLDB tips and tricks on Friday.

All of the amazing features in Xcode, Swift Playgrounds, and the Apple LLVM Compiler wouldn't be possible without the contributions from the LLVM open source community.

The community is made up of many developers like you, inspired to bring their own ideas into the LLVM frameworks.

The LLVM open source project is growing at a stunning pace that would be extremely challenging for any one team, at any one company, to match.

We'd like to invite you to check out the project, at LLVM.org, and see how you can fit into the community and contribute your unique ideas.

And now I'd like to introduce Duncan, who will go over some of the exciting new language features we have in the Apple LLVM Compiler.

[ Applause ]

Let's talk about language support.

First, we'll talk about new language features, then updates to the C++ library, and finally new errors and warnings to help improve your code.

Let's start with new language features.

Objective-C now supports class properties.

This feature started in Swift as type properties, and we've brought it to Objective-C.

Interoperation works great.

In this example, the class property someString is declared using property syntax, by adding a class flag.

Later, this someString property is accessed using dot syntax.

Class properties are never synthesized.

You can provide storage, a getter, and a setter, in the implementation.

Or you can use @dynamic to defer resolution to runtime.

Over to C++.

LLVM has had great support for C++11 for years.

The only holdout has been support for the thread-local keyword.

This year, we added support.

Let me tell you a little about it.

If a variable is declared with a thread-local keyword, LLVM creates a separate variable per thread.

Initializers are called before the first use after thread entry, and destructors are called on thread exit.

C++ style thread-local storage, supports any C++ type.

And its syntax is portable with other C++ compilers.

The Apple LLVM Compiler, already has support for C-style thread-local storage, even when compiling C++ code.

There are two syntaxes available: One with a GCC style keyword, and another from the C11 standard.

C-style thread-local storage has lower overhead than C++ thread local, but it has restrictions.

It requires constant initializers and plain old data types.

If your code meets these restrictions, you should continue to use C-style thread-local storage for maximum performance.

Otherwise, use the C++ thread-local keyword.

Both work great.

Thread-local variables can help fix bugs in code that uses threads.

For more on thread related bugs, watch the Thread Sanitizer and Static Analysis talk.

That's it for new language features.

Let's move on to the C++ standard library.

Libc++ has been the default C++ standard library for years.

We've been encouraging you to move off of Libstandardc++ and in Xcode 8, we've deprecated it on all our platforms.

Please move on as soon as you can.

If your Xcode project still uses Libstandardc++, you should upgrade to Libc++ by changing the C++ standard library build setting.

Xcode's project modernization will offer to do this automatically.

The Libc++.dylib has a great update this year for all our platforms.

It has complete library support for C++14, as well as many other improvements.

Standard library features that require the dylib, now have availability attributes.

While Apple frameworks encourage deploying to old targets through the use of runtime checks, the C++ standard library availability checks are done at compile time.

To use C++ features that require the newest dylib, you need to target a platform that supports them.

That's the C++ library.

Between Xcode 7 and Xcode 8, we've added more than 100 new errors and warnings to help you find bugs in your code.

Let's talk about just a few of them.

In Xcode 7, we added a great feature to Objective-C called Lightweight Generics.

The kind of type modifier plays an important role, allowing implicit downcast from the kind of type to any of its subclasses.

In Xcode 8, we've improved the diagnostics for method lookup on kind of types.

In this example, getAwesomeNumber is declared inside My Custom Type, which inherits from NSObject.

Later, getAwesomeNumber is called on a kindof UIView.

This code is broken since My Custom Type is unrelated to UIView.

Xcode 8 gives an error here.

Type checking of methods called on kind of types, is restricted to the same class hierarchy.

This improved type checking also avoids misleading warnings when an unrelated type declares a method with the same name.

In Xcode 8, kind of types are way easier to use.

Next, containers in Objective-C, such NSArray or NSMutableSet can hold arbitrary objects.

However, the NSMutableSet called s in this example is added to itself.

This creates a circular dependency, which triggers a warning in Xcode 8.

Besides creating a strong reference cycle, a circular dependency can prevent some methods from being well-defined.

Infinite recursion.

This example implements the factorial function.

If n is positive, it returns n times factorial of n minus 1, recursing to calculate the answer.

If n is zero, it returns factorial of 1, also recursing.

The compiler now has a warning.

When all pass through a function, call itself.

This catches common cases of infinite recursion.

Here, one possible fix is to return 1 when n is zero.

Standard Move is a great C++ language feature.

It allows you to specify that ownership should be transferred from one container to another.

Moving a resource is faster than a deep copy.

However, using standard move on a return value, blocks named return value optimizations.

Typically, when a local variable is returned by value, the compiler can avoid making a copy at all.

In generateBars, calling standard move forces the compiler to move bars.

Although moving is fast, doing nothing is even faster.

LLVM now warns about this pessimizing move.

The fix is to avoid standard move on return values, getting back the lost performance.

Similarly, when a function takes an argument by value, and returns it by value, the compiler automatically uses standard move on the return.

There's no need to call it explicitly.

For example, the standard move on the return value in rewriteText is redundant.

Although this doesn't actively hurt performance, it makes the code less readable.

It's better to return text directly.

This is easier to maintain and consistent with returning local variables.

Finally, there are new warnings for references to temporaries in C++ range-based for loops.

Here, the loop is iterating through a vector of shorts, but the iteration variable i, is a const reference to an int.

Because of the implicit conversion between short and int, i is a temporary.

This can lead to subtle bugs, since it looks as if i is pointing inside the range, but it isn't.

The compiler now warns about this unexpected conversion.

One fix is to change i to a const reference of short.

Another is to make it clear that i is a temporary by removing the reference.

There is a similar warning when a range doesn't return a reference at all.

An iterator to standard vector of bool, does not return a reference.

So the iteration variable b in this example is a temporary.

It's surprising that b does not point inside the vector.

The compiler now warns about this unexpected copy.

The fix is to remove the reference making it clear that b is a temporary.

The new warnings for infinite recursion, standard move, and C++ range-based for loops are enabled by default.

To try them out in your expo project, add them to the Other Warning Flags Build setting.

That's it for new diagnostics.

Let's move on to advancements in compiler optimization.

We've made improvements throughout the LLVM compiler to optimize the runtime performance of your code.

We've picked just a few to highlight today.

We'll talk about improvements to Link-Time Optimization, highlight new code generation optimizations, and describe techniques for arm64 cache tuning.

It has been a couple of years since we've talked about Link-Time Optimization and we have major improvements to share.

Link-Time Optimization or LTO, optimizes the executable as a single monolithic unit.

It inlines functions across source files, removes dead code, and performs other powerful, whole program optimizations.

LTO blurs the line between the compiler and the linker.

To understand how LTO works, let's first look at the traditional compilation model.

Let's say we have four source files.

The first step is to compile, producing four object files.

The object files are linked with frameworks, to produce an app.

An LTO build starts the same way, by compiling the sources into object files.

In LTO, these object files contain extra optimization information which enables the linker to perform Link-Time Optimizations, and produce a single, monolithic object file.

The output of LTO is linked with frameworks such as Foundation, to produce the app.

LTO maximizes performance.

Apple uses LTO extensively on our own software, typically speeding up executables by 10 percent compared to regular release builds.

Its effects multiply when combined with profile guided optimization.

It can also reduce code size dramatically, when optimizing for size.

However, it has cost at compile time.

The monolithic optimization step can have large memory requirements, doesn't take advantage of all your cores, and has to be repeated in incremental builds.

Large C++ programs with debug info, have been the most expensive to compile.

Over the last two years, we've worked hard to reduce the overhead.

For example, let's look at the memory usage for linking the Apple LLVM compiler itself with LTO and debug info.

Smaller bars are better.

In Xcode 6, this took over 40 gigabytes.

Since then, we've reduced the memory usage by 4x.

We've also reduced compile time by 33 percent.

The Line Tables Only debug information level uses even less memory in LTO.

Linking LLVM itself, now takes only 7 gigs.

LTO has never been better.

But there is still a compile time trade-off, particularly for incremental builds.

We have an exciting new technology that designs away these costs.

Incremental LTO scales with your system.

It performs global analysis and inlining without combining object files.

This speeds up the build because it can optimize each object file in parallel.

Moreover, since the object files stay separate, and you build cache in the linker keeps incremental builds superfast.

Let's look at how an incremental LTO build works.

The compile step is the same as monolithic LTO, producing one object file for each source file.

Without combining object files, the linker runs an LTO analysis of the entire program.

This analysis feeds optimizations for each object file, enabling the objects to inline functions from each other, as well as other powerful whole program optimizations.

The LTO optimized object files are stored in a linker cache, before being linked with frameworks to produce the app.

The resulting runtime performance of programs built with incremental LTO, is similar to monolithic LTO, with a few benchmarks running slower, and a few running faster.

But it's fast at compile time too.

Let's look at build times with my favorite C++ project: The Apple LLVM compiler itself.

In this graph, smaller bars are better.

The time at the top is for building without LTO.

Monolithic LTO adds significant build time overhead, taking almost 20 minutes, instead of 6.

Incremental LTO is much faster at under 8 minutes, adding only 25 percent overhead.

Let's zoom in on the link step where LTO is doing its work.

Without LTO, linking the Apple LLVM compiler itself, takes less than two seconds.

You can't see the bar in the graph, because the linker isn't performing any compiler optimizations.

Monolithic LTO takes almost 14 minutes, when it can use all your cores.

Incremental LTO takes two minutes and a quarter, more than 6x faster than monolithic LTO.

Memory usage is great too.

Without LTO, linking the Apple LLVM compiler uses a little more than 200 megabytes.

As we saw earlier, monolithic LTO takes 7 gigabytes.

Incremental LTO uses less than 800 megabytes.

The scaling is incredible.

[ Applause ]

All these results are for a fresh build.

It gets even better.

With incremental LTO, incremental builds don't repeat unnecessary work.

Let's look at an example where the controller changes, and you start an incremental build of the app.

Changing the controller invalidates the link itself, but the other LTO object files are still in the linker cache.

However, if a function from the controller is inline into main, then main needs to be reoptimized at LTO time as well.

So we start the build, and only the controller needs to be recompiled.

After rerunning LTO analysis, both controller.O and main.O are optimized and new LTO object files are stored in the linker cache.

Then the LTO objects are linked as before to produce the app.

Incremental LTO gives you the performance you expect from an incremental build.

When a source file with small helper functions is changed, object files that use them get reoptimized.

But for a typical incremental build, most LTO object files are linked directly from the linker cache.

Let's take a final look at the time to link the Apple LLVM compiler itself.

The top three bars show the times for a fresh build from before.

If we change the implementation of an optimization pass, monolithic LTO takes the same amount of time.

It's the same as a fresh build.

But incremental LTO takes only 8 seconds.

This is 16x faster than the initial build, and 100x faster than monolithic LTO.

[ Applause ]

This is stunning.

State of the art, link time optimization, with low memory requirements and fast incremental builds.

Try incremental LTO today.

The improvements to LTO are fantastic, but if you're using LTO with a large C++ project, you can minimize compile time with the line tables only, debug information level.

This gives you rich back traces in the debugger, at the lowest cost.

That's it for Link-Link optimization.

I invite Gerolf on stage to talk about new code generation optimizations.

[ Applause ]

So, let's move on with optimizations in the code generator.

We put a lot of effort into the Xcode 8 Apple LLVM compiler to improve the performance of all apps.

In this section, I will talk about three of them: Stack packing, shrink wrapping, and selective fused multiply-add.

Let's start with stack packing.

This is about local variables and runtime stack memory.

The Apple LLVM compiler always had optimizations, that try to reduce the stack memory space.

In Xcode 8, the compiler got better at doing that than ever before.

And here is why.

Let's look at that example.

Pay attention to the definition of x inside the scope of the if statement.

Then look at the definition of y after the if statement.

If the compiler does not optimize this code snippet, you would expect that the two variables x and y, live at two different stack locations on the runtime memory stack.

However, according to C-style language rules, the lifetime of a variable ends at the scope where it is defined.

And this is what the compiler exploits.

Let's take a look what this does to our two variables.

x is defined inside the if statement.

The scope of that if statement ends just before y is defined.

When y is defined, it starts a lifetime of y and what you see in this picture is that x and y, the lifetimes for x and y, do not overlap.

So the compiler can assign them the same stack location.

This reduces the total memory stack required for your program, and that may provide better performance for your app.

Now this is all great, but there is a little caveat that comes with it.

Programming language rules are sometimes really hard to check.

And when they are violated, your program may give unpredictable results, or in a technical term, it may have undefined behavior.

Let's take a look at the example.

Pay attention to the pointer variable.

Inside the if block, it gets assigned the address of x.

Now look at the print statement.

At that point, the address of x is used via the pointer variable.

So what is happening here is that the address of x escapes the scope of the if statement, and that [inaudible] our language rules as undefined behavior.

Luckily, it is easy to fix.

It's just something to be aware of.

The way you fix this is by simply extending the scope of the lifetime for x by moving the definition of x outside the if statement, just before the condition is checked.

And now the definition of x is in the same scope as the address that is used in the print statement.

The takeaway from this is simply please make any effort you can think of to make your program adhere to programming language rules, that will give a much better experience for you, for the optimizing compiler, and for our users.

That is stack packing.

Let's move on to shrink wrapping.

Shrink wrapping is about the code the compiler generates in the entry and exit of your function.

It's resource management code that manages the runtime stack and the registers.

The observation here is that this code is not needed for all the paths through your function, and shrink wrapping will place this resource managing instruction only to places where they are actually needed.

So let's take a look at a simple example.

So here's a simple function that takes two parameters, a and b, simple compares the two.

And if a is smaller than b, then it calls a function foo which has the address of a local variable as a parameter, and eventually it returns.

To understand shrink wrapping, let's look at the pseudo assembly code akin to the code the compiler actually generates.

So here you'll see an entry code that allocates stack memory, saves registers, does the comparison, branches.

And the exit block that restores the registers, deallocates the stack, and returns.

And then if the condition is right, the function foo is called.

So what we see here is we have two paths in that program.

One from the entry to the exit block, and another path from the entry block to the call to the exit block.

The key observation here is that the resource managing instructions for the runtime stack memory and for the registers are only needed because we have a call instruction in our function.

So what shrink wrapping does, it recognizes this condition and moves these instructions out of the entry block and out of the exit block to the place where they actually need it.

So it shrinks the lifetimes of the resource managing instructions and wraps them around the region, where they are actually needed.

In this case, the region is just the call.

But now imagine, that the hot code in your function is the path from the entry to the exit.

No longer do we have to execute the resource allocating instructions.

And if this is hot functions, you have many functions like this.

You can imagine that this way we can save millions of instructions, providing nice performance gains, and also power savings for your app.

And that is shrink wrapping.

[ Applause ]

Let's move on with fused multiply-adds.

It takes us back to very simple arithmetical operations, additions and multiplications.

The arm64 processor has one instructions a fused multiply-add instructions instruction, that computes an expression like a plus b times c, in one single instruction.

You would naively assume that whenever you see an expression like this, it's always the best to generate this instruction.

And that's exactly what the compiler has done so far.

But it may surprise you.

In some cases, it's actually faster to generate two instructions, an add instruction, and a multiply instruction, to get faster performance for your app.

Why? I've prepared a simple example to demonstrate this.

This function takes four integer parameters and it computes a simple expression, a times b plus c times d.

What does the code look like when the compiler generates a single fused multipy-add instruction?

Well, it will compute a times b, then the multiply-add had to wait for that result.

And the multiply-add will compute the value of our expression.

How long does this take?

It takes four cycle for the multiply, four cycle for the add.

So in total, that simple sequence takes eight cycles.

When we generate 2 multiplies and 1 add, how can that be faster?

Let's take a look at that sequence.

First we issue the compiler issues a multiply, computes a times b, then computes c times d.

Eventually adds the result.

The secret here is that modern processors have instruction level parallelism.

They can execute two or more multiplies at the same time, in parallel.

So what is happening here is that the two multiplies get executed in parallel.

So within four cycles, we don't get the result of one multiply, but the results of two multiplies.

Then we just add the results in one cycle, and now we see that for this to compute the value of this expression, we only need five cycles.

Let's compare this with the sequence for with the sequence with a fewest multiply-add and now we see that for this simple sequence, we get a speed up of about 2x.

So with selective fused multiply-add, you can now speed up many simple expressions in your app.

And that is fused multiply-add.

[ Applause ]

So let's move on to arm64 cache tuning.

Here I will talk about two techniques: The compiler has to impact which data gets stored into the cache and the data that gets stored into the cache, impact the performance of your app.

Before we go into the details, what the compiler is doing, I want to quickly review the memory hierarchy.

At the top, you see the main memory that hosts the program variables.

And it's at the top.

At the bottom, you see the registers.

It's very slow to load data from main memory to the registers.

To bridge that gap, there's a cache, a temporary storage.

It's about 10,000 to 100 times to 100,000 times smaller than main memory, but it's much, much faster to load data out of the cache.

About 10 times to 100 times faster.

Then data is loaded out of main memory.

We not only load a single register value, but an entire cache line out of main memory.

So cache line contains more than one register value.

The reason why this design is usually so successful, is because your program data have two locality properties: Temporal locality and spatial locality.

Temporal locality simply means the data your program accesses now, it will access this very soon again.

Spatial locality simply means data your program accesses now, it will also access neighboring data.

So when you access an array field, it will also access the field next to it.

When it accesses a field in your data structure, it will also access a field next to that.

And so on and so forth.

Now, you look at this design, and you wonder so it's so fast to load data out of the cache.

Can we somehow magically preload the data into the cache, out of main memory while the processor is doing some other operations, so that all data get loaded quickly when we need them, when our programs need them?

It might surprise you, on your processor, in your iPhone, that magic is already happening.

It's hardware prefetching.

The processor looks at each address that is loaded, tries to find pattern in those addresses.

When it finds a pattern, it predicts which data your program will need in the future, prefetches this data out of main memory, again, while other complications are happening, puts the data into the cache and eventually when your program needs them, the program can load them quickly, out of the cache.

Now this year, we worked with the hardware architects to see how the compiler can even make that prefetching magic work better for your apps.

And we found a few patterns.

So what the compiler does now, it analyzes your source code, predicts which data your app will need in the future, issues prefetch instructions for this app, for this data.

Now while your app is running, the prefetch instruction executes, fetches the data from main memory, puts it into the cache, and when your program is ready to need those data, it simply and quickly loads it out of the cache.

Your program no longer has to wait for the data to be loaded out of main memory.

And that is the magic of software prefetching.

So far [ Applause ]

so far, I have talked about optimizations the compiler does automatically for you.

For the next optimization, non-temporal stores it will need your help.

To understand what is going on there, we need to take a look at what is happening when data is stored into main memory.

So again, let's take a look at our memory hierarchy, and assume we have a simple assignment in our program where we want to assign a value of say 100, to a variable in main memory, which has the current value of 55.

The first thing that is happening, the old data from the address you want to store the new value in is loaded into the cache.

And since we load data from main memory, not only the old data for that variable is loaded, but also the neighboring data also the neighboring data because we fill the entire cache line.

In the second step, we store the value in our register, into the cache line, in the cache.

And then finally, the data in the cache line or the cache line is needed for different data and our values get written back in the third step to main memory.

So it's a three step process.

Loading data from main memory into the cache, store the register value into the cache, and then eventually store the data back into main memory.

And this all because data usually have the locality property, and specifically temporal locality.

But what if your data do not have temporal locality?

Wouldn't it be much faster to simply store the value directly from the register into the main memory in just one, single step?

This is exactly what non-temporal stores do.

They avoid the extra load of a cache line, and also the other benefit of that you have is since you know that the data are no longer needed, that don't go into the cache, so the cache can now have other data for your app that are more useful.

The way you instruct the compiler to generate non-temporal stores is with a compiler builtin.

So you use the builtin to instruct the compiler to generate the non-temporal stores.

It takes two parameters.

The value you want to store, and the address where you want to store it to.

When do you want to use non-temporal stores?

When there is no [inaudible] use, no temporal locality in your code, you copy a large chunk of data, preferably larger than the size of the cache, and to make it really count for your app, it should be where the performance is hiding.

So it should be in hot loops.

And you don't want to use a non-temporal storage basically when any of these condition does not hold.

What can you gain from this?

So in this slide?

We look at the performance gains from the non-temporal stores on three benchmarks that contain hot loops that look similar to the example that I showed you before.

So what this data show you is that for very common loop bodies, you get very significant speed ups in the range from 30 to 40 percent.

And that is what non-temporal stores can hopefully do for many hot loops in your app.

[ Applause ]

It has been a long journey.

We have looked at many, many new features and a lot of great new optimizations that the new compiler brings to your app.

So we've talked about the Apple LLVM compiler is based on an open source project.

You can interact with us in the open source community, even provide patches for your favorite compiler.

We have seen many great new features like the objective class properties and C++11 thread local storage support.

Now the new Libc ++ has full support for C++14 and has many new improved features, but keep in mind, that Libstandardc++ is deprecated.

You get many more warning and diagnostics to make your code cleaner than ever before.

And you heard about this fantastic new feature, incremental LTO that gives you basically the same performance as monolithic LTO, and really, truly, awesome compile time.

And we talked about a number of optimizations in the code generator that automatically should speed up all your apps, and we finally talked about long term pro stores where the compiler provides the means, but you provide the smarts, and then to use them.

So I hope all this convinces you that we put together a really great compiler release.

We couldn't be happier about it.

Please get your hands on it.

Check it out, what it can do for your app.

You'll find more information on our website.

There are a number of really cool talks out there that you should be interested in.

Check them out.

Thank you all for watching.

Thank you for coming.

Have a great time at the conference.

[ Applause ]

Apple, Inc. AAPL
1 Infinite Loop Cupertino CA 95014 US