Modernizing Grand Central Dispatch Usage

Session 706 WWDC 2017

macOS 10.13 and iOS 11 have reinvented how Grand Central Dispatch and the Darwin kernel collaborate, enabling your applications to run concurrent workloads more efficiently. Learn how to modernize your code to take advantage of these improvements and make optimal use of hardware resources.

[ Applause ]

Good morning.

And welcome to the modernizing Grand Central Dispatch Usage Session.

I'm Daniel Chimene from the Core Darwin Team and my colleagues and I are here today to show you how you can take advantage of Grand Central Dispatch to get the best performance in your application.

As app developers, you spent hundreds or thousands of hours building amazing experiences for your users.

Taking advantage of our powerful devices.

You want your users to be able to have a great experience.

Not just on one device, but across all the variety of devices that Apple makes.

GCD is designed to help you dynamically scale your application code.

From the single core Apple Watch, all the way up to a mini core Mac.

You don't want to have to worry too much about what kind of hardware your users are running.

But there are problematic problems that can affect the scalability and the efficiency of your code.

Both on the low end and on the high end.

That's what we're here to talk about today.

We want to help you ensure that all the work you're putting into your app to make it a great experience for your users translates across all of these devices.

You may have been using GCD API's like dispatch async, and others to create cues and dispatch work to the system.

These are only some of the interfaces to the concurrency technology that we call Grand Central Dispatch.

Today, we're going to take a peek under the covers of GCD.

This is an advanced session packed full of information.

So, let's get started right away, by looking at our hardware.

The amazing chips in our devices have been getting faster and faster over time.

However, much of the speed is not just because the chips themselves are getting faster, but because they're getting smarter, and smarter about running your code.

And they're learning from what your code does over time to operate more efficiently.

However, if your code goes off core, because it's completed its task, then it may no longer be able to take advantage of the history that that core has built up.

And you might leave performance on the table when you come back on core.

We've even seen examples of this on our own frameworks, when we applied some of the optimization techniques that we're going to discuss today, we saw large speed ups from simple changes to avoid these problematic patterns.

So, using these techniques lets you bring high performance apps to more users with less work.

Today, we're going to give you some insight into what our system is doing under the covers with your code.

So, you can tune your code to take the best advantage of what GCD has to offer.

We're going to discuss a few things today.

First, we're going to discuss how you can best express parallelism and concurrency.

How you can chose the best way to express concurrency to grand central dispatch.

We're going to introduce Unified Queue Identity, which is a major under the hood improvement to GCD that we're publishing this year.

And we're finally going to show you how you can find problem spots in your code, with instruments.

So, let's start by discussing parallelism and concurrency.

So, for the purpose of this talk, we're talking about parallelism which is about how your code executes in parallel, simultaneously across many different cores.

Concurrency is about how you compose the independent components of your application to run concurrently.

The easy way to separate these two concepts in your mind, is to realize that parallelism is something that usually requires multiple cores and you want to use them all at the same time.

And concurrency is something that you can do even on a single core system.

It's about how you interpose the different tasks that are part of your application.

So, let's start by talking about parallelism and how you might us it when you're writing an app.

So, let's imagine you make an app and it processes huge images.

And you want to be able to take advantage of the many cores on a Mac Pro to be able to process those images faster.

What you do is break up that image into chunks.

And have each core process those chunks in parallel.

This gives you a speed up, because the cores are simultaneously working on different parts of the image.

So, how do you implement this?

Well first, you should stop and consider whether or not you can take advantage of our system frameworks.

For example, accelerate has built-in support for parallel execution of advanced image algorithms.

Metal and core image can take advantage of the powerful GPU.

Well, let's say you've decided to implement this yourself, GCD gives you a tool that lets you easily express this pattern.

The way you express parallels in GCD is with the API called concurrentPerform.

This lets the framework optimize the parallel case because it knows that you're trying to do a parallel computation across all the cores.

concurrentPerform is a parallel for loop that automatically load balances your computation across all the cores in the system.

When you use this with Swift, it automatically chooses the correct context to run all your computation in.

This year, we've brought that same power to the objective C interface dispatch apply with the dispatch apply auto keeper.

This replaces the Q argument allowing the system to choose the right context to run your code in automatically.

So, now let's take a look at this other parameter, which is the iteration count.

This is how many times your block is called in parallel across the system.

How do you choose a good value here?

You might imagine that a good value would be the number of cores.

Let's imagine that we're executing our workload on a three-core system.

Here, you can see the ideal case, where three blocks run in parallel on all three cores.

The real world isn't necessarily this perfect.

What might happens if the third core here is taken up for awhile with UI rendering?

Well, what happens is the load balancer has to move that third block over to the first core in order to execute it, because it's the third course taken up.

And we get a bubble of idle CPU.

We could have taken advantage of this time, to do more parallel work.

And so, instead our workload took longer.

How can we fix that?

Well, we can increase the iteration count and give the load balancer more flexibility.

It looks good.

That hole is gone.

There's actually another hole over here, on the third core.

We could take advantage of that time as well.

So, as Tim said on Monday, let's turn the iteration cup up to 11.

There. We filled the hole, and we have efficient execution.

We're using all of the available resources on the system until we finish.

This is still a very simplistic example.

To deal with the real-world complexity, you want to use an order of magnitude more, say 1000.

You can use a large enough iteration count so the load balancer has the flexibility to fill gaps in the system and take the maximum of your amount of advantage of the available resources of the system.

However, you should be sure to balance the overhead of the load balancer versus the useful work that each block in your parallel for loop does.

Remember that not every CPU is available to you all the time.

There are many tasks running concurrently on the system.

And additionally, not every worker thread will make equal progress.

So, to recap, if you have a parallel problem, make sure to leverage the system frameworks that are available to you.

You can use their power to solve your problem.

Additionally, make sure to take advantage of the automatic load balancing inside concurrentPerform.

Give it the flexibility to do what it does best.

So, that's the discussion about parallelism.

Now, let's switch to the main topic for today, which is concurrency.

So, concurrency.

Let's image you're writing a simple news app.

How would you structure it?

Well, you start by breaking it up into the independent subsystems that make up the app.

Thinking about how you might break up a news app into its independent subsystems, you might have a UI component that renders the UI, that's the main thread.

You might also have a database that stores those articles.

And you might have a networking subsystem that fetches those articles from the network.

To give you a better picture of how this app works and breaking it up into subsystems gives you an advantage, let's visualize how that executes concurrently on a modern system.

So, let's say here's a timeline that shows at the top the CPU track.

Let's image that we only have one CPU remaining.

The other CPUs are busy for some reason.

We only have one core available.

At any time only one of these threads can run on that CPU.

So, what happens when a user clicks the button and refreshes the article list in the news app?

Well, these interface renders the response to that button and then sends an asynchronous across the database.

And then the database decides it needs to refresh the articles, which it chooses another command to the networking subsystem.

However, at this point, the user touches the app again.

And because the database is done off the main thread of the application, the OS can immediately switch the CPU to working on the UI thread, and it can respond immediately to the user without having to wait for the database thread to complete.

This is the advantage of moving work off the main thread.

When the user interface is done responding, the CPU can then switch back to the database thread, and then finish the networking task as well.

So, taking advantage of concurrency like this lets you build responsive apps.

The main thread can always respond to the user's action without having to wait for other parts of your application to complete.

So, let's take a look at what that looks like to the CPU.

These white lines above here show the content switches between the subsystems.

A contact switch is when the CPU switches between these different subsystems or threads that make up your application.

If you want to visualize what this looks like to your application, you can use instrument system trace, which shows you what the CPUs and the threads are doing when they're running in your application.

If you want to learn more about this, you can watch the "System Trace In-Depth" Talk from last year where the instrument's team described how you use system trace.

So, this concept of context switching is where the power of concurrency comes from.

Let's look at when these context switch might happen and what causes them.

Well, they can start when a high priority thread needs the CPU as we saw earlier, with the UI thread pre-empting the database thread.

It can also happen when a thread finishes its current work, or it's waiting to acquire resource.

Or it's waiting for an asynchronous request to complete.

However, with this great power of concurrency comes great responsibility as well.

You can have too much of a good thing.

Let's say you're switching between the network and database threads on your CPU.

A few context switches is fine, that's the power of concurrency you're switching between different tasks.

However, if you're doing this thousands of times in really rapid succession, you run into trouble.

You're starting to lose performance because each white bar here is a context switch.

And the overhead of a context switch adds up.

It's not just the time we spend executing the context switch, it's also that history that the core has built up, it has to regain that history after every contact switch.

There are also other affects you might experience.

For example, there might be others ahead of you in line for access to the CPU.

You have to wait each time you context switch for the rest of the queue to drain out and so you may be delayed by somebody else ahead of you in line.

So, let's look about what might cause excessive context switching.

So, there's three main causes we're going to talk about today.

First, repeatedly waiting for exclusive access to contended resources.

Repeatedly switching between independent operations, and repeatedly bouncing an operation between threads.

You note, that I repeated the word repeatedly several times.

That's intentional.

Context switching a few times is okay, that's how concurrency works, that's the power that we're giving you.

However, when you repeat it too many times, the cost start to add up.

So, let's start by looking at the first case, which is exclusive access to contended resources.

When can this happen?

Well, the primary case in which this happens is when you have a lock and a bunch of threads are all trying to acquire that lock.

So, how can you tell if this is occurring in your application?

Well, we can go back to system trace.

We can visualize what it looks like in instruments.

So, let's say this shows us that we have many threads running for a very short time and they're all handing off to each other in a little cascade.

Let's focus on the first thread and see what instruments it's telling us.

We have this blue track, which shows when a thread is on CPU.

And the red track shows when it's making a sys call.

In this case, it's making [inaudible].

This shows that most of its time is waiting for the [inaudible] to become available.

And the on core time is very short at only 10 microseconds.

And there are a lot of context switches going on on the system, which it shows you on the context switches track at the top.

So, what's causing this?

Let's go back to look at our simple timeline and see how you excess contingent could be playing out.

So, you see this sort of staircase pattern in time.

Where each thread is running for a short time, and then giving up the CPU to the next thread, rinse and repeat for a long time.

You want your work to look more like this.

Where you have the CPU can focus on one thing at a time, get it done, and then work on the next task.

So, what's going on here that causes that staircase?

Let's zoom in to one of these stair steps.

So, here we're focusing on two threads, the green thread and the blue thread.

And we have a CPU on top.

We've added a new lock track here that shows the state of the lock and what thread owns it.

In this case, the blue thread owns the block, and the green thread is waiting.

So, when the blue thread unlocks, the ownership of that lock is transferred to the green thread, because it's next in line.

However, when the blue thread turns around and grabs the lock again, it can't because the lock is reserved for the green thread.

It forces at context switch because we now have to do something else.

And we switch to the green thread, and the CPU can then finish the lock and we can repeat.

Sometimes this is useful.

You want every thread that's waiting on the lock to get a chance to acquire the resource, however, what if you had a lock that works a different way.

Let's start again by looking at what an unfair lock does.

So, this time when blue thread unlocks, the lock isn't reserved.

The ownership of the lock is up for grabs.

Blue can take the lock again, and it can immediately reacquire and stay on CPU without forcing a context switch.

This might make it difficult for the green thread to actually get a chance at the lock, but it reduces the number of context switches the blue thread has to have in order to reacquire the lock.

So, to recap when we're talking about lock contention, you actually want to make sure to measure your application and system trace and see if you have an issue.

If you do, often the unfair lock works best for objects, properties, for global state in your application that may have taken a drop many, many times.

There's one other thing I want to talk about when we're mentioning locks, and that is lock ownership.

So, remember the lock track we had earlier, the runtime knows which thread will unlock the lock next.

We can take advantage of that power to automatically result priority inversions in your app between the waiters and the owners of the lock.

And even enable other optimizations, like directed CPU handoff to the owning thread.

Pierre is going to discuss this later in our talk, when talking about dispatch sync.

We often get the question, which primitives have this power and which ones don't.

Let's take a look at which low-level primitives do this today.

So, primitives with a single known owner have this power.

Things like serial queues and OS unfair lock.

However, asymmetric primitives, like dispatch semaphore and dispatch group don't have this power, because the runtime doesn't know what thread will single the sub primitive.

Finally, primitives with multiple owners like private, concurrent queues and read or writer locks, the systems doesn't take advantage of that today, because there isn't a single owner.

When you're picking a primitive consider whether or not your use case involves threads of different priorities interacting.

In like a high party UI thread with a lower priority background thread.

If so, you might want to take advantage of a primitive ownership that ensures that your UI thread doesn't get delayed by waiting on a lower priority background thread.

So, in summary, these inefficient behaviors often emerge in properties of your application.

It's not easy to find these problems just by looking at your code.

You should observe it in instrument system trace to visualize your apps true, real behavior and so you can use the right lock for the job.

So, I've just discussed the first cause on our context switching list, which is exclusive access.

To discuss some other ways your apps can experience excessive context switching, I'm going to bring out my Daniel Steffen to talk to you about how you can organize your concurrency with GCD, to avoid these pitfalls.

[ Applause ]

All right.

Thank you, Daniel.

So, we've got a lot to cover today, so I won't be able to go into too many details on the fundamentals of GCD.

If you're new to the technology, or need a bit of a refresher, here are some of the sessions at previous WWDC conferences that covered GCD and the enhancements that we've made to it over the years.

So, I encourage you to go and see those on video.

We do need a few of the basic concepts of GCD today however, starting with the serial dispatch queue.

This is really our fundamental synchronization primitive in GCD.

It provides you with mutual exclusion as well as FIFO ordering.

This is one of these ordered and fair primitives that Daniel just mentioned.

And it has a concurrent atomic in queue operation, so it's find for multiple threads to in queue, operations into the queue at the same time, as well as a single DQI thread that the system provides to execute asynchronous work out of the queue.

So, let's look at an example of this in action.

Here we're creating a serial queue by calling the dispatch queue constructor and that will give you a piece of memory that as long as you haven't used it yet, it's just in your application.

Now, imagine there's two threads that come along in call D queue.async method to submit some asynchronous work into this queue.

As mentioned, it's find for multiple threads to do this, and the items will just get in queue in the order that they appeared.

And because this is the asynchronous method, this method returns and the threads can go on their way, so maybe this first thread eventually calls queue.sync.

This is the way you interact synchronously with the queue.

And because this is an ordered primitive here, what this does is it will in queue a placeholder into the queue so that the thread can wait until it is its turn.

And now, there's this automatic worker thread that will come along to execute the asynchronous work items, until you get to that placeholder at which point the ownership of the queue will transfer to the thread waiting in queue.sync so that it can execute its block.

So, the next concept that we'll need is the dispatch source.

This is our event monitoring primitive in GCD.

Here we are setting one up to monitor a default descriptive for readability if you make read source constructor.

You pass it in a queue which is the target queue of the source, which is where we execute the event handle of the source, which here just reads from the default descriptor.

This target queue is also where you might put other work that should be serialized with this operation, such as processing the data that was read.

Then, we set the cancel handler for the source, which is how sources implement the invalidation pattern.

And finally, when everything is set up, you call source and activate to start monitoring.

So, it's worth noting that sources are really just an instance of a more general pattern throughout the OS, where you have objects that deliver events to you on a target queue that you specify.

So, if you're familiar with XPC, that would be another example of that XPC connections.

And, it's worth noting that all of everything we're telling you today about sources really applies to all such objects in general.

So, putting these two concepts together, we get what we call the target queue hierarchy.

So, here we have two sources with their associated target queues, S1, S2 and the queue is Q1 and Q2.

And we can form a little tree out of this situation by adding yet another serial queue to the mix, by adding mutual exclusion queue, EQ, at the bottom.

The way we do this is simply by passing in the optional target argument into the dispatch queue constructor.

So, this gives you a shared single mutual exclusion context for this whole tree.

Only one of the sources or one item in one of the queues can execute at one time.

But it preserves the independent individual queue order for queue 1 and queue 2.

So, let's look at what I mean by that.

Here I have the two queues, queue 1 and queue 2 with them queued in a specific order.

And because we have this extra serial queue at the bottom, and the executes, they will execute in EQ and there will be a single worker thread executing these items giving you that mutual exclusion property, only one item executing at one time.

But as you can see, the items from both queues can execute interleafed while preserving the individual order that they had in their original queues.

So, the last concept that we'll need today, is the notion of quality of service.

Here is a fairly deed concept that was talked about in some detail in the past.

In particular, in the power performance and diagnostics session in 2014.

So, if this is new to you, I would encourage you to go and watch that.

But what we'll need today from this is really mostly it's abstract notion of priority.

And we'll use the terms QOS and priority somewhat interchangeably in the rest of the session.

We have four quality of service classes on the system.

From the highest user interactive UI to user initiated, or IN, utility, UT to background BG.

The lowest priority.

So, let's look at how we would combine this concept of quality of service with the target queue hierarchy that we just looked at.

In this hierarchy, every node in the tree can actually have a quality of service label associated with it.

So, for instance the source 2 might be relevant to the user interface.

It might be monitored for an event where we should update the UI as soon as the event triggers.

So, it could be that we want to put the UI label onto the source.

Another common use source would be to put a label on the mutual exclusion queue to provide a flow of execution so that nothing in this tree can execute below this level, so UT in this example.

And now if anything else in this queue fires, for instance source 1, we will be using this flow for the tree if it doesn't have its own quality of service associated.

And source firing is really just an async executes from the kernel.

And the same as before happens, we end queue the source handler eventually into the mutual exclusion queue for execution.

For asyncs from user space, your quality of service is usually determined from the thread that called queueu.async.

Now, we have a user initiated thread that makes item at IN into the queue and for execution into E queue eventually.

And now, maybe we have the source 2 that flies with this very high priority UI relevant event that executes its event handler, and queues its event handler into EQ.

So, now you'll notice that we have a priority inversion situation.

We have three items in queue with a very high priority item at the end preceded by some low priority items.

And these have to execute in order.

The system resolves this inversion for you by bringing up a worker thread at the highest priority of anything that is currently in queue.

And it's worth keeping this little tree on the right hand side here in mind because it comes up again later in the session.

And with that let's move on to our main topic of the section which is how to use what we just learned to express good granularity of concurrency to GCD.

Let's go back to our news application that Daniel introduced earlier and focus on the networking subsystem for a little bit.

In a networking subsystem, you'll have to monitor some network connections in the kernel.

And with GCD you'll do that with a dispatch source, and the dispatch queue like you just saw.

But of course in any networking subsystem you usually not just have one network connection, you'll have many of them and they will all replicate the same setup.

So, let's focus on the right hand side on the three connections here and see how the execute.

If the first connection triggers, just like the same thing we just saw happens, we will end queue the event handler for that source onto its target queue.

Of course if the other two connections fire at the same time, they'll still replicated and you'll end up with three queues with an event handler end queued.

And because you have these three independent serial queues at the bottom, you've really asked the system to provide you with three independent concurrency contexts.

If all these become active at once, the system will oblige and create three threads for you to execute the event handlers.

Now, this may be what you wanted, and maybe exactly what you were after, but it is quite common for these event handlers to be small and only read some data from the network and in queue it into a common data structure.

Additionally, as we saw before, you don't have just three connections, you may have many, many of them if you have a number of network connections in your subsystem.

So, this can leave to a situation where you have this kind of context switching pattern, and excessive context switching that Daniel talked about where you execute a small amount of work, context switch to another thread and do that again, and again, and again.

So, how can we improve on this situation in this example here?

We can apply the single mutual exclusion context idea that we just talked about by simply putting in an additional serial queue at the bottom and forming a hierarchy, you can get a single mutual exclusion context for all of these connections.

And if they fire at the same time, the same thing as before will happen, the event handlers will get end queued onto the target queues, but because there's an additional serial queue at the bottom here, it's a single thread that will come and execute them in order instead of the multiple threads that we had before.

So, this seems like a really simple change but it is exactly the type of change that lead to the 1.3 X performance improvement in some of our own code that Daniel talked about earlier in the session.

So, this is one example of how we can avoid the problematic pattern of repeatedly switching between independent operations.

But it really comes under the general heading of avoiding unwanted and unbounded concurrency in application.

One way you can get that is by having many queues becoming active all at once.

And one example of this is that independent requiring source pattern that we just say.

You can also get this if you have independent or object queues.

If many objects in your application have their own serial queues and you put asynchronous work into them at the same time you can get exactly the same phenomenon.

You can also see this if you have many work items submitted to the global concurrent queue at the same time.

In particular if there's work items block.

The way the global concurrent queue works is that it corrects more threads when existing threads block to give you a continuing good level of concurrency in your application.

But if those threads then block again, you can get something that we call the thread explosion.

This is a topic that we went into some detail in the "Building Responses and Efficient Apps with GCD" in 2015.

So, if this sounds new to you, I'd encourage you to go and watch that session.

So, how do you choose the right amount of concurrency in your application to avoid these problematic patterns?

One idea that we've recommended to you in the past is to use one queue for subsystem.

So, here back in our news application, we already have one queue for the user interface, the main queue and we could choose one queue for the networking and 1 queue for the database subsystem in addition.

But what we've learned today a more general way to think of this is really to use one queue hierarchy per subsystem.

This gives you a mutual exclusion context for the subsystem, and you can leave the rest of the queue event sub structing and subsystem alone and just target that network queue or database queue that underlies the bottom of your queue hierarchies.

But, that may be a bit too simplistic a pattern for a complex application or a complex subsystem.

The main thing that is important here is to have a fixed number of serial queue hierarchies in your application.

So, it may make sense to have additional queue hierarchies for a complicated subsystem, say a secondary one for slower work, or larger work items, so that the first one, the primary one can keep the subsystem responsive to requests coming in from outside.

Another thing that's important to think about in this context is the granularity of the work submitted to those subsystems.

You want to use fairly large work items when you move between subsystems to get a picture like what we say earlier in the session, where the CP is able to execute your subsystem for long enough to reach an efficient performance state.

Once you're inside the subsystem, say the networking subsystem here.

It may make sense to subdivide into smaller block items and have a finer granularity to improve the responsiveness of that subsystem.

For instance, you can to that by splitting up your work and re-asyncing to another queue in your queue hierarchy.

And that doesn't introduce a context switch because you're already in that one subsystem.

So, in summary, what have we looked at in this section?

We saw how we can organize queues and sources into serial queue hierarchies.

How to use a fixed number of the queue hierarchies to give GCD a good granularity of concurrency.

And how to size your work items appropriately earlier in the section for parallel work and here for concurrent work inside the subsystem as well as between subsystems.

And with this, I'll hand it over to Pierre to dive into how we have improved GCD to always execute the queue hierarchy on a single thread and how you can modernize your code to take advantage of this.

[ Applause ]

Thank you Daniel.

So, indeed we have completely reinvented the internals of GCD this year to eliminate some unwanted context switches and execute single queue hierarchies like the ones that Daniel showed on the single thread.

To do so we have created a new kind of concepts that we call Unified Queue Identity that let us do that.

And we will walk you through how it works.

So, really this part of the talk will focus on a single queue hierarchy, like the ones Daniel showed earlier.

However, we'll work on simplified ones with the sources at the top, and your mutual exclusion context at the bottom.

The internal GCD notes are not quite given for that part of the talk.

So, when you create an [inaudible] context you use to dispatch queue constructor, that creates just a piece of memory in your application that is a note.

And one of the first things that you may do is to dispatch recent coded items to it.

So, you will have code in your application that will here and queue a [inaudible] on the queue, when that happened before we used to request a thread anonymously to the system.

And the resolution of what that was meant to do happens late inside your application.

In this case, we change that and what we do is that we create our counter object, the Unified Queue Identity that is tied to your queue and is exactly meant to represent your queue in the kernel.

We can tie that object with the required hierarchy to execute to work, which here is backup.

And that causes the system to ask for a thread.

The thread request, that dotted line on the slide, may not be fulfilled for some time, because here that's a background thread, and maybe the system is loaded enough that it's not even worth giving you a thread for it.

Later on, some other path of your application may actually try to en queue more work.

Here a UT [inaudible] that is slightly higher priority.

We can use the queue identity, the unified identity in the catalog to look and solve the priority inversion, and elevate the priority of that thread request.

It may be that is the small nudge that the system needed to actually give you a thread here to execute your work.

But this thread is in the scheduler queues not yet on call.

Not executing.

And the reason why is because there is another thread in your application that is interacting with a queue and working synchronously at a higher priority, even, usually shaded.

Now that we have that Unified Queue Identity, we can actually since that thread has to block to en queue the placeholder that Daniel told you about a bit earlier, we can block the synchronous execution of that thread on the Unified Queue Identity.

The same on that we use for asynchronous work, [inaudible].

But now that we unified the asynchronous and the synchronous part of the queue in a single identity, we can apply an optimization and delicately switch the thread that's blocking you by passing the scheduler queue and registering the queue delays that Daniel introduced while talking about the scheduler very early.

So, that's how the unified queue identity is used for for asynchronous and synchronous work items.

Now, how did we use that for events?

Why is it useful?

So, that is the small tree that we've been using so far, let's look at the creation of these sources.

When you create the source using the makeResource factory button, you set a bunch of events, of favorite handlers and properties.

But what is really interesting is what happens when you activating the object.

This is actually at that moment, that we will notice that utility are QOS, at which the handler for your source will always execute.

Because it's inherited from your queue hierarchy.

We will also know now, with the new system, that the handler will eventually execute that in queue execution mature exclusion context.

And will now register the source at front with the sync unified identity that I just talked about a bit earlier.

If we look at the higher UI QOS source that we have on the tree, the way we are treated is very similar of the first one, except that when you're setting the handler here you're specifying the QOS that you actually want.

And again, what's interesting is what happens at activation.

That is when we the snapshot and like before when we got the utility QOS from your hierarchy, here we get it from your hint.

We still recall the fact that they will execute both the sources on the same execution context.

And will register that second source up front again, which with some unified identity in the kernel.

So, really what we're trying to solve with that quite complex identity is a problem that we had in previous phases of the OS, where related operations would actually bounce off the old threads.

Let's look at how it used to work.

So, remember that's our queue hierarchy, and let's bring up the timeline that you've seen a bunch of times now in our talk.

At the top, the CPU, but now there is a new tack, the exclusion queue card that will show you what is executing at any given moment on that queue.

So, that's really how the runtime used to work before this phase in macOS Sierra and iOS 10.

So, let's look at what happens if the first source fails.

Before, like I said thread requests were anonymous.

We would ask for an anonymous thread, deliver the event on the thread and then we would look at the event.

And when we look at the event inside your application, that is only then that we realize that this event is meant to run on a queue.

We would then queue the event handler.

But since the queue is unclaimed, the thread could actually become that queue and start executing given handler for your source.

And we do so.

Now, the interesting thing is what happens when the second source that is higher priority fires?

The same actually.

Since it's a hierarchy QOS here, higher priority that's what you're executing right now.

We would bring up a new anonymous thread deliver that higher priority event on the thread.

And look at what that event means.

And we will notice that it is for exactly the same queue hierarchy only then.

And then queue the handler after the one we just pre emptied.

As you see, we closed our first context switch.

It was of that higher priority event.

But, we cannot make for what progress, because unlike the first time, that second thread cannot take over the queue it is already associated with a thread.

We cannot take it over.

So, the thread is done.

Which as Daniel explained one reason why you context switch again.

And that's what we do, we context switch back to the first thread that is the one that can actually make progress.

We execute the rest of the first handle and finally move to the second one.

So, as you can see, we use two threads and two context switches that you really didn't want for a single execution context.

We fixed that using Unified Identity in macOS High Sierra and iOS 11.

We got rid of that thread.

And we also, of course got rid of the two context switches that we had, that were unwanted.

And of course, its important because unlike what happened when Daniel showed you the pre-emption with that UI touch event, where we could take advantage of the fact that we actually had two threads that were independent to be more responsive for application.

Here, we didn't benefit from any of these context switches, because these event handler, S1 and M2 had to execute in order anyways.

So, knowing about that event early was not useful.

And if you look at how this actually, what the flow is today, it looks more like this.

What happened here?

The most important thing on that flow is that now if you look at the thread, it's called EQ, because that's the part of the unified identity, the thread and the EQ are basically the same object.

And the kernel knows that it's really executing a queue, which is reflected on the CPU tab, you don't see the events anymore, it's just running your queue.

However, you might ask, how did we manage to deliver the event, that second event without requiring a hamper.

That is actually a good question.

When the event fires, now we know where it will execute, where you will handle it.

We just mark the thread.

No helper needed.

And at the first possible time, we will notice that thread was marked with you have pending events.

And when we de-queue the events, one needs to hide time, hide after the first handler finishes.

We can grab the events from the kernel, look at them, and then queue their handlers on your hierarchy.

So, why did we go through that quite complex explanation?

That's so that you can understand how to best take advantage of the runtime behavior.

Because clearly, the runtime uses every possible hint you're giving us to optimize behavior in your application.

And admittance buttons to know how to hint and when to hint the runtime correctly so that we make the right decisions.

Which brings me to what should you do to existing code bases to take advantage of all that core technology that we rebuilt.

Now, actually two steps to follow to take the full advantage of that technology.

The first one is no mutation after activation.

And the second one is paying extra care with extra attention to your target queue hierarchies.

So, what does that mean?

No mutation past activation really means that when you have any kind of property on a dispatch object, you can send them, well as soon as you activate the object, you should stop mutating them.

The second example, that's our source that we've seen quite a few times already in the talk.

That [inaudible] for the ability.

And you're setting a bunch of properties, handlers; the event handler, the consent handler.

You may have registration handlers.

You can even change them a few times, that's fine, you can change your mind.

And then you activate the source.

The contact here is that you should stop mutate your objects.

It's very tempting to, after-the-fact, for example change the target queue of your source.

That will cause problems.

And the reason why is exactly what I showed a bit earlier, at activate time we take a snapshot of the properties of your objects, and we will take decisions in the future based on that snapshot.

And if you change the target queue hierarchy after-the-fact, it will hinder that snapshot stale and that will defeat a bunch of very important optimization such as the priority inversion avoidance [inaudible] the direct handoff that we have for the dispatch sync that I presented earlier, are all defensive and deliverable optimizations that we just went through.

And I insist on the points that Daniel made early on, which is that many of you probably never had to create a dispatch source in your application.

And this is fine, this is really how it's supposed to work.

You probably actually use them a lot of them through system frameworks.

It's a shame you have a framework that you have to then dispatch queue to because they are asyncing some notifications on the queue on your behalf.

Behind the scenes, they have one of these sources.

So, if you're changing the assumptions of the system, you will actually break all of these optimizations as well.

So, I hope a made a point really clear that to target your hierarchy is essential and you have to protect it.

What does that mean?

And how to do that?

The first way, which is a very simple device, is that when you're building one, start from the bottom and build it toward the top.

When you show that card from the slide build up, as you see, these wider holes there, they are your target queue relationships.

None of them have to be mutated if you [inaudible] in that order.

However, when you have a large application, or you're hiding your frameworks and you're bending one of these queues to another part of your engineering company, you may want to have stronger guarantees than that.

You may want to lockdown these relationships, so that really no one can mutate them after-the-fact.

This is actually something that you can do with the technology that we call set a queue hierarchy.

We introduced it last year, and actually if you are using Swift 3, then you can stop listening to me, because you're already in that form and that the only world you're living.

However, if you have an existing cloud based, or you use older versions than of Swift, you need to do some extra steps.

So, let's focus on the relationship between Q1 and EQ here.

When you created that with Objective-C you probably hold code that looks like this.

You create your queue and then in the second step, you will set your target queue of Q1 to EQ.

That is not protecting your queue hierarchy.

Anyone can come along and call dispatch target queue again and break all your assumptions.

That's not totally great.

There is a simple step to just fix that code into a way that is safe, which is to adopt a new API we introduced last year, which is dispatch queue create with target, which in a single automatic step will create the queue, set the queue hierarchy height, and protect it.

And that's it.

These were the two steps to follow for you to really work with the [inaudible] well.

Other, a bit like the mutated case that Daniel walked you through early on, finding when you're doing one of these things wrong is fairly challenging, especially on the large cloud base.

Finding that in an existing cloud base full code inspection is hard.

This is why we created a new GCD performance instruments tool to find problem spots in an existing application.

And I will call Daniel back to the stage to demo for you.

[ Applause ]

Thank you, Pierre.

All right to start out with please note that this GCD performance instrument that we'll see is not yet present in the version of XCode 9 that you have, but it will be available in an upcoming seed of XCode 9.

So, for this demo, let's analyze the execution of our sample news application in some detail.

So, what happens here if you click this connect button at the bottom, is that this app creates a number of network connections to a server, to read lists of URLs from, which are then displayed in the WebViews whenever the refresh button is hit.

So, let's jump into XCode to see how we are setting up those network connections.

So, here we are in XCode in the create connections method, which does just that.

It's very simple.

We have a for loop, maybe just create some sockets and connect them to our server.

And we monitor that socket for readability with one of these dispatch read sources that we've seen so many times already in this session.

And here it is the trusted the see API.

We then set up the event handler block for that dispatch source here.

And when the socket becomes readable, we just read from it with the read system call until there is no more data available.

Once we have the data, we pass it to our database, subsystem in the application with this process 0 method.

So, let's build and run, and take a system trace of this application and see how it executes.

So here we are in instruments, in system trace, and in addition to the usual tacks in system trace, we've added this new GCD performance instrument.

When we click on there, we see a number of performance events that have been reported for performance problems.

One of these is this mutation after activation event, that we can also see when we go and mass over the timeline.

You can also click on one of the other events here, such as this, re-target after activation event.

And the list will take us directly there.

If you want more details on this, we can disclose the backtrace on the right hand side is instruments which will show us where exactly this event occurred in your application.

So, here for instance it is in our create connections method.

If we double click on this frame, instruments will show us directly the line of code where the problem occurred.

This is actually a target queue call here that indeed occurs after activate.

Like, this is the pattern up here just told you about.

To go and fix that, we can jump directly into XCode with the open file and XCode button and instruments.

So, here we are at that dispatch the target queue line and indeed it, as well as the dispatch source at event handler set up happens after activate.

So, here in this example, it's really easy to fix.

We just move these two lines down below.

And we have fixed the problem.

We have activate after we set up the source, and not before.

So, let's jump back into instruments and see what we can see in the system trace now.

It looks the same as before, except when you click on the GCD performance track, you will see there is no more significant performance problems detected.

And that's what you ought to see if you use this instrument.

So, of course this was very simple in this application.

You may have to do some work.

So, let's focus on the points track in the application.

This shows us a number of network event handlers.

And these are the source event handlers in our application.

How did you manage to make these show up in instruments?

That's actually really interesting to understand because it's something you can apply to your own code to understand how it executes in instruments.

Well, going back to XCode in our create connections method, when we set up our source and its source event handlers, we are interested in the execution of that event handler, and try to understand its timing.

To see that instruments, we've added the kdebug signpost start function at the beginning of the handler, and the kdebug signpost end function at the end.

And that is all it takes for the section of code to appear highlighted in the points track in instrument system trace.

So, if you switch back to instruments, that is these red dots at the pop in the points track and we can see in the back trace that it matches our event handler for one of these events.

If you zoom in on one of these interesting looking areas in the points track, here, you can see that there is a number of event handlers that are occurring very close together.

And by mousing over we can actually see that they're execute for very short amounts of time.

The pop-up will tell us the amount of time it has executed and we can even see that sometimes we have overlapping event handlers that are all executing concurrently at the same time.

So, this is one of the symptoms of potentially unwanted concurrency in our application, where something that didn't look like it would cause concurrency in your code, actually does run in a concurrent way or multiple threads and cause potentially extra context switches.

So, to understand this better, let's bring up the threads in instruments.

And the system trace that are executing this code.

So, here I've highlighted the three worker threads that are executing these event handlers.

And we can see as before that they are executing on call during this time.

And the time they were running.

But here we can see they were again, running for a very short amount of time in this area.

And we can verify that they are making these read system calls that we saw earlier in the event handler.

And we can get some more detail by looking at the back trace again, and seeing, yes it is us that is calling that read system call and here it reads 97 bytes from our socket.

And looking at the other threads, the same pattern repeats.

You can see it's the same read system calls occurring there, more or less at the same timeframe and so on the second thread here or on the first thread.

They're really all doing the same thing, and overlapping.

It would be much better for our program if these things executed on a single thread.

Here we don't really get any benefit from the concurrency because we are executing such short pieces of code.

And we are probably getting more harm than good from adding these extra context switches.

So, let's apply the patterns that we saw earlier to fix this problem in this sample application.

Jumping back into XCode, let's see how we set up the target queue for this source that we have.

So, that's sort of when you create this queue at the top of this function framework and as you can see, we do it simply by calling this batch queue correct.

And that creates an independent serial queue that isn't connected to anything else in our application.

This is exactly like the case we had earlier in my example of the networking subsystem.

So, let's fix that by adding a mutual exclusion context at the bottom of all of these queues for all of these connections.

And we do that by adding the, or by switching to the dispatch queue create with target function up here introduced to you earlier.

So, here we add dispatch queue, create this target.

And we use a single mutual exclusion queue as the target queue for all of these.

And this is a serial queue that we created somewhere else.

And with that, we build and run again and look at the system trace again.

And now it looks very different.

Here we have still the same points track and we still see the same network events that occur, but as you can see, there's no more overlapping events in that track, and there's a single worker thread that executes this code.

And if we zoom in on one of these clusters we can see this is actually many instances of that event handler executing in rapid succession, which is exactly what we expected.

And when you zoom in more on one particular event, you can see it's still executing for a fairly short amount of time, and making those same read sys calls.

But now that is much less problematic because it's all happening on a single thread.

So, this may seem like a very simple and trivial change, but it's worth pointing out that it's exactly this type of small tweak that led to the 1.3X performance improvement in some of our own framework code that Daniel pointed out at the beginning of the session.

So, very small changes like this can make a significant difference.

All right so, let's look back at what we've covered today.

Daniel, at the beginning went with you over the details of how not to go off core unnecessarily is ever more important for modern CPUs so that it can reach the most efficient performance state.

We looked at the importance of sizing the workforce of power workloads and for work moving between subsystems in your application as well as inside those subsystems.

We talked about how to choose good granularity of concurrency with GCD by using a fixed number of serial queue hierarchies in your application.

And Pierre walked you through how to modernize your GCD usage to take full advantage of improvements in the OS, in our hardware.

And finally, we saw how we can use instruments to find problems spots in our application and how to fix them.

For more information on this session, I will direct you to this URL where the documentation links for GCD are as well as the movie for the session, and we have some related sessions this week that might be worthwhile going to.

Introducing Core ML already having happened, the other two are going to help you with parallel and computing [inaudible] task in your application like we talked about at the beginning.

And the last two are going to help you with more performance analysis and improvements of different aspects of your app.

And with that, I'd like to thank you for coming.

If you have any questions, please come and see us at the labs.

[ Applause ]

Apple, Inc. AAPL
1 Infinite Loop Cupertino CA 95014 US