[ Music ]
[ Applause ]
This is session 411, System Trace in Depth.
My name is Chad Woolf.
And I'm Joe Grzywacz.
And we are performance tools engineers for Apple.
Now in last year's In Depth session we covered the Time Profiler.
And we showed you how to analyze your applications all the way down to the disassembly level.
And our goal, then, was to show you how to make your code as fast as possible.
Now at some point you may want to run that optimized code on multiple CPUs in order to get even more work done.
But as you do this, you also increase the system load.
Now as the system load increases, a couple of second-order effects creep in, such as increases in preemption, lock contention, and virtual memory activity.
Now any one of these three things is enough to potentially offset the gains that you had made when you were doing your time profiling.
So in today's session, we're going to show you how to use system trace to analyze the second-order effects.
And we're going to show you how to efficiently load the system while still maintaining optimum performance.
So our session's going to look like this today.
We're going to talk a little bit about system trace and why it applies to application developers.
And then Joe and I are going to walk you through system trace.
We're also going to talk a little bit about threading, signposts, some a little bit on virtual memory, and show you some best practices on how to get the most out of the tools.
So why system trace for application developers?
Well, when your application becomes front and center on the device, from the users' perspective that is the system.
They don't see your application as a peer with all these system services and daemons contending for shared resources like CPU time and memory.
They just see your app.
And so if your app stutters do to a flurry of virtual memory activity or due to maybe a misprioritized thread, they're going to come to you looking for a fix.
So that's the bad news.
The good news is that when your app is front and center on the device, it is the most important thing on the device.
So the operating system knows this, and it's going to give you as much CPU time and memory as it can possibly spare.
So when we talk about tuning with system trace, we're not really talking about tuning the system, we're talking about tuning your application to use the resources that it's been given.
Now system trace is a template in Instruments.
It works great in all four of our platforms.
And when you take a recording with it, it puts the kernel into a special tracing mode that records all the scheduling activity, system calls, and virtual memory operations that are occurring.
Now this can accumulate over time.
It could be a lot of data.
So one of the changes we made in Instruments 8 is that we put the template into Windowed Mode by default.
And what that means is that we keep about the last five seconds worth of data around.
And the advantage here is that you can take you can start your recording, setup your application, take as long as you need to get it to reproduce the performance problem, and when that performance problem reproduces, you hit stop and you get that last five seconds worth of actionable data.
Now this is what five seconds worth of data can look like.
These traces can get pretty dense.
We found it's very useful if you're able to correlate this data with some of the high-level activities inside your application, such as am I updating a table view, did I have a download going on in the background, maybe am I updating a graph.
Those kind of high level things.
So this year in Instruments 8 we included the Points of Interest instrument.
Now the Points of Interest instrument is essentially a blank canvas where you tell Instruments what you consider interesting and we'll put it up there on the graph for you.
Now you do this using Signposts from within your code.
And Signposts have been released for a while, but we used to have the way you used to get to it was to call this direct system call and you had to dig these macros out of these header files and put it altogether.
And it was kind of clunky.
Now on top of it being clucky in today's SWIFT world, it just simply doesn't work.
So what we're going to do is deprecate that approach.
But we have added in the new operating system these nice functions to do the exact same thing.
And they're kdebug signpost.
So they work great from C, Objective C, C++, and now SWIFT as well.
So to get started, the easiest way is to drop a point event on a timeline.
That's one of those little red lollipops.
And the way you do that is invoke kdebug signpost and that's it.
So inside of our Mouse Down, every time we hit our Mouse Down, you'll see one of these appear on the timeline.
Now it does take a few arguments.
The first one is a code that's just some arbitrary integer between 0 and 16383.
And it helps you identify your Signpost.
The next four arguments are integers as well and can be anything you want.
By default Instruments will just pass that right up to the UI.
Now you can name these things.
Just got to toggle over to the configuration section of the Instrument and add to the Table View.
And so we have our code 5 now being Mouse Down.
So next time I take a recording, I can see that those Points of Interest are now Mouse Downs.
Now if you like Points of Interest, you'll also like Regions of Interest.
They're basically the same thing, but they work for states and actions or things that occur over a period of time.
Now it's a little bit more complicated because instead of one call you do have two.
You got a start and an end.
And you also have a pairing rule that's in Instruments.
Now by default it's a very simple pairing rule.
It just uses the code, so all you have to do is supply the same code in both the start and the end.
Now we know that that's not going to work for all applications because some applications have to issue a flurry of starts followed by a flurry of ends.
And what that can be inside Instruments there can be some ambiguity in how to pair those things up.
So we do allow you to change the pairing rule.
Come down here to the Configuration.
You can select Code and First Argument or Code and Thread.
So Code and First Argument means that both your start and end have to have the same code and same first argument.
So in this example what we've done is started a bunch of URL downloads in parallel, and we've used the URL download task pointer as a first argument.
So now you can see these parallel downloads on the Instruments timeline.
If you use Code and Thread, you have to issue the same code start and end on the same thread.
And here's an example of that happening.
But you can see it's happening with inside of a dispatch apply block.
So now we have four Regions of Interest on the graph that are we've got one for each thread or a worker thread.
Now so far, the Regions of Interest you've seen have all been read.
It's all monochromatic.
But if you're willing to sacrifice your last argument, you can come over here and check this box that says use the last argument for color.
And in your fourth argument, you just need to supply one of these enumerations between 0 and 4, and those are the five basic colors in Instruments.
And in this example when my download task completes successfully, I've colored it green.
And if it completes with an error, I color it red.
So you can see very clearly a difference between pass and fail.
So now when we put this altogether, you can see how it's much easier to correlate that big, complicated trace I showed you with the high level activity in your application.
So for example, you can see that this flurry activity was actually created by our download tasks.
So let's talk about our demo application today.
We're going to be looking at an app we wrote for you guys called Graphasaurus 2, it's the spiritual successor to Graphasaurus from last year's session.
But just like last year's session, we're going to be looking at real-world problems.
So these are problems that we encountered when we were tuning Instruments 8.
And we decided to just include them in an iOS application for demo.
There is a new graphing style.
This looks more like the state graphing style that you see in Instruments because that's something that we're tuning this year.
And another difference from last year is that we're going to assume that our code is has already been time profiled, and it's already optimal given the constraints we have.
And just like last year, we're going to be generating our graphs on the CPU using core graphics.
Now, what we did is we did some initial timing.
And we found that to generate all four of these graphs.
It was going to take about 20 milliseconds in total.
And that's in the worst case when everything's zoomed out.
Now that is larger than our 16 millisecond deadline if we wanted to hit 60 frames per second.
So we decided that we needed to try to introduce some parallelism.
Because what we know is that all four of these graphs can be generated independently.
And they take about five milliseconds in the worst case.
So we're thinking if throw all this work at dispatch and we have perfect scalability, then our two-core iPad over here should be able to cut that work down in half.
To see how we did, show you how to use system trace.
I'm going to turn it over to Joe.
Thank you, Chad.
[ Applause ]
All right, so what you're looking at here is a Quick Time mirror of the iPad Pro I'm playing with here.
So you can see just like in Instruments, you can pan around nice and smooth.
But when I eventually pinch out a punch, the animation gets just a little bit stuttery way out here.
It's not awful, but it could be better.
So what I want to do now is figure out why that is.
So let's go to it here to Xcode.
And let's click and hold on the Run button and choose Profile.
So that'll build your application release mode.
Install it on the device.
And then Instruments will come up with its template chooser where you decide how you want to profile.
I'm going to go ahead and double click on the System Trace template.
And now from here before I start recording, I went ahead and added some of the kdebug signpost start and end points to my code ahead of time.
And so I want to configure how those will appear inside of Instruments.
And so you do that down here in the lower right in the Record settings.
And first off, I did want to use my first argument for color, so I put some unique numbers inside of my kdebug signpost.
And I added three different codes.
Code 0 is CADisplayLink.
So that's that 60 hertz timer that's driving the animation.
So basically this region will correspond approximately to my frame time because this is where I do all of my rendering.
Next is Code 1, and that is, I'll call it CreatePath.
So this one I'm actually creating the CG paths, the rectangles and the labels that are going to be appearing on the screen.
Finally Code 2 is called RenderGraph.
So that's when I take those CG paths and actually render them into a CG bitmap context and then display them on the screen.
Finally, since my code's going to be running in parallel and they're going to be emitting the same two codes here, I need to tell Instruments how to differentiate them.
And in this case, the thread is good enough since it'll be running on different threads.
All right, so I did all that configuration.
I don't want to be doing this again and again every time I come to Instruments.
So we go to file and choose Save As Template.
I'll give it a descriptive name.
Graphasaurus System Trace and hit Save.
And now let's say you close this document, you close Instruments, you come back a week later, whenever you get to the template chooser, move over to the Custom tab, there's your template ready to go.
You don't have to redo your configuration.
Just hit record.
Now Instruments is going to wait for me to kind of reproduce the problem here.
So I'm just going to pinch out, just like you saw me do before.
Pinch out where it gets to the problem.
And then I'm going to just reproduce that problem for a couple of seconds so that it fills that window buffer with the data I'm interested in.
And once I do that, I stop the recording.
So now Instruments will go download all of that data off the device and then begin to analyze it.
And since this was a Windowed Mode recording, you only get those last few seconds.
Make sure that when you reproduce the problem, you stop the recording right afterwards.
Otherwise, newer events are going to come in and kind of push out the stuff you were actually interested in.
So we'll wait here for Instruments to finish analyzing.
And there we go.
So we're looking at a whole bunch of stuff here.
So let's make this a little bit larger so we can see what's going on.
All right, so this first selected instrument that's the new Points of Interest Instrument.
So let's zoom in on a section here randomly and kind of see what we're looking at.
So now we can see there's all those codes I created.
I see DisplayLink, the CreatePath, the RenderGraph.
And it looks like I'd expect.
I have my big blue CADisplayLink time here.
And inside there's four pairs of the green and purple create and rendering those graphs.
And so that looks good.
But when I mouse over to this region, you get a little tool tip showing you those arguments you provided along with the duration.
And I'm actually running here close to 30 milliseconds.
So it's about half of the speed I want to be running at.
So that's not good.
But this was just one frame.
I rendered a ton.
So what does it look like in aggregate?
Well, we come down here and look at this detail table.
This is currently showing us a time sorted list of all those regions.
So you could look through this huge list here and look at the arguments and et cetera.
But we did a table here that aggregates that for you.
It's called the KDebug Interval Signposts by Code table.
And when I select that here's all the codes that Graphasaurus 2 emitted.
And we can see here's my CADisplayLink.
I rendered 152 frames, and on average, they were taking about 28 milliseconds.
So yeah, in average, I'm not running so well.
You can see the min, the max, the standard deviation that sort of thing.
You can dive in with this focus button next to the code.
And now that'll give you a table of all of that data.
Those are all the events that happened that were my CADisplayLink events.
So I'm going to from here you could sort them by anything you want, different arguments, whatever is important in your application.
Here I just want to sort them by duration.
And then what I'm going to do is, I don't know, I'm going to pick one of these ones here somewhere in the middle, and now what I want to point out is when I click on one of these rows, the graph view up above shifted.
And what happened was it went and it made the region I'm interested in visible.
Here it is.
And it put this blue inspection head at the beginning of that region.
So you kind of correlate the thing you clicked on in the bottom with where it is up above in the track view.
So now I'm looking at this frame.
You can also control-click on that row and choose set time filter.
And what that'll do is gray out everything in the detail view that's outside of that time range, and it does the same up in the track view up above.
So you can use that to kind of filter out your data that you're interested in or here just use it as kind of a visual cue at the frame you're looking at.
So now that I've done that.
I'm looking at this frame.
I can see that my CADisplayLink started here.
It ended here.
But I don't really know why it looks like this.
All this is telling you is when it started, when it ended.
You don't know if your application was doing work, if it went to sleep.
You can't tell from this graph.
So we need to dive in deeper.
Here in the top right of Instruments in the toolbar, we're currently on the Instruments strategy, which is those list of all the instruments in this template.
You could click here on this thread data to see all the threads in your application.
Alternatively, let's say we were already down here looking at a thread in our detail view.
If you option-click, you'll see you get these hyperlink kind of style that you can click on that and choose Reveal in Thread Strategy.
So that'll jump you to the Thread Strategy and preselect that thread for you.
So we can see here, make this a little bit larger.
It selected the main thread for us.
And if I look at this, there's a couple of other dispatch worker threads that are running.
And these two in particular, this one right here and the one below it, are doing a lot of work.
These are all those red dots that are showing up.
So if I option drag to zoom in on one of those regions, we can actually start to see what those are by hovering over them.
Here, this is a ulock wake system call, so it's waking up from some sort of lock.
Here's a ulock wait system call, so it's waiting on some lock.
And if you keep hovering over them, you're actually going to see that pattern repeating.
There's a lot of this ulock wait and waking going on.
So if you click, you'll set that inspection head at that point.
And if you come down here to what we call the threads and narrative view, down here in this table, it'll actually show you in that table kind of where I clicked up above, what was going on in the thread at this time.
All right, this is a list of everything this thread was doing, kind of a story of this thread's life.
So here we can see one of those wait calls.
Okay, well where did that happen?
Whenever possible Instruments is going to take a backtrace along with those system events.
And you can find those over here on the right in the Extended Detail View.
So we can see I have some SWIFT code building paths and inside of it it's actually creating an NSAttributedString.
Okay, and down below that a couple frames is where it's taking that lock.
Kind of unexpected, wasn't expecting a lock inside of NSAttributedString, but there it is.
So what are the ramifications of that?
Let's go back over to the threads narrative view and see what happens.
So it took us 109 microseconds just to take that lock.
And then we blocked for another 6 microseconds.
What's really cool this year is now it'll show you which thread made your thread runnable.
So basically who released that lock so that you could take it.
And we can see that it was made runnable by a Graphasaurus thread 0x8468b.
Okay. And we can see that even after that lock was released, we still waiting another 98 microseconds before we actually give a CPU back.
All right, let's look at what that other thread was doing.
Why did he release that lock?
Option click on that, choose reveal and thread strategy.
That'll select that other thread up above, and then down below in the narrative view, we'll see what thread was doing at that time.
And we can see he was calling ulock wake.
So he was releasing that lock.
That makes sense.
So we can kind of see that we have these two threads running in parallel, except they're kind of contending over this lock.
And so they're doing a bunch of things besides actually just running.
Another way to see that is up in this track view.
We have those thread states visible here.
Let me make this a little bit larger.
So if you hover over the thread states, you'll see it was running for this period of time, 64 microseconds.
Then it was blocked for a little while.
Then it was runnable for kind of a long time, so that means it's not actually running.
And then finally it runs.
And if we kept digging around, looking around, we would see this pattern repeating over and over again.
And so what becomes clear is these two threads are fighting over this lock.
And I'm creating a whole bunch of strings during this time.
So if this is taking so much time and is so important, why don't I see that in the time profiler?
Well, truth be told if you go back and look at the time profiler, it does show up.
But it's only like four or five percent or so.
It didn't really stick out as a giant red flag.
And a big reason for that is because these threads are spending a lot of time in the blocked and runnable states.
And time profiler only samples what's actually running on a CPU, so it's not going to show up there.
And so what I need to do to fix this problem is realize, well, I'm just putting some attributed strings onto a state graph.
There's not that many states that need to be displayed.
I should just cash them ahead of time, and then look them up in a dictionary lock free.
And everything should run a lot more smoothly.
And to show you what that looks like, back over to Chad.
[ Applause ]
Okay, so what Joe is seeing is a textbook Lock Contention pattern.
And this is where we have two threads now working away.
And they're contending for a shared resource somewhere in the attributed string lock creation code.
Now even though we're only holding that lock for a few microseconds, the performance impact is much more significant than that.
And Joe showed you that a little bit.
But I want to talk about it just a little bit more.
So when your thread is in the running state that means that the thread is on the CPU.
It's running its full stride, and all of those performance optimizations that you were making with a time profiler are now paying off.
Now at some point, you do call into ulock wake, and then some short period after that it puts the thread into the block state.
And so what's happening here is ulock wait system call realizes that that lock is being held by another thread, and so it asks the kernel to take you off the CPU and put you back when that thread or sorry, when that lock is actually acquired.
Now 3.42 microseconds later that does happen.
We do go into the runnable state.
But now when we're in the runnable state, this is the amount of time it takes us to get back onto the CPU.
Now if you'll notice we're in the runnable state for about 7 microseconds.
That's nearly twice the amount of time that that lock was actually contended.
So we're getting a significant amount of overhead.
Now another way to look at this quantitatively is you can go into the thread strategy, select the thread, create a time filter selection the way that Joe showed you, and then change the detail section from the narrative over here to the thread summary.
And what that will show you is the total time spent per thread state.
In this particular example, we can see that our thread was only running about 82 percent of the time.
Now what that means is that we're still getting some benefit from the multicore.
We are getting some work done, but our scaling is not that linear scaling, perfect scaling that we were hoping for.
We are still wasting a little bit of time.
Now when Joe makes the fix that he was talking about, what's going to happen well, two things are going to happen.
The first one is that the update graph regions, they're going to get a little shorter because we are doing less work when we're caching these strings.
But more importantly, that thread is no running a hundred percent of the time.
So you're going to get that perfect scalability.
So if you add a so if you double the number of CPUs, you're going to half the amount of time that that code takes.
And that's great.
So if you have a fix like that, you should definitely try to take it.
Now let's talk about preemption.
We didn't get to see any preemption in our examples here, it is something that shows up quite frequently in a system trace.
And what preemption is is an involuntary removable of your thread from the CPU.
So some other higher priority work was needed needed the CPUs.
There were none available, and so your thread had to be removed.
Now there's an exception to that.
There is a voluntary form of preemption that you might see from time to time.
And that occurs inside of a spin lock.
When a spin lock realizes it's not making any more forward progress, it can call into the thread switch system call and essentially yield its quantum of time over the holder of the lock.
And so what that looks like in system trace in the thread narrative is you'll see it called a thread switch.
And then the preemption narrative after it will say that it was yielding the CPU rather than being I think removed from the CPU.
So another lighter weight form of preemption is called the interrupted state.
And the interrupted state is when your thread is running on a CPU and that CPU has to handle a hardware interrupt.
So your thread is suspended.
The interrupt handler runs, and then your thread is resumed.
Now at this point, the priority of your thread doesn't matter.
You can have the highest priority.
It really won't matter.
The interrupt will always take precedence.
Now the good news is that these are typically brief, a couple of microseconds.
And typically they won't add up to contribute to any sort of performance problem inside your application.
But they do show up, so that's why I wanted to present them.
Now another nice feature of Instruments 8 is called a new System Load instrument.
And what that instrument does is helps you identify hotspots in your system trace that could be contributing to dropped frames, for example.
Now it does that in two ways.
The first way is this bottom table view.
And this shows you a picture of what the scheduling state look like.
All of the threads that were ready to run underneath the blue inspection line.
So you can tell that at that moment in time, we had three threads.
One was a kernel thread and two are a Graphasaurus threads.
And these are the threads that were not blocked and trying to run.
You can see the core assignments for those as well.
Now the other feature of this instrument is called the User Interactive Load Average graph.
And what this is is a each one of these bars represents a 10 millisecond window of time.
And the height of the bar is the average number of active threads in that 10 millisecond period.
So that's threads that are either running, runnable, preempted, or interrupted.
Essentially threads that are not blocked.
Now since it's the User Interactive Load Average, we only include threads that have a priority greater to or equal to 33 because those are the those are the threads with a priority that could interfere with the user interactive quality of service class.
Now to make it stand out a little bit more clearly, when the thread bars turn orange that means that your load average has exceeded the number of cores on that particular device.
So wherever you see a burst of orange, you can see that that's a frame drop waiting to happen.
So you might want to zoom in to those particular regions that are orange, make sure that the threads that you have running are balanced at the right quality of service level.
Now when Joe makes this fix we're going to end up with a little bit more head room.
We're going to be able to add a new feature.
And that new feature are these hover labels, very similar to the hover labels that you see in Instruments, except on Graphasaurus you do a long press.
And then the hover labels follow your fingertip.
So to show how that feature is going, I'm going to turn it back over to Joe.
[ Applause ]
So yeah, so I added the NSAttributedString fix, went back to 60 frames per second looked good, added the new generation of tool tips work, and things got slow again.
You can kind of tell there's a couple frames here dropping here and there.
So looked in at time profiler, wasn't anything obvious that I could remove.
There wasn't any extra work I was doing.
So I went back and took the system trace that you're looking at here.
Before I took that trace, I went ahead and added a new Signpost code, number 3, and that's going to represent my GenToolTips work.
And you'll see that show up in red up above.
So let's zoom in on one of these sections here.
All right, so we can see here's my new red bars, these GenToolTips.
And so it's important for me to kind of describe how my algorithm works here.
Basically whenever this CADisplayLink region starts, for every single graph on the scene, I go and do a dispatch async of all the render work for that graph and I dispatch async to go generate the tool tip dictionary, look up stuff.
And I do that for every single graph on the scene.
But then I had kind of a clever realization that in order to kind of call my rendering complete, I don't actually need to wait for the tool tip stuff to finish.
And so I do a dispatch group wait on just the render work.
And we can see that actually kind of worked here pretty well.
Here's my start of my frame.
The CADisplayLink time.
We can see some of those tooltips working here.
Let me scroll over to the right.
You can see actually one of them here.
Actually, barely doesn't even start until my render frame is done.
So it looks like I did a good job.
I can give myself a pat on the back.
That looks nice.
However, when I look at my CADisplayLink time here, it's taking 17.4 milliseconds.
Pretty close, but it's not my 16.6 that I want.
So again, that was just one of the regions.
Let's look at what we were doing in aggregate.
Let's go back to that KDebug Interval Signpost by Code table.
Here we can see our CADisplayLinks.
I did about 260.
That's more than we did the last time, so that sounds good.
Sixteen milliseconds on average.
That's actually less than my 16.6.
So that actually sounds pretty decent.
However, this max is still sitting here at about 19.27.
And if we look at all the individual events, let's sort this by duration from longest to shortest, we can see here's that one at 19.
There's a bunch in the 18s.
There's some here in the 17s.
A lot in the 17s.
More in these upper 16s.
So we still have a number of frames that are actually rendering too slowly.
Not by much, but they're still too slow.
That means we're going to be dropping frames.
So this time, where do we go from here?
We could go back to diving down into the thread strategy looking at all our threads and the system calls and VM events and thread events and all sorts of things.
But whenever possible, well this system trace template has a bunch of instruments up here, and they kind of give you sort of like higher level aggregate information that's kind of useful to look at that first.
So before you go diving down into the 100,000 plus events, take a look at these higher level aggregates.
And so what I'm going to do is let's take a look at this User Interactive Load graph.
That's part of that System Load instrument.
And let's go ahead and zoom out here to snap track to fit, so we can see all the data again back on the screen.
And when we've done that, I'm going to zoom in over here.
You can see there's a fair bit of orange in this graph.
Make this a little bit larger.
So you can see there's a fair bit of orange up there, which means we have more user interactive threads running than we have cores, right.
These are threads that are saying, I have a lot of work to do, and I need to do it now.
Give me a CPU.
Well, we're running out of CPUs and that's why our graph here is orange.
So let's zoom on one of these large regions of orange over here.
You can tell what the value is just by hovering over a region.
Zoom in a little bit.
So we can see this particular 10 millisecond bucket, on average there's about 2.84 user interactive threads that we're trying to run, again, on a dual-core machine, so about .8 threads are lacking CPU time.
And that's why that's orange.
And we can see on average, there's a lot of regions here that are too big.
Here actually it's four.
We were trying to run twice as many threads as we have cores.
So let's look at that region in more detail.
So let me scooch this over so you can see this.
As Chad mentioned, you can see what threads you're actually trying to run in that period by click and holding in the ruler view up here, and you can move that blue inspection head back and forth and Instruments will tell you which threads we're trying to run during that instant of time.
And if we come down and look at this table, I'll sort it by the priority.
We can see, I looked right here and there are two Graphasaurus threads that are running, cool.
There's actually two location D threads that are trying to run at a slightly lower priority.
And that's part of being on a real system.
You're going to see system daemons coming in and trying to do their work.
But it's okay.
They're running on a slightly lower priority than my stuff, so I still have the CPU, looks good.
I do, however, have this third thread that's running at the same time, well, trying to run at the same time.
And he's not getting any CPU resources.
And I know what these threads are.
The two of them are my render work.
And then I have the third thread that's trying to do my generation of tool tips.
And so what's happening is one of them is not able to get a CPU.
And we can kind of see this again.
If we look at, say, one of our frames.
Let's go back up to that Points of Interest Region.
You know, we could see that we're doing our rendering here inside of our frame, and the generation of tool tips happen, so it gets a little bit of CPU time here.
But what it's doing is when it does get that CPU time, it's taking away from my rendering.
And so basically I've kind of misprioritized my work because when I stop and think about it, well, I need that rendering to happen right now.
Being at that user interactive level makes perfect sense because I want it to be nice and smooth and 60 frames per second.
But these tool tips, they're not quite as high priority.
I do want them done quickly because as soon as that user long presses on that screen, I want them to show up.
But they're not really as important as that render work.
And you can see it pretty clearly here.
They are definitely taking away some of that time that CPU resources from this render work that's now being delayed.
And that's what's helping kind of drag out that CADisplayLink time.
So the fix for that is actually simple in this case.
Let's go over to Xcode.
So I have this view controller class.
And one of the things it does is it creates a tool tip queue.
We can see that created right down here.
This is where I do all my tool tip work.
And it's created with a couple attributes.
One is, you know, it's concurrent, so good, they can run on multiple CPUs if they're available.
And it's set for the user interactive QOS class.
And that's that same QOS class that my render work is happening, so they're all contending for resources.
So like we said, it's not actually as important, so I'm going to change that class.
You can read about the different classes right in the header file.
I'm going to take it a couple of notches down and go with the utility level class.
And what that'll do is give that CPU prioritization to do my rendering work, and then when there is a little bit of CPU time at the end of the frame or whenever, then the tool tips will run.
And they're still at a high enough priority that when that user taps on the screen, they should be ready to go.
And so to show you what that looks like, back over to Chad.
[ Applause ]
Okay, so when Joe makes that fix, the graph's going to look like this.
We're going to notice that our CADisplayLink times have come down to about 12.7 milliseconds on average, which is much better than before.
But even better still is that our max duration is only about 14.6 milliseconds.
So we're not dropping any frames, and we're well within our 16 millisecond deadline.
Now we're doing that in spite of also continuing to overload the system.
If you think about it, we are still running three different threads.
But because we've correctly prioritized the work, our Gen Tool Tips code is running down here at priority four.
So that's going to stay out of the way of the User Interactive code.
So we're still getting a lot of work done on the CPU.
We still have a very high system load.
But at the same time, we're still getting a perfectly smooth user experience.
So what is Quality of Service, really?
Quality of Service, in case you hadn't seen it yet, is an attribute that you attach to blocks, queues, and threads.
And it's basically an expression to the kernel about how much of the system resources you're willing to devote to getting that particular piece of work done quickly.
Now the different Quality of Service classes can constrain the priority range.
So you can see that our utility classes put it down into a priority of four, so in our User Interactive code is running in the high 30s, high 40s.
But the Quality of Service classes can also throttle things like IO and also the CPU frequency that the code is running at.
So when you pick a Quality of Service for your code, make sure that you look through the documentation very carefully and make sure that it matches the kind of work that you're doing.
Now another thing that can affect the performance of your application is virtual memory faults.
They do get a little worse under load as memory pressure increases.
But the good news is they are manageable.
Now System Trace has all the tools that you need in order to analyze virtual memory faults.
Inside the thread strategy, virtual memory faults appear as these little blue capsules.
Inside the thread narrative it reports virtual memory faults and even attaches a backtrace on where that fault was resolved inside your code.
Now we also have an instrument that's dedicated to analyzing virtual memory faults.
So for example, you can see where your code is more susceptible to one type of fault or another.
So for example, maybe you have code that is experiencing more zero fills or more copy on writes.
Now the next thing you need to know about virtual memory faults is that the fault occurs on access rather than allocation.
So you can ask for a large allocation from the kernel, let's say 500 meg, but you don't actually back that with physical memory until you start touching or accessing the pages of that allocation in your process.
So it's something to think about when you're allocating memory.
The other thing that's important to know about virtual memory faults is that they are resolved inline.
So there's no explicit call that you need to make to resolve a virtual memory fault.
All you need to do is touch any byte inside a page that's marked as requiring a fault, and the kernel will take control of your thread, resolve the fault, and then give you control back.
And so when you see those blue capsules in your System Trace on the thread strategy that's exactly what's happening.
So what do you do about virtual memory faults inside your application?
Well the easiest thing to do excuse me.
The easiest thing to do is just simply absorb them.
What I mean by that is leave enough room inside your performance budget where you can handle a certain amount of virtual memory faults before you run your deadlines.
This will make you more resilient under a load.
So as memory pressure increases, if your budgets are big enough and you have enough slack, you won't notice the difference in your performance.
Now we realize that some people do not have these kinds of lax deadlines in terms of their UI generation.
So another alternative is to try to do the faulting on a background thread.
So let's say you have a game, for example, and your player is coming to the end of level one and they're going to transition to level two.
Well, what you might do is dispatch async to a background queue and then touch the pages for the content of level two on that background queue.
And then by the time your rendering thread comes around to pick up that new content there will be no stutter.
Now, we'll have to give you a warning here on this approach.
Make sure you only touch the pages that you're absolutely going to use.
Because if you start touching more of the pages than you need, then you're actually going to make the problem worse.
And that about does it for today's session.
We think that the System Trace makes a great companion to the Time Profiler.
The Time Profiler helps you make your code fast, but System Trace allows your application to scale better under a higher load.
We encourage you to try a System Trace out on your own apps.
We know that when we try it against Instruments, we always find something that's worth fixing.
And if you've used System Trace in the past, we invite you to come back to Instruments 8 and give it another look because we've done some major improvements to both the approachability and the power of the tool.
We think it'll make a great addition to your toolbox.
For more information, here's our link Session 411.
We have some related sessions today that happened also happened during the week and on Friday.
Enjoy the rest of your morning.