Optimizing App Startup Time

Session 406 WWDC 2016

Launching an App is a complicated and subtle process and the ramifications on launch times of different App design patterns are often non-obvious. Come learn what happens in the time between when an App begins launching and when the main() function gets control and how that time relates to the code and structure of your App. Learn about the inner details of the dynamic loader, dyld, and best practices for structuring your code to perform at its best from the very start.

[ Music ]

Good morning and welcome to session 406, Optimizing App Startup Time.

My name is Nick Kledzik, and today my colleague Louis and I are going to take you on a guided tour of how a process launches.

Now you may be wondering, is this topic right for me.

So we had our crack developing marketing team do some research, and they determined there are three groups that will benefit by listening to this talk.

The first, is app developers that have a app that launches to slowly.

The second group, is app developers that don't want to be in the first group [laughter].

And lastly, is anyone who's just really curious about how the OS operates.

So this talk is going to be divided in two sections, the first is more theory and the second more practical, I'll be doing the first theory part.

And in it I'll be walking you through all the steps that happen, all the way up to main.

But in order for you to understand and appreciate all the steps I first need to give you a crash course on Mach-O and Virtual Memory.

So first some Mach-O terminology, quickly.

Mach-O is a bunch of file types for different run time executables.

So the first executable, that's the main binary in an app, it's also the main binary in an app extension.

A dylib is a dynamic library, on other platforms meet, you may know those as DSOs or DLLs.

Our platform also has another kind of thing called a bundle.

Now a bundle's a special kind of dylib that you cannot link against, all you can do is load it at run time by an dlopen and that's used on a Mac OS for plug-ins.

Last, is the term image.

Image refers to any of these three types.

And I'll be using that term a lot.

And lastly, the term framework is very overloaded in our industry, but in this context, a framework is a dylib with a special directory structure around it to holds files needed by that dylib.

So let's dive right into the Mach-O image format.

A Mach-O image is divided into segments, by convention all segment names are, use upper case letters.

Now, each segment is always a multiple of the page size, in this example the text is 3 pages, the DATA and LINKEDIT are each one page.

Now the page size is determined by the hardware, for arm64, the page size is 16K, everything else it's 4k.

Now another way to look at the thing is sections.

So sections is something the compiler omits.

But sections are really just a subrange of a segment, they don't have any of the constraints of being page size, but they are non-overlapping.

Now, the most common segment names are TEXT, DATA, LINKEDIT, in fact almost every binary has exactly those three segments.

You can add custom ones but it usually doesn't add any value.

So what are these used for?

Well TEXT is at the start of the file, it contains the Mach header, it contains any machine instructions as well as any read only constant such as c strings.

The DATA segment is rewrite, the DATA segment contains all your global variables.

And lastly, is the LINKEDIT.

Now the LINKEDIT doesn't contain your functions of global variables, a LINKEDIT contains information about your function of variables such as their name and address.

You may have also heard of universal files, what are they?

Well suppose you build an iOS app, for a 64 bit, and now you have this Mach-O file, so what happens the next code when you say you also want to build it for 32 bit devices?

When you rebuild, Xcode will build another separate Mach-O file, this one built for 32 bits, RB7.

And then those two files are merged into a third file, called the Mach-O universal file.

And that has a header at the start, and all the header has a list of all the architectures and what their offsets are in the file.

And that header is also one page in size.

Now you may be wondering, why are the segments multiple page sizes?

Why is the header a page sizes, and it's wasting a lot of space.

Well the reason everything is page based has to do with our next topic which is virtual memory.

So what is virtual memory?

Some of you may know the adage in software engineering that every problem can be solved by adding a level of indirection.

So the problem with, that virtual memory solves, is how do you manage all your physical RAM when you have all these processes?

So they added a little of indirection.

Every process is a logical address space which gets mapped to some physical page of RAM.

Now this mapping does not have to be one to one, you could have logical addresses that go to no physical RAM and you can have multiple logical addresses that go to the same physical RAM.

This offered lots of opportunities here.

So what can you do with VM?

Well first, if you have a logical address that does not map to any physical RAM, when you access that address in your process, a page fault happens.

At that point the kernel stops that thread and tries to figure out what needs to happen.

The next thing is if you have two processes, with different logical addresses, mapping to the same physical page, those two processes are now sharing the same bit of RAM.

You now have sharing between processes.

Another interesting feature is file backed mapping.

Rather than actually read an entire file into RAM you can tell the VM system through the mmap call, the I want this slice of this file mapped to this address range in my process.

So why would you do that?

Well rather than having to read the entire file, by having that mapping set up, as you first access those different addresses, as if you had read it in memory, each time you access an address that hasn't been accessed before it will cause a page fault, the kernel will read just that one page.

And that gives you lazy reading of your file.

Now we can put all these features together, and what I told you about Mach-O you now realize that the TEXT segment of any of that dylib or image can be mapped into multiple processes, it will be read lazily, and all those pages can be shared between those processes.

What about the DATA segment?

The DATA segment is read, write, so for that we have trick called copy on write, it's kind of similar to the, cloning that seen in the Apple file system.

What copy and write does is it optimistically shares the DATA page between all the processes.

What happens when one process, as long as they're only reading from the global variables that sharing works.

But as soon as one process actually tries to write to its DATA page, the copy and write happens.

The copy and write causes the kernel to make a copy of that page into another physical RAM and redirect the mapping to go to that.

So that one process now has its own copy of that page.

Which brings us to clean versus dirty pages.

So that copy is considered a dirty page.

A dirty page is something that contains process specific information.

A clean page is something that the kernel could regenerate later if needed such as rereading from disc.

So dirty pages are much more expensive than clean pages.

And the last thing is the permission boundaries are on page boundaries.

By that I mean the permissions are you can mark a page readable, writable, or executable, or any combination of those.

So let's put this all together, I talked about the Mach-O format, something about virtual memory, let's see how they play together.

Now I'm going to skip ahead and talk a little, how the dyld operates and in a few moments I'll actually walk you through this but for now, I just want to show you how this maps between Mach-O and virtual memory.

So we have a dylib file here, and rather than reading it in memory we've mapped it in memory.

So, in memory this dylib would have taken eight pages.

The savings, why it's different is these ZeroFills.

So it turns out most global variables are zero initially.

So the static [inaudible] makes an optimization that moves all the zero global variables to the end, and then takes up no disc space.

And instead, we use the VM feature to tell the VM the first time this page is accessed, fill it with zero's.

So it requires no reading.

So the first thing dyld has to do is it has to look at the Mach header, in memory, in this process.

So it'll be looking at the top box in memory, when that happens, there's nothing there, there's no mapping to a physical page so a page fault happens.

At that point the kernel realizes this is mapped to a file, so it'll read the first page of the file, place it into physical RAM, set the mapping to it.

Now dyld can actually start reading through the Mach header.

It reads through the Mach header, the Mach header says oh, there's some information in the LINKEDIT segment you need to look at.

So again, dyld drops down what's in the bottom box in process one.

Which again causes a page fault.

Kernel services it by reading into another physical page of RAM, the LINKEDIT.

Now dyld can expect a LINKEDIT.

Now in process, the LINKEDIT will tell dyld, you need to make some fix ups to this DATA page to make this dylib runable.

So, the same thing happens, dyld is now, reads some data from the DATA page, but there's something different here.

dyld is actually going to write something back, it's actually going to change that DATA page and at this point, a copy on write happens.

And this page becomes dirty.

So what would have been 8 pages of dirty RAM if I just malloced eight pages and then the read the stuff into it I would have eight pages of dirty RAM.

But now I only have one page of dirty RAM and two clean pages.

So what's going to happen when the second process loads the same dylib.

So in the second process dyld goes through the same steps.

First it looks at the Mach header, but this time the kernel says, ah, I already have that page in RAM somewhere so it simply redirects the mapping to reuse that page no iO was done.

The same think with LINKEDIT, it's much faster.

Now we get to the DATA page, at this point the kernel has to look to see if the DATA page, the clean copy already still exists in RAM somewhere, and if it does it can reuse it, if not, it has to reread it.

And now in this process, dyld will dirty the RAM.

Now the last step is the LINKEDIT is only needed while dyld is doing its operations.

So it can hint to the kernel, once it's done, that it doesn't really need these LINKEDIT pages anymore, you can reclaim them when someone else needs RAM.

So the result is now we have two processes sharing these dylibs, each one would have been eight pages, or a total of 16 dirty pages, but now we only have two dirty pages and one clean, shared page.

Two other minor things I want to go over is that how security effects dyld, these two big security things that have impacted dyld.

So one is ASLR, address space layout randomization, this is a decade or two old technology, where basically you randomize the load address.

The second is code signing, it has to, many of you have had to deal with code signing, in Xcode, and you think of code signing as, you run a cryptographic hash over the entire file, and then sign it with your signature.

Well, in order to validate that run time, that means the entire file would have to be re-read.

So instead what actually happens at build time, is every single page of your Mach-O file gets its own individual cryptographic hash.

And all those hashes are stored in the LINKEDIT.

This allows each page to be validated that it hasn't been tampered with and was owned by you at page in time.

Okay, so we finished the crash course, now I'm going to walk you from exec to main.

So what is exec?

Exec is a system call.

When you trap into the kernel, you basically say I want to replace this process with this new program.

The kernel wipes the entire address space and maps in that executable you specified.

Now for ASLR it maps it in at a random address.

The next thing it does is from that random, back down to zero, it marks that whole region inaccessible, ad by that I mean it's marked not readable, not writeable, not executable.

The size of that region is at least 4KB to 32 bit processes and at least 4GB for 64 bit processes.

This catches any NULL pointer references and also foresees more bits, it catches any, pointer truncations.

Now, life was easy for the first couple decades, of Unix because all I do is map a program, set the PC into it, and start running it.

And then shared libraries were invented.

So who loads dylibs?

They quickly realize that they got really complicated fast and the kernel people didn't want the kernel to do it, so instead a helper program was created.

In our platform it's called dyld.

On other Unix's you may know it as LD.SO.

So when the kernel's done mapping a process it now maps another Mach-O called dyld into that process at another random address.

Sets the PC into dyld and let's dyld finish launching the process.

So now dyld's running in process and its job is to load all the dylibs that you depend on and get everything prepared and running.

So let's walk through those steps.

This is a whole bunch of steps and it has sort of a timeline along the bottom here, as we walk through these we'll walk through the timeline.

So first thing, is dyld has to map all the dependent dylibs.

Well what are the dependent dylibs?

To find those it first reads the header of the main executable that the kernel already mapped in that header is a list of all the dependent libraries.

So it's got to parse that out.

Then it has to find each dylib.

And once it's found each dylib it has to open and run the start of each file, it needs to make sure that it is a Mach-O file, validate it, find its code signature, register that code signature to the kernel.

And then it can actually call mmap at each segment in that dylib.

Okay, so that's pretty simple.

Your app knows about the kernel dyld, dyld then says oh this app depends on A and B dylib, load the two of those, we're done.

Well, it gets more complicated, because A.dylib and B.dylib themselves could depend upon the dylibs.

So dyld has to do the same thing over again for each of those dylibs, and each of the dylibs may depend on something that's already loaded or something new so it has to determine whether it's already been loaded or not, and if not, it needs to load it.

So, this continues on and on.

And eventually it has everything loaded.

Now if you look at a process, the average process in our system, loads anywhere between 1 to 400 dylibs, so that's a lot of dylibs to be loaded.

Luckily most of those are OS dylibs, and we do a lot of work when building the OS to pre-calculate and pre-cache a lot of the work that dyld has to do to load these things.

So OS dylibs load very, very quickly.

So now we've loaded all the dylibs, but they're all sitting in their floating independent of each other, and now we actually have to bind them together.

That's called fix-ups.

But one thing about fix-ups is we've learned, because of code signing we can't actually alter instructions.

So how does one dylib call into another dylib if you can't change the instructions of how it calls?

Well, we call back our old friend, and we add a lot of old indirection.

So our code-gen, is called dynamic PIC.

It's positioned independent code, meaning the code can be loaded into the address and is dynamic, meaning things are, addressed indirectly.

What that means is to call for one thing to another, the co-gen actually creates a pointer in the DATA segment and that pointer points to what you want to call.

The code loads that pointer and jumps to the pointer.

So all dyld is doing is fixing up pointers and data.

Now there's two main categories of fix-ups, rebasing and binding, so what's the difference?

So rebasing is if you have a pointer that's pointing to within your image, and any adjustments needed by that, the second is binding.

Binding is if you're pointing something outside your image.

And they each need to be fixed up differently, so I'll go through the steps.

But first, if you're curious, there's a command, dyld info with a bunch of options on it.

You can run this on any binary and you'll see all the fix-ups that dyld will have to be doing for that binary to prepare it.

So rebasing.

Well in the old age you could specify a preferred load address for each dylib, and that preferred load address was the static linker and dyld work together such that, if you load, it to that preferred load address, all the pointers and data that was supposed to code internally, were correct and dyld wouldn't have to do any fix-ups.

But these days, with ASLR, your dylib is loaded to a random address.

It's slid to some other address, which means all those pointers and data are now still pointed to the old address.

So in order to fix those up, we need to calculate the slide, which is how much has it moved, and for each of those interior pointers, to basically add the slide value to them.

So rebasing means going through all your data pointers, that are internal, and basically adding a slide to them.

So the concept is very simple, read, add, write, read, add, write.

But where are those data pointers?

Where those pointers are in your segment, are encoded in the LINKEDIT segment.

Now, at this point, all we've had is everything mapped in, so when we start doing rebasing, we're actually causing page faults to page in all the DATA pages.

And then we causing copy and writes as we're changing them.

So rebasing can sometimes be expensive because of all the iO.

But one trick we do is we do it sequentially and from the kernel's point of view, it sees data faults happen sequentially.

And when it sees that, the kernel, is reading ahead for us which makes the iO less costly.

So next is binding, binding is for pointers that point outside your dylib.

They're actually bound by name, they're actually is the string, in this case, malloc stored in the link edit, that says this data pointer needs to point to malloc.

So at run time, dyld needs to actually find the implementation of that symbol, which requires a lot of computation, looking through symbol tables.

Once it's found, that values that's stored in that data pointer.

So this is way more computationally complex than rebasing is.

But there's very little iO because rebasing has done most of the iO already.

Next, so ObjC has a bunch of DATA structures, class DATA structure which is a pointer to its methods and a pointer to a super gloss and so forth.

Almost all those are fixed up, via rebasing or binding.

But there's a few extra things that ObjC run time requires.

The first is ObjC is dynamic language and you can request a class become substantiated by name.

So that means the ObjC run time has to maintain a table of all names of which class that they map to.

So every time you load something, it defines a class, its name needs to be registered with a global table.

Next, in C++ you may have heard of the fragile ivar problem, sorry.

Fragile base class problem.

We don't have that problem with ObjC because one of the fix-ups we do is we change the offsets of all the ivars dynamically, at load time.

Next, in ObjC you can define categories which change the methods of another class.

Sometimes those are in classes that are not in your image on another dylib, that, those method fix-ups have to be applied at this point.

And lastly, ObjC [inaudible] is based on selectors being unique so we need unique selectors.

So now the work that we've done all the DATA fix-ups, now we can do all the DATA fix-ups that can be basically described statically.

So now's our chance to do dynamic DATA fix ups.

So in C++, you can have an initializer, you can say [inaudible] equals whatever expression you want.

That arbitrary expression, at this time needs to be run and it's run at this point now.

So the C++ compiler generates, initiliazers for these arbitrary DATA initialization.

In ObjC, there's something called the +load method.

Now the +load method is deprecated, we recommend that you don't use it.

We recommend you use a plus initialize.

But if you have one, it's run at this point.

So, now I have this big graph, we have your main executable top, all the dylibs depend on, this huge graph, we have to run initializers.

What order do we want them in?

Well, we run them bottom up.

And the reason is, when an initialize is run it may need to call up some dylib and you want to make sure that dylibs already ready to be called.

So by running the initializers from the bottom all the way up the app class you're safe to call into something you depend on.

So once all initiliazers are done, now we actually finally get to call the main dyld program.

So you survived this theory part, you now all are experts on how processes start, you now know that dyld is a helper program, it loads all dependent libraries, fixing up all the DATA pages, runs initializers and then jumps to main.

So now to put all this theory you've learned to use, I'd like to hand it over to Louis, who will be giving you some practical tips.

[ Applause ]

Thanks, Nick.

We've all had that experience where we pull our phone out of our pocket, press the home button, and then tap on an application we want to run.

And then tap, and tap, and tap again on some button because it's not responding.

When that happens to me, it's really frustrating, and I want to delete the app.

I'm Louis Gerbarg I work on dyld and today, we're going to discuss how to make your app launch instantly, so your users are delighted.

So first off, let's discuss what we're going to go through in this part of the talk.

We're going to discuss how fast you actually need to launch so that your users are going to have a good experience.

How to measure that launch time.

Because it can be very difficult.

The standard ways you measure your application don't apply before your code can run.

We're going to go through a list of the common reasons why your code, or sorry we're going to go through a list of, why, the common reasons your launch can be slow.

And finally, we're going to go through, a way to fix all the slow downs.

So I'm going to give you a little spoiler for the rest of my talk.

You need to do less stuff [laughter].

Now, I don't mean your app should have less features, I'm saying that your app has to do less things before it's running.

We want you to figure out how to defer some of your launch behaviors in order to initialize them just before execution.

So, let's discuss the goals, how fast we want to launch.

Well, the launch time for various platforms are different.

But, a good, a good rule of thumb, is 400 milliseconds is a good launch time.

Now, the reason for that is that we have launch animations on the phone to give a sense of continuity between the home screen and your application, when you see it execute.

And those animations take time, and those animations, give you a chance to hide your launch times.

Obviously that may be different, in different context your app extensions are also applications that have to launch, they launch in different amounts of time.

And a phone and TV, and a watch are different things, but 400 milliseconds is a good target.

You can never take longer than 20 seconds to launch.

If you take longer than 20 seconds, the OS will kill your app, assuming it's going through an infinite loop, and we've all had that experience.

Where you click an app, it comes up to a home screen, it doesn't respond, and then it just goes away, and that's usually what's happening here.

Finally, it's very important to test on your slowest supported device.

So those timers are constant values across all supported devices on our platforms.

So, if you hit 400 milliseconds on a iPhone 6S that you're using for testing right now, you're probably just barely hitting it, you're probably not going to hit it on a iPhone 5.

So let's do a recap of Nick's part of the talk.

What do we have to do to launch, we have to parse images, map images, rebase images, bind images, run image initializers, and then call main.

If that sounds like a lot, it is, I'm exhausted just saying it.

And then after that, we have to call UIApplicationMain, you'll see that in your ObjC apps or in your Swift apps handled implicitly.

That does some other things, including running the framework initializers and loading your nibs.

And then finally you'll get a call back in your application delegate.

I'm mentioning these last two because those are counted in those 400 milliseconds times that I just mentioned.

But we're not going to discuss them in this talk.

If you want a better view of what goes on there, there's a talk from 2012, iOS app performance responsiveness.

I highly recommend you go back and view the video.

But that's the last we're going to speak of them right now.

So, let's move on, one more thing I want to talk about, warm versus cold launches.

So when you launch an app, we talk about warm and cold launches.

And a warm launch is an app where the application is already in memory, either because it's been launched and quit previously, and it's still sitting in the discache in the kernel, or because you just copied it over.

A cold launch is a launch where it's not in the discache.

And a cold launch is generally the more important to measure.

The reason a cold launch is more important to measure is that's when your user is launching an app after rebooting the phone, or for the first time in a long time, that's when you really want it to be instant.

In order to measure those, you really need to reboot between measurements.

Having said that, if you're working on improving your warm launches, your cold launches will tend to improve also.

You can do rapid development cycles on warm launches, but then every so often, test with a cold launch.

So, how do we measure time before main?

Well, we have a built in measurement system in dyld, you can access it through setting an environment variable.

DYLD Print Statistics.

And it's been available in shipping OSes actually, but it prints out a lot of internal debugging information that's not particularly useful, it's missing some information that you probably want.

And we're fixing that today.

So it's significantly improved on the new OSes.

[ Applause ]

It's going to put out, a lot more relevant information for you that should give you actionable ways to improve your launch times.

And it will be available in seed 2.

So, one other thing I want to talk about with this, is that the debugger has to pause launch on every single dylib load in order to parse the symbols from your app and load your break points, over a USB cable that can be very time consuming.

But dyld knows about that and it subtracts the debugger time out from the numbers it's registering.

So you don't have to worry about it, but you notice it because dyld's going to give you much smaller numbers than you'll observe by looking at the clock on the wall.

That's expected and understood, and it's everything's going correctly if you see that, but I just wanted to make note of it.

So let's move on, to setting an environment variable in Xcode, you just go to the scheme editor, and you add it like this.

Once you do that you'll get the new console log into the output, console output logged.

And what does that look like?

Well this is what the output looks like, and we have a time bar on the bottom representing the different parts of it.

And let's add one more thing.

Let's add an indicator for that 400 milliseconds target, which this app I'm working on is not hitting.

So, if you look in, this is in order basically the steps that Nick discussed in order to launch an app so let's just go through them in order.

So dylib loading, the big thing to understand about dylib loading and the slowdown that you'll see from it, is that embedded dylibs can be expensive.

So Nick said an average app can be 100 to 400 dylibs.

But OS dylibs are fast because when we build the OS, we have ways of pre-calculating a lot of that data.

But we don't have every dylib in every app when we're building the OS.

We can't pre-calculate them for the dylibs you embed with your app, so we have to go through a much slower process as we load those.

And the solution for this is that we just need to use fewer dylibs and that can be rough.

And I'm not saying you can't use any, but there are a couple of options here you can merge existing dylibs.

You can use static archives and link them into both, into apps that way.

And you have an option to lazy load, which is to use dlopen, but dlopen causes some subtle performance and correctness issues, and it actually results in doing more work later on, but it is deferred.

So, it's a viable option but you should think long and hard about it and, I would discourage it if at all possible.

So, I have an app here that currently has 26 dylibs, And it's taking 240 milliseconds just to load those, but if I change it and merge those dylibs into two dylibs, then it only takes 20 milliseconds to load the dylibs.

So I can still have dylibs, I can still use them to share, functionality between my app and my extension, but, limiting them will be very useful.

And I understand this is a tradeoff you're making between your development convenience and your application launch time for your users.

Because the more dylibs that you have the easier it is to build and re-link your app in and the faster your development cycles are.

So you absolutely can and should use some, but it's good to try to target a limited number, we would, I would say off hand, a good target's about a half a dozen.

So now that we've fixed up our dylib count let's move on to the next place where we're having a slowdown.

Between 350 milliseconds in binding and rebasing.

So as Nick mentioned, rebasing tends to be slower due to iO and binding tends to be computationally expensive but it's already done the iO.

So that iO is for both of them and they're comingled, the timing's also comingled.

So if we go in and look at that, all that is fixing up pointers in the DATA section.

So what we have to do, is just fix up fewer pointers.

Nick showed you a tool you can run to see what pointers are being fixed up in the DATA, section, dyld info.

And it shows what segments and sections things are in, so that will give you a good idea of what's being fixed up.

For instance, if you see a symbol to an ObjC class in ObjC section, that's probably that you have a bunch of ObjC classes.

So, one of the things you can do is you can just to reduce the number of ObjC classes object and ivars that you have.

So there are a number of coding styles that are encouraging very small classes, that maybe only have one or two functions.

And, those particular patterns may result in gradual slowdowns of your applications as you add more and more of them.

So you should be careful about those.

Now having 100 or 1,000 classes isn't a problem, but we've seen apps with 5, 10, 15, 20,000 classes.

And in those cases that can add up to 7 or 800 milliseconds to your launch time for the kernel to page them in.

Another thing you can do is you can try to reduce your use of C++ virtual functions.

So virtual functions create what we call V tables, which are the same as ObjC metadata in that in the sense that they create structures in the DATA section that have to be fixed up.

They're smaller than ObjC, they're smaller than ObjC metadata but they're still significant for some applications.

You can use Swift structs.

So Swift tends to use less data that has pointers for fix-ups of this sort.

And, Swift is more inlinable and can better co-gen to avoid a lot of that, so migrating to Swift is a great way to improve this.

And one other thing, you should be careful about machine generated codes, so we have instances where, you may describe some structures in terms of a DSL or some custom language and then have a program that generates other code from it.

And if those generated programs have a lot of pointers in them, they can become very expensive because when you generate your code you can generate very, very large structures.

We've seen cases where, this causes megabytes and megabytes of data.

But the upside is you usually have a lot of control because you can just change the code generator to use something that's not pointers, for instance offset based, structures.

And that will be a big win.

So in this case, let's look at what's going on here with my, with my load time.

And I have at least 10,000 classes, I actually have 20,000, so many it scrolled off the slide.

And if I cut it down to 1,000 classes, I just cut my launch times, my time in this part of the launch from 350 to 20 milliseconds.

So, now, everything but the initializer is actually below that 400 millisecond mark, so we're doing pretty good.

So for ObjC set up, well Nick mentioned everything it had to do.

It had to do class registration, it has to deal with the non-fragile ivars, it has to do category registration and it has to do selector uniquing.

And I'm not going to spend much time on this one at all, and the reason I'm not is, we solved all of those by fixing up the rebasing and data, and binding before.

All the reductions there are going to be the same thing you want to do here.

So we just get a little bit of a free win here, it's small.

It's 8 milliseconds.

But we didn't do anything explicit for it.

And now finally, we're going to look at my initializers which are the big 10 seconds here.

So I'm going to go a little more in depth on this than Nick did.

There are two types of initializers, explicit initializers, things like +load.

As Nick said we recommend replacing that with +initialize, which will cause the ObjC run time to initialize your code when the classes were substantiated instead of when the file is loaded.

Or, in C/C++ there's an attribute that can be put onto functions which will cause it to, generate those as initializers, so that's an explicit initializer, that we just rather you didn't use.

We rather you replace them with call site initializers.

So by call site initializers I mean things like dispatch once.

Or if you're in cross platform code, pthread once.

Or if you're in C++ code, std once.

All these functions have basically the same sort of functionality where, any code in one of these blocks will be executed the first time its hit and only that.

Dispatch once is very, very optimized in our system.

After the first execution of it, it's basically equivalent to a no op running past it, so I highly recommend that instead of using, explicit initializers.

So let's move on to implicit initializers.

So inplicit initializers are what Nick described mostly from C++ globals with non-trivial initializers, with non-trivial constructors.

And one option is you can replace those with call site initializers like we just mentioned.

There's certainly places where you can place globals with non-global structures or pointers to objects that you will initialize.

Another option is that you don't have non-trivial initializers.

So in C++ there's initializers called a POD a plain old data.

And if you're objects are just plain old datas, the static, or the static linker will pre-calculate all the data for the DATA section, lay it out as just data seen there, it doesn't have to be run, it doesn't have to be fixed up.

Finally, it can be really hard to find these, because they're implicit, but we have a warning in the compiler -Wglobal-constructors and if you do that it will give you warnings whenever you're generating one of these.

So it's good to add that to the flags your compiler uses.

Another option is just to rewrite them in Swift.

And the reason is, Swift has global variables and they'll be initialized, they're guaranteed to be initialized before you use them.

But the way it does it, instead, is instead of using an initializer, it, behind the scenes, uses dispatch once for you.

It uses one of those call site initializers.

So moving to Swift will take care of this for you, so I highly encourage it that's an option.

Finally, in your initializers please don't call dlopen, that will be a big performance hit for a bunch of reasons.

When dyld's running it's before the app has started and, we can do things like turn off our locking, because we're single threaded.

As soon as dlopens happened, in those situations, the graph of how our initializers have to run changes, we could have multiple threads, we have to turn on locking, it's just going to be a big performance mess.

You also can have subtle deadlocking and undefined behaviors.

Also, please don't start threads in your initializers, basically for the same reason.

You can set up a mute text if you have to and mute text even have like, preferred mute texts even have, predefined static values that you can set them up with that run no code.

But actually starting a thread in your initializer is, potentially a big performance and correctness issue.

So here we have some code, I have a C++ class with a non-trivial initializer.

I'm having trouble with the connection.

Please try again in a moment.

Well, thank you Siri.

I'm having a, I have a non-trivial initializer.

And I guess I had it in for debugging all commented out and okay, I'm down to 50 milliseconds, total.

I have plenty of time to initialize my nibs and do everything else, we're in very good shape.

So now that we've gone through that, let's talk about what we should know if you just, this was really long and pretty dense.

The first one is please use dyld print statistics to measure your times, add it to your performance or aggression suites.

So you can track how your app is performing over time, so as you're actively doing something you don't find it months later and have trouble debugging it.

You can improve your app launch time by, reducing the number of dylibs you have, reducing the amount of ObjC classes you have, and eliminating your static initializers.

And you can improve in general by using more Swift because it just does the right things.

Finally, dlopen usage is discouraged, it causes subtle performance issues that are hard to diagnose.

For more information you can see the URL up on screen.

There are several related sessions later in the week and again, there's the app performance session from 2012 that goes into the other parts of app launch, that highly recommend you watch, if you're interested.

Thank you for coming everybody, have a great week.

[ Applause ]

Apple, Inc. AAPL
1 Infinite Loop Cupertino CA 95014 US