Effective HTTP Live Streaming

Session 502 WWDC 2012

Designed for mobility, HTTP live streaming dynamically adjusts playback quality to match the available speed of wired or wireless networks. Gain a practical understanding of how HTTP live streams are made. Learn best practices for constructing and testing your HTTP live streams.

Roger: Good morning, everyone. My name is Roger Panton, and welcome to the first of two sessions this week on HTTP Live Streaming. It's actually really remarkable when you think about the last three years. I was chatting with a few other folks who work on streaming back at Apple. We were talking about how in our collective experience no media technology that we've ever seen has been adopted as quickly as HTTP Live Streaming, and certainly no new device has acquired so much new content as quickly as the iPad has. I think all of us in this room can take a lot of credit for that. We, of course, at Apple build some great devices, and we built a solid platform for you. Everyone who's working on tools to produce the content, and of course, most importantly all you folks who work on applications the last mile between the network and the user, what the user sees that's so vitally important. I think we can all be very proud of what we've accomplished.

Having said that, once you get over the initial enthusiasm of, "I've got primetime TV on my phone. This is awesome." Once you get over that sort of initial thing, and you start looking at the applications, and you start looking at the streams of the bit of a critical eye you realize that, you know what, there are actually a few things that we can do that could make these experiences even better. That's what we're going to talk about today.

To take us through that, we actually have an old handbook with us today. He is our Media Technologies Evangelist. He is our go-to guy when you folks send us crazy puzzling questions about what's wrong with your apps. He has contributed to some of the sections in the spec, the internet draft to [inaudible 00:02:17] real live streaming. Many of you have actually already met him either in person, or over email, or on the develop performs. Ladies and gentleman, I'd like to introduce to you Eryk Vershen.

Eryk: Good morning. Thank you all for coming. I hope you are not tired already. We still got another three days to go. As Roger mentioned, I'm the Media Technologies Evangelist. I'm one of the engineers on the evangelism team. The talk today is effective HTTP Live Streaming.

I want to start out by asking what makes a great streaming app? Well, high-performance, of course, fast start-up, fast seeking, no stalls. You also want navigation. You want to be able to seek fluidly. You want to be able to fast forward and rewind. It's also important that your client be getting the right stream. That is the most appropriate stream for the bit rate that they can currently stay in the network that they're currently on. Lastly, we believe you should have your content localized. Now, I want to emphasize that everything that you need is in IOS 5 already. Now, I assume you were at the keynote yesterday, so will have noticed that the adoption rates for IOS 5 are pretty terrific, and even with the combination of IOS 5 and IOS 4, you pretty much covering almost all the devices, so there's not really any point to be support having the backward support to older versions.

Before we get started, I want to do a little bit of a review about HTTP Live Streaming. Now, those of you who are familiar with HTTP Live Streaming already, please bare with me. This is only going to be a couple of minutes. Those of you who are new to HTTP Live Streaming, this might be a little bit fast for you, but I think you'll still get something out of the talk, and we have a lot of really great material in terms of documentation in earlier days WWDC presentations.

HTTP Live Streaming starts out with a really simple idea, really a simple concept, and that's what we want to deliver video over HTTP, and we want to do it with an ordinary web server. We don't want any magic in the web server, and the mechanism is pretty simple itself. You take your video, you break it into small, roughly equal size chunks. You have a list of those pieces, and in the case of live video, you have to update that list periodically. Now, in order to have that list, which we call a playlist, we actually have two kinds of playlist, two types; a media playlist, which is a list of media files or segments; and a master playlist, which is a list of other media playlist.

Let's talk about the first one, your media playlist. The media playlist is at its heart just a list of those file or a segment, so that's file sequence a, b, c, d that you see here. Remember that the names are unimportant. We can be calling those Bob, and Jane, and Ted. What's important is the order. Now, the other lines that you see in this file, the one that star with a hash are tags, and those are the way we convey more information to the client. That is an M3 wait file, a playlist file, what version we're using, what the maximum duration for a segment should be, and so on.

Now, this particular playlist is a video on demand playlist (VOD). It has a playlist type of VOD, and it has an endless tag indicating that you're at the end. Now, if this was a live playlist, we wouldn't have the playlist type, and we wouldn't have the endless tag. Instead of being a static list, it would be a list that changes. The server would be updating it. Based on the target duration will be updating it periodically, so the client could fetch it and get a view of the newly produced content.

Now, the other kind of playlist that you have is a master playlist. Master playlist is again a list of things, but in this case, it's a list of other playlist, a media playlist. The other important thing in there is it's a list of bitrates, that bandwidth attribute tells the client how many bits per second this thing is going to take. This is essential so that the client can make the decision about switching between different bitrates.

We have this simple mechanism. Now, how do you take that mechanism and turn it into something that gives you high performance? Well, I want to start with asking the question, what do we mean by "high performance?" Fast startup, obviously, fast seeking, and absolutely no playback stalls. You also want to be able to switch between streams easily. There's a lot that goes into fast startup and seek. The first thing is you want to make a good initial choice of variant, that is your master playlist, that first entry in that master playlist. That's the one that client's going to start with, so you want the bit rate of that to be something that most of your clients are going to be able to sustain. Now, when I'm talking about getting the streams to the right device, I'll talk about how you can arrange to have different master playlist delivered to different devices, so that you can tune that initial choice.

Now, the other thing you want to do, is you want to serve your playlist with GZip. Playlist can have, in the case of VOD, hundreds of entries, and even in a live playlist if you're delivering a large window of content, you can again have hundreds of entries, and GZip reduces the size dramatically, and it's very easy to add to your server. It's typically a one line fixed in your server configuration. In order to have fast startup and seek, you need to have an IDR-frame. Now, those of you who aren't that familiar with H.264 might not be familiar with IDR-frames. It's an instantaneous decoder refresh, and it's a special kind of I-frame. What it indicates to the decoder is, no frame that occurs after this frame, depends on any frame that occurred before. It means essentially that I can reset the decoder. Now, you want that IDR-frame at the beginning of your segment, because if we are seeking or if this is a live stream, and we're starting a partway through, we need an IDR-frame to get started. If you've put your IDR-frames partway into the segment, you're just wasting our time until we get to that IDR frame.

Now, in MPEG transport streams, there's a certain amount of overhead that's in there, and there's some padding that occurs, and depending on how your transport stream is getting created, you might have more overhead than in other cases. Now, we've done a lot of work with our media file segmenter and stream segmenter to try and make sure that we don't have too much overhead. Typically, we have less than 10%, and in fact, in most cases less than 5% overhead. Now, in looking at steams that come from other encoders, we've seen overhead as high as 45%.Now, to make that a little more concrete for you, let's say you had a stream that was nominally a hundred kilobits per second. If you've got 45% overhead, then you've only got 55 kilobits of actual video data that's going through. If you got 5% overhead, you got 95 kilobits. Obviously, I can do a much better stream if I've got 95 kilobits for video data than 55. You want to minimize that overhead if you can. If you're working with a different vendor, I suggest that what you do is take your original content, encode it with our media stream segmenter, and see what kind of overhead you're getting, and use that as a negotiating point with your vendor.

Now, the target duration that is a size of your segments, it does have some effect on startup, but I want to emphasize over and over again that 10 second target duration is what we recommend. It's what we recommended at the beginning. It's what we still recommend. We believe that 10 second target durations is the best choice. If you go with a smaller target duration you're increasing the likely hood that you're going to get a stall. Some of that is coming out of two different reasons. If you're delivering your content, if it's live content and you're going through a CDN, you're going to have propagation delays for that new content to make it all the way out to the edging goes on the CDN, and that's going to be variable. Also, if the client is fetching the stuff over a cellular network, they've got higher latencies, and both of those things can make you much more likely to encounter a stall if you have a small target duration.

Now, the other thing that you want to do to prevent stalls is you want to make sure that you're not under-reporting your bandwidth in your master playlist. If I declared a particular variant, one of my choices as being 200 kilobits per second, and please, its maximum bit rate should be 200 kilobits per second. If you don't do that, let's say that I'm running at something lower, and I want to go up to that 200 kilobit per second stream, if that thing actually peaks at 300 kilobits, I might not have that much headroom, and I may stall because you lied to me about how much bandwidth I needed.

The other piece that comes in here has to do with ads, and this is another thing that we've recommended over and over, and we continue to emphasize. People who are coming from some other environment will say, "I want to put an ad content. I'll just have a separate player, and when I need to do my ad I'll just click on the separate player." When you're doing things with HLS, if you got two player, two things that are fetching streams, the second one when you started up, he doesn't have any of the knowledge, of the first one has about what bit rate he was getting, so he has to go to a whole algorithm again. He has to start with that first bit rate, which is typically going to be of lower quality, which means your ad, one of the things that your client wants to have be very good quality is not going to be good quality, so that's bad on their side. The other thing that's happening is that guy is competing with the first stream. He's trying to do his pre-fetches to get set, and they're competing for bandwidth, so we absolutely recommend that you should have your ad content in the stream, and there are a number of techniques you can use to do that.

There's another point that I want to bring up about preventing stalls, and this one's a little bit complicated setup, so bare with me here. When you're updating live playlist, so you're delivering ... Imagine you're delivering this content. You're going through a CDN, right? In my little diagram here the blue bars indicate when I've got my encode done, and that blue box is how long it takes me to get that segment out to the edge note. That's going to be variable. Remember that the segments are going to be different sizes depending on whether what kind of stuff I was encoding, like, I might have been in on a faded black. I was on black screen for a while, and now I'm into something, and there's like smoke or fog, so I ending up having to encode. The encoding is more difficult. I end up with a larger thing that takes me longer, as well as just delays in the CDN.

As you remember, when you're delivering a playlist in the live environment, the segments that it refers to, they have to be on the system already. They have to be available. That means that you can't deliver that playlist for segment seven when segment seven is not available, because then you're going to get a stall as well. The nine view is that, once I've got my segment up, I'll just shift my playlist up, but because I've got a lot of variation between in how long it takes the segments to go up, I can end up with a very long delay in how long it takes my playlist to come up. that playlist for segment seven may come up really fast, because it had a short segment, and that playlist for segment eight may take a long time, at which point the time between when I picked-up the playlist for segment seven and segment eight, I might be more than one-and-a-half target durations between those times, and I'm actually out of spec, and I could get a stall. What you want to do is delay when you're uploading those playlist. If you want to get those playlist to come up and be available on a pretty regular [ca-dent 00:16:13], when it like clockwork, because that's what the client needs. Now, the other thing that you want to do is you want to have fast stream switching. Here, your IDR-frames they're really important. That's where we can switch. If you think about it when you're switching between two streams, you're essentially have a discontinuity. You're potentially changing the profile and the level that you're going to, or the resolution, so the decoders having to restart and needs that IDR-frame. This is where you can also have difficulty, because if you have longer segments, if you say, "I'll just go with bigger segments. If 10 is good, let's go to 10. Let's go to 30." The problem there is when your ... When I want to do a stream switch, I always grab the segment before the segment where I believe I want to get into. Partially, because I don't know. There's not an absolute requirement that there would be an IDR-frame at the beginning. I want to have that lead in, which means that if I'm trying to switch and I've got bigger segments on having to pull down a lot more data before I can make the transition. You've done all this work to increase your performance. You want to use the access logs that you can get from the client to keep track of your performance.

These are great things, but remember, this is the extracting data from the user that knows something about the user, so you've got to ask the user's permission to get this stuff. You want to look especially at duration watched, observed bit rate, indicated bit rate, so you can tell whether you're actually increasing your performance or not. The three things I want you to take way is: Use 10 second durations. Serve your playlist during GZip, it's really easy. Absolutely have at least one IDR-frame per segment.

Now, we got a high performance, but we also need to have great navigation. I want to start talking about great navigation by also asking the question what makes great navigation. A generous live content window, and I'll talk about that in a little more detail. We want seeking to work well, and you absolutely want to have fast forward and reverse playback supported. Remember when you have live content. What you have is a ... Your playlist is a window into that content, because the content is essentially infinite. I can only show you a range of it. Now, according to the spec, you only have to have three target durations worth of content, but the larger you make that window, the more flexibility you give to the user, so they can pause or seek back. To try it elsewhere, I've got a couple of little animations here. Let's say I have a small content window. It's three segments. I can't pause, playback, because as I'm moving along - that arrow is the segment that I'm about to play - I've got no room here. I can't pause. I fall off of the window. Now, if I have a bigger content window, I've only made it slightly bigger here, because I don't have a large enough screen. In this case, as I'm moving along, I can actually pause and the stream's moving along, and I can start up again later. Now, how big should you make this window? Well, we believe that at a minimum you should have a minute with the content in this window, but what we really think you should do is you should have a lot more than a minute. You should have half an hour, or even an hour. You should have enough content that the user can be in a situation where they get a phone call, somebody's at the door, there's a noise in the other room. They've got something that interrupts them, and they want to pause, but get back to where they were in the content.

Now, you also want your seeking to work well. The most important thing is floating-point durations. In the very first version of HLS, you can have in and due duration, but please, use floating-point duration. There's really no point being still within in, due durations. You need this because the only information that the client has about how long this segments take is the duration that you specify in the media playlist. If I want to seek forward, I haven't seen those segments. I don't know exactly how long they are. I only have that duration that you've told me. If you give me in and due durations on, I'm accumulating errors, so if I attempt to seek well into the program, and start up, I've accumulated an error, and I'm at a very different time that current time that's going to report my app is going to be very different than what I would have gotten if I'd just played forward.

The other part about seeking is there's actually two kinds of seek. There's fast seek, or precise seek. In AV Player item, if I used seek to time, I’m getting a fast seek. In case the of a fast seek, what we're doing is we're finding the segment where we believe that the time is picking up the first ideal frame, and going ahead and playing. We’re just trying to get in the general ballpark. That goes pretty quickly, but if I really want precise seeking, if I want to get close to the frame that I want, I used that second method, seek to time with tolerance before and tolerance after.

Now, please remember that if you want precise seeking with HLS, those tolerances have to be zero. If you don't have them a zero, it won't invoke the precise seek. What happens in this case is we're actually going to go, and pick up that segment, and the one before it, and we're going to try and get ... We'll pick up the IDR-frame, and try, and start exactly where you want us to. It does take a little more time because we have to fence that segment beforehand, because we don't have a guarantee that you have an IDR-frame at the beginning, so we might actually have to start from there, as well as the fact that there can be some minor variations in time codes where we want to get. It may not be exactly in the segment that we're interested in.

The other thing that you want to have in order to give the user a good experience is fast forward and reverse playback. The way you do this is with the I-frame playlists. This is really important if you're using air-play and Apple TV. If I move my content over to Apple TV, I don’t have the scrub bar that I can touch. I just got the fast forward button. If I don't have this I-frame playlists, I'm not going to get, the users are not going to get any feedback. You're not going to get any feedback when you're doing a fast forward. The higher your I–frame frequency, the smoother your playback's going to be in fast forward. If I've got, say, four seconds between each I-frame, then when I’m going at 4x, I've got one frame to show to the user per second. If the separation is two seconds, then I'd have two frames to show. The higher we get this, the smoother the playbacks going to be.

Your I-frame playlists can reuse your existing media files. If you've already got content, the I-frame playlist is pulling out references to those I-frames. It can point at exactly the same media files that you had for your normal playback content. The playlist uses our bite rate syntax to indicate a particular offset and length with the I-frame will be in that segment.

You want to have multiple I-frames playlists in your master playlist. At the beginning, I've got my four variants here, and then I've got some I-frame playlist. Now, what you can't see in this is that the low I-frame m3u8 is pointing to the same media that the low audio video m3u8 is pointing to. It's just pointing to pieces of it. Even though we're using the same media, the decision about which bit rate to choose is made independently if I'm in fast forward than from normal playback. If in normal playback, I'm playing my medium level bit stream there. When I go into fast forward, I'm not required to pick the medium range I-frame. I'll pick whichever one has the most appropriate bit rate. Those are the essentials for having great navigation, having that nice, generous live content window, floating-point durations, and I-frame playlists. We got good performance. We got great navigation.

Next thing is getting the right stream to the right device. The tag line I want you to remember is the right stream to the right device. You want to choose your bit rates carefully. You want to know something about your client. What's the capabilities of this particular device? There's some special considerations for cellular. We have this great tech note two, two, two, four, on bit rates. Rather than bore by going through all of the detail in there, I thought I try to distill down to some high points.

First thing is that adjacent bit rates should be a factor of about one-and-a-half to two a part. The reason you want to keep them at a particular range, is you don't want them too close together. Keep them too close together, you're essentially wasting bandwidth. You're just going to pick one. It's not really going to make a much different if I've got 150 and 180. What's the point? But you don't want them too far apart, because if they're too far apart, you may gain a situation where a client could actually have gone the better stream, but you don't have one available, because you're going from 100 kilobits to 300 kilobits. It's too big of a jump.

Key frames, IDR frames. We, in all of our recommendations, there are no more than three seconds apart, and in fact, in some cases they are less. The first bit rate that you have in your playlist should be the one that most of clients can sustain. Choosing bit rates you do have some considerations. The problem here is that I can't tell you the absolute right answer, because there is no absolute right answer. You have to figure it out for yourself. You have constraints on yourself. You're using some particular encoding hardware. You may have a limitation on the number of various bit rate that you can produce. Maybe that with your particular cell hardware you can only produce five. You can't produce six, so you got to pick five that work for you.

Also, in the case of live if you're delivering through a CDN, you got to worry about how much time does it takes, how much bandwidth do I need into my CDN to get all of this streams up at the same time, because I'm not going through the big fat pipe where I've got a whole bunch of different edge notes to talk to. I'm trying to get this in from the backend. I may have a budget. In fact, I may even have financial constraints on that side.

Lastly, you want to think about, need to talk about before, what's my ability to switch, what's the distance between these various bit rates. Once you've picked for your bit rates, once you've deployed all this stuff, you want to verify your assumptions. Just like when you're measuring your performance, you want to measure this stuff. You want to track the client performance in the field. You've got access logs and error logs. Once again, don’t forget to get the user's permission to grab this data. We got a lot of fields in the access log, so you want to look at what streams am I actually getting. How long are they playing for? Where are my stream switching? Am I getting stalls?

You want to also know about your clients. Devices have different capabilities. They have different screen resolutions. They have different versions of H264 that they can handle different profiles and levels. You might, in some cases, want to provide a different playlist to different models ,like a particular playlist to iPads versus iPhones. You want to use the capabilities you have on the client side to find out about your network.

The first thing you can do is you can add things into the playlist that allows the client to select based on your solution. Different devices have different resolutions that they can handle. Let's imagine that we have a playlist like this. I’ve got 640X360, a 720p, and a1080p. If I'm delivering to in all device, a 3GS, the 3GS can't play 720p or 1080p, so it's going to pick that 640X360. Whereas the new iPad, we'll pick the 1080p if they can handle the bit rate. You should remember that if you got an app, and you actually have a smaller window, you're not delivering your video to the whole screen. Let's say I had on my iPad, and I had a window that was 640X360, then we can pull the 640X60 string, because there's no point getting the 1080p and downshifting it. If you're only showing that stuff in the small window, why will we waste the user's bandwidth? If you then go to full-screen ,we're going to switch up to the 1080p if we can.

You can also add things into your playlist that allows the client to filter based on the codec, the particular profile and level that you're encoding. This codec tags are somewhat obscure, I'll admit, and we have some documentation available that will help you navigate to the correct one. That's why they added the comment, and by the way, that's not a comment. You can't put comments in your playlist, so don't try and use that slash. It won't work.

The first one is baseline, the second one is main, the third one is high. Once again, if I’m going send in like the 3GS, it doesn't handle main or high, so it's going to take that baseline profile. It's going to ignore the other two variant where is, say, the iPhone 4 would take the main profile.

You can also select base on your device model, but sniffing the user-agent string. This is something you do on a server side rather than the client side. You can sniff the user agent string in the side. iPhone versus iPad, I'll deliver you a different playlist. The first two here are if you're coming over Safari, the last two are if you're in an app. If you like to go to the stump the experts, what modern phone can that third one not be from? It's a fairly obscure question.

You also want to know about your network. You've got the reachability API available to you, SC Network Reachability. This allow you to find out whether you're on cellular or wifi. Doing this, you can either go to a different URL or add this as extra information in your request, so that the server can tell you, and you get a different playlist. This is particularly important, so you can have that first item in the playlist be appropriate for the network that you're on.

There's some special considerations for cellular, and this is something that we've repeated over and over again, but it bares repeating again, and that's if you're delivering this content over a cellular network, you absolutely have to provide a stream that is no more than 64 kilobits per second or you will be rejected. As who's reminded by someone this morning, when you're doing that, when you submit to the app store, tell us a URL to your cellular stream, the cellular playlist, so that we can test it. Because otherwise, what we have to do is try and sniff that network to find out what it is, and if we can't find what it is, we'll reject your app and you'll have to go through another cycle. The other things that you need to remember is some people have been grabbing media in some other format, bring it in to their app, and having a local web server to serve that SHLS. You can't do that on cellular. Don't do that.

Delivering the right stream to the right device is about choosing your bit rates carefully, using the access logs to verify your assumptions, and customizing your master playlists. We've talked about great performance, great navigation, gain new wide stream. There's one other piece, which is localizing your content. You're probably delivering your app to the world, but you might even be delivering it just to a particular country. In that case, many countries have significant minorities that speak some other language, so you probably want to get your audio into the multiple languages. The way you do this is with alternate audio playlist.

Here's a conceptual diagram of a master playlist. It's got a bunches of variants. When audio only, and then three audio video variants at different bit rates. When I add an audio alternates, what I'm doing is I'm adding some more playlist, one that has English language audio, one with German, one with French, and I'm grouping those audios together, and then associating that group with each of the variants.

Let me walk though what the master playlist looks like. To get one of those audio variants you use the media tag, which is something we added in five, oh, but it's backward compatible. What the important things in here, this URI that it tells us where the playlist is, the language that it's in, and the name that we're going to show to the user for this alternate. When I add the German and French in here, again, I've got URIs, probably set the language, indicates the language, and the name is the appropriate name in the language that the user speaks, so it’s Dutch and it's Frances. It's not French and German. That’s English. The way these things are connected together is with that group ID. That group ID says, these things are all related to form a of group of audio that the user can switch between.

When I add this into a playlist, you can see the rest of this playlist is a typical variant, typical master playlist with the bunch of variants. The connection is still that group ID, and that group ID is just an arbitrary name that you can come up with. The audio attribute on the stream info tags says that this particular variant can have its audio substituted by any of the audios in this group. Those alternates that you have, they're constrained by the streams that used them, that is if my variant says I'm using a particular codec, a particular sample rate and bit rate, then the alternate should have the same sample rate and bit rate. In fact, the playlist if my variant has a particular target duration, the alternate should have the same target duration as well. If I have discontinuities or I have add insertions, the same things apply. This alternated have to be substitutable for the variants. That media tag is backward compatible, so I can take that playlist, and I can provide that to a client who's not on 5, oh, and they'll still be able to play the variants. they just won't be able to detect the alternates.

I could try and wrap up ... I want to mention a bunch more things that you can do. Along with the audio alternates, you can have video alternates. I can have an event that has multiple cameras on it, and I can provide the feeds of both of those cameras to my users. I can add time metadata, which allows me to add metadata that's associated with particular instance in my movie. I can add American style close captioning. I can add world times and dates associated with points in my stream. I can also consolidate segments together, so that same bit range and text that we use to identify I-frames in your media files, you can actually consolidate your media files into a single media file, and reference the individual segments. This allows you to have fewer files on your server. Of course, streams can be encrypted.

We have a lot of resources available if you're interested in HTTP Live Streaming. remember this URL, developer.apple.com/resources/http-streaming. We have pointers to our documentation there, including the internet drafts spec. We also have pointers to all of the tech notes that we've done. we have pointer to the tools, and to some example streams that we've recently put up. These example streams are really great. They're using all of the things that I've talked about today. They're using play and point durations, an alternate audio, an I-frame playlist, and more stuff, time, metadata, and closed captions. These are great way to test your player if you're writing a player. This gives you some content that you can play. We can make sure that you're picking up time metadata correctly, that you're dealing with closed captions, that you're able to switch between alternate audio, so I encourage you to take advantage of these example streams.

That's almost it. I'm Eryk Vershen. You can email me. The documentation, of course, the tools are available where all of our special downloads are in developer.apple.com/downloads. Just go there, and search for HTTP, and you'll find the tools. I also want to recommend that if you have questions about HTTP Live Streaming that you don't get chance to get answered here, that you go to the dev forums. Engineers from the HLS team are on the dev forums answering questions. Another pitch for our talk tomorrow, please do come back tomorrow, and find out all the cool new stuff we've done. Thank you very much.

Apple, Inc. AAPL
1 Infinite Loop Cupertino CA 95014 US