Error Handling Best Practices for HTTP Live Streaming

Session 514 WWDC 2017

HTTP Live Streaming (HLS) reliably delivers media content across a variety of network and bandwidth conditions. However, there are many factors that can impact stream delivery, such as server or encoder failures, caching issues, or network dropouts. Learn the best-practice behaviors that your servers should adopt to maximize reliability, and gain a practical understanding of the errors your app may encounter and how to handle them.

Hello, welcome to our session on error handling best practices for HTTP live streaming.

My name is Shravya Kunamalla and I am an AVFoundation engineer.

Let's get started.

There are a huge number of Apple developers streaming content using our very popular HTTP live streaming.

Over the years, the usage has evolved into multiple complex delivery scenarios.

The developers are doing live event broadcasts, prerecorded [inaudible], and in each of these there are possibly multiple different media selections, variance at different bit rates, audio and subtitles of different languages.

The content itself might be protected and there could possibly be millions of simultaneous viewers subscribing to your streams.

Given the enormity the system is bound to run into errors.

A lot of developers and content providers have asked us one question in particular over the years, what is the right thing to do when an error happens.

And on very popular demand, we present to you today the best practices for handling errors on both app and the server side.

Most of you listening to this talk might already know all about HLS delivery, but let's quickly go through the overview.

We have a master playlist, this consists of alternate versions of the same presentation.

In this example, there is a 6 megabit and 2 megabit video, English and French audio, English and French subtitles.

Each of these is called a media playlist and has its own [inaudible] playlist.

The media playlist consists of segments.

In case of life, the segment list is updated at regular intervals on playlist re-fetch.

The segments may be dropped off from the beginning and new segments are added to the end.

In case segments are protected media playlist also contains keys.

We also have session data, this can be for example titles or lyrics.

These are the resources that the server is expected to deliver and the HLS client needs for playback.

So, what should the server do when it's unable to deliver due to errors?

What are the best practices for handling both content and delivery errors?

There are a number of iOS, macOS, and tvOS clients expecting resources from server.

The server should aim to deliver the resources in time and if it fails to do so, communicate the right error code to AVPlayer.

This error code should clearly convey the cause of error.

Was the request a valid request, was it authorized, has the server encountered an error?

Is the server incapable of performing the request, for example due to an unsupported feature request?

Next, let's see the recommended way to signal these various errors to AVPlayer.

So, here is the list of failures and the recommended error codes.

These are in compliance with the standard HTTP error codes specified in RFC7231.

Segments are protected and the AVPlayer does not have the required authentication, send 401.

If the client doesn't have authorization for the content, send 403.

For all temporary resource unavailable cases like [inaudible], send 404.

For permanent resource unavailability, send 410.

For all unexpected server conditions where no other specific message applies, send 500.

Most of the content providers or CDNs are cache in proxies which are getting the content itself from some encoder somewhere.

To notify of invalid response from gateway, send 502.

If server is down for maintenance or overloaded and is unavailable for any other reason, send 503.

For gateway timeout, send 504.

Now these error codes aren't necessarily new they have been around for a while.

And if we look closer at these errors there is a class of errors that are temporary like resource and server temporary unavailability.

Starting iOS 11 we now have a way to explicitly communicate such temporary failures to AVPlayer by means of GAP tag.

We mark segments as GAP by the use of EXT-X-GAP tag.

This can be applied to one or more segments.

Put this in your playlist to indicate GAP and enable AVPlayer to make an informed decision.

On seeing this tag AVPlayer will know that this is a temporary failure and may decide to go to a backup alternate or switch down.

If nothing viable is available in the utmost case AVPlayer will play the available media until we recover from the error condition.

So, going back to failures and error codes.

For which of these errors is the GAP tag applicable, 404 temporary resource unavailability and 503 server unavailability always use GAP tag.

Keep in mind, this tag is applicable to both live and [inaudible] playback, but the use case is typically the live scenario.

Next, let's move on to HLS specific media error cases.

On live playback, the HLS pack specifies that the playlist needs to be updated on regular intervals.

If the server is unable to update the playlist in time according to the published target duration, we recommend to communicate the stale playlist to AVPlayer by sending 404.

Now returning stale playlist itself is fine, but that leaves the onus of identifying the stale playlist on the AVPlayer which it does eventually.

And on identifying that AVPlayer will try to recover by means of switching to other available [inaudible] or retries.

This may be too late in some cases leading to stalls.

Sending 404 instead will communicate the stale playlist to AVPlayer much more quickly.

There is another advantage here, it would also give immediate notification of stale playlist to any new AVPlayer joining the stream.

For unsupported features for example, BYTE-RANGE not supported, send 501.

For all authentication failures, send 401.

Next, an example going through a typical live playback scenario.

Let's say we have two video variants, one of 6 megabit and one of 2 megabit.

We also have the responding encoder packagers one providing 6-megabit content and another providing 2-megabit content to our server.

And the server is distributing this content to the HLS client requesting it.

Let's say the [inaudible] bandwidth of the app is good enough to handle the 6-megabit variant it goes ahead and fetches the 6 megabit media playlist.

Gets the response back and moves on to fetch the first segment, segment one.

Everything seems to be good until now.

Then suddenly the 6-megabit encoder or packager is down with substantial downtime for example.

The next time AVPlayer re-fetches the playlist the server now has a way to communicate the failure to it, GAP tag.

For this re-fetch request, we recommend that server should now send 200 okay and the subsequent segments in the media playlist should be marked as GAP.

AVPlayer on seeing this GAP tag switches down to 2-megabit variant media playlist and moves on to fetch the next segment, segment two, from the 2-megabit variant.

With this we have switched down smoothly and in time to avoid a stall.

For backward compatibility for any segment request marked as GAP the server should still send 404.

Next, let's move on to failover.

What is a failover?

It is a method of protecting the system from failure in which a standby or backup system takes over when the main system fails.

So, what failover can our server have?

One viable approach is to have redundant variants on backup servers, have variants on different servers with same bit rate an include them in the master playlist.

This will give the AVPlayer the ability to smoothly switch over in case of error.

Backup alternates will be tried first before switching down.

If the server wants to explicitly trigger a failover it should send 404 to okay list request.

To summarize, always notify the HLS client of error with correct error code.

Have backup playlists on different servers to failover in case of server failures, having some redundancy is good.

Send 501 for unsupported features.

And in the case of live, update the playlist in time as specified by HLS Spec. Prefer GAP tag in case of temporary failures.

And send 404 to indicate stale playlist.

Next, let's move on to how to handle AVFoundation errors.

When an error occurs, the user viewing the actual stream wants to know two things.

First, that the error happened and second, what caused the error to happen.

And not all errors can be anticipated on the server.

The AVFoundation client or app should be returned to respond appropriately to various error conditions originating from the AVPlayer.

So, how can we identify the error?

The error can be identified by looking at AVPlayer.status and AVPlayerItem.status.

This will change to AVPlayerStatsFailed and AVPlayerItemStatusFailed respectively on error.

For the exact error that caused the status to change to fail look at AVPlayerItem.error.

This describes what caused the item to be no longer playable.

Listen to AVPLayerItemFailedTo PlayToEndTimeNotification to get notified that the item did not play to end.

The user info dictionary of this notification contains an error object that describes the problem and can be retrieved by AVPlayerItemFailedTo PlayToEndTimeErrorKey.

Dig deeper, look at AVPlayerItem.errorLog.

This gives the snapshot of all the error events that happened during the playback session.

So, what do these errors mean?

They can mean one of these four things, network errors, timeouts, format errors, and live playlist update errors.

Network errors are all the 4xx and 5xx errors that server sends and TCP/IP, DNS errors.

After requesting a resource there are timeouts for each master playlist, media playlist, medial files, and keys.

And failure to get a response within this timeout will cause timeout errors.

Any incorrect format of playlist key and the session data will result in format errors.

And in case of live, playlist needs to be updated according to published target duration and the failure to do so will cause live playlist update errors.

What are the corresponding AVFoundationDomain error codes?

For network errors and timeouts, it will be AVErrorContent IsUnavailable or AVErrorNo LongerPlayable.

AVErrorContent IsUnavailable indicates that the content was never playable.

This could mean authentication failures or authorization failures.

AVErrorNo LongerPlayable indicates that the content was playable, but over the course of time one or more errors happened resulting in being no longer playable.

AVErrorFailed ToParse indicates parsing failures.

AVErrorContent NotUpdated means the playlist was not updated in time.

Always look at the user info of the error to get the underlying error.

Keep in mind, this can be nested if more than one error caused the item to fail.

When a new error log entry is added to error log AVPlayerItemNewError LogEntryNotification is sent.

So, listen to this for immediate notification of error.

I would like to stress on one point here, AVPlayer will try its best to continue playback by means of retries and switching to different available variants.

The AVPlayerItem.status will change to fail only when there is no viable variant to use to continue playback and we have played out whatever buffer we have.

For all temporary errors, AVPlayer will attempt switching and/or retry.

If there is nothing to switch to AVPlayer will retry for a reasonable amount of time before giving up.

After a given amount of time it will attempt to switch back up to failed variant if the network conditions are suitable.

For permanent errors like 410 no retries will be attempted and AVPlayer only tries switching to a different variant.

The permanent and temporary error codes are in compliance with the standard HTTP error codes specified in RFC7231.

All session data errors are not fatal and not ignored.

Next, let's go to a code snippet.

To view the error, once you have done the usual things, create your asset, create your player item, create a player with that item the first thing you should do is add observer to track the status of the player.

Then add observer to track the status of player item.

And here you register to listen to AVPlayerItemFailed ToPlayToEndTimeNotification.

Once you have that and the status of the item changes to failed look at AVPlayerItem.error to print out what the error is.

This is the place where you should add code to display relevant messages about the error to the user.

On getting AVPlayerItemFailedTo PlayToEndTimeNotification extract the error as the value of AVPlayerItemFailedTo PlayToEndTimeErrorKey and again, take appropriate action.

For instance, print the error or display relevant error messages to the user.

To summarize, always monitor AVPlayer and AVPlayerItem.status.

Listen to notifications, AVPlayerItemFailedTo PlayToEndTimeNotification tells you when the item did not play to end.

If you want to more actively monitor the errors for example, for the purpose of sending debug info to server for analytics listen to AVPlayerItemNewError LogEntryNotification to know when a new error log entry is added.

In conclusion, when an error occurs always take appropriate action, don't ignore it.

Notify the user of the error and always, always display meaningful messages or pop-ups when suitable.

For more information, go to the WWDC site and use the session number 514.

Thank you and have a great conference.

Apple, Inc. AAPL
1 Infinite Loop Cupertino CA 95014 US