Creating Photo and Video Effects Using Depth

Session 503 WWDC 2018

The TrueDepth camera in the iPhone X streams high-quality depth data in real time allowing you to enhance your photo and video apps in fun and creative ways. Dive deep into the principles and best practices for working with depth data, learn how to use the new Portrait Segmentation API for still images, and see how these techniques can create special effects like background replacement and perspective changes.

[ Music ]

[ Applause ]

Good morning, it's a pleasure to be here with you.

My name is Emmanuel and I'm an engineer on the Core Image team.

This morning we'll be looking at creating photo and video effects using depth.

Let's get started.

With iOS 11 we started delivering portrait depth data alongside your portrait still images.

During last year's WWDC sessions we showed how you can leverage that depth data to achieve amazing effects, such as force perspective, simulated depth of field effects, as well as various foreground and background separation effects.

This year we're extremely excited to announce that we're coming out with a new feature, a coming out with a new feature, a portrait matte.

So during the first half of this session we'll be focusing on portrait still images and how you can apply really great effects on them.

During the next half my colleague Ron over from video engineering will be focusing on using a TrueDepth camera to achieve real-time video effects.

All right, let's take a look at the portrait segmentation API.

So I mentioned a portrait matte, so what is a portrait matte?

A portrait matte is a segmentation from foreground to background and what this means precisely is that you have a mask which is 1.0 in the foreground and 0.0 in the background and you get soft and continuous values in between.

The portrait matte is of extremely high quality and is able to preserve fine details, such as curl and hair on the outline of your subject.

This is amazing.

The matte can be used to achieve great many effects, here is just great many effects, here is just one of them where we essentially do a foreground and background separation by darkening the background.

But we're really putting this tool into your hands so that you can create amazing apps and new effects and delight your users.

All right, so the portrait effects matte is coming to you with iOS 12.

It is available for both the front and the rear facing camera.

It is available to you with portrait still images and at the moment only when there are people in the scene.

Note that the portrait matte is linearly encoded, so it's not gamma encoded and what you get is a grayscale buffer.

And also there's no guarantee that you will be getting a portrait matte alongside your portrait still images, you always do get the depth data but you need to make sure to test for its existence.

All right, let's take a look at the API and how we can actually load that data in.

load that data in.

So ImageIO provides a low-level API that allows you to load portrait effects matte.

So by calling CGImageSourceCopy there's a new key it can pass kCGImageAuxiliaryData TypePortraitEffectsMatte.

And this call returns a dictionary containing three main pieces of information.

The image data itself as a CFDataRef, metadata pertaining to the buffer itself as a CFDictionary, as well as metadata pertaining to the capture itself.

AVFoundation also provides a higher-level API that sits on top of ImageIO that you can use.

So taking the output from CGImageSourceCopy you can feed it to the AVPortrait effects matte class.

And what you get out of it is very simple it's a CV pixel buffer along with pixel format type so you can use that CV pixel buffer for your further processing needs, it's really that simple.

that simple.

AVFoundation also supports portrait matte delivery at capture time.

So starting with your typical AVFoundation setup with your AVCaptureInput, device, as well as capture session.

The first thing you want to do is make sure that your environment supports the delivery of the portrait effects matte.

To do this you'll be checking for is that data delivery supported, as well as is portrait effects matte delivery supported.

The reason we have the two there both depth and portrait effects matte is that they come together.

You can either opt in to get only the depth data, but whenever you want the portrait effects matte you also need to activate the depth data delivery.

And to activate or to opt in for that delivery make sure to modify your AVCapturePhotoSettings and set the isPortraitEffects MatteDelivery enabled, as well as isDepthDataDeliveryEnabled to true.

Then the capture time your ddiFinishProcessingPhoto callback will give you the portrait effects matte.

It's really that simple.

All right Core Image also provides you with a way to load and save your portrait effects matte.

A new queue was introduced auxiliaryPortrait EffectsMatte which you just pass your image with contents of URL and you get a CI image back which contains the portrait effect.

Core Image also allows you to save your portrait effects mattes directly into your files.

To do this there's a new context option called portraitEffectsMatteImage, you pass in your CI image containing the portrait effects matte, and then you can write your file to disk using for example writeHEIFRepresntationOfImage.

All right so one thing that's important to note here is that the three images, so your RGB, the depth buffer, and the portrait matte buffer live at a different resolution.

So for example, the portrait matte for the rear facing camera is half-size and the depth data is even smaller.

So let's look at the images side-by-side in the case of the rear facing camera.

What this means that in your What this means that in your applications you need to make sure to either down-sample your RGB image to the size of your portrait depth or portrait matte or [inaudible] and sample them to the size of your RGB image.

So that's all I wanted to talk about today for the portrait segmentation API and we have a great demo for you to see this live in action.

So during this demo I'll be making use of a Jupiter Notebook which is a browser-based, real-time interpreter for Python.

And we'll be making use of Python bindings for Core Image which we'll be introducing later today in a separate session.

So let's start and load an image that contains portrait depth and portrait matte in.

So this is the image we're going to be working with.

The first thing I want to show you is what the depth data looks like for that image, so let's look at the two side-by-side.

So we have the portrait depth on So we have the portrait depth on the left and the portrait matte on the right-hand side and you can just see how fine the details are.

And we'll do a zoom crop in just a minute so that you can better appreciate just how high quality it is.

Then next thing we do is we resize the images.

As you see the RGB and the depth data vary greatly in size.

So we resize our images and let's have a look at them side-by-side.

So we have our RGB and our depth data on the right-hand side.

During the first part of this demo I'll focus on depth data, then we'll see how things get much, much simpler when you use a portrait effects matte.

So the effect I'm going to be working on today is depth thresholding and essentially what I'll be doing is computing a histogram of the gray level values in my portrait depth.

And I'll be applying a threshold or a clipping point in that histogram so that everything becomes zero or one depending if it's sitting below or above that threshold.

Then we'll be closing holes in the image by using morphological closing operations and then blurring the mask so that we get blurring the mask so that we get a nice feathered look.

Let's have a look at this in action.

So remember all of this is executed live in the browser using Core Image as a back end.

All right so the first thing I want to show you is how changing percentile point changes my mask here.

So the higher it is the less aggressive I am on clipping the foreground.

So let's pick a value that's reasonable, maybe something like this here.

And what you can see here is that there are regions or islands we call them that are connected to my foreground and there's a bit of the subject here I'd like to take out.

So what I'll do is I'll add a bit of morphological closing.

Look at how this appears magically.

If I go too far obviously I lose my entire subjects, I don't want to do that.

So let's pick something like this.

What I can do then is change the feathering by applying the mask on top of it all.

Let's take a look at how the RGB is threshold in back.

This is not the effect we're coming up it's just to give a sense of how the mask is applied, so let's keep going.

So I've chosen a few parameters for this thresholding here and for this thresholding here and this is what I'll be using for my foreground.

Next, I'm applying an effect only on my foreground, so in this particular case I'm using the Core Image photo effect Noir which turns everything grayscale and has a has a bit of contrast.

I'm doing an exposure adjustment, as well as desaturating my image slightly and augmenting the contrast even further.

Let's take a look at the output.

This is going to be the foreground that I'll be using and what I want to do here is leverage the depth data mask that I have to composite this foreground onto a background.

Let's just generate a background which is just a darker version of the original image.

We can then composite the two together using the Core Image filter blendWithMask and we have the result right there, it's that simple.

All right.

Thank you.

[ Applause ]

Okay so as you saw that required Okay so as you saw that required a bit of fiddling with the parameters for clipping, for smoothing, and then so on and so forth.

The portrait effects matte enables you to do this without actually doing much process, this is really exciting so let's have a look at this.

So here we're starting with another image that also has portrait depth and portrait matte information embedded into it, so let's look at these.

As I mentioned earlier, this is an extremely high-quality foreground mask.

So let's have a look at a crop on this hair.

Look at how fine the detail is, this is beautiful.

On the right-hand side here is the depth that it shows you just how coarse the depth data is compared to effect matte.

So we'll do another foreground separation effect here similarly to what we just did but we used a portrait matte instead of using the portrait depth.

So let's look at our foreground similar as before but in this case, we'll desaturate the foreground slightly, add a bit of contrast, as well as some vibrance to the image.

vibrance to the image.

Now let's generate a background, in this case we'll do a disk blur which is another Core Image filter, as well as bring down the exposure so things get pretty dark.

But we still get a bit of background remaining, it's quite faint but it's still there.

And again, I use a CI blend with mask which is going to do the compositing for us with the mask and we have left and right.

Isn't this beautiful?

[ Applause ]

Thank you.

All right let's look at another great demo which we call Big Head.

So because the portrait matte is so fine we can actually do things like change the [inaudible] size of our subject with respect to the background using it.

So let's do just that and we'll do it live.

So here's our input image on the left here and the portrait matte on the right-hand side.

And what I'll be doing here as I'll be playing with the size of the subject in my frame notice how the subject is getting how the subject is getting smaller and bigger.

And you can actually you know give more weight to the subject in your frame, but you can also do pretty cool things like now that I have this let's say that I pick my favorite size here.

You can do things like pseudo simulated depth of field here by giving more contrast to the foreground and using a [inaudible] in the background to give more pop to your subject.

All right let's take a look at just another image to see just how easy that is because they're using the exact same pipeline under the hood which is very simple just using the portrait matte and using it to blur the foreground and background.

Again our input image here, we can change the size of it, make it bigger, then apply a bit of contrast to give more focus to our subject.

It's that easy, really exciting.

[ Applause ]

All right let's take a look at another demo which we call Marching.

I'm not even going to try to I'm not even going to try to explain what it does, let's just take a look at the filter in action.

There you go, fun stuff and just because we can do it I can expose how many of these I want to stitch together.

So going from just a few to you really, really pushing that way, way too far.

Really exciting stuff.

All right.

So that's it for this demo, I hope you enjoyed this.

So if you'd like to know more about using Python bindings for Core Image I encourage you to come to this afternoon's session on Core Image Performance Prototyping in Python.

That's all for me today, let me introduce you to my colleague Ron from video engineering who will be talking about real-time video effects with TrueDepth.

Thank you everybody.

[ Applause ]

Thank you Emmanuel.

Thank you Emmanuel.

Great photo effects but what about video.

My name is Ron Sokolovsky and I am from video engineering.

In this part we are going to leverage the TrueDepth's camera to create similar effects with real-time video, like for example this background replacement app.

In order to create such effects we are going to deep dive into the stream coming from the TrueDepth camera, the characteristics, best practices, and challenges.

We are also going to show you how to work with point clouds, a completely different way to process and render rich depth information.

And that background replacement app we're calling it Backdrop and we'll show you how to make it step-by-step.

But first things first, the stream for the TrueDepth's camera is made of frames, each frame is a depth map, a 2-D image in which each pixel contains the depth information or the distance to the scene in or the distance to the scene in that direction.

We've chosen a specific coloring scheme, closed pixels are colored in red while fire red pixels are colored in blue.

In between them there is a colorful spectrum so you can see the texture of the depth map.

There are also black pixels, those are holes in a depth map.

For those pixels we have no information what is the depth.

We are releasing today a new tool, a sample app for you to explore this stream and we call it TrueDepth Streamer.

You can slide between the video stream and the TrueDepth stream.

Now because the TrueDepth camera has active illumination even in complete darkness while the video is pitch black it is business as usual for the TrueDepth camera.

So now you see me and now you don't.

[ Applause ]

So how do you add the stream from the TrueDepth camera into your application?

Well I'm glad you asked.

The first thing you need to do is to discover the built-in TrueDepth camera and then you initialize the device capture input.

And you add the depth data output into your session.

At this point you're good to go, you can start the session and you will have the TrueDepth stream with your session.

This stream can come in two forms of data, disparity or depth.

Now disparity is the inverse of depth and vice versa, so which one should you choose?

Well disparity usually yields better results, especially for machine learning applications but the depth data has more meaning in terms of real-world measurements.

Know that if you work with depth that the depth error goes with the depth squared.

That means that an object at 1 meter would have four times the depth accuracy as an object at 2 meters.

We have two streams, video and depth, and they don't necessarily share the same resolution.

The native resolution of the TrueDepth's camera is VGA or 640x480 and that's what you'll get if you choose a video preset of an aspect racial of 4:3.

If however you choose an aspect ratio of 16x9 you'll get a depth map of 640x360.

In both cases the depth map will cover the entire field of view of the RGB image.

Now we are talking about video applications, so we are crunching a lot of numbers very, very fast and that could create system pressure over time.

So you can test your application and gauge the system pressure and gauge the system pressure level which goes from nominal to fair, serious, critical, and then shutdown.

And the responsibility is in your hands because the system will let you go all the way to shutdown but when it does it's bye-bye every capture device.

Another thing you can do is to adopt a degradation scheme, if the pressure level gets serious you can reduce the frame rate to 15 frames per second or you can choose a more elaborate scheme with gentle degradation going from 30, 24, 20 and 15 frames per second anytime the pressure level increases.

So we have holes in the depth map what can we do about it?

Well, in fact you could get the stream already filtered for you.

There is a parameter called isFilteringEnabled and it's' defaulted to true, which means you get a filtered depth map smooth, spatially and temporally smooth, spatially and temporally and the holes are filled from the RGB image.

This is especially useful for photography and segmentation applications because you know every time you query a pixel you get the depth's value.

In TrueDepth Streamer you can switch to the filter stream and see that it is smoother and the holes are filled.

So this is great, but it is not applicable to 100% of the use cases.

If you're working with point clouds or any type of real-world measurements you're better off staying with the raw data which holds the highest fidelity.

If you do you will have holes, you will have pixels marked as zero, it does not mean that they are the distance of zero meters from the camera it just means we have no information about them.

Therefore, you should watch out for operations like averaging and downsampling because you don't want to mix those real values with those zeros.

values with those zeros.

But why do we even get holes?

Well the TrueDepth camera detects objects up to a distance of about 5 meters, but not all materials are made the same.

Some materials have low reflectivity.

they absorb most of the lights.

For example this extreme scenario is a very low reflective fabric watch what happens when we switch to the depth map and I walk away from it.

Even though there are objects in the scene with larger distance we see holes forming on this fabric because it's absorbing most of the light.

If we switch to the filtered stream and repeat the same motion those holes are filled.

But it's not only about the amount of light reflected back it's also about the direction in which it is reflected to.

Some materials are specular or Some materials are specular or shiny and they are very picky and choosy in which direction they send back the light.

An extreme scenario would be this display, you can watch the video stream to see the reflection.

And when we switch to the depth map holes are forming depending on the angle between the device and the screen.

And if we switch to the filtered stream those holes are filled but with less fidelity.

Another challenging scenario is outdoor, typically in an outdoor scene the background is very far away so we don't expect to get any depth on the background.

Also, the sun acts as an aggressor to the active illumination.

To demonstrate that I went outside on a very sunny afternoon, positioned myself against the sun.

And when we switch to the depth map you can see there's no depths on the background and we get some holes around the frames, specifically in this frames, specifically in this case the hair.

But still most of the depth on the foreground is intact and very useful.

One final point I want to cover for getting holes is the fact that from the perspective of the TrueDepth camera some of the light projected hits an object on the way back so we get shadows from the parallax between the projector and the camera.

You can see an example on the right side of this mug, but something is different here.

This is not a depth map so why do we even get holes?

Well in TrueDepth Streamer you can switch from 2-D mode into 3-D mode, which gives us a point cloud view.

With point clouds we can dynamically change the perspective of the scene creating even more holes when we do so.

And now I can ask you is this mug half empty or half full and mug half empty or half full and the answer is we have no idea.

By virtually changing the point of view of the camera we don't add new information and that is because the TrueDepth camera can do many things, but bending light into the mug not one of them.

Let's see this live.

[ Applause ]

So I'm starting with a video view and I want you to look in this corner.

Watch what happens when I touch the screen, I will touch my forehead, you can see an indication of the depth.

And if I move the phone you can see the [inaudible] changing and the reason I can do so is because we have the stream from the TrueDepth camera running as well and you can see that it is overlaid on the video.

So we have this livestream 30 frames per second and we can frames per second and we can switch to the filtered stream and then all the holes are filled.

If I switch to the point cloud view I can dynamically change the point of view to the scene.

So even though I'm looking directly to the device it looks as if somebody's watching me from up above.

Now the reason we call this a point cloud is if I zoom in you can actually see the points in 3-D space.

But being here with you in WWDC I feel like I have to pinch myself just to get things back in perspective.

[ Applause ]

Thank you so that brings us to point clouds.

How do we create them?

We're starting from a depth map, a 2-D image in which the depth Z is a function of the pixel coordinates U and V.

And we want to transform it to a And we want to transform it to a new coordinate system in 3-D space, X, Y, Z.

Now we already have Z right, that's the depth from the depth map but we want to get X and Y.

For that we need to get the help from the Intrinsics Matrix which holds information for the focal lengths and principle point.

If for example I want to get X I need to start with the pixel coordinate U, subtract the principle point, multiply by the depth, and divide by the focal lengths.

And naturally I have to do the same thing for the other dimension as well.

Now this Intrinsics Matrix is accessible through the camera calibration data.

In fact, this operation is done in every frame of the TrueDepth stream.

The reason for that is that the video stream and the depth stream are coming from two separate cameras.

But because the TrueDepth camera gives us a depth map we can transform it into a point cloud transform it into a point cloud and re-project it to the perspective of the RGB image so the depth stream is already registered on the video stream for you and you get RGBD data.

Now, thank you.

Yeah, it's pretty cool.

Now these types of operations are best done in metal graphic shaders.

And you can download the code for TrueDepth Streamer and you want to focus on two areas.

In the vertex shader we control the location of the points, we'll start with the depth map and transform it to real-world coordinates or X, Y, Z.

Then we can multiply it with a view matrix to change the point of view to the scene.

In the fragment shader we get the output of the vertex, but we have to see if it's a real value or a hole in the depth map.

If it's a hole and it's marked as zero we don't know its depth so we cannot transform it to X, so we cannot transform it to X, Y, Z and we would need to discard this point.

If it is a real value we can sample the RGB texture and add color to the fragment or point in this case.

So I understand this part was a bit technical and a lot of you come from different backgrounds.

Have no fear we have just the app for you, an app to replace your background, let's see it live.

Let's see it live.

So I can put myself in Yosemite, I can swipe down put myself in something more abstract.

I can even go all the way to Antelope Canyon, Arizona, it took me 15 hours to get there last time, I could have just swiped down, saved a lot of money on gas.

In fact, this application can In fact, this application can even put you in space where nobody can hear you stream.

[ Applause ]

So how do we create that?

Anytime we're dealing with a video application there's other things that are going on a per frame basis, in this case we have to detect a face, create a brand-new mask from the depth map, smooth and upscale it to the RGB resolution.

And then we take this foreground mask and upscale it again to the low-light background image.

And then we can blend or [inaudible] them, but there's something we can do to reduce some of the complexity.

If anytime we load a background image we resize it to the RGB resolution just once not per-frame, then we don't need that second upscale and the that second upscale and the blending is done at low resolution which makes a big difference.

So let's deep dive into those depths.

The first thing we need is to find the center of the face.

And in iOS there are actually quite a few ways you can get face metadata.

You can use a Core Image detector or the Vision Framework, but in this case since we just need the pixel at the center of the face we can use AV meta data object type face.

But it gives us the center in the coding system of the RGB image and we need to map it to the depth map which might not be in the same resolution.

Once we have the value of the depth of the face we can use it plus a margin of characteristic 25 centimeters to threshold the depth map and create a binary mask, foreground is one, background is zero.

In fact, we can stop here, we can use this binary mask and can use this binary mask and create the effect.

The transition from background to foreground will be very sharp, but we'll get some fidgeting around the edges.

So we want to filter it a bit.

The first stage will be to apply some smoothing from the background to the foreground, in this case Gaussian Blurring.

The radius of Gaussian Blurring will determine the slope of the transition and you can play with the value to get different effects.

Another processing stage we add is gamma adjustments, it allows us to further fine-tune this transition from background to foreground.

If we use a gamma value which is higher than one we'll get a narrower foreground mask.

On the other hand, if we use a gamma value that is smaller than one we'll get a wider foreground mask and maybe some aura.

So you can create different effects by combining those two parameters.

If you use a large blur radius If you use a large blur radius and a large gamma value you create this transparent transition that makes you seem as if you're a hologram in space or similarly it could be underwater and you can play with the values to create different effects.

If I keep the radius high and reduce the gamma value to a very low number I create this halo around my head.

So you can play with this to create your own effects.

How do we implement this?

In Core Image it is very straightforward, we can concatenate three filters in a row.

We start with a Gaussian Blur, we add the gamma adjustment, and we upscale to the RGB resolution.

But there are a couple of small points I want to emphasize as best practices.

Anytime you work with a convolutional based operation such as Gaussian Blurring the best practice will be to start by clamping to extent.

by clamping to extent.

By repeating the border pixels outwards we can make sure all the borders of the image are handled correctly by the filter.

Moreover, after the filtering and just before the upscaling the best practice will be to crop back to the original extent because that's the part of the image we really care about.

At this point we have an alpha matte of the foreground and you can use it to create different kinds of effects for the background and the foreground just like Emmanuel showed in the first half.

In Backdrop, we blend the RGB stream with a loaded background image in a single line of Core Image code using the alpha matte we created from the TrueDepth camera to create this background replacement effect.

So the TrueDepth camera gives us a resolution of depth map of 640x480 coming at you 30 frames 640x480 coming at you 30 frames a second, already registered to the video stream.

You can use it to create point clouds and dynamically change the perspective of the scene or use the filter depth to create different kinds of video effects.

You can go to the webpage and download the Jupiter Notebook TrueDepth Streamer and Backdrop and we hope it inspires you as a starting point to many new cool video effects to create in your applications.

Come say hi at the AVCapture lab, thank you so much for your time.

Have a great day.

[ Applause ]

Apple, Inc. AAPL
1 Infinite Loop Cupertino CA 95014 US