Welcome to Session 507.
I'm Brad Ford.
I'm from the Camera Software Team, and I'm very excited to share some deep thoughts with you this afternoon.
Did you see what I did there?
All right [applause].
This session is part one of a two-part series on a very important initiative for Apple this year, and that is media containing depth information.
I'll introduce depth at the conceptual level, I'll familiarize you with key terms, and I'll teach you how to capture depth data on the iPhone, much like this.
You'll see a lot of ghostly images in this session.
Here's the agenda.
First we're going to cover depth and disparity on iPhone 7 Plus at a high level.
Then we'll move on to streaming depth data from the camera, capturing photos with depth data, and finally we'll end with a slight tangent, which is dual photo capture.
It is the most highly requested feature we've had on dual camera, and I'm very excited to talk about it.
Your job is to listen for all the truly horrible depth puns that I have sprinkled throughout this session, and let's make a game of it.
Okay? Every time you hear one, just give me a nice big groan to let me know that you care.
Here, let's practice.
Everybody ready for a deep dive?
[ Group groaning ]
Thank you, from the bottom of my heart.
[ Group groaning ]
Good. The reason you're all here today is this guy right here.
This is the iPhone 7 Plus.
The product, of course, has sold exceptionally well, even better than its plus size predecessor, and that's thanks in large part to the quality of the dual camera system.
It is a dual prime lens system consisting of a 28-millimeter equivalent wide-angle camera, and a 56-millimeter equivalent telephoto camera.
Both of them are 12 megapixels.
They share the same feature set, the same formats.
You can run either of these cameras on its own, or you can address them in tandem using a third virtual camera, the first time we've ever delivered one on iOS, and it's called the dual camera.
It runs them in a synchronized fashion, the same frame rate, and running them together enables two marquee features.
The first is dual camera zoom.
This switches between the wide and the tele automatically as you zoom.
It matches exposure, focus and frame rate so that it's kind of magical.
You don't even realize that we're switching cameras, but all of this happens very seamlessly.
We also are compensating for the parallax shift to make it a smooth transition as you go back and forth between the wide and the tele.
And the second marquee feature is, of course, the Portrait mode, where the dual camera system locks into the tele camera's narrower field of view, but then uses images from both the wide and the tele to generate a beautiful shallow depth of field effect that you'd expect from a much more expensive camera with a fast, wide-open lens.
The foreground is sharply in focus, while the background is progressively blurred in these pleasing little bouquet circles.
The depth effect has gotten even better in iOS 11.
We've made improvements to the rendering of the out-of-focus area.
It more accurately represents a wide-open fast lens with sharp and well-defined bouquet circles.
We've also improved how the rendering handles the edges between the foreground and the background.
Please check it out if you haven't yet.
I think you'll be pleasantly surprised at how great the quality of the shallow depth of field effect is in iOS 11.
To generate an effect like this you need to be able to separate foreground from background.
In other words, you need depth.
And up to now that depth information has been exclusive to the Apple camera app's Portrait mode, but now new in iOS 11 we are opening up depth maps to third party apps.
Here's a gray scale visualization of the depth map that was embedded in this image file.
Having depth information opens up a world of possibilities for image editing, such as applying different filters to the background and the foreground, like this.
I've applied a noir black and white filter to the background and the fade filter to the foreground.
And notice how the little girl's tights are still pink, but everything behind them is black and white.
Knowing the gradations of depth, I can get even fancier and I can move the switch-over point forward or backward, like this.
Keep your eyes on the flower.
So now notice that just her hand and her flower are in color, while everything else is in black and white.
You can even control foreground and background exposures differently, like this.
So now she looks like she was photoshopped into her very own photo.
I'm not saying you should do it.
I'm saying you could do it.
Let's get technical.
I like to call this section deep learning [group groaning].
First we need to define what a depth map is.
In the real world depth means the distance between you and an observed object.
A depth map is a transformation of a three-dimensional scene into a two-dimensional representation, and you do that by setting the depth to a constant distance.
Let me explain what I mean.
I'm going to use a diagram of a pinhole camera often during this presentation.
If you've studied computer vision, you'll be really familiar with pinhole cameras.
A pinhole camera is a simple lightproof box without a lens.
Instead, it just has a little poked hole, a single small aperture that permits light to enter in and project itself as an inverted image on the other side of the image plane, or a sensor.
The opposite side is known as image plane or sensor.
The aperture through which the light rays pass is called the focal point, and the field of view of the image captured depends on the focal length.
So the focal length is the distance from the focal point to the image plane.
A shorter focal length means wider field of view; whereas, longer focal length, longer box, means narrower field of view.
The focal length is that constant distance by which real world distances are flattened into a 2D image.
Put simply, a depth map is a transformation of a 3D depth into a 2D, single channel image where each pixel value is a different depth, like five meters, four meters, three meters.
Now, to truly measure depth you need a purpose-built camera for this, something like a time-of-flight camera.
For instance, a system that bounces light signals off of objects and then measures the time that it takes to return back to the sensor.
The iPhone 7 dual camera is not a time-of-flight camera.
Instead, it is a disparity-based system.
Disparity is a measure of the magnitude of shift of an object when observed from two different cameras, like your eyeballs.
Disparity is another name for parallax.
You can observe this effect by holding your head steady and fixing your gaze on something close, and then without moving your head, close one eye and then the other eye.
So, for instance, this would be left eye, right eye; left eye, right eye.
And you can see the colored pencils appear to shift a lot more than the markers in the back because they are closer.
That's the parallax effect, or disparity.
Now back to our pinhole camera model.
Now I've taken a bird's eye view of two cameras that are said to be stereo rectified.
That means, one, that they are parallel to one another, they're pointing in the same direction, and two, they have the same focal length, which is very important.
That's the distance from the focal point to the image plane or sensor.
Each camera will have a measured optical center or a principal point, and if you draw a perpendicular line from the pinhole to the image plane, then the optical center is the point at which it intersects with the image plane.
Now, there's another term that you should be familiar with and that is baseline.
Baseline refers to the distance between the two optical centers of the lenses in a stereo-rectified system.
Here's how it works.
Rays of light from an observed object pass through the optical centers and or through the apertures and land at different points on the image planes of the two cameras.
A fourth term that I'm going to throw at you right now is Z.
Z is the canonical term for depth, or real-world depth.
Now, watch what happens to the points on the image plane as the observed point gets farther away.
They moved closer together.
I'm going to show that to you one more time.
So as the real point gets farther away, they get closer together on the image plane, and as the object gets closer, the dots move farther away from each other.
So when the cameras are stereo rectified, these shifts only move in one direction.
They either move closer or farther away from one another, but on the same line, or the epipolar line.
Now, knowing the baseline you can essentially line up the cameras along their optical centers like this and subtract the distance between the observed points on the image planes to get the disparity.
That's what disparity is.
You can express this distance in whatever units make sense for your processing.
It could be pixels, meters, microns.
And it's common to store it in pixels since we think of RGB images in pixels.
Now, storing pixel shifts works fine, as long as the image that they accompany never changes size.
It's not so good if you're going to edit that image because if you've scaled the image down, you've now effectively changed the pixel size.
So you have to go through the map and you have to scale each value in the depth map.
That's a very brittle representation.
Instead, we at Apple have chosen to express disparity using normalized values that are resilient to scaling operations.
So here's how we do that.
Again, going to our observed point, you'll notice that there are two similar triangles being formed.
I'll highlight them for you.
These triangles have equal ratios of sides and proportions.
Now, if I get rid of the cameras to just show you the triangles, the real-world triangle sides are Z, or meters, and baseline, the distance between the two optical centers.
Inside the light box, or the lightproof box, that same triangle is represented as the focal length in pixels and the disparity in pixels.
Do you feel math coming on?
I feel math coming on.
So stay with me here.
This is pretty painless.
Baseline is to Z as pixel disparity is to focal length.
Okay. Well, what if we divide both sides by the baseline so the b's cancel out on the left, and what you're left with is 1 over z.
That's pretty nice.
1 over z is inverse depth.
That is literally what disparity means.
When an object moves farther away, the disparity shrinks.
When it moves closer, the disparity grows.
So it is the inverse of depth.
What remains on the right is what we call normalized disparity.
So it's not a pixel shift anymore, it's d over focal length times baseline.
The baseline is baked in so you don't need to carry that information with you separately when you're dealing with the depth map.
The units are 1 over meters, just as it's 1 over z, and it withstands scaling operations, and as you can see, converting from depth to disparity is trivial, since it's just a 1-over operation.
Is anyone feeling way beyond their depth at this point?
[ Group groaning ]
This is a little tricky stuff, but the takeaways are simple.
We have a disparity-based system, not a true time-of-flight camera, but disparity is a great proxy for depth, and normalized disparity is the inverse of depth.
Hey, speaking of normalized disparity as being the inverse of depth, here's a deep thought.
This image has a disparity map, so I guess this would make this a depth-defying leap.
[ Group groaning ]
So in our depth API set we use the term depth data, and this is a generic term for anything that's depthy.
It can refer to either a true depth map or a disparity map.
Both are related to depth, they're both depthy, so they are both depth data.
And we have a purpose-built object for this.
The canonical representation on our platform for depth is called an AVDepthData.
It's available on iOS, macOS and tvOS.
It's a class in the AVFoundation framework and it represents either depth or disparity maps.
It also provides some nice facilities to convert between depth and disparity.
Okay. Let's get into the nuts and bolts of depth maps.
Depth maps are images, if you haven't figured out by now.
They're kind of like RGB images, except they're single channel, but they can still be expressed as CV pixel buffers, and now CoreVideo defines four new pixel formats for the types that we saw on the previous slide.
They're all floating point.
The first two are for normalized disparity and it's measured in 1 over meters.
Notice that there's a 16-bit flavor and a 32-bit flavor.
The second two are for depth and they're measured in meters.
They also come in 16- or 32-bit flavors.
Why would we do this?
Well, if you're going to be working with depth on the GPU, it would make sense for you to request 16-bit or half float values of depth.
If you'll be working on the CPU, you should work with the full 32-bit float variants.
They'll work better.
We'll talk later about where an AVDepthData object might come from, but for right now let's just focus on its core properties.
Given an AVDepthData object you can query its depth data type, which is one of those four pixel formats; you can get access to the depthDataMap itself which, again, is a CV pixel buffer; you can iterate through it by row and column using standard CV pixel buffer APIs.
And the final two properties I want to highlight here have to do with inherent problems in capturing depth data, and we're going to go through these problems one at a time and discuss the solutions.
The first problem is holes, holes in the depth data.
To calculate disparity both cameras need to observe that same point, but from two different perspectives.
If they can't see it, no disparity.
So why might they not be able to see it?
For one, occlusions, such as a creepy finger coming in and suddenly blocking one of your cameras.
If it's partially obscuring it or obscuring it, you don't have two points of view anymore, therefore, you have no disparity.
Another more common reason is difficulty in finding features.
When camera one and camera two's images are compared, remember, they line them up by optical center and look for features matching key points.
Let's say it's dark out and the observed point may not have very well-defined features anymore, the color is a little bit noisy, the edges are hard to find.
Another example would be if you point the cameras at a flat, white wall with no texture to it, there are no features so it's very hard to find differences in matching.
For any of these reasons you might have areas in your image where there is no disparity, and those are called holes.
Holes are expressed in the depthDataMap as not a number standard floating-point representation, either 16-bit or 32-bit.
Depth maps may also be processed to fill in the holes.
We can do this by interpolating based on surrounding depth data that's good or by using metadata present in the RGB image.
The isDepthDataFiltered property of AVDepthData tells you whether the map has been processed in this way.
If you receive an unfiltered AVDepthData, you can expect to find not a number values within that map.
Okay. We'll talk a little bit more about how you can request filtering later on.
The second problem that interferes with accurate disparity generation is calibration error.
There are lots of different kinds of calibration errors that can happen that we can correct, but there's one that we can't, and that is incorrect accounting of the optical center in either of the two cameras.
So for this one I've shifted our pinhole cameras down to the bottom by 90 degrees to give myself a little more room at the top.
In an ideal stereo-rectified system, perspective only shifts in one direction, left or right, along these same lines.
So if there's a ray that's observed from camera one, it would be viewed as a series of intersecting points on a line from camera two, like this.
So for disparities to be measured accurately you must have an accurate baseline.
And baseline, again, is the distance between the two optical centers.
If you don't have an accurate baseline, you can't align those two cameras' optical centers and you can't figure out how much disparity there is.
Now, what happens if the optical center is calculated wrong or just misreported?
Let's say the true optical center is here, but for some reason it's misreported as being here.
Now suddenly all of our disparity points on camera two's image plane are shifted to the left by the same fixed amount.
Now all the objects will be reported as being farther than they truly are.
If the error were in the other direction, then the objects would be misreported as being too close.
So we can detect and fix a lot of problems, but this one we can't detect and fix because, again, all of those points still look like they're on the same correct line.
We don't know the difference between the baseline being wrong and the person actually moving further or closer.
Now, how can this happen; why would there be problems with optical center calculation?
iPhone cameras don't use pinholes, they have lenses, and on iPhones those lenses don't stay still.
If OIS is engaged, then the lens may be moving laterally to counteract hand shake.
Gravity can come into play because it can cause the lenses to sag.
The focus actuators are actually springs to which an electrical current is applied.
So all of these reasons might cause it to move around laterally a little bit, and these very small errors in optical center position can result in large errors in disparity.
When this occurs, the result is a constant amount of error in every pixel in the map.
The disparity values are still usable relative to one another, but they no longer reflect real-world distances.
For this reason AVDepthData objects have to have a concept of accuracy.
An accuracy value of absolute would mean the units do reflect real-world distances, there's no calibration problem.
Relative accuracy means that the Z ordering is still preserved, but the real-world scale has been lost.
Depth data captured from, say, a third-party camera can be reported as either absolute or relative, but iPhone 7 Plus always reports relative accuracy due to the calibration errors that I just mentioned.
But I don't want you to be frightened by that.
Relative accuracy is not bad accuracy.
Dual camera depth is still totally usable, and let me show you how.
Awesome, formulas on slides.
Okay. Here comes a bit of math again.
Let's say we've got a relative accuracy disparity value on the left, which is the d with the little dunce cap over it because it's bad, and that's equivalent to an absolute disparity d plus a fixed amount of error.
We don't know what the fixed amount of error is, but it's there.
Now, let's take a common operation such as finding the difference between two disparities in the same map, it's like subtracting the differences.
So let's say the equation looks like this.
You have two bad datas where you're subtracting two bad disparities, and that's the same as two good disparities with the same fixed error.
If we reorder things, we find that actually we can get rid of the errors because they cancel each other out and we're left with a very happy coincidence here.
This happy discovery is that the differences are the same, whether your disparity is perfect or your disparity is relative.
This formula kind of proves that relative is just as good as absolute if you're creating effects that only rely on, say, differences within the same map.
And that's why the effects produced from relative accuracy depth still look fantastic.
And with that I think we've wrapped up our AVDepthData intro, or maybe we've gotten to the bottom of it [group groaning].
It's time to move on to our first capture case, which is streaming depth, and I feel a demo coming on.
Okay. Let's start with a demo called AVCamPhotoFilter.
This is an app that we released last year as sample code with the show, and this was to show you how to apply an effect in real time to a preview and render that same effect to the photo.
So last year it just had one button at the top and that was to filter the video, and it did, you know, kind of a cheesy little rosy effect to the video, but it shows it to you in real time on the preview and it also renders it to the photo when you take a photo.
This year we've added some depth to this sample by showing you how to preview depth in a streaming fashion.
So now what we're doing is turning on depth and we're previewing it by mixing between full RGB and full depth.
I'm going to call up my lovely assistant Vanna actually, it's Eric.
He's going to come up and show us something that's dynamic, like a baseball glove.
I love it.
Now, notice that it's quite noisy, there's a lot of jumping around happening.
You can definitely see what it is, but it's not perfect and there's a lot of temporal problems going on, but I can click the Smooth button and suddenly we have filtered the depth to fill in the holes and temporally smooth them, and now it's a really nice-looking disparity.
I'm going to go ahead and take a photo.
And now if I go back to the Photos app, we'll find that we just captured a really lovely looking depth representation, and now this is an educational app because finally we can answer the question how deep is your glove, how deep is your glove [group groaning].
You really need to learn.
Let's go back to slides.
I know it's late.
I'm trying to keep you awake.
How did we do that.
AVFoundation frameworks camera capture classes are divided into three main groups.
The first is the AVCaptureSession, which is just a control object.
You tell it to start or stop running, but it doesn't do anything unless you give it some input, and for that we have AV capture inputs, such as an AVCaptureDeviceInput, I've made one here associated with the dual camera, and that provides input to the session, but now you need to direct it somewhere as an output.
And now we have a new kind of output called an AVCaptureDepthDataOutput.
This is affectionately referred to on our team as the DDO, and it functions similarly to our VideoDataOutput, except that instead of delivering CoreMedia sample buffers, it delivers AVDepthData objects, that canonical representation that I was talking about.
It delivers them in a streaming fashion.
Now, where is AVCaptureDepthDataOutput supported?
You can, of course, add it to any session anywhere, but you're not going to get depth unless you are on the dual camera because that is the only dual system or stereo system that we have for calculating disparity.
When you attach a DepthDataOutput to your session, some things happen.
The dual camera automatically zooms to 2X, that is the full field of view of the tele, and that's because in order to calculate disparity, the focal lengths need to be the same and at 2X zoom the wide-angle camera's focal length matches the tele.
Also zoom is disabled while you are calculating depth.
We've added some new accessors to AVCaptureDevice.
On the dual camera you can discover which video formats support depth by querying the supportedDepthDataFormats property.
And there's also a new activeDepthDataFormat property that lets you see what the activeDepthDataFormat is or select a new DepthDataFormat.
We currently support three video resolutions or presets for depth, and let me go through them one at a time.
The first is the ever-popular Photo Preset.
In the Photo Preset you get a screen-sized preview coming out of VideoDataOutput, and you get full res 12-megapixel images coming out of the photoOutput.
So here you see that the VideoDataOutput is delivering 1440x1080, which is screen-sized.
Accompanying that, if you use a DepthDataOutput, you get 320x240 at a maximum of 24 fps.
Why so small?
Well, it takes a lot of horsepower to do that disparity map 24 times a second.
You can also get it at a lower resolution if you would like, 160x120.
Next we have a 16x9 format.
This is a new format this year.
Last year we had a 720p 16x9 format that went up to 60 fps.
This is a new one that goes up to 30 fps, but it supports depth.
And again, it is aspect correct in the DepthDataOutput at 320x180 or 160x90.
And finally, we have a very small VGA-sized preset or active format that you can use if you just want something very small very fast.
Let's talk about frame rates.
AVCaptureDevice allows you to set the min and max video frame rates, but it does not allow you to set the depth frame rates independent of the video frame rate.
That is because depth needs to be delivered coincident with the video or at an even fraction of the video frame rate.
So, for example, if you select a max video frame rate of 24, the depth can keep up with that, so you get 24 fps of depth.
If, however, you select 30 fps video, the depth cannot keep up so it will select not 24, but 15, so that you got nice even multiples.
DepthDataOutput supports filtering depth data, as I just showed you in the AVCamPhotoFilter demo.
That fills the holes and it also smooths things out as you move around so that you don't see temporal jumps from frame-to-frame.
Let's look at our current landscape as far as data outputs.
We have four of them now.
The first is the VideoDataOutput, which has been around since iOS 4, and it is the thing that gives you video frames one at a time in a streaming fashion at 30 fps or 60 fps, whatever you set it to.
We also have an AudioDataOutput which typically gives you pushes of PCM frames in 1024 at a time at 44.1.
We also have a MetadataOutput that can deliver either faces, detected faces or barcodes, and these come in sporadically.
They may have some latency, up to four frames of latency for finding faces.
And now we're adding DepthDataOutput, which, as I just mentioned, is either delivered at the frame rate of the video or at a rate evenly divisible by the video.
So now this is kind of getting ridiculous.
In order to work with all of these data outputs you have to have a very sophisticated buffering mechanism to keep track of when everything's coming in if you care about dealing with all of them at the same time, or dealing with a certain presentation time altogether.
We have recognized this as a problem for a while now, but the DepthDataOutput has proven to be the bridge too far.
That wasn't very loud.
Next one, better effort, please.
In iOS 11 we've added a new synchronizing object called an AVCaptureDataOutput Synchronizer.
It delivers all of the available data for a given presentation time in a single unified callback, and it delivers a collection object called an AVCaptureSynchronizedData Collection.
So this allows you to designate a master output, the one that's most important to you, the one that you want everything else to be synchronized to, and then it will do the job of holding on to the media as long as it needs to, to ensure that all of the data for a given presentation time is available before it gives you that single unified callback.
It will either give you all of the data for all of the outputs, or if it's assured that there is no data for a particular output, it will go ahead and give you the collection with what it had.
So here's a little code snippet showing how to work with the data output synchronizer's unified delegate callback, which passes you, again, a SynchronizedDataCollection.
You can use it like an array or like a dictionary, depending on what you want to do with it.
You can iterate through it like you would an array, using fast enumeration if you just want to get a list of everything that's in the current collection.
Or if you want to deal with it in a dictionary like fashion, you can index by subscripting a data output that you're concerned with.
For instance, here I'm just looking for the particular result that came from the DepthDataOutput and if it's present, it will give it to me.
You have to guard your code to look for nil because, again, there might not be any depth for that given presentation time.
For an example of how to use AVCaptureDataOutput Synchronizer, again, use AVCamPhotoFilter.
That sample code is already available.
It's associated with this session.
You can download it right now.
There's another new streaming feature in iOS 11, a slight tangent here, and that is support for delivering camera intrinsics with each video frame when you're using VideoDataOutput.
If you recall our pinhole camera, in order to transform points from a 3D space to a 2D space, we needed two bits of information.
We needed the optical center, or principal point, and we needed the focal length.
In computer vision you can use these properties to re-project a 2D image back to the 3D space by using the inverse transformation, and this figures prominently in the new AR kit.
New in iOS 11 you can opt in to receive such a set of intrinsics with each and every video frame that you're delivered, and you opt in by calling the AVCaptureConnection isCameraIntrinsic MatrixDeliveryEnabled.
When you do that you can expect to get one attachment per buffer with the intrinsics.
Let me show you what the matrix itself looks like.
It may look imposing, but it's really quite simple.
Camera intrinsics are a 3x3 matrix that describe the geometric properties of the camera.
fx and fy are the pixel focal length.
They're separate x and y values because sometimes cameras have anamorphic lens or anamorphic pixels.
On iOS devices, our cameras always have square pixels, so fx and fy are always going to be the same value.
Then x naught and y naught are the pixel coordinates of the lens' principal point, or optical center.
These are all in pixel values and they're given at the resolution of the video buffer with which they're provided.
So, once you've opted in, you can expect to get sample buffers in a streaming fashion and you can get this attachment from them, and the payload is a C/F data that wraps a matrix float 3x3, which is a SIMD data type.
If you're doing computer vision, you'll be really interested in this new feature.
Okay. I think we've officially deep sixed the streaming topics.
[ Group groaning ]
Let's move on to the photo capture, and let's start with a demo.
This is a two-for-one.
We're going to do two apps here.
AVCam is the venerable piece of sample code that shows how to take photos and movies using AVFoundation.
And notice here, though we've added depth support to it, you don't see anything happening with depth.
That's because while I'm able to take a picture of these pencils here, you don't actually see a representation of the depth, but it was stored in the photo.
So when I go into the Photos app and I look at it, and let's say I go into the Editing menu, look what popped up, Depth at the top.
So I can now touch the Depth and suddenly it will apply that blur effect to the background, which is pretty cool.
So now photos that you take in your app are eligible to have the shallow depth of field effect applied to them as well.
That's pretty cool.
We can also do other more interesting things with depth, knowing now that we've got them in all of these photos.
And by the way, in iOS 11 all of the photos that you take in the Portrait mode are now storing depth information in the photos, so they are fodder for your new creative apps.
I'm going to use this app called Wiggle Me to show some creative things that you can do with the depth.
I'll select an easy one for beginning.
What it's doing is taking something that was flat and it's re-projecting it out into a 3D space and it's kind of rolling it around, or I can just stop it from rolling and I'm just going to use the gyro to move my phone around.
Isn't that a neat effect?
It sort of comes to life.
I'm going to pick a different one.
I really like the dog.
The dog looks great.
So now he kind of moves around from side-to-side.
You can also do something which is force the perspective to change.
Knowing where the depth is, you can mess with the depth, like this.
Dolly zoom [laughter].
Dolly zoom, dog in your face.
I have preferred to rotate it while Dolly zooming, because it's sort of like a gangster dog.
I think the appropriate music for this part would be "Rolling in the Deep," don't you?
[ Group groaning ]
You guys are doing a great job.
I really appreciate it.
Okay. When taking photos with depth, we support a wide gamut of capture options.
You can do flash captures with depth, you can do still image stabilization with depth.
You can even do auto exposure brackets, such as a plus 2, minus 2, 0 EV.
You can do Live Photos with the depth stored in the photo itself.
AVCapturePhotoOutput is what you need to use to get photos with depth.
This is a class that we introduced last year as the successor to AVCaptureStillImageOutput.
It excels at handling complex photo requests.
I'm talking about a request where you expect to get multiple assets and they need to be tracked and delivered, such as you're going to get a raw and a JPEG, and a live photo movie, et cetera.
You could get multiple things and they're coming in at different points.
The programming model is that you fill out a request, which is called an AVCapturePhotoSettings, you initiate the photo capture by passing the request and the delegate to be called later.
And as your photoOutput is the one and only interface for capturing Live Photos, bare RAW images, and Apple P3 wide-color images.
Also, now in iOS 11 it is the one and only way to capture HEIF file format, which was mentioned in the keynote.
A great many changes needed to be made to the AVCapturePhotoOutput to support HEIF and so in iOS 11, to accommodate those great many changes, we have added a new delegate callback.
It's a simple one.
This is a replacement for the callbacks where you would get a sample buffer.
Instead, you now get a new object called an AVCapturePhoto.
AVCapturePhoto is the only delivery vehicle for depth, so if you want depth, you need to opt in by implementing this new delegate callback.
In addition, you need to explicitly opt in for DepthDataDelivery before starting your session.
Why? Well, remember, the dual camera needs to do some special behavior when it's doing depth.
It needs to zoom up to 2X so that the focal lengths match, and it needs to lock itself there so that you're not zooming.
So the way that you do that is before you start running your session, you tell the photoOutput I want DepthDataDeliveryEnabled, and then on a per photo request basis, that would be when you actually snap the photo, you would fill out a settings object and say, again, I want depth in this particular photo.
Then you work with the resulting AVCapturePhoto that comes back and it has an accessor called AVDepthData.
Wow, that AVDepthData, it's everywhere.
It's like pervasive.
It's like deeply integrated into the API.
[ Group groaning ]
On iOS most AVCaptureDevice formats have the ability to take higher resolution stills than their streaming resolution.
Looking at our formats that support depth on iPhone 7 Plus, here you see the streaming video resolution compared to the high res photo resolution that you get.
So, for instance, for photo, if you're streaming, you only get screen-sized buffers, but you get 12-megapixel stills.
The same holds true for depth.
Remember what I told you that when we're streaming depth, there's a lot of work to be done in a real-time fashion to meet that 24 fps, but when doing a photo, we have a little extra time since it doesn't need to be delivered real time, so we can give you a very high quality, great looking map that's over twice the resolution of the streaming.
The aspect ratio always matches that of the video.
So if you're doing 16x9 video, you get a 16x9 map.
Now it's time to talk about the dirty little subject of distortions.
The depth maps that we capture and embed in photos are distorted.
I'm sorry to be the bearer of that news, but it's actually a good thing.
Let me explain why.
All the camera diagrams that I showed you up to this point were pinhole cameras.
Pinhole cameras have no lenses so the images are rectilinear; that is, light passes through the little aperture in straight lines and presents a geometrically perfect replicated inverted object on the image plane.
So if you had a perfect grid of squares like this and you took a picture of it with a pinhole camera, it would look like this on the image plane, but upside-down.
So straight lines would remain straight.
Unfortunately, in the real world we need to let more light in, so we need lenses, and lenses have radial distortions.
These distortions are present in the captured images as well because they were sort of bent in slightly odd ways to get to the image sensor.
And in an extreme case, straight lines captured through a bad lens might look something like this.
This is no good for finding disparity, since two images need to be matched to find features.
Well, if camera one has got a set of distortions and camera two has got a different set of distortions, how are you going to find the same set of features in those two images since they're warped differently?
I left out an important step when I described how we calculate disparities and I'm going to fill it in right now.
Before comparing the tele and the wide images, we have to do an extra step.
We have to make those warped images rectilinear; that is, we unwarp them using a calibrated set of coefficients and those characterize the lens' distortions.
After each image is corrected they look like this; satisfying, straight lines, straight.
Now we can, with certainty, compare points in the two images and find a perfect, real-world, rectilinear disparity map, which looks like that.
Now we have the opposite problem.
The disparity map matches the physical world, but it doesn't match the image that we just took, which has warping due to the lens, so now we have to do another step, which is to rewarp the disparity map back to the image so that it we use a set of inverse lens coefficients to do this, and the final disparity map has the same geometric distortions as its accompanying image.
So I said that this was a good thing.
Let me explain why.
It means that out of the box our depthDataMaps that come with photos are meant for filters, for effects.
They always match the image that they accompany.
So if you're working on effects, if you want to do stuff with like the Wiggle Me app, you want to do interesting effects with the image such as I showed at the very beginning, they're perfect for that.
What they're not perfect for is reconstructing a 3D scene.
If you want to do that, you should make them rectilinear, and you can do that.
I'm going to talk about that in a minute.
I'd like to just touch briefly on the physical structure of the depth data in our image files.
In iOS 11 we support two kinds of images with depth.
The first is HEIF HEVC, the new format, also called HEIC files, and there, there is first-class support for depth.
There's an area inside the file called the auxiliary image, which can store a disparity or a depth or an alpha map, and that's where we store it.
We encode it as monochrome HEVC, and we also store metadata that's important for working with that depth, such as information about whether or not it was filtered, what is its accuracy, camera calibration information like lens distortions, and also some rendering instructions.
All of those are encoded as XMP along with the auxiliary image.
The second format we support is JPEG.
Boy, JPEG wasn't meant to do tricks like this, but we made it do this trick anyway.
The map is 8-bit lossy JPEG if it's filtered, or if it has not a numbers in it, we use 16-bit lossless JPEG encoding to preserve all of the not a numbers, and we store it as a second image at the bottom of the JPEG, so it's like a multipicture object, if you're familiar with that.
Again, we store the metadata as XMP, just as we do with HEIF HEVC.
On to the most requested developer feature for the dual camera, and that's dual photo capture.
What do I mean by this.
So far, when you use the dual camera and take a picture, you still just get one image.
It's either from the wide or it's from the tele, depending where you're zoomed, or if you're in the area between one and 2X you might get portions of both as we do some blending to make an even nicer picture, but you still only get one.
You've been clamoring for both images and that's what we're giving you now.
With a single request, you can get both the wide and the tele in their full 12-megapixel glory and you can do whatever the heck you want with them.
[ Applause ]
Here's how you do it.
It's very similar to opting in for depth.
Before starting the capture session, you need to opt in by telling the photoOutput I'm going to ask for dual photo so enable it.
And then as you are capturing on a per photo request basis you can fill out your settings by saying I would like this particular photo to be a dual photo, give me both wide and tele.
When you do that, the number of photo callbacks that you get doubles.
It's not just that you get two callbacks.
Let's say you're asking for RAW plus HEIF dual photo.
Well, that would be four because you're going to get two wides and two teles of RAW and HEIF.
So whatever you were expecting to get before, the number of callbacks will double.
Now, we support all of the same gamut of features that we do with depth, and that is you can do flash with dual photo, auto SIS, exposure brackets and you can optionally get depth if you need it.
How do we deal with zoom?
This is a problem of security and confidence.
Let's say that your app only shows the field of view of the tele.
Well, the wide-angle camera has more information, so if you take a picture, you're actually giving people something outside of the viewable area and that might be a privacy concern.
So if you are zooming, we deliver dual photos, but with the outside blackened so that they match the field of view that's seen in preview.
If you want the full image, you can, just don't set the zoom to anything other than one.
How do you know if it has this blackened area on the outside?
Well, inside the image we store a clean aperture rectangle that defines the area with valid pixels.
Dual photos can be delivered with camera calibration data, too.
Camera calibration data is the kind of data that you need to do augmented reality, virtual reality, lens distortion correction, et cetera.
So with both a wide and a tele and camera calibration data, you can make your own depth maps.
I challenge you to make one better than Apple does.
You can also augment reality, of course, because you get the intrinsics.
Let's talk about the individual properties of camera calibration.
This is the last object that I'm going to introduce tonight.
The AVCameraCalibrationData is our model class for camera calibrations.
Where does it live?
Well, if you ask for depth, you get it with an AVDepthData.
It is a property of that.
You can also get it if you've opted in from an AVCapturePhoto.
So you opt in by saying I would like to camera calibration with this photo, which works rather nicely.
If you're doing dual photo capture, you ask for dual photo and you ask for the camera calibrations, you get two photo callbacks and you get the calibrations for the wide, with the wide result and the tele with the tele result.
What does an intrinsicMatrix look like?
I hope this is a little bit familiar, since it's the same as what we looked at earlier for the streaming VideoDataOutput case.
Again, it's a 3x3 matrix and the CameraCalibrationData, it's used for going from the 3D space to the 2D space when flattening an image.
You can apply the inverse when going back to the 3D space.
It has pixel focal length which, again, are two different numbers, but because we have square pixels, they are the same number.
And it also has an x and y for the optical center.
The pixel values are given at a resolution of a reference frame.
Again, the depth data might be very low resolution.
We don't want to give it to you at that low resolution so, therefore, we provide a separate set of dimensions.
Typically, they're the full size of the sensor, therefore, you get a lot of accuracy, a lot of resolution in the intrinsicMatrix.
Next is the extrinsicMatrix.
This is a property that describes the camera's pose in the world.
You need it when you're working with images from stereo-rectified cameras to triangulate where one is compared to another one.
And our extrinsics are presented as a single matrix, but kind of two matrices squashed together.
So the first one, the one on the left, is the rotation matrix.
It's a 3x3 that describes how the camera is rotated with respect to the world origin, wherever that happens to be.
And there's also a 1x3 matrix describing the camera's translation, or sort of distance from the world origin.
It's important to note that the tele camera is the origin of the world when you're using the dual camera, which makes it very easy.
If you're just getting a tele image, the matrix that you get will be an identity matrix.
If you're working with wide and tele, then the wide will, of course, not be an identity matrix, since it's describing its pose and distance from the tele camera.
But using the extrinsics, you could, for instance, compute the baseline between the wide and the tele.
There are also several properties dealing with the geometric distortions of the lens, as we talked about earlier.
These are useful for when you need to make either an image or a depth map rectilinear.
There are two properties that you need to be concerned with.
The first is lensDistortionCenter.
This describes the point on the sensor that coincides with the center of the lens' distortion.
This is frequently different from the optical center of the lens.
It's like if you looked at all of the distortions, radial distortions on the lens sort of like tree rings, this would be the center of the tree rings.
Also, along with this distortion center we have a lensDistortionLookupTable, which you can think of as being a number of floating point dots connecting the lensDistortionCenter to the longest radius.
Again, if you drew little circles from each of these dots, you would get something that looks like tree rings that would show you the radial distortions of the lens.
The lensDistortionLookupTable is a C array of floats that are wrapped in a data.
If each and every point along those dotted lines was a 0, you would have the one and only perfect lens in the world.
It has no radial distortions at all.
If there is a positive value, it indicates that there is a lengthening of the radius there.
If you have a negative value, it indicates that it was shrunk there.
But looking at this entire table together, you can sort of get a feel for where the bumps in the lens are.
To apply distortion correction to an image you'd begin with an empty destination buffer and then iterate through it row-by-row and for each point you would use the lensDistortionLookupTable to find the corresponding value in the distorted image, and then write that value to the right position in your output buffer.
This is extremely tricky code to write.
We know this.
So, we've provided a reference implementation for you in AVCameraCalibrationData.h. We actually put code in a header file.
It's all commented out.
It's a big objective C function.
Please take a look at it.
It describes how to rectify an image or how to rewarp an image, depending on which table you pass it.
There is also, as you might expect, the inverse of that table, which describes how to go from the warped back to unwarped.
It's really easier to show you with a demo.
Let's do a demo.
This will be our fourth and final sample app of the day, and it's called Straighten Up.
I bet you can guess what it does.
This is an app that uses the AVCameraCalibrationData, specifically the lens distortion characterizations, to make an image rectilinear.
This morning I went outside and I took a series of dual photos.
You can tell that they're dual photos and that I was zoomed in to 2X because of the black border around them.
This one is, of course, from the tele and this is the distorted image.
Now, when I press the Undistort button, you'll see something that's a little bit subtle.
You can definitely see it, but it's pretty subtle.
Typically telephoto lenses have less curvature so they have fewer radial distortions at the edges than wide lenses.
I'll zoom in to a portion so you can see the difference.
This is rectilinear, straight lines are straight, and this is distorted.
Now, if I go into the wide image, again, we don't have the corners, but you can see in the image data that we do have already that the distortions are more prominent.
So distorted, undistorted.
You can definitely see around the edges that they are pulling in more.
Back to slides.
Time for the wrap-up.
iPhone 7 Plus dual camera is not a time-of-flight camera system.
If you leave with only one piece of knowledge, it's that, and I hope you know how disparity differs from depth.
Also, the canonical representation on our platform for depth is AVDepthData.
We learned about intrinsics, extrinsics, lens distortion info.
These are all properties of an AVCameraCalibrationData.
We learned about the AVCaptureDepthDataOutput, and that it provides streaming depth which you can filter, or not.
And we learned that you can capture photos using AVCapturePhotoOutput and have depth delivery enabled.
Finally, we spent a little bit of time talking about the dual camera, dual photo delivery which produces a wide and a tele for a single image with which you can do interesting computer vision tasks, and I hope you do.
We have three pieces of sample code that are available right now and are associated with this session; AVCam, PhotoFilter and Wiggle Me.
For more information, here is the URL for the site.
And don't tune me out just yet.
Directly following this session there is an informal get-together for developers with an interest in photography.
[Whispering] That's all of you.
So you can come and mingle with members of the Apple Media Technologies Group.
You can ask questions, of course, or we can just talk and socialize.
Tomorrow there is a sister session at 11:00 a.m. where you'll learn how to read and manipulate the depth data that's in image files.
Today we just briefly touched on the surface of what you can do with images with depth.
Tomorrow you get a whole host of demos.
So I really hope you'll make time for that tomorrow.
If you do, I'd deeply appreciate it [group groaning].
And finally, I will be presenting a dedicated session on working with HEIF on Friday morning that I also hope you'll attend.
In that one I will delve deeply into the AVCapturePhoto interface.
Thank you, and enjoy the rest of the show.
[ Applause ]