VINIT MODI: Hi, everyone. Welcome to the session on
the new multi-camera API. My name is Vinit Modi, and
I'm the product manager on the camera platform. Just a quick reminder--
after this talk, please step outside
to the sandbox area if you'd like to ask
us more questions. Before we talk
about the new API, let me give a quick update
on the state of camera. Historically, most camera apps
focus on the native camera app that ships with the device. It turns out, however, that
more than twice the amount of camera usage occurs on
the apps that you build. And it's extremely
important that you support the new
features that are available in the
new Android APIs. When we speak to a lot of
developers, what we find is that the number one question
is the state of camera2 API and where we're going
to be going forward. We've been working very hard. And starting with Android
P, what you'll find is almost all new devices will
support camera2 and HALv3. What this means is when you look
at the camera characteristics, you'll find that the device
will advertise itself as either camera2 LIMITED,
which is similar to a point and shoot, a camera2
FULL, which offers advanced capabilities
like per frame control, and LEVEL_3, which enables
YUV reprocessing and RAW. In addition, we've been
working with several OEMs, or manufacturers, to open
up new APIs at launch. So we're excited this year
that both the Google Pixel 3 and the Huawei Mate
20 series support the new multi-camera API. Now, let me step
back and say why this new API is so important. Prior to Android P, as
developers, you only get access to one of the
physical sensors, where the native camera app gets
access to the full hardware capability. But starting with P,
you'll get the same access as the native camera app. This includes all
the physical sensors plus the logical camera. And the logical camera
is an abstraction of all the physical sensors
that allows you to easily take advantage of the hardware. There are several new use
cases and possibilities with the new multi-camera API. Today, Oscar's going to
talk about optical zoom, and Emilie is going
to cover bokeh. Thank you very much. Oscar is up next. [APPLAUSE] OSCAR WAHLTINEZ: Hi
everyone, my name is Oscar. I work in the developer
relations team, and we're going to start
off with a live demo. What could go wrong? [LAUGHTER] So here I have a Mate 20 phone. We are implementing
multi-camera zoom. What we're doing here is we
are swapping the UI layer, the two camera streams. I'm not doing any kind of
detail zoom or cropping. I'm simply swapping the streams. As you can see, it's
almost instantaneous. There's no tear
down I'm bringing up with the camera session. It's just a single session,
and I'm swapping the two camera streams. The idea is that, as I said,
single camera session-- two streams. And we're going to swap
between the streams, and we're going to show
you how this was built. The key component, though,
is that we had the same code running on both devices. As many camera developers
know, it is quite a feat to have the same code running
across such different devices, especially for something
as tied to the hardware as it is multi-camera. So first, let's talk about
how we can use multiple camera streams simultaneously. The basic guarantee
provided by the framework in the multi-camera
APIs is that you can use at least
two physical camera streams at the same time. Recall the guaranteed
stream configurations for single camera devices. It is a set of rules based
on hardware level, target type, and target size. If we use the multi-camera
APIs correctly, we can get an exception
to these rules. Let's illustrate
this with an example. We have a single YUV
stream of maximum size. As per the previous table,
devices with limited hardware level will be able to
use a single stream with that configuration. If we use the
multi-camera APIs, we can actually use two streams
of equivalent configuration from the underlying
physical cameras. Let's walk through
what we need to do to implement the app that
we just demoed earlier. We broke it down to five steps. Are you ready? Step number one-- find
the physical cameras. We start by identifying
pairs of physical cameras that can be opened
simultaneously. Using the camera
character as the subject, we look for the
[? request ?] capabilities, and if logical multi-camera
is one of them, we know this device
is a logical camera. Now that we found a logical
camera, we store it. We'll need the ID
for later, we'll see. And we get the physical
cameras associated with it. Then, we can move
on to the next step. Here's a visualization of
what we just described. We take that logical camera ID,
and with the characteristics, we call get physical
camera IDs, and now we retrieve the physical
cameras associated with the logical camera group. Onto the next step. Open logical camera. The second step is nothing new. We open the camera. Recall the logical camera
ID we saved earlier. That is the only one we use
to pass to the camera manager. So to reiterate, we only
open the logical camera. The step callback will trigger
when the device is ready. We have now opened
the logical camera. In the next step, we'll create
the output configuration updates. They will be used to
create the camera session. For each desired
output target, we may have a physical camera ID
from the list we found earlier if we want to retrieve frames
from a specific hardware camera. Let's go into more details. We create the output
configuration object using our desired output target. And if we want to
associate that output with a specific
physical camera, then we pass the ID in the set
physical camera ID API. If we want to use
a logical camera, we can simply skip this step. We may also have a
combination of both. So at the end of the
day, we have a list of output configurations,
some of which may be associated with physical
cameras, some of which-- logical camera. The goal is to put
all the configurations into a single session
configuration. As we just explained,
each output configuration has an associated output
target and, optionally, a physical camera ID. Now, we create the
capture session. How do we create the capture
session using the new session configuration object? We start off with our list
of output configurations that we just created. With that, we instantiate
a session configuration, which includes the
capture session callback. From that callback, we're
going to get an instance of the create a camera session. We take that session
configuration object and the camera device, which
we got from step number two when we opened logical camera. And we send a frame
with a request to create a new session with
our desired configuration. The callback provided in the
session configuration object will be triggered now. And then we'll have our camera
session ready to be used. Last step. Capture requests. Once that has happened, we
can start getting frames out of the cameras. For example, you want to
capture a frame from two physical cameras
simultaneously, we take the session
we created earlier and a pair of output targets. In this particular
case, each target will be associated with
a specific camera ID. We create the capture
request that we normally do, in this case,
using template preview. We attach the output targets to
it, again like we normally do. And now, we dispatch
the capture request. Nothing different here. Except in this case,
the output surfaces will receive image
data from each of the associated
physical cameras, and the capture request
callback will trigger only once. So again, it's just like
any capture request. The big difference is that
the completion callback will give me back two start
exposure timestamps instead of just a single value from
normal capture requests. So to recap, this
is how I implement our optical zoom demo. We found the physical cameras. We opened the logical camera
that is part of that group. We created the output
configurations. We [? print ?] a list. We create our capture session. And then, we dispatch
our capture requests. One more topic I wanted to
touch on is lens distortion. A classical example of lens
distortion is the fisheye lens. This is not a real example. It is here for
illustration purposes only. All lenses have some
amount of distortion. For logical cameras, you can
assume that that distortion will be minimal. For the most part, it'll
be corrected by a drivers. However, for physical
cameras, distortion can be pretty significant. The physical lens
distortion is described in a set of radial and
tangential coefficients. The coefficients can be queried
using the lens distortion camera characteristics key. The documentation has a lot more
details if you're interested. The good news is that there's
a way to correct distortion without doing a bunch of math. We can simply set the
distortion correction mode on our capture requests. OFF means that no
distortion is applied. We may need to use
this if we want to do things like
[INAUDIBLE] synchronization. Emilie will touch on that later. FAST means that the
best possible correction is applied while meeting
the advertised frame rate. If no FAST correction
is possible, this may be the same as OFF. HIGH QUALITY means
that distortion will be corrected as much as
the lens allows, potentially at the cost of frame rate. If we don't specify
a correction mode, it will be either
FAST or HIGH QUALITY. It is going to be up to the
implementation details, which is the default.
You, as a developer, can query to see which one
was applied to your capture request. Let's see a code
snippet demonstrating how this lens distortion is
set to high quality, which is probably what we want
for a still image capture. Assuming we already
started our camera session, we instantiate the
capture request builder using our desired
template, in this case, as I said, image capture. Then, we use the
camera characteristics to determine if HIGH QUALITY
distortion correction mode is available. Now that we know that
we have a HIGH QUALITY correction for distortion, we
set it on the capture request, and we do what we always do--
dispatch the capture request. For more sample code
and technical details, take a look at our blog post. We covered this and some more. We published it
earlier this week. And now, I'll hand
it over to Emilie. [APPLAUSE] EMILIE ROBERTS: Thanks, Oscar. My name is Emilie Roberts. I'm a partner
developer advocate, and I'm going to show
you a cool demo that uses some of these
multi-camera APIs to do a bokeh effect
on the Pixel 3. So we actually have
three-- well, two demos. 2.5 demos. The first one is
a single cam demo. There's no multi-camera at all. But I wanted to sort
of show the mechanisms for creating the bokeh effect. Then, when we get into the dual
cam demo, you'll see exactly-- you know, we can focus on
the multi-camera aspects and not worry so much about
the bokeh effect itself. And it's going to
be published soon-- Open Source. So don't worry about
scribbling down too much code. So can we go to this phone? Demo. Excuse me. Okey dokey. So we have-- I didn't set this up properly. OK, let's do the single
cam bokeh effect. Taking a selfie here. And I think you can see on the
screen, it's finding my face. It's cutting it out. Let me bump up the
final result. And it's kind of pasting the
portrait mode in there. This is kind of a rough
cut portrait mode. And I do have an
optimization on this. Let's see how that goes. I'll show you the
output steps here. So it's trying to do a better
job of finding the foreground. Hey, it didn't do too bad. It's generating the foreground
image, the background, which is just monochromed and
blurred a little bit, and-- come on, app. Don't let us down-- pasting on the final result. So that's not too
bad for a single cam. Let's try the dual cam demo. And with these stage
lights, I'm not sure. Come on. Hey, not too bad. We're doing good. So you can see a depth map being
created in the bottom left-hand corner that's detecting me
in the foreground and then the rest of y'all a
little bit faded out. You can see the closer
folks are a gray. And then black goes
right to the back. You can also see the
lights wreaking some havoc. Let me show you
the final result. Obviously, there are a few
optimizations that can happen, but it's working pretty well. So again, this is using the two
front cameras on the Pixel 2. You can see the two
streams going at once. Oops. Will this connect back up? No. Anyway, both streams at once. The wide angle lens and
the normal angle lens going at the same time. Can we head back to
the slides please? So let's talk about
how we do that. Oh, there we are. Anyway, so we had
the normal camera and the wide angle lens
running at the same time. Again, we're going to publish
this on probably GitHub, open-source it, so you can dig
into it, help us optimize it, make it even better. So the first case--
the single cam. Let's look at that quickly. The floating head
bokeh effect I call it. We're going to take a
photo with face detect. We're going to make two copies. So we have background,
foreground. Do some sort of fancy
background effects and then paste that floating
head back where it belongs. Face detect is built
into the Camera2 API. It's quite easy in
code to implement. First thing we want to
do is check the camera characteristics to see if
your camera device supports FaceDetect. If it does, you find
the mode you want. There is off, and
then simple, and full, depending on your camera device. Then, when we make our
camera capture request, we just include
that in the request. When we get our
results, you can see if the mode was set,
if you found any faces. And in this example,
I just search for the first face that it
finds is the one I used. We could imagine expanding
this to have multiple faces. Just a note-- FaceDetect
really grabs really the face. So I just bumped those
bounds out a little bit. So it's more of a head
getting chopped off. That sounds bad. A head being pasted
onto the background. Let's talk about the
fun background effects. So you can do what
you want here. I did a couple things. First, using
RenderScript, we just did a blur on the background. And because it's a
multi-camera talk, some cameras have a manual zoom. So you could-- if you're
working with multi-cam, you could do the background
with another camera and zoom way out of focus. So you could actually
do an optical blur, which would be kind
of cool, and also save you that software step. In this demo, we also did a
custom software sepia effect using RenderScript. But if you're using multi-cam
again, lots of cameras have built in effects,
like monochrome and sepia, that you can query and include
in your capture request as well. If you haven't used
RenderScript before, it looks something like this. And for our blur
effect, we care most about the three middle lines. And it's a built in
script, intrinsic blur. It's pretty handy. And it basically
works out of the box. In this case, it blurred
outside of the box because the box is not blurry. This is a custom RenderScript
script for the sepia effect. You can see in those
first three lines, basically we're taking
the input red, green, and blue channels, kind
of muting the colors, making them a bit
yellow, and sending those to the output channels. Okey dokey. So we've got the background. It's got this cool
bokeh effect on it. What do we do with
the foreground? From FaceDetect, we've
got the face cut out. And we just apply a PorterDuff
with a linear gradient to make the edges a bit softer. So when we paste it on,
it's not that harsh line. And ta-da. Paste it on, and things
look pretty good. There are a couple
of optimizations. One you saw, which is with
the GrabCut algorithm. This is built into OpenCV, the
Open Computer Vision library that we're using for the
depth map demo later on. Basically, I found the face. And then I chose a
rectangle a bit larger to try to guess where
the body might be. And then GrabCut does its best-- like the Magic Wand tool in
your favorite photo editor-- to shrink down that foreground
to the actual foreground bounds. We could also, as I mentioned,
add in multiple faces. Now, the moment you've
all been waiting for. Let's talk about dual cam
bokeh with the depth map. We're going to use to
cameras simultaneously. And we're going
to create a depth map, which is the
hard part, which I highlighted that in bold. But then we go ahead and
use the same mechanism we already talked about. Okey dokey. How does this work? First of all, the
double capture. So this, on the left, is
me hanging out with my pets at home. The left is the normal camera
in the Pixel 3 front cameras. And the right is
the wide angle shot. To do that, just as
Oscar walked through, we set out multiple
output configurations. So for each lens, we set up-- here, we have the previous
surface as well as an image reader for the normal lens. We use set physical camera
ID to the normal lens. And we do the same thing
for the wide angle lens. So we end up with four
output configurations we're putting into our configuration. From then-- or from
there, it's just a matter of choosing our
output targets for the capture. In this case, we
want those photos so we can operate on them. So we say we want the image
reader from the normal lens and the wide angle lens. OK, so we have our images. Now we have to do a bunch
of math and some magic and make that bokeh
effect happen. I want to give a brief
introduction to stereo vision before we get into all the code. But I have to say,
looking at these slides, working on these slides,
I got a little bit bored. I like geometry, but you
know, it's a lot of letters. And I started asking myself,
what does P stand for anyway? Obviously, it's a
pile of chocolate. P stands for pile of chocolate. And this is what we're
going to be focusing on for the rest of this demo. And you know, camera one is a
little bit boring, camera two. So S here, we're going
to replace with a shark. This is my friend,
Pepper the Shark. And H is Hippo. So these are our
helpers that are going to help us talk
about stereo vision. So left camera, normal
lens, is Pepper the Shark. Wide angle lens is Susie
Loo, the couch hippopotamus. And they're both zeroing in
on that big pile of chocolate. And already, it's
a lot more fun. I hope you agree. So those skewed
rectangles there. That's the 2D surface. That's like the image that the
cameras are going to capture. In other words, the
2D representation of that real live
3D object we have. Let's take a look at
what that looks like. The shark eye view
is right in there on the almonds, sea
salt, and dark chocolate, whereas the hippo cam is focused
in on the raspberry crunch. So they're both seeing
the same 3D object, but they have this
2D representation. And what we really want to do
is take their separate views and be able to combine
them, so we get a little bit more information
than that 2D view and be able to create
a great depth map. So we have, again, the normal
view, the wide angle view. Well in this case,
they're both normal. But the left-hand,
the right-hand overlay on each other, you get that
kind of 3D ruler effect from elementary
school that I hope you got to enjoy as a child. And from there, we can
create a depth map, which allows you to do really
cool things like awesome bokeh effects as well as know how
far away the chocolate is so that you can reach out
and grab it, obviously. Okey dokey. So those two cameras,
those two pictures, are at a different
orientation from each other. And they're separated in space. So we need to get those
on top of each other. This is what we call
the camera extrinsics. How the two cameras
relate to each other. So we need to rotate and
translate each of those images so they appear on
top of each other. Normally, we say that-- normally, we give the rotation
and translation parameters for a camera in
relation to World. So instead of Camera
1 to World, we'll have Shark to World
and Hippo to World. But when we're doing stereo
vision, what we really need to worry about
is Shark to Hippo. So how are these two cameras
related to each other? Like a good engineer,
all I know is I have to switch Hippo to
World to be World to Hippo. And now I have this pathway
from Shark to World to Hippo. I hope that was a fun
introduction to the math, which you can read all
about on Wikipedia and look something like this. To get the rotation
matrix, we're going to inverse the
rotation matrix for Camera 2 and cross multiply
it with Camera 1. And for translation,
it's something like this. Take the inner
product and subtract. You can read all about it on
Wikipedia or other sources. So one thing I want
to just point out if you're working on this
yourself is the translation matrix for a Pixel 3
from the normal camera to the wide camera. This is what I got out. What do you notice about it? The 9 millimeter separation
between the cameras looks just about right. If you look at the phone,
you know there's a good-- what's the American? A good-- anyway, there's
a good nine millimeters between those cameras. That makes perfect sense. But what I didn't
notice, and which cost me about a week of time, is
that it's in the y-coordinate. So the cameras are
on top of each other. And so while I'm
working with this phone, looking at the two
cameras beside each other, I just assumed that
they were obviously horizontally displaced. No big deal, except that
the depth map function that I'm using
assumes that they're going to be beside each other. It assumes horizontal
displacement. So you just have-- because-- oh, I didn't
say the important part. Camera sensors are often
optimized for landscape, which makes sense. If you do it wrong, your
depth maps don't work. You pull your hair out. You have a great
week like I did. Anyway, just a note if
you're implementing this. So we have the
camera extrinsics, how we get the pictures from the
cameras on top of each other, how they relate to each other. Camera intrinsics are properties
of the cameras themselves. So we know we have a normal
lens and a wide angle lens. And they have
different properties. So there are two things. One is the camera
characteristics. This is things like the focal
length, the principle axis, and if that axis is
skewed for some reason. This appears often in the
three by three matrix. And distortion-- the wide angle
lens and any wide angle lens-- near the edges,
especially, you're going to get a little
bit of distortion going on that we need to
consider as we're mapping the two images to each other. Another note-- so we're going
to use the intrinsic distortion properties of the lens
to undistort the image. But as Oscar told
us, by default, the camera undistorts
the image for us. So we're going to
undistort it and then reundistort, which means we're
actually going to distort it, which is bad news. So we actually need to turn
off the distortion correction if you want to do depth maps. That's easy enough with
our camera requests. We just make sure that
distortion mode is off. Okey dokey. So here are the four things. Rotation, translation, the
camera characteristics matrix, and the lens distortion. How do you get these properties? It's pretty easy. You just take an
entire afternoon, print out a checkerboard
sheet, or-- has anyone in this room done this before? It's called camera-- yeah? It's fun, right? It's great-- camera calibration. Take a whole series of
shots with both the cameras. You run a bunch of algorithms. You figure out these four
camera characteristics. And from then, you can go ahead
and start making depth maps from the cameras. You can tell from
my cheerful face, it's not actually that fun. Don't do it. It's no good. Luckily, in the camera2
multi-camera APIs, we have these great fields-- rotation, translation,
calibration, and distortion. So you can get it straight out
of the API, which is wonderful. I'm going to just tell
you a few notes if you're implementing these yourself. So the camera characteristics,
the focal length, and the access information
comes in five parameters. This is in the
Android documentation. But to create that
three by three matrix, you just have to follow
the documentation and plug-in the numbers. Another thing that
might throw you off is the distortion coefficients
again are five values. But the OpenCV library uses
them in a different order than the values you
get out of the API. So you just need to know
that it goes 0, 1, 3, 4, 2. The good news is if you
use them in the 0, 1, 2, 3, 4 order, when you
undistort your images, they look like they've
been in a whirlpool. So you're sure something's
wrong with those coefficients. Anyway, so once we have
all those parameters, we can go ahead and start
preparing our images to do a depth map comparison. This is me in my kitchen. And I don't know if
you can see from there, but if you look at
the ceiling, you'll notice there's kind
of a curve going down. We don't live in a fun house. It's the distortion effects
we were talking about with the wide angle lens with
the distortion correction off. As well when you're
comparing two images, the straight lines--
well, and the curved lines-- need to line up
in each of the images when you're making depth map. We call that rectifying. And we use the camera
characteristics to do that. That's just showing
the bent roof. All of these functions are in
the OpenCV library, the Open Computer Vision library. The first one is Stereo Rectify. This gets us a set
of parameters we can use to perform
these calculations. So we pass in the-- sorry, the values we got from
the API, the camera matrix, the distortion coefficients,
the rotation, and translation that we calculated before. We get these parameters out, and
we call undistort rectify map, which creates a
map telling us how we can take two images from
these two different cameras and map them onto each other. And the Remap function
does just this. So let's see what that gives us. Here is, on the left again,
from the normal cam, front cam, of the Pixel 3 and the wide
angle lens from the Pixel 3. You can see they
look pretty good. The shark lines are lined up. The crop is about right. You know, the wide angle
has a lot more crop region. That's all lined up. The roof lines, the
door lines are straight. There's no wacky distortion. And actually, I'd say,
from where you're sitting, you probably have
to look closely to notice that the left-hand
picture is a little bit closer to the left-hand
side of the frame. So they're actually offset by
a little bit, which is just about what you'd expect
if you had two cameras 9 millimeters apart. So we got the images. We've undistorted. We've rectified them. We're very close to
creating the depth maps. All we have to do is call
the depth map function. We use stereoBM or stereoSGBM. One has a few more
parameters than the other. And when you get to play
with the open-source demo, you can see how these parameters
work, and play around with it, optimize them,
commit your changes, help make that app better. And we call compute and
make this depth map. And when you do that, you'll
get an amazing photo-- something like this. Actually, sometimes it looks
a lot better than that. But anyway. This isn't quite what
we want to work with. What we really want to do is
filter that, in this case, using a weighted least
squares filter, which smooths that out and gives us
a little bit more useful depth map. So the darker pixels,
as we saw in the demo, are the ones farther back. The whiter pixels
are the closer ones. And it's probably a
little hard to see-- you can see the shark's snout
and the hippo's snout are a little bit grayed out. So it's actually working
to some extent there. This is how we call the filter. It's also included in
the OpenCV libraries in the contributor modules. It's all open source. And it's really cool. When you get a depth map that
is perfect, it's exhilarating. OK, here we have our depth map. What do we do with it? So we can just apply this depth
map as a mask on top of it. And the black areas,
we want to fade out, and we want to highlight
the foreground. That's pretty easy to
do with a PorterDuff. And the result is
something like this. So indeed, the foreground
is more present. And then the background
is faded out. Personally, I have
high standards. I see like a
translucent floating shark over my shoulder. My face is a little bit grayed
out, my eyeball's missing. So I'm going to put another
big red X through this and say, not quite good enough. It's a good start. But what we really want is
a depth map more like this. So we're going to put a hard
threshold on the depth map and decide foreground,
background, that's it. In other apps, you may want
to do something similar but maybe not such
a harsh distinction. It could be a smoother curve. To do that, we can use the
OpenCV function, threshold. We give it some cutoff value. For the app, it's somewhere
around 80 to 140 out of 255. And that's just that limit
where something is considered foreground or background. I wanted to note
this in case you're implementing any of this. When we applied the
mask like I showed you, you actually need to
turn those black pixels to transparent pixels. So this function
just goes through and converts all the black
ones to transparency. And here we go. We're almost there. So I wanted to note one
thing on this slide. The middle picture-- you can
see my eye is a bit blacked out. Just remember that for
three more slides or so. So we have our initial picture,
we've got our depth map, we do this hard threshold on it. And we can again
create our background just like we did
in the first demo, blur it out, monochrome it,
and cut out that foreground. We have all the pieces
we need to paste it on. And this is our amazing,
final, portrait shot, which is pretty good. I'm proud of it. So let's talk about
an optimization. Remember that eyeball
thing I was talking about? So anything kind of
gleaming and shiny can get messed up in
this current iteration of the application. Or bright lights can throw
off the depth map generation. And so I did one
optimization, which was we have the FaceDetect region. I'm pretty sure I want the
face in the foreground. So I just used it and
hard cut it in and said, anything on the face is going
to be in the foreground. So that protected like
my teeth and my eye from that masking out effect. I don't know if you
noticed-- can I go back-- my fuzzy red hair
and the red couch-- there we go-- they
kind of blend in. And so I'm thinking we
could use GrabCut possibly to do a little bit better
job of figuring out exactly what's in the foreground. So thanks a lot. We really hope that this
gave you a bit of a deep dive into using camera2 and
the multi-camera APIs, giving you some
exciting creative ideas. We really want to
hear your ideas, and we really want to
see them in your apps. And we also want to know what
features you're looking for. We think they're
great, and we want to keep pushing the camera
ecosystem forward and doing more and more stuff
really ecosystem wide. Thanks so much again. And please do come to the
sandbox, camera sandbox, if you want to ask
us any questions, if you want any follow ups,
you want to try this app, and see if it works. And look for it
soon open source. Thanks a lot. [APPLAUSE] [MUSIC PLAYING]