Getting the most from the new multi-camera API (Android Dev Summit '18)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

VINIT MODI: Hi, everyone. Welcome to the session on the new multi-camera API. My name is Vinit Modi, and I'm the product manager on the camera platform. Just a quick reminder-- after this talk, please step outside to the sandbox area if you'd like to ask us more questions. Before we talk about the new API, let me give a quick update on the state of camera. Historically, most camera apps focus on the native camera app that ships with the device. It turns out, however, that more than twice the amount of camera usage occurs on the apps that you build. And it's extremely important that you support the new features that are available in the new Android APIs. When we speak to a lot of developers, what we find is that the number one question is the state of camera2 API and where we're going to be going forward. We've been working very hard. And starting with Android P, what you'll find is almost all new devices will support camera2 and HALv3. What this means is when you look at the camera characteristics, you'll find that the device will advertise itself as either camera2 LIMITED, which is similar to a point and shoot, a camera2 FULL, which offers advanced capabilities like per frame control, and LEVEL_3, which enables YUV reprocessing and RAW. In addition, we've been working with several OEMs, or manufacturers, to open up new APIs at launch. So we're excited this year that both the Google Pixel 3 and the Huawei Mate 20 series support the new multi-camera API. Now, let me step back and say why this new API is so important. Prior to Android P, as developers, you only get access to one of the physical sensors, where the native camera app gets access to the full hardware capability. But starting with P, you'll get the same access as the native camera app. This includes all the physical sensors plus the logical camera. And the logical camera is an abstraction of all the physical sensors that allows you to easily take advantage of the hardware. There are several new use cases and possibilities with the new multi-camera API. Today, Oscar's going to talk about optical zoom, and Emilie is going to cover bokeh. Thank you very much. Oscar is up next. [APPLAUSE] OSCAR WAHLTINEZ: Hi everyone, my name is Oscar. I work in the developer relations team, and we're going to start off with a live demo. What could go wrong? [LAUGHTER] So here I have a Mate 20 phone. We are implementing multi-camera zoom. What we're doing here is we are swapping the UI layer, the two camera streams. I'm not doing any kind of detail zoom or cropping. I'm simply swapping the streams. As you can see, it's almost instantaneous. There's no tear down I'm bringing up with the camera session. It's just a single session, and I'm swapping the two camera streams. The idea is that, as I said, single camera session-- two streams. And we're going to swap between the streams, and we're going to show you how this was built. The key component, though, is that we had the same code running on both devices. As many camera developers know, it is quite a feat to have the same code running across such different devices, especially for something as tied to the hardware as it is multi-camera. So first, let's talk about how we can use multiple camera streams simultaneously. The basic guarantee provided by the framework in the multi-camera APIs is that you can use at least two physical camera streams at the same time. Recall the guaranteed stream configurations for single camera devices. It is a set of rules based on hardware level, target type, and target size. If we use the multi-camera APIs correctly, we can get an exception to these rules. Let's illustrate this with an example. We have a single YUV stream of maximum size. As per the previous table, devices with limited hardware level will be able to use a single stream with that configuration. If we use the multi-camera APIs, we can actually use two streams of equivalent configuration from the underlying physical cameras. Let's walk through what we need to do to implement the app that we just demoed earlier. We broke it down to five steps. Are you ready? Step number one-- find the physical cameras. We start by identifying pairs of physical cameras that can be opened simultaneously. Using the camera character as the subject, we look for the [? request ?] capabilities, and if logical multi-camera is one of them, we know this device is a logical camera. Now that we found a logical camera, we store it. We'll need the ID for later, we'll see. And we get the physical cameras associated with it. Then, we can move on to the next step. Here's a visualization of what we just described. We take that logical camera ID, and with the characteristics, we call get physical camera IDs, and now we retrieve the physical cameras associated with the logical camera group. Onto the next step. Open logical camera. The second step is nothing new. We open the camera. Recall the logical camera ID we saved earlier. That is the only one we use to pass to the camera manager. So to reiterate, we only open the logical camera. The step callback will trigger when the device is ready. We have now opened the logical camera. In the next step, we'll create the output configuration updates. They will be used to create the camera session. For each desired output target, we may have a physical camera ID from the list we found earlier if we want to retrieve frames from a specific hardware camera. Let's go into more details. We create the output configuration object using our desired output target. And if we want to associate that output with a specific physical camera, then we pass the ID in the set physical camera ID API. If we want to use a logical camera, we can simply skip this step. We may also have a combination of both. So at the end of the day, we have a list of output configurations, some of which may be associated with physical cameras, some of which-- logical camera. The goal is to put all the configurations into a single session configuration. As we just explained, each output configuration has an associated output target and, optionally, a physical camera ID. Now, we create the capture session. How do we create the capture session using the new session configuration object? We start off with our list of output configurations that we just created. With that, we instantiate a session configuration, which includes the capture session callback. From that callback, we're going to get an instance of the create a camera session. We take that session configuration object and the camera device, which we got from step number two when we opened logical camera. And we send a frame with a request to create a new session with our desired configuration. The callback provided in the session configuration object will be triggered now. And then we'll have our camera session ready to be used. Last step. Capture requests. Once that has happened, we can start getting frames out of the cameras. For example, you want to capture a frame from two physical cameras simultaneously, we take the session we created earlier and a pair of output targets. In this particular case, each target will be associated with a specific camera ID. We create the capture request that we normally do, in this case, using template preview. We attach the output targets to it, again like we normally do. And now, we dispatch the capture request. Nothing different here. Except in this case, the output surfaces will receive image data from each of the associated physical cameras, and the capture request callback will trigger only once. So again, it's just like any capture request. The big difference is that the completion callback will give me back two start exposure timestamps instead of just a single value from normal capture requests. So to recap, this is how I implement our optical zoom demo. We found the physical cameras. We opened the logical camera that is part of that group. We created the output configurations. We [? print ?] a list. We create our capture session. And then, we dispatch our capture requests. One more topic I wanted to touch on is lens distortion. A classical example of lens distortion is the fisheye lens. This is not a real example. It is here for illustration purposes only. All lenses have some amount of distortion. For logical cameras, you can assume that that distortion will be minimal. For the most part, it'll be corrected by a drivers. However, for physical cameras, distortion can be pretty significant. The physical lens distortion is described in a set of radial and tangential coefficients. The coefficients can be queried using the lens distortion camera characteristics key. The documentation has a lot more details if you're interested. The good news is that there's a way to correct distortion without doing a bunch of math. We can simply set the distortion correction mode on our capture requests. OFF means that no distortion is applied. We may need to use this if we want to do things like [INAUDIBLE] synchronization. Emilie will touch on that later. FAST means that the best possible correction is applied while meeting the advertised frame rate. If no FAST correction is possible, this may be the same as OFF. HIGH QUALITY means that distortion will be corrected as much as the lens allows, potentially at the cost of frame rate. If we don't specify a correction mode, it will be either FAST or HIGH QUALITY. It is going to be up to the implementation details, which is the default. You, as a developer, can query to see which one was applied to your capture request. Let's see a code snippet demonstrating how this lens distortion is set to high quality, which is probably what we want for a still image capture. Assuming we already started our camera session, we instantiate the capture request builder using our desired template, in this case, as I said, image capture. Then, we use the camera characteristics to determine if HIGH QUALITY distortion correction mode is available. Now that we know that we have a HIGH QUALITY correction for distortion, we set it on the capture request, and we do what we always do-- dispatch the capture request. For more sample code and technical details, take a look at our blog post. We covered this and some more. We published it earlier this week. And now, I'll hand it over to Emilie. [APPLAUSE] EMILIE ROBERTS: Thanks, Oscar. My name is Emilie Roberts. I'm a partner developer advocate, and I'm going to show you a cool demo that uses some of these multi-camera APIs to do a bokeh effect on the Pixel 3. So we actually have three-- well, two demos. 2.5 demos. The first one is a single cam demo. There's no multi-camera at all. But I wanted to sort of show the mechanisms for creating the bokeh effect. Then, when we get into the dual cam demo, you'll see exactly-- you know, we can focus on the multi-camera aspects and not worry so much about the bokeh effect itself. And it's going to be published soon-- Open Source. So don't worry about scribbling down too much code. So can we go to this phone? Demo. Excuse me. Okey dokey. So we have-- I didn't set this up properly. OK, let's do the single cam bokeh effect. Taking a selfie here. And I think you can see on the screen, it's finding my face. It's cutting it out. Let me bump up the final result. And it's kind of pasting the portrait mode in there. This is kind of a rough cut portrait mode. And I do have an optimization on this. Let's see how that goes. I'll show you the output steps here. So it's trying to do a better job of finding the foreground. Hey, it didn't do too bad. It's generating the foreground image, the background, which is just monochromed and blurred a little bit, and-- come on, app. Don't let us down-- pasting on the final result. So that's not too bad for a single cam. Let's try the dual cam demo. And with these stage lights, I'm not sure. Come on. Hey, not too bad. We're doing good. So you can see a depth map being created in the bottom left-hand corner that's detecting me in the foreground and then the rest of y'all a little bit faded out. You can see the closer folks are a gray. And then black goes right to the back. You can also see the lights wreaking some havoc. Let me show you the final result. Obviously, there are a few optimizations that can happen, but it's working pretty well. So again, this is using the two front cameras on the Pixel 2. You can see the two streams going at once. Oops. Will this connect back up? No. Anyway, both streams at once. The wide angle lens and the normal angle lens going at the same time. Can we head back to the slides please? So let's talk about how we do that. Oh, there we are. Anyway, so we had the normal camera and the wide angle lens running at the same time. Again, we're going to publish this on probably GitHub, open-source it, so you can dig into it, help us optimize it, make it even better. So the first case-- the single cam. Let's look at that quickly. The floating head bokeh effect I call it. We're going to take a photo with face detect. We're going to make two copies. So we have background, foreground. Do some sort of fancy background effects and then paste that floating head back where it belongs. Face detect is built into the Camera2 API. It's quite easy in code to implement. First thing we want to do is check the camera characteristics to see if your camera device supports FaceDetect. If it does, you find the mode you want. There is off, and then simple, and full, depending on your camera device. Then, when we make our camera capture request, we just include that in the request. When we get our results, you can see if the mode was set, if you found any faces. And in this example, I just search for the first face that it finds is the one I used. We could imagine expanding this to have multiple faces. Just a note-- FaceDetect really grabs really the face. So I just bumped those bounds out a little bit. So it's more of a head getting chopped off. That sounds bad. A head being pasted onto the background. Let's talk about the fun background effects. So you can do what you want here. I did a couple things. First, using RenderScript, we just did a blur on the background. And because it's a multi-camera talk, some cameras have a manual zoom. So you could-- if you're working with multi-cam, you could do the background with another camera and zoom way out of focus. So you could actually do an optical blur, which would be kind of cool, and also save you that software step. In this demo, we also did a custom software sepia effect using RenderScript. But if you're using multi-cam again, lots of cameras have built in effects, like monochrome and sepia, that you can query and include in your capture request as well. If you haven't used RenderScript before, it looks something like this. And for our blur effect, we care most about the three middle lines. And it's a built in script, intrinsic blur. It's pretty handy. And it basically works out of the box. In this case, it blurred outside of the box because the box is not blurry. This is a custom RenderScript script for the sepia effect. You can see in those first three lines, basically we're taking the input red, green, and blue channels, kind of muting the colors, making them a bit yellow, and sending those to the output channels. Okey dokey. So we've got the background. It's got this cool bokeh effect on it. What do we do with the foreground? From FaceDetect, we've got the face cut out. And we just apply a PorterDuff with a linear gradient to make the edges a bit softer. So when we paste it on, it's not that harsh line. And ta-da. Paste it on, and things look pretty good. There are a couple of optimizations. One you saw, which is with the GrabCut algorithm. This is built into OpenCV, the Open Computer Vision library that we're using for the depth map demo later on. Basically, I found the face. And then I chose a rectangle a bit larger to try to guess where the body might be. And then GrabCut does its best-- like the Magic Wand tool in your favorite photo editor-- to shrink down that foreground to the actual foreground bounds. We could also, as I mentioned, add in multiple faces. Now, the moment you've all been waiting for. Let's talk about dual cam bokeh with the depth map. We're going to use to cameras simultaneously. And we're going to create a depth map, which is the hard part, which I highlighted that in bold. But then we go ahead and use the same mechanism we already talked about. Okey dokey. How does this work? First of all, the double capture. So this, on the left, is me hanging out with my pets at home. The left is the normal camera in the Pixel 3 front cameras. And the right is the wide angle shot. To do that, just as Oscar walked through, we set out multiple output configurations. So for each lens, we set up-- here, we have the previous surface as well as an image reader for the normal lens. We use set physical camera ID to the normal lens. And we do the same thing for the wide angle lens. So we end up with four output configurations we're putting into our configuration. From then-- or from there, it's just a matter of choosing our output targets for the capture. In this case, we want those photos so we can operate on them. So we say we want the image reader from the normal lens and the wide angle lens. OK, so we have our images. Now we have to do a bunch of math and some magic and make that bokeh effect happen. I want to give a brief introduction to stereo vision before we get into all the code. But I have to say, looking at these slides, working on these slides, I got a little bit bored. I like geometry, but you know, it's a lot of letters. And I started asking myself, what does P stand for anyway? Obviously, it's a pile of chocolate. P stands for pile of chocolate. And this is what we're going to be focusing on for the rest of this demo. And you know, camera one is a little bit boring, camera two. So S here, we're going to replace with a shark. This is my friend, Pepper the Shark. And H is Hippo. So these are our helpers that are going to help us talk about stereo vision. So left camera, normal lens, is Pepper the Shark. Wide angle lens is Susie Loo, the couch hippopotamus. And they're both zeroing in on that big pile of chocolate. And already, it's a lot more fun. I hope you agree. So those skewed rectangles there. That's the 2D surface. That's like the image that the cameras are going to capture. In other words, the 2D representation of that real live 3D object we have. Let's take a look at what that looks like. The shark eye view is right in there on the almonds, sea salt, and dark chocolate, whereas the hippo cam is focused in on the raspberry crunch. So they're both seeing the same 3D object, but they have this 2D representation. And what we really want to do is take their separate views and be able to combine them, so we get a little bit more information than that 2D view and be able to create a great depth map. So we have, again, the normal view, the wide angle view. Well in this case, they're both normal. But the left-hand, the right-hand overlay on each other, you get that kind of 3D ruler effect from elementary school that I hope you got to enjoy as a child. And from there, we can create a depth map, which allows you to do really cool things like awesome bokeh effects as well as know how far away the chocolate is so that you can reach out and grab it, obviously. Okey dokey. So those two cameras, those two pictures, are at a different orientation from each other. And they're separated in space. So we need to get those on top of each other. This is what we call the camera extrinsics. How the two cameras relate to each other. So we need to rotate and translate each of those images so they appear on top of each other. Normally, we say that-- normally, we give the rotation and translation parameters for a camera in relation to World. So instead of Camera 1 to World, we'll have Shark to World and Hippo to World. But when we're doing stereo vision, what we really need to worry about is Shark to Hippo. So how are these two cameras related to each other? Like a good engineer, all I know is I have to switch Hippo to World to be World to Hippo. And now I have this pathway from Shark to World to Hippo. I hope that was a fun introduction to the math, which you can read all about on Wikipedia and look something like this. To get the rotation matrix, we're going to inverse the rotation matrix for Camera 2 and cross multiply it with Camera 1. And for translation, it's something like this. Take the inner product and subtract. You can read all about it on Wikipedia or other sources. So one thing I want to just point out if you're working on this yourself is the translation matrix for a Pixel 3 from the normal camera to the wide camera. This is what I got out. What do you notice about it? The 9 millimeter separation between the cameras looks just about right. If you look at the phone, you know there's a good-- what's the American? A good-- anyway, there's a good nine millimeters between those cameras. That makes perfect sense. But what I didn't notice, and which cost me about a week of time, is that it's in the y-coordinate. So the cameras are on top of each other. And so while I'm working with this phone, looking at the two cameras beside each other, I just assumed that they were obviously horizontally displaced. No big deal, except that the depth map function that I'm using assumes that they're going to be beside each other. It assumes horizontal displacement. So you just have-- because-- oh, I didn't say the important part. Camera sensors are often optimized for landscape, which makes sense. If you do it wrong, your depth maps don't work. You pull your hair out. You have a great week like I did. Anyway, just a note if you're implementing this. So we have the camera extrinsics, how we get the pictures from the cameras on top of each other, how they relate to each other. Camera intrinsics are properties of the cameras themselves. So we know we have a normal lens and a wide angle lens. And they have different properties. So there are two things. One is the camera characteristics. This is things like the focal length, the principle axis, and if that axis is skewed for some reason. This appears often in the three by three matrix. And distortion-- the wide angle lens and any wide angle lens-- near the edges, especially, you're going to get a little bit of distortion going on that we need to consider as we're mapping the two images to each other. Another note-- so we're going to use the intrinsic distortion properties of the lens to undistort the image. But as Oscar told us, by default, the camera undistorts the image for us. So we're going to undistort it and then reundistort, which means we're actually going to distort it, which is bad news. So we actually need to turn off the distortion correction if you want to do depth maps. That's easy enough with our camera requests. We just make sure that distortion mode is off. Okey dokey. So here are the four things. Rotation, translation, the camera characteristics matrix, and the lens distortion. How do you get these properties? It's pretty easy. You just take an entire afternoon, print out a checkerboard sheet, or-- has anyone in this room done this before? It's called camera-- yeah? It's fun, right? It's great-- camera calibration. Take a whole series of shots with both the cameras. You run a bunch of algorithms. You figure out these four camera characteristics. And from then, you can go ahead and start making depth maps from the cameras. You can tell from my cheerful face, it's not actually that fun. Don't do it. It's no good. Luckily, in the camera2 multi-camera APIs, we have these great fields-- rotation, translation, calibration, and distortion. So you can get it straight out of the API, which is wonderful. I'm going to just tell you a few notes if you're implementing these yourself. So the camera characteristics, the focal length, and the access information comes in five parameters. This is in the Android documentation. But to create that three by three matrix, you just have to follow the documentation and plug-in the numbers. Another thing that might throw you off is the distortion coefficients again are five values. But the OpenCV library uses them in a different order than the values you get out of the API. So you just need to know that it goes 0, 1, 3, 4, 2. The good news is if you use them in the 0, 1, 2, 3, 4 order, when you undistort your images, they look like they've been in a whirlpool. So you're sure something's wrong with those coefficients. Anyway, so once we have all those parameters, we can go ahead and start preparing our images to do a depth map comparison. This is me in my kitchen. And I don't know if you can see from there, but if you look at the ceiling, you'll notice there's kind of a curve going down. We don't live in a fun house. It's the distortion effects we were talking about with the wide angle lens with the distortion correction off. As well when you're comparing two images, the straight lines-- well, and the curved lines-- need to line up in each of the images when you're making depth map. We call that rectifying. And we use the camera characteristics to do that. That's just showing the bent roof. All of these functions are in the OpenCV library, the Open Computer Vision library. The first one is Stereo Rectify. This gets us a set of parameters we can use to perform these calculations. So we pass in the-- sorry, the values we got from the API, the camera matrix, the distortion coefficients, the rotation, and translation that we calculated before. We get these parameters out, and we call undistort rectify map, which creates a map telling us how we can take two images from these two different cameras and map them onto each other. And the Remap function does just this. So let's see what that gives us. Here is, on the left again, from the normal cam, front cam, of the Pixel 3 and the wide angle lens from the Pixel 3. You can see they look pretty good. The shark lines are lined up. The crop is about right. You know, the wide angle has a lot more crop region. That's all lined up. The roof lines, the door lines are straight. There's no wacky distortion. And actually, I'd say, from where you're sitting, you probably have to look closely to notice that the left-hand picture is a little bit closer to the left-hand side of the frame. So they're actually offset by a little bit, which is just about what you'd expect if you had two cameras 9 millimeters apart. So we got the images. We've undistorted. We've rectified them. We're very close to creating the depth maps. All we have to do is call the depth map function. We use stereoBM or stereoSGBM. One has a few more parameters than the other. And when you get to play with the open-source demo, you can see how these parameters work, and play around with it, optimize them, commit your changes, help make that app better. And we call compute and make this depth map. And when you do that, you'll get an amazing photo-- something like this. Actually, sometimes it looks a lot better than that. But anyway. This isn't quite what we want to work with. What we really want to do is filter that, in this case, using a weighted least squares filter, which smooths that out and gives us a little bit more useful depth map. So the darker pixels, as we saw in the demo, are the ones farther back. The whiter pixels are the closer ones. And it's probably a little hard to see-- you can see the shark's snout and the hippo's snout are a little bit grayed out. So it's actually working to some extent there. This is how we call the filter. It's also included in the OpenCV libraries in the contributor modules. It's all open source. And it's really cool. When you get a depth map that is perfect, it's exhilarating. OK, here we have our depth map. What do we do with it? So we can just apply this depth map as a mask on top of it. And the black areas, we want to fade out, and we want to highlight the foreground. That's pretty easy to do with a PorterDuff. And the result is something like this. So indeed, the foreground is more present. And then the background is faded out. Personally, I have high standards. I see like a translucent floating shark over my shoulder. My face is a little bit grayed out, my eyeball's missing. So I'm going to put another big red X through this and say, not quite good enough. It's a good start. But what we really want is a depth map more like this. So we're going to put a hard threshold on the depth map and decide foreground, background, that's it. In other apps, you may want to do something similar but maybe not such a harsh distinction. It could be a smoother curve. To do that, we can use the OpenCV function, threshold. We give it some cutoff value. For the app, it's somewhere around 80 to 140 out of 255. And that's just that limit where something is considered foreground or background. I wanted to note this in case you're implementing any of this. When we applied the mask like I showed you, you actually need to turn those black pixels to transparent pixels. So this function just goes through and converts all the black ones to transparency. And here we go. We're almost there. So I wanted to note one thing on this slide. The middle picture-- you can see my eye is a bit blacked out. Just remember that for three more slides or so. So we have our initial picture, we've got our depth map, we do this hard threshold on it. And we can again create our background just like we did in the first demo, blur it out, monochrome it, and cut out that foreground. We have all the pieces we need to paste it on. And this is our amazing, final, portrait shot, which is pretty good. I'm proud of it. So let's talk about an optimization. Remember that eyeball thing I was talking about? So anything kind of gleaming and shiny can get messed up in this current iteration of the application. Or bright lights can throw off the depth map generation. And so I did one optimization, which was we have the FaceDetect region. I'm pretty sure I want the face in the foreground. So I just used it and hard cut it in and said, anything on the face is going to be in the foreground. So that protected like my teeth and my eye from that masking out effect. I don't know if you noticed-- can I go back-- my fuzzy red hair and the red couch-- there we go-- they kind of blend in. And so I'm thinking we could use GrabCut possibly to do a little bit better job of figuring out exactly what's in the foreground. So thanks a lot. We really hope that this gave you a bit of a deep dive into using camera2 and the multi-camera APIs, giving you some exciting creative ideas. We really want to hear your ideas, and we really want to see them in your apps. And we also want to know what features you're looking for. We think they're great, and we want to keep pushing the camera ecosystem forward and doing more and more stuff really ecosystem wide. Thanks so much again. And please do come to the sandbox, camera sandbox, if you want to ask us any questions, if you want any follow ups, you want to try this app, and see if it works. And look for it soon open source. Thanks a lot. [APPLAUSE] [MUSIC PLAYING]

Info

Channel: Android Developers

Views: 15,075

Rating: undefined out of 5

Keywords: type: Conference Talk (Full production);, pr_pr: Android, purpose: Educate

Id: u38wOv2a_dA

Channel Id: undefined

Length: 34min 27sec (2067 seconds)

Published: Thu Nov 08 2018