Building a Realtime Video Processor with Swift and Metal

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

everybody I'm Ben and today I'm gonna be talking about building a video processor with Swift and metal really briefly about me I'm one half of the team that builds a two popular iPhone photography apps Specter which is used for generating real-time long exposures so we use a metal to blend multiple frames together to create this long exposure handheld and halide which is a manual photography app that also in addition to supporting raw we have a depth mode that's similar to portrait mode in the first party camera and in that case we're actually combining multiple video sources in real time to give you this cool preview I'm gonna return to that a few times during the talk as an example some of the concepts we're gonna be talking about so I'm gonna structure this talk in three sections where I'm gonna start by talking about the architecture of a real-time video renderer some of the performance bottlenecks and how to get around them in the second part and then I'm going to talk about some of the nuances of working with image data which these are usually handled for you if you're using a high-level framework like core image or core graphics and when you get your hands dirty you have to worry about so I have a lot I want to cover in a half hour so I'm just gonna get started and dive right into architecture now I'm sure everyone here wants me to just stand up here and just spend a half hour reading line by line Apple documentation on avfoundation nothing just a wonderful way to spend your night but I'm actually just gonna talk about it at a high level okay so there is a delegate method inside of Av foundation you hook into and you say give me the frames as they come off of the camera and this is where things get interesting so inside your delegate method you kick off your render loop so every time a frame arrives you are going to update your internal state which depending on what your app does it could be pretty complicated and then you're going to kick off commands to the GPU which will execute asynchronously and these commands could be to perform additional filter effects or different operations on the image but ultimately you're going to render them on-screen or over to an AV asset writer to write out to disk most of your work on the logic side and updating your state is gonna be done in Swift and then everything GPU side is going to be done in shaders and if you've ever done game programming this loop probably seems really familiar because it's almost identical to a game loop where you have a timer firing every 60 frames a second you process the data the queued up inputs and you dispatch the GPU to render cool so I don't think a lot of people here have a lot of experience in GPU so I'm gonna spend a lot of time diving into that and there are two tools available through metal the first one which I'll spend most of my time talking about is the compute pipeline and it's a fairly recent idea I recently in the last 10 years and the idea is that you can use a GPU as a general-purpose programming platform which means if you aren't using it for image editing or image effects it's probably Bitcoin mining any problem that is massively parallel lyza balandin volve matrices lends itself really well to GPU processing the second pipeline and the original reason that GPUs were invented was for 3d games so there's a separate paths you can take that's optimized for rendering 3d content but it's a little more complicated so let's just talk about things from the compute pipeline because it's the simplest to understand so a shader is a tiny little program that runs in the GPU massively in parallel we're talking about hundreds of threads executing at once and in an image processor you're probably executing one thread per pixel of your image okay the language itself I'm sorry it's based on C++ a subset of C++ 11 and Apple calls it metal so there are a couple annotations there that make it look a little different than normal C++ but I'll talk about that in a moment all that we're doing in this hello world shader is we have this bitmap that comes in which I'm calling in texture I am reading an RGB value into a half float and I'm writing it to the output image so it's just taking and writing it out to an output image notice I'm referring to them as textures and this is a very GPU specific term you can think of a texture is like a super powerful abstraction around a bitmap so whereas with a bitmap you're just reading color values one by one from from a blob of data when you upload the bitmap to the GPU and turn it into a texture you can do all these hardware accelerated operations like if you have a float value between two pixels the GPU can actually filter it and blend it for you so you get nice smooth transitions lots of tools like that that I won't go into so how do you kick off that shader to do your work over on the Swift side of things here is a sample of some of the metal you'll write I've cut out a bunch of the boilerplate set up but at the start of your app you declare the different functions that you want to run that are written over in dot metal files and so here we have that hello world shader function that's over compiled for you and Xcode and everything and it's accessible in your app bundle so you asked for that function and you stash it in a compute pipeline State handle and you stash this at launch because it can take a few milliseconds to load and then at render time at some later point you set up your render encoder and you say set it to that state of that function and you bind values to the function using in this case set texture or you could set buffers or bytes or samplers or all sorts of different types of parameters so you have to imperative ly say I want this value inside this slot in my shader and then you call dispatch thread groups and it kicks off to the GPU so again that double bracket syntax we saw earlier that's an annotation to the compiler to let it know what you're expecting to slot into that parameter in your shader so when you write set texture index 0 it knows to plug it in there cool so that's the simpler option and the 3d pipeline is a bit more complicated where there's multiple stages to it I'm not going to go into it because honestly it's like the GPU compute PI but just more complicated but the idea why would you want to use the 3d pipeline then if you look at Apple's documentation they tell you explicitly that the rendering pipeline the 3d pipeline is the fastest and most efficient way to display something on screen they can take various shortcuts when they know that you're not going to be reading values from a texture later so if you're rendering something on-screen it's highly recommended that the final step is to use the render pipeline to put it out on screen fortunately it's really easy to do that even if you're rendering a 2d image you're going to basically render the world's most boring 3d scene you take for vertexes for vertices and you cover your screen with a flat plane you just cover the viewport with a flat plane and you render a texture map on that flat plane that covers the viewport cool so when would you want to use each as I said a moment ago the render pipeline is the most efficient but one other situation would be useful to understand is let's say that you wanted to render a UI like the group FaceTime chat window where you have multiple planes and some of them are occluding other ones right well if you use the 3d pipeline it's actually smart enough if you render multiple flat planes to know that this image the corner here is covered by this image and it will discard that geometry and it won't do what's known as overdraw where you're rendering stuff that never actually gets displayed to the user so it's just wasted resources and this can all be handled on the hardware side so you just submit the geometry and say render it and I'll know okay I'm gonna skip this I'm gonna skip this so if you're building something that looks a lot like UI kit you probably want to ultimately use the 3d pipeline but if you're just doing straight-up image filtering then you probably want to use the compute PI plane cool so I'm giving this talk it's slug so let's actually talk about the Swift things that are really nice the first thing and you mostly know this from working with Objective C but the interoperability with C is awesome so the way that you communicate with shaders and past large bundles of data back and forth well you could have 40 parameters that you pass in or you would wrap your parameters in a struct and pass that down the arrangement of the struct matters you're just sending a buffer of bytes so you're probably well you'd normally want to use C but what's really so one caveat there though is if you're maintaining separate structures in Swift land and sea land and you add a field on one and not the other everything's gonna fall apart but what I do is I create a header file that defines the structures that I use in both metal and in Swift and swift is happy be like oh yeah I can totally instance that and it'll be identical to what I passed the GPU and I can keep the two in sync so here all I actually I instance it just like any other Swift struct where I pass it the parameters I'm expecting to pass down and I can pass it into the index one in the shader and it all just works now that syntax is pretty gnarly and it's imperative and because of now with Swift five point one function builders I'm starting to play around with building a DSL like Swift UI where I can actually impair declaratively define how I want my different parameters to be passed down into the shaders and it does nothing other than makes it much more clear what is going on as I'm passing data back and forth okay one last thing I want to touch on that segues into performance as I said earlier everything that's going on is asynchronous so we execute commands in the GPU they're sent off and they'll probably execute a few milliseconds later and you don't want to block when you execute these you know there is an API to say execute this and wait for it to return you don't want to use that because you're constantly receiving frames in this loop and you never want to block it it's also your responsibility at the end when you're done producing the frame to recycle the resources you can think of it almost like with UI tableview were recycling but with much worse consequences because what is a 12 megapixel image how much memory does that take up it isn't 12 mega bytes because 12.5 megapixels has four image channels RGB a which means every image at 4k takes up 48 megabytes of memory so it's very easy to run out of resources and don't even try to allocate that in 16 milliseconds on every frame so you have to be extremely conservative with resources which is the main pain point in performance so all right so one example where we run into a problem with the portrait mode that we have is the data that comes into us this depth data its unbounded it's in meters and I need to clamp it between zero and one to use it as part of my filtering effect and so in order to do that I need to find the minimum and maximum value in an image of 10,000 pixels I have to do that in 30 milliseconds if I were doing 60 frames a second then it would be 16 milliseconds so I got to process a lot of data in a very short period of time okay now if you were to ask a game developer how do you make your program fast they would tell you that the CPU has nothing to do with it thanks to Moore's law CPUs are plenty fast they're great they're amazing and memory has not gotten any faster for the most part it's mostly been flat for a couple decades right so in all likelihood when you have a performance problem locally it has to do with memory access okay the way that you can usually hide this is in modern computer architecture your memory is laid out in hierarchies so you have main memory which has like 1.8 gigabytes but on the other side right next to your CPU is a l1 cache it's a it's a tiny bit of memory that's actually located on the CPU die and it is blazing fast if you can't fit all of your data in there as it fills up it pushes data into an l2 cache which is larger and slower and then as that fills up it pushes data into l3 some hardware has l4 but it's a hierarchy of fast but small memory the closer you get to the CPU now I'm going to steal a demo from Scott Myers who is a noteworthy C++ expert who's written some great books and he talks about memory access speeds and a few of his talks so I'll have a link to them later but you could imagine memory access here as let's say that's how long it takes to access l1 about a second well l2 is relatively that fast l3 is about that fast and if you have to reach out to main memory you might as well just go grab a coffee check Twitter do your taxes low some image macros because it's gonna take a really long time if you have a cache miss now you're wondering what are the different levels when I say small amounts of memory using I believe it's an iPhone 7 so that's the majority of hardware out there using that as the baseline you can expect 64 kilobytes that's with a K kilobytes of memory is an l1 cache you can accent 4 cycles l2 is 1 megabyte it takes about 10 cycles l3 is 4 megabytes it takes 40 cycles and main memory takes 30 times slower than l1 cache to access now there's other stuff you can also think about like when you access memory it can prefetch lines which are known as cache lines there's instruction caches but I'm simplifying everything to get the idea across that there's most of performances based around using as little memory as possible and arranging your data in ways that you work with these caching facilities so for more information check out Scott Myers but also Mike Acton who's the former engine director over at insomniac games they made that awesome spider-man game last year and they talk a lot about data oriented design which is about structuring your app in a way that your hardware can just well use basically to run in a straight line and access data as fast as possible so that usually means arrays or arrays of structs but just arranging everything continuously in memory based on your access patterns so let me give you one of the practical implications let's say that you're building stardew valley and you have this entity in your game a wheat patch it's a grid of wheat and over time it will grow you're gonna want to increment its height and how would you represent this well if you're an object-oriented program or of course you start by designing a class called wheat patch that probably inherits from ten other items in the hierarchy and then you have a position property which of course is a class because you might want to have a subclass of position you have this height property but-but-but-but now these are objects talking to each other we'd never want to reach in there and violate the law of Demeter Demeter and actually modify tell you no no no you send it the Gro message to make sure it increments its counter and then you have the maximum flexibility someday so there's enterprise game development this will slaughter your performance in most languages for one classes tend to be just sort of thrown about all over the heap so you can almost guarantee that you're not going to get free paging in of multiple instances at a time they're just all over the place like glitter they also have dynamic dispatch by default especially in Objective C so you're gonna miss the instruction caching and so it's going to be looking up instructions which will also cause a stall and just philosophically you're looking at a problem in terms of isolated instances instead of in batches of data that you're going to want to process all at once the reason that Swift is awesome and compelling compared to a on par with C++ starts with that last section we talked about explicit data layout you can make sure that you pack your structs to make use of every single bit in a small amount of space as possible to fit as much as you can in your various cache levels by default it uses static dispatch so you're going to be hitting the instruction cache more often and just philosophically it encourages you to look at the problem in terms of values instead of these abstractions that are objects although it finds a nice compromise where you still have some of the nice message passing through protocols but without any of the dynamism that costs a lot of performance okay so if the left is the cliched approach here's the most data oriented design you can come up with where you have an array of arrays and implicitly the position in those two arrays declares its position in the grid in the game and of course we use au int to use as little space as possible and then when you have to increment all the numbers your CPU can just run in a straight line and update it which is great in a game loop so even if you don't have to operate at that level it but hooves you to think about the data and how you're processing it because really it's a way of thinking so let me give you an example that doesn't involve code as much so for example in that portrait blur we have to apply a we do something a little different than a Gaussian blur but let's say you have to apply a Gaussian blur Gaussian blur has a time complexity of Big O of n times R squared which basically means when you want to create a really blurry image it's going to slaughter performance it's just it explodes in the amount of positions it has to sample in an image to produce this nice beautiful blur so there are alternative approaches to blurs that approximate a Gaussian some of them give you these weird artifacts which if you're just having a background blur like in UIKit you can get away with but if you want something really beautiful nothing really beats on Gaussian so what if we shrink the image to 1/4 of its size then apply a Gaussian and then upscale the blurred version I challenge anyone in this room to tell me the difference between these two images the one on the left eye down scales applied the blur and then up scales and the right is the true Gaussian it means that we're processing 1/4 a number of pixels and the radius of the blur is about half of at the full size so it's wins on both sides cool awesome and that transitions us into the third section where I can apologize for some of the nuances you're going to run into when you're processing this data knowing that memory is a constraint just keep that in the back of your head because you get to this point you have your shader set up you have the images coming in and you have this image that came in from the camera and you expect it to output like this but then when you go and render it you end up with this red image it's missing a couple of color channels that's weird what is going on there and it turns out the system is giving you actually two images for every frame of video that's weird and one of these images is actually this grayscale one channel image and this is a two channel image and this is known as Y UV color encoding and what happens here is it splits up the image into its brightness values and color values the UV refers to its position in this plane of different color values it's also sometimes called ycbcr the CB stands for color blue CR stands for color red so if you see those used interchangeably they all mean the same thing why are they splitting it up like this okay I had to shrink this because we're and putting it up on a projector but this is an optical illusion you might have seen on the internet it might be too high fidelity even the size but remember this thing that went on Twitter a couple months ago where it looks like all the kids in this photo have their shirts in different colors but if you then zoom in it turns out all the shirts are grayscale but then this crisscross pattern is the color and that's because I'm gonna do a couple more visual optical illusions is great but the reason that this happens is because our eyes are more sensitive to shifts in brightness than to color so this has an interesting implication so we have this image that's separated we can actually store the two color image 1/4 of its size and because it's two channels and luminosity is one channel the total savings is about 50% 50% less that has to be passed around conserving damn with the the down part is you have to convert it back to RGB to display it on-screen and for whatever reason Apple doesn't include the sample code to do this in metal but recently they included it in a our kit so I have a link at the bottom but all you have to do is copy and paste this code and it's one matrix operation to get back to RGB it isn't that difficult it's not that into me it's okay in this case does theirs copy and paste these magic values or you could go watch a college class on online to understand it but it's like it's not hard what is hard is the system gives you these metal textures which mean nothing they're just blobs of data they could mean luminosity they could mean RGB they could mean color they could mean anything so it's up to you to keep track of it as you're passing it into your different subsections of your run render loop and so the way I do it is I just use structs to wrap the values want to make sure that the two textures are always coming along together as best buddies and making sure that I know explicitly this texture represents RGB and that way if I forget to implement an a shader support for one format or the other then it'll catch it at compile time cool so all right so we've got it displaying on-screen but then it turns out it's way too dark or it could be displaying way too light what's going on there you don't need to check your monitor so what's going on if let's imagine for a moment in our heads a perfectly linear gradient something like this you probably think on a scale from zero to 100 or in this case 255 that the 50 percent value sits right here in fact if you were to write a loop that just outputs values on-screen linearly like that it would actually look more like the gradient on top and the reason is that our eyes are more sensitive to shifts in shadows than highlights so that means that in fact you think that 128 in the middle in a perfectly linear gradient it's way over here on the right and this has a very bad consequence because you only have those 128 values now to represent all of this so I've never one's been able to give this demo on a projector that's been able to properly show it so I have a link at the bottom so you can see on your monitor but I also in Photoshop blew out the gradient to really show it but what happens is that when you have so few bits to work with you end up getting banding at the low end of the gradient so you're gonna lose details in the shadows and your photos are gonna look off so this is the source of the bane of your existence gamma correction the idea is you take a photo with a camera let's say those values in the real worlds are all linear before you encode it into bytes or save it to disk or whatever they are lifted up using a gamma curve to make everything seem brighter just so that there's no more bits to work with and then when you go to display it later at the inverse of that curve is applied so that its output to the monitor and it looks correct okay the reason this is the bane of your existence and I strongly recommend understanding gamma correction is that if you have to do a calculation like saying I want to blend two images together 50% if you're doing it in this gamma space then your calculations are gonna be totally off and again Apple provides sample code this time in a PDF file so it's a little harder to copy and paste and it's literally the second last page in the metal shading language specification for getting it into linear space I don't have a type safe solution to that it's just a matter of making sure you pick one space to work in and everyone's working the same space kind of like if you ever ask a server person what should I set my the clocks in our server - they'll be UTC UTC so think of gamma and linear is like the UTC of spaces to work in just everyone agree on one and stick there so here's the last thing that will annoy you and this is a more recent problem last few years you go to render it on screen and if you have really sharp eyes you'll notice that the red is slightly off from what you're expecting or you won't notice it and you'll ship it to production the reason for that is starts with this horseshoe which represents the entire visual spectrum our eyes can see everything there and for several decades we could only represent the stuff in this Minton inner triangle in that spectrum on computer monitors that's known as the sRGB color space and that was fine but then Apple ruined everything by releasing displays that can display more content and now you have to keep track of the images that are coming in through the system and what color space they're originally in and how you're gonna output it on-screen so we just add to this lightweight wrapper you just add a tag there for color space and then add various affordances and also exhaustive in enumerations to make sure that you support both color spaces inside of all of your methods and you should be all set and again you just pick one color space for everyone to work in and everything should be fine cool and you end up with the final image so what do we cover today one the render loop two talked about shaders three and I think the most relevant thing for everyone this room is a data oriented approach to system design and the reason that Swift is so powerful is the ability that you really don't get any of the compromises that you'd get in other systems you can get great performance you can get a language that's perfectly suitable for game programming but you get all the niceties of something that isn't based on C++ and finally I talked a little bit about keeping everything in order on why you V gamma and color spaces so yeah that's everything we want to talk about and we're gonna take questions at the end right now okay all right well take three questions please raise your hand and I'll come to you with the mic did you have any experience with shaders we're working with they're with other game engines or other languages before it or did learning was learning metal and like their version the first take you had at it was kind of curious was it a good take or you know how confusing was it to actually pick that up good question so the alternative right now is so the question well it was already on speaker so no need to repeat it so the alternative to metal is OpenGL and now there's a newer version from the same working group that created OpenGL Vulcan so I'll say OpenGL was created and say the 1990s before GPUs or even a thing there a terrible abstraction now and that's a reason Apple invented metal so I learned metal and OpenGL in parallel but I'm really glad metal existed because it kind of did away with all the abstractions that OpenGL no longer fits around it so it's a great way to start now so with Vulcan I haven't played too much with it but I've heard mixed things so if you were to start it right now I'd say just dive into some tutorials on metal and then it's much easier than to go back and learn all the sharp edges around OpenGL cool I have a question have you come across any good tooling to determine how much of your memory is actually being cached in the different cache lines of the CPU so I believe dtrace can do it although that doesn't really solve I don't think you can still you can run D trace on iOS right so I think in Scott Myers says that you develop an intuition around these certain things so when you notice that you're having issues like false sharing it's like okay am i accessing the same value for multiple threads but I believe they were like counter tools I'm Mac D trace but I haven't run across one on iOS so really though I think with a lot of this you you're not gonna be able to on the 11th hour run into I'm having performance problems oh I just have to fix this one bug it's really like wait you didn't structure everything in advance you're you're already doomed so you just kind of have to be thinking about it from the design side I think one last question would you prefer to build this in in C++ over Swift or what are the benefits that may be doing one of the other good question so C++ well Swift doesn't exist anywhere outside of Apple platforms let's be honest with ourselves like if you're really brave you can write on the server side but I wouldn't touch it on Android like it's just not there so if I were doing something cross-platform no question I'd just be using C++ and my heart would go on like man I really wish I could use Swift as far as like again well I I think especially with game programming there's just also a larger community so when you run into a problem you're going to have way more people that can answer it in sample code if you want to do a shader so I've had to and going back to OpenGL I have for certain effects there's like shader toy which uses WebGL and I've looked at that and in my head translated okay what is this in metal shading language so you lose a lot of the community when you go down this path but yeah I can be super selfish and just write everything in Swift but if I were on a team or doing cross-platform I'd absolutely you C++ cool awesome thank you thanks thank you Ben

Info

Channel: Swift Language User Group

Views: 1,164

Rating: 5 out of 5

Keywords:

Id: rPZRVFLpCGM

Channel Id: undefined

Length: 32min 1sec (1921 seconds)

Published: Thu Dec 12 2019