Explain me Metal like I'm 5 - iOS Conf SG 2020

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] great hello everybody my name is Andre and I'm really proud to be here so today we are going to speak about GPUs I've been in I was developer for seven years right now and I barely touched your kids throughout my career because most of my time I spent developing gaming giants or now we make in the apps full of machine learning and it's computer vision so basically all my career I've been running metal shaders and today I'm going to show you there is not that hard because I see a lot of people thinking that it is really complex technology that is hard to approach and all of those people are simply contributing to my imposter syndrome because I know that it's not that hard but everybody thinks it is so today we are going to kind of bust the myth about it but in order to better get an idea of what a GPU is and how it works I think it's important to kind of look back and the history of how rendering evolved and get some insights or why GPUs built right now the way that I built so we'll start with 1977 where Atari 2600 was released because it was the first notable system which has separate hardware for the graphics and it only had 128 bytes of RAM and it of course couldn't never feed the frame I in RAM because it was like the whole hardware that had including OS and stuff so the graphics had to be generated like in real time like literally so the TV's back there was working the same way so they was scanning the image through the electronic beam like line by line and once the portion of image was scanned it was staying on a TV until they can electronic beam when the round-trip and kind of gets back and read rows it so the problem was that Atari 2600 could only display like five objects at one time and that of course wasn't enough even for games in 70s so the way developers can baptize it they had to move objects as the TV was scanning the image so once the beam passed the object they moved it further so they could scan it and then once they rose finishe they returned objects in their initial positions and then returned back and now we can plane about the X code box that it has so back there it was way harder so in this way they could achieve like running a lot of objects this is the screenshot from the space invaders and there is a actually very nice book regarding this topic another fun thing is that they didn't have enough time to process like inputs from the gamepad or a computer game logic so they had to can include fake lines where they didn't have can process the image in instead there were processing inputs and the pitfall game developers didn't have enough time even more so they had to connect create huge ground and huge trees so they can have more time comparing what the player wants to do the next iteration was ness what they tried to do is they wanted to replace the picture processing with tiles so what tiles were is a simple small piece of memory that is very fast and the whole kind of image was processed by tiles there is provided additional opportunities like collision detection but was very limited and that's why all of the games from that era looks a bit like square after that there were 3d solids the first Hardware but it was very expensive in most limited to professional simulators and the first GPU as we know was only released in 2001 by Nvidia it was called GeForce 256 it was barely programmable but it was very fast and it was very first piece of hardware that you could bought install on your PC to accelerate vertex graphics in 2001 they are they release g-force ethics and basically introduced the concept of shaders it will be covering it hardly today and right after g4c fix was introduced it started obviously that GPU are going to space in terms of performance especially compared to CPU which went back and if you saw on the trends will see that the GFLOPS metrics is like a metrics how you evaluate how to perform on the hardware is where like six times per year but the clock speed actually went down and that's because the amount of transistors went up so from here we can see that GPUs are massively parallel machines are very used to solve the problems that are friendly for a parallelism and to understand why we have to understand like why how the rendering currently goes so the morning rendering modern rendering process is like a key to understand why GPU is so performant GPUs itselves are very limited in what they can do they're very dumb hardware they don't have an OS what they can only do is to like render triangles lines and points and is highly optimized for floating-point operation so that's why GPUs are not used for example font rendering because the precision is simply not enough if you ever played our video games then you know that all of the 3d objects and cysts of triangles or how they call em polygons in order to get this drone in a screen you can have to do two things first of all you have to put all those vertices somewhere and then just feel it it's something so their positioning is done by vertex shader this is like very common but it is also very a simple shader that only thing it does is just transforms every vertex is from there our model into the screen space so it gets the triangle from the model coordinate space and Traverse it into your monitor is one so once they each triangle gets on this crew the restoration comes into play but rasterizer does it kind of overlaps your triangle with a grid of pixels and chooses the only pixels that overlap with the triangle you have and then it calls the fragment shader for each of the pixels that are overlapping the triangle so what the fragment shader is is still very small program that is launched for each pixel apparently on the screen to better understand it let's imagine that we'll draw two triangles that fill all of the screen and see how vertex traders can help us draw something so the moisture fragment shader you can imagine is like returning solid color so if you return red your screen will basically turn red and you will see like the whole screen is the same color but if you try to be a little bit smarter you can calculate the for example lengths to the center of the screen and use it as a red channel and then you will see like the gradual gradient what you else can do is that you can pass each frame a sinus of time and you will see then an animation of the screen because each frame you'll see a different value in the fragment shader we'll be drawing it but what most of the time I do is just simply takes the textures and put it on is needed position on onto the screen the idea of programmable images were actually before the shader started so the guy called Paul Hagbard in 1984 launched the challenge where he proposed to kind of created code that fits on a business card that can actually generate an image and that's what his first image he did but nowadays people do amazing stuff writing only fragment shaders all of those images are actually animated and they didn't have any image any texture or any 3d model so all of those are generated with code with mathematical formulas so you can get the flames with 66 lines of code only using fragment shaders and this is really powerful technique that transformed the world in a way we know it's so de all of those fancy video games are based clear powering by the shaders technology where it is as simple as launching small program pair each pixel and ask it to return the color for for each pixel so what Apple has to offer that's the most interesting for us since we of work on Apple platforms actually we've been blessed with having very good GPU support since just 2007 since the first iPhone released it supported OpenGL ES and at that time it could render a comparable graphics physics PlayStation 1 when the iPhone 4 was released the OpenGL he has to support landed and we saw the beautiful games like the internet Blade and throughout that time a lot of frameworks and in gaming giants released because OpenGL was kind of the only way to proxy to open jail when iPhone sound released OpenGL ES 3 no 3.0 was introduced but nobody really cared and there are multiple reasons for that first of all it's Vulcan Vulcan was announced by Cronus this kind of comity behind the open jail in 2015 an apple was originally in a working group the important thing to understand this open jail is not a library this is sort of a standard that the vendor has to kind of conform to for the hardware and software and the Vulcan is a standard as well and as we know Apple doesn't like standards doesn't like plain and games so they decided to use their own API it's called metal and this is what we are going to kind of focus on today so the key aspects of metal is that it is a design to be very low CPU overhead so this is how it compares so for example OpenGL or in even more high-level frameworks it features all of the modern GPU features you can imagine from tessellation to instance rendering and stuff it is designed to make expensive tasks less open so you move redundant work in the initialization stage and then do less throughout the frame it's very optimized for CPU behavior because you can paralyze GPU CPU substances or physical devices you can prepare your next RAM while GPU renderings the previous one and it's very seen API that will cover any details today so in the core of the API of the metal is MTL device it basically represents a single GPU in order to get it you usually get MTEL creates a system default device function but but it's not a singleton object in iOS it actually kind of is but if you use metal for max max can have multiple GPUs one for a high power processing one for low power it can have external GPUs and by deciding which MTEL device you will use it will define where the code will be executed and the important thing to understand about the metal API is that it extensively uses the pattern called dependency injection so all all of the objects are kind of created out of other objects and the key the next key object is the common queue the common queue is something you can create out of device you just call a method called make common queue and this pair is often referred to the metal context so these are two core objects that you need to actually process something on a GPU so what the common queue is is basic queue from the computer science that you all know you can submit a something there and it will be executed in an order that was was submitted the important things to understand about it is that you are not the only one who submitted workload there so Appl do it as well for example when you use a map key they submit to the common queue things to render maps they submit something to from their core logic so you have to remember that you are not alone and you will have probably need to wait for some time until your comments will be executed so those bags that you see here being submitted to the queue are actually common buffers common buffers is like storage objects for comments once you feel the common buffer it's actually nothing happens it only starts executing once they comment starts kind of the the order is for this package to to create a common buffer you have to calm a common buffer function so we get the kind of an empty package that you have to fill so what we can fill it with there are multiple types of comments like render comments plead comments and compute comments so render comments straight straight forward you can just render triangles or lines or points bleed comments are for coping for extra fast coping from one texture to another and the compute comments are for general purpose we'll see how it works later in order to feel a common into the common buffer you have to use a special objects called encoders so for each common type you have a dedicated encoder that you have to create out of common buffer so you see button here you create objects out of objects that you already have and then use it and then end it before you can create another one so for example for render comment it will look like this so we'll take our empty package that you just got from the common queue and create and dedicated encoder and for that and the process of creating render comment is very close to crafting in videogames so you have slots where you can put something and then just ask encoder to bake something out of it but the one ingredient is always has to be there in its pipeline state what pipeline state is it represents the GPU state that is needed to be in for the current common it is composed out of shader functions that we covered earlier in each pipeline state has its own optional parameters for example textures or maybe some constants as well usually you have to kind of create the pipeline State once in the initialization stage of your app in save it later for like reuse in order to create that what you do is you create a descriptor and assign vertex function a fragment function for the render state that you can obtain from library and then you just ask the device to compose the pipeline State for you in storage somewhere usually in a property so once you get it you just put it first lot calling set rendering pipeline state in this is just the first ingredient that you will need after that you can add for example the geometry versus actual vertices of your 3d model then you can add maybe uniforms or a texture to cover your model ways you can use dedicated message like set buffer or said fragment buffer for that and once everything is in place you just call draw primitives and what really happens is your kind of package gets filled with one comment nothing actually getting drawn on the screen until the package will be submitted to the queue so you can further proceed to creating comments you can another you can assign another state or another geometry called draw primitive once again boom and you'll have another comment so once you're finished is filling your buffer with any kind of comments you can interchange types of comments you can first encode the rendering comment then you can use the bleed common and then just compute and interchanging the way you want you have to call end encoding on an encoder so to indicate that you are done filling this package until you do it you cannot submit the comment buffer and the last stage is just to call commit on our field package just so it gets a delivery to the common queue and executes it in order to wait for the results to be ready you have to call wait until completed but it's actually considered bad practice because it prevents you from having cpu GPU parallelism you can prepare more stuff until the previous commands are executed in a GPU using your CPU which is stalling so what's our advice is to use add completion handler function to kind of get the callback once the comment will be done so let's see how to code something yourself maybe not rendering because rendering is not really applicable for master day iOS developers it's mostly used in an arm video games industry what you can do is that you can get leverage or what's called GP GP which sends for general-purpose GPU to for example implement machine learning algorithms or maybe some computer visions or maybe you want to create a photo editor with fancy filters like Instagram has which actually what is going to do right now but before that we'll dig in the history once again yes I'm a boring guy and the early gpgpu days started similarly when I first programmable GPU were released but it also was done by PhD guys who was doing like computer graphics is their primary research and it was a funny time because financial companies hired game developers to implement very vast solutions for their problems the term gpgpu originated when in 2002 when gpgpu Ark was released it was a website where enthusiasts tried to use a GPU for general-purpose computing what they tried to do is to simulate physical collisions chemical reactions and stuff using parallel instructions and the way they did it is that they try to encode their data into the textures so putting each float into the separate channel and then reinterpret it during the rendering so they were seeing like hole on a pixel 5 there is a 2 that means something let's interpret it in that way but in 2007 our beloved Nvidia released CUDA which actually changed the game so now from cryptocurrency mining to machine learning most of that stuff is done throughout the CUDA good was the first GPU architecture and software platform that is designed for parallel computing it is vs. C C++ language for GPUs and started like a whole new industry of very fast computations what could initially did is that they introduced a very simple concept of computers they took a fragment shaders and said hey we didn't really need to use them for pixel on a screen what we can do is just feed the fragment shader an arbitrary ID instead of its position and say in get it compute anything else so what we'll do right now we'll write our own compute Raider which will implement the brightness filter on the Instagram right so we can do our images more bright or more dark but we'll have few prerequisites before we proceed first of all all of the code you'll see will be written with the ll library but don't be scary add ISTA library I developed for three years now and it has very important our principle that it stands on that it has no new concept so it is not a wrapper library around the metal it's still the same metal same types it's just a bunch of extensions that makes your code look more safety and more idiomatic and it doesn't have any performance overheads so what it essentially does is just transforming outdated objective-c like API to some modern stuff but you don't have to learn anything you and you don't pay anything is just more severe way to do stuff another thing we learn today is an encoder pattern it doesn't really exist in the world something I kind of imagined for myself to organize my code what essentially encoder is is just a class it represents all of the needed information for the particular shader so it knows how to create a pipeline state for the shader it knows how to encode the shader into the package so our first step will be to actually create a compute by plane state it is like very easy we have to introduce our class it will be called brightness because it still stands for a brightness shader and what we'll want to do is to store our compute by plane States somewhere in the properties of that class using the library is an injection because this is where the shader is located what it looks like is basically a single line of code where we just ask library to get us a computer pipeline State for the brightness trader this is the name which you see the best as a string has to correspond to the shader name that you'll write in a shader code we'll see it later another thing we'll have to do is introduce encoding API so what encoding does is just basically feels our common buffer with the instructions on how to do the actual brightness and the way it works is we have to first create the encoder since we are doing a computer we'll need a special kind of encoder called MTL compute common encoder and it'll look something like this so we'll have a common buffer where we call compute and we'll have to pass a closure where the system will provide us an encoder that we are going to use and in order to kind of craft a common for computing we'll have to first put our ingredients in so we'll start with the textures we'll have two textures either input first will be the source texture the texture will be actually processing so it's our image and the destination texture will store the results for it will also have an intensity like it's a constant identifying how much of a brightness we want to bring in so we just put the texture first by saying encoder set textures passing the array of textures that you want to use and that's all we need to do the second step will be to provide an intensity for the shader that is as simple as ever you just press set intensity at zero index because the first constant we'll have to use and that's it you are set the the constant last step is to assign the pipeline state assigning pipeline state in requires two steps first of all you will have to assign the pipeline set yourself and another thing is that you have to tell the GPU how much of a threads you want to launch and in our case we want as many threads as pixels in our image because for each pixel we want our dedicated thread that will transform the pixel in an image so we just say dispatch to D we pick our pipeline state it is stored previously in our property and we say exactly so how many threads with once we say destination dot size so this is the amount of great we'll use and that's it once we get all done we have an encoder that set up in our common buffer has a one common but the problem is that not nothing really happens at that moment so you only have the package with something inside you have to kind of launch it to the GPU but before doing that we have to write GPU instructions GPU instructions are written in a metal changing language this is a C C++ sub set basically it only limits to running basic C functions with some special keywords this is the shader we are going to write but we'll split it into smaller pieces just to better understand what's going on first of all we start our shader with saying kernel kernel States for computers but you can also write here fragment shaders if it's a fragment shader or a vertex shader if it's a vertex shader next you declare your arguments your arguments currently our textures its input and output and we also specify the access will going to use it with because this way metal can optimize stuff and if you say that you only write the texture it will use the cache more aggressively and you'll see even more performance the last are input that will use is factor this is still our intensity just with a fancy name so this is the multiplier that we are going to multiply our pixel with and to connect get the expected results and the last step is position position it kind of indicates where is the location of the thread so you can know which pixel to process with each individual thread and you can also ask metal for other matching information not only the position of this thread you can also ask for the position thread groups or how many threads there are but this is more of an advanced technique for the simple stuff we need only the position of the thread so first of all what you have to do is you're going to read the pixel that you need and you use position for that you use position to read the pixel from the texture next what we are going to do is we are going to multiply it by a factor in order to kind of the image for a more brightness way if it's more than one or it's if it's less than one it will be more darker and write it back but not in the original texture but rather to the output texture it's another one okay so now we got our shader Radium we actually want to launch it and see what it results in order to do that you have to do very few things first you have to do initialization stage it is consist only of three lines of code the first one is a we create the context which basically stores our device and a common cue is you remember then you create a library for Brighton's dot self so you pass type of the encoder just to get a needed library for your bundle in case your ink or those seats in the framework or somewhere else or you can just press a bundle that you want and you create your encoder that you just wrote by passing the library is an injection so it could extract the computer pipeline stain and prepare himself for the encoding and the launching is a very simple process so first of all you get to prepare your textures what you do is that you convert the cg image to texture using the context a built-in method and then you create the matching texture for the destination one so it will use the same size the same height the same width the same pixel format but we say that the usage will be shader right so we get more optimization from the metal side next we'll use the scheduling weight method this is very similar to how encoding works so you will pass a closure where you will receive a buffer that you will have to fill in and the filling the buffer is just as easy as calling our encoder the method is being introduced it was an encode just so it can put the actual common inside and that's it once you finished with that your comment is went to the execution and now ready to be executed on a GPU once you wait it to the end either by saying schedule and wait or if you want to be more fencing utilized pearlies even more you can just say schedule and then subscribe yourself to the callback of the buffer you just call the output texture that's eg image to convert it back to the CG image and you can see the results so can we make maybe straight usage out of what we did are right now of course you can now dear me if air beam be fake photos you can just pass lower exposure just to see what the real room will look like not like the paradise just a regular room or if you're near beam because you can do the other way you can pass a multiplier more than one to make an illusion that your apartment is the best place in the world so for the last part I have few tips for those of you who are brave enough to start doing something with the GPU so first of all you gotta remember that GPU is a separate device that stands like physically outside of CPU and that's why not only you pass data between two languages from metal jingling which is so Swift or Objective C but only between two physical devices so if you in case you pass instructs you gotta remember that Swift doesn't have a standard force memory layout so that's why you have to use custom Clank attributes to preserve it next you gotta remember then when you feel common buffer nothing really happens the only encoding comments so when you mistakenly think in that calling drawer something will draw something it's not it just will put a comment in the comment buffer and you have to wait in the waiting time is kind of undetermined because you are not alone submitting the comments you have to keep the code divergence to minimum so GPUs aren't really good handling if conditions because all of the threads are working parallely and they have to be kind of executing same set of instructions GPUs are better used for floods it doesn't support doubles as at all or at i/o s and using Inns are usually better to be avoidable you should calculate read websites out the liberties is more of an advanced tip the cache are reasonable objects so if you create by plain state don't erase it so the errors you can pick up for garbage collection you just put it somewhere in the property just so you can reuse it every time you need it and usually you shouldn't wait for GPU to wait for the execution I what you should do instead is that you should subscribe for the completion handlers just so you're in while the GPU does something you can actually utilize your CPU for something useful so that's it for today I hope you enjoyed the talk and probably will try something metal yourself today you can follow me on Twitter to get the slides and the code for the stock for this presentation I will post it there today and I actually think we are have some times for the questions this is unexpected but I tried my best okay so the first one is can I have a live demo hello triangle not right now since we have a limited time but we can hack it together if you want to after the presentation what kind of calculation should on a CPU part and what should be doing on shader side so this is actually a very challenging question but generally you decided by whether or not there your algorithm is friendly to fertilization it can be very obvious and can it not be really trivial to detect it but general rule is that if the algorithm involves a lot of dependency like it features a lot of similars mutex conditions or waiting Zorich if it's a loop where one iteration depends on the previous one this is not what is friendly to the GPU but you have a workload that for example for loop where each iteration doesn't depend on anything but for example iterator this is a very good sign for this to be on a GPU how difficult is to migrate an existing opengl code to metal is it a full rewrite another related question is it worth going for that migration in terms of performance okay so yes it is a full rewrite but it's not hard and I will try to explain why so once you get into one single GPU API it can be directing sin can be awoken it can be open jail and can be metal the thing is that they all work the same so just the same paradigm same concept same common buffer excuse a shaders everywhere same names for uniforms and stuff so once you get along with the OpenGL it's very easy to convert to metal because you if you have a OpenGL codebase that means you already understand all of the needed concepts for metal metal doesn't bring anything new to the table compared to what OpenGL have but and actually the performance sometimes may be even worse I outside of what Apple claims because metal has like some behaviors that differs from the open jail but you still definitely should do it because OpenGL is officially deprecated by Apple so maybe in one or two versions of iOS it will be completely removed out of system so your app will not be maintainable anymore so you should start now and especially given that OpenGL implementation of Iowa 13 is very buggy so your users are likely to experience problems right now if you have OpenGL code base thanks Andre
Info
Channel: iOS Conf SG
Views: 10,594
Rating: undefined out of 5
Keywords: ios, singapore, apple, iphone, ipad
Id: zXBEJzAaHY8
Channel Id: undefined
Length: 35min 19sec (2119 seconds)
Published: Thu Jan 30 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.