Custom Hand Gesture Recognition with Hand Landmarks Using Google’s Mediapipe + OpenCV in Python

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hey what's up everybody it's ivan here and over the last like couple of weeks i was helping a former student of mine to make this really interesting project in python which was all about recognizing letters in the russian sign language and uh during the process of helping her i kind of got to dive into the topic a little bit myself and i thought it was pretty interesting and i thought that you know i might associate with you guys and so in this video we'll dive into how we can we'll use like this really cool repo that i found and we'll talk about the athletes and the folks who like who like build that um and and we'll use it to train like a system um to recognize like our own kind gestures and then they can be like anything right so we can be like peace sign thumbs up thumbs down hopefully not only thumbs up if we were talking about this video well hopefully but yeah i mean it's it's gonna be pretty interesting and above all i would say like it's a it'll be a really really beginner friendly video so if you have like a project for i don't know for college or whatnot and and you're or like it's just like a person something you're learning more i'll keep it like very beginner friendly and as long as you have like basic knowledge of python like basic knowledge of how to like install modules and do things like that i think it'll be good and yeah so now let's talk about the structure of what's going to be in the video so um the first thing we'll talk about is kind of like the top level overview of the in my opinion like two main approaches that that you can take when performing hand gesture recognition the pixels first approach and the landmarks first approach uh we'll do that in like a very fancy google prison that i put together in like 10 minutes but it's going to be good we'll kind of cover all the main like ideas that you need to know so that's like when we jump into code if you're like uh you know that you have a way of understanding why things are like organized there the way that they are then once we have that sort of like top-level understanding of the different approaches there are and stuff like that um i'll show you like a really cool repo that i found and it's english translation that makes it really easy for us to get started with kind gesture recognition and training something uh on training a training like a neural net and custom hand gestures and that repo uses google's media five framework and that's like a framework that you people install in python and it has like all sorts of really nice things that make it really easy to um get started with you know this type of like computer vision tasks and then on top of google's media pipe there's like a really really simple curious neural network that that takes like hand landmarks and uh classifies them into their actual gestures which we'll we'll cover all of that and then we'll look at the code and how we can like use the repo to train something on our custom gestures but i'll but i'll also like walk you through the at least like the interesting parts in the code that it might be useful for you to know if you want to like take this thing and like modify it and build it built it into some of your own projects or maybe build something like completely new it might it might be so they kind of know where some of the like main magic is happening so to speak so um if you're excited you know smash that like button um consider subscribing to the channel if you like it and as always like if you have any questions throughout the video comments what not just like lumen leave them in the comment section down below and uh let's get started hi again everybody um it's iron from the future actually and i wanted to tell you two things and the first thing is that like it's gonna be a slightly longer video than usual but like i really enjoyed the like in depth like in the coding explanations and the training explanations with dolph so hope you'll enjoy it but like if you want to skip to any specific part of the video i'll leave the time stamp so like feel free to jump around that that that i always find that useful so hopefully like you'll find it useful too and then like this little fragment where i explain the top level of how everything works and get recorded properly so i'll just like quickly record it and then proceed normally with the video so so now let's talk about the top-level understanding of how things work in the repository and how we can approach building such a system that you know grabs the frames from the camera and recognizes the hand gestures in in those frames so the first kind of part that needs to be handled is like the actual part of grabbing frames that can in my case it's from the webcam uh and grabbing him in a way that we can like interact with them and work with them in python and that's how built in the repository by really powerful really cool uh i enjoy working with it a lot a library called opencv or like open computer vision um which is like which is like also kind of you know good news that it's using that because it's gonna you know be pretty easy for you to build it into your own project projects because it's like such a wide spread kind of such a such a massive library used in so many places really powerful really cool and uh in this video i will talk about like the examples of you know building such system to recognize certain hand gestures like you know like the peace sign or the i don't know the okay sign thumbs up sign i don't know i just did that uh but i don't know why uh uh or like the uh rock and roll sign or something right which i wanted to say is just like like if any of these gestures like don't mean to you exactly what they mean to me in some part of the world because like i know they can do different things around the world okay yeah it would be funny just let me know pretty sure shouldn't anything do crazy but yeah let me know if if some of the gestures that i'm showing like if they mean something differently i think it would be like a nice educational thing i guess um so cool now that we talked about kind of this first part of the system that's grabbing frames you know uses opencv for that um how do we actually build like the the main kind of crucial system system that's capable of you know taking taking you know some of these frames and actually like like all the drawings like all this like little pointy like dots here like labels that's that's opencv style but how do we actually handle this is the system that's recognizing the hand gestures um right so the first kind of hint that we can take at how we can approach solving this problem is by thinking like you know we have frames with with hands right like hands can be all sorts of different shapes gestures can be all sorts of like different shapes and different you know i don't know uh people can have different hands that are doing different hand gestures with different slight variations and different slight you know specifics backgrounds can be different lighting can be different like all of this is kind of hinting that it would be really really difficult if not impossible for a human being to write like a comprehensive system uh like just plural algorithmically with code that would be able to tell us exactly which hand gesture it is right so this being like such a broad task like different ways that the gestures can be presented different backgrounds all that stuff it's kind of hidden that we should use a machine learning approach or you know as we'll soon talk about like a deep learning approach like in specific so the first machine learning or as we've talked about miraculously like deep learning because it uses like uh deep neural networks approach is something i would call pixel's first approach uh and that's because we're grabbing like uh we were taking the rgb matrices which rgb matrices like it's essentially you know all of the frames all of the like images are represented in computers as matrices at least like you know it's a very like well-known format of like an rgb rgb matrix um and we pretty much can get that by with opencl by like grabbing frames from the webcam so we get like this rgb matrix which you know contains pretty much all of the information that's in the image we passed it through a convolutional neural network which is a special type of a neural network that's optimized to work specifically well for images and we get some sort of classification so that would be like the pixel's first approach okay and that's the part where i handed over to the ivan from the past to continue with the video and so that's approach number one um the second approach is it's a little bit more multi-layered but it starts off very similarly we take um grab the frames from the camera and pass them through a control network but but this neural network is different this one is not supposed to give us the definitive like answers to what gesture the han the hand and the frame was showing but it's supposed to extract um hand landmarks like this kind of pointy um skeleton looking thingies of the hand uh and then we take those landmarks and we pass them through another much much smaller and much much much much simpler um neural network architecture like a feed forward architecture and that one gives us like the final classification like what sign the hand is showing so the main like drawback the main like disadvantage of this approach would be that like training a neural network to like you know extract this embeddings is like much harder of a task but you know we actually don't have to train it since like it's a pretty well known problem and there's you know they're they're like pre-made like ready neural nets that we can use for this task and that's where like google's media pipe is going to come into play in just a few moments uh and so like if the whole like we don't have to train this thing is not a problem right um then we can actually start looking at the like advantages of this approach and the main advantage is that we have this like really cool really reliable neural net that's capable of like extracting hand landmarks um like training a neural network on this on like kind of landmarks to recognize gestures is a vastly vastly vastly simpler task than training it on pixels because as i was saying hands can be different shape they could have different like finger lengths they could have like i don't know different skin color different um i don't know like just really really they can be really really different and lighting can be different and the cameras can be very different and so for us to train like this simple system we need to collect a lot of training there like we need hands that are on this type of like fancy flowery background that i have but also hands that are like on i don't know like a blank like white wall background or on the like brick wall background or just like the countryside forest background if you want if you want the system to be really really reliable like that right but with hand landmarks kind of all like heavy lifting ends up being them down but it's like uh kind of kind of neural net that's extracting the this like landmarks and once once it's extracted the landmark it's pretty much all the data that the smaller neural network needs in order to do like a classification so leverage this like really powerful pre-trained neural net that's extracting the same bearings um to be able to train a much smaller network that's much lighter much easier to train and that that requires a lot less training images so for instance if you want to train it to recognize a peace sign instead of me going like collecting you know peace signs and all sorts of like different you know backgrounds i can just show like the peace signs maybe in different like a couple different orientations in a couple different like i don't know uh ways that you can show it or whatnot like uh and that'll be enough and like that that's the uh that's like that's like the power of this approach that we're leveraging this like cool pre-trainer on that to extract the to extract the landmarks so that so that that we can train a much smaller much simpler neural network to perform the actual classification so now that we've talked about this like landmarks first approach you might notice that one step here is missing um and the step is that like the neural network which extracting this hand landmarks and subsequently the neural net that's classifying hand gestures it doesn't really need to look at all of the frame like it doesn't really need to look at every single pixel in the frame to recognize a hand gesture as you may guess it really only needs to look at the actual hand which means that like all the like background and all of the like i don't know like the other part of my hand or the other like me the background all stuff like that is like it's not important and so we can make the whole system a lot more reliable by first cropping the hand like the the area in the image where all the data that we need to make the prediction as to what hand gesture that that area is showing is contained which looks sound like this we grab the frames from the camera starts off the same but then we pass those frames from the camera through a hand detection model which you know pretty much detects the handling image and the crops it and then we take that crap hand image and we pass that through that like powerful kind of neural net that gives it gives us the hand landmarks and then those hand landmarks we pass through a feed-forward neural network that performs the classification so that's actually like i would say you know if not the most effective then like in the context of what we're talking about like it's definitely the most like effective efficient way of recognizing hand gestures on a single hand now looking at this approach it might look a little bit complicated and as if it would be difficult to implement but like i actually kind of assured you that it's going to be really easy to implement for one main reason and that is that a lot of the work has been done for us so for example um this stage with hand detection and landmark extraction is is like handled really well by google's media pi framework and so as i'll show you like a cop in a couple moments we pretty much just like run some python code and it's detecting the landmarks and that second stage which has to do with the training of like a smaller fit forward neural network and like collecting training images and all that stuff is handled really well by the repo which i'll show you in also a couple moments so in reality it's like super easy to get get started but always when working with these types of approaches um it's i i finally like it's it's always useful to like have a top level understanding of what the different pieces are doing in that way you just you know it's much easier to then apply it to your own problems or to then maybe improve on it or just like debug it if something goes wrong you know so that's like the top level overview of how things work here and so right now we're in the website of the google's like media pipe um framework that's that's used in the repo that i keep telling you about no showing a couple moments i keep telling you that also right um but yeah we're right now on this like on its page where it kind of explains um what's it doing and here we can see like the models that are used as the pawn detection model as we've talked about here that detects the like the the palms and the next model is the hand landmark detection model which takes the um cropped palms like crap like hand images and uh and performs precise key point localization of 21 3d hand knuckle coordinates inside the detected hand regions via regression regression that is direct coordinate prediction so all that is said that it you know classifies the uh landmarks and here they provide offensive and here they provide like the python api for us to try it out and so all you need to run this thing is to have opencv installed which is like really easy to do it just like through pip and stuff and then install media pipe it'll probably run like some tensorflow lightroom that's in the background but all of the running is done on on the on the cpu and uh it's like it runs pretty quickly in my case like i'll show you in a moment i have like two webcams like this this one it's it's the uh uh my kind of laptops like inbuilt webcam right here and as you can see like i run it and it's running like pretty fast all done on cpu and it's detecting the hand landmark and it's it's pretty cool how well it works right like you know my hands can be like all sorts of different like orientations i can like they can even like obscure each other to a certain extent you know still be um it'll still be working uh i can kind of yeah i can kind of even hide my hand behind the mic you know it's still able to get the whole picture and like for all the gestures that like um you know if your gestures are like operating like on some sort of like close closed palm away like as you can see it's also handling like that that quite well as well um so so like that's pretty much the simplest way that we can work with google as media pipe framework and kind of get it get it to show us some some results if it's giving you like any errors here like media pipes that installer like see we do not install just like go and install people install like open cv and media pipe like that's pretty easy to do honestly uh for opencv it's like peep install in the python console uh people install opencv dash python affirmative pipeline just pick install media pipe um yeah just in case any of you have any problems here so that's us tasting kind of like the basic um media pipe hand landmark detection application right here so now that we have looked at the media pipe part of the pipeline of our hand gesture recognition system i'm pretty much covered like the extraction of frames pawn detection and hand landmark detections using you know video pipe i will fill in the remaining um pretty much the remaining kind of steps that are required to recognize custom hand gestures which is like collection of data and then training like a smaller neural net to perform that task using a really really cool repo and let's let's take a look at it so it's this repo that's that's called hand gesture recognition using media pipe um by um kazuhiro uh takashi um so it's a it's a really awesome repo uh and the one we're gonna be using is not this one though it's um this one which is just a english translation of you know of the like of the original repo so so it's like a fork of that original repo translated into english and so the original repo is by kazuhiro takashi and the translation is by nikita kisilov uh right here is the author of like this translated repo so now in order to get this repo you can either get clone it i'm using this url or just go here and download the zip file and just extract it somewhere i've put it on my desktop right here and i have it opened in an ig called pycharm which is just essentially like an uh like a place where you can like write python in a nice way but you can use like any idea that you like i'm keeping it like beginner friendly but like you know it obviously can use uh the whatever way that you like to write python i i prefer by charm for this types of things um so now that we've get called this repo it actually shouldn't have this file because we just created that i've just created that a couple of seconds ago this media pipe um so now we can go ahead and launch the update fire file which pretty much launches the which is pretty much like the main file in this repository that performs inference in the frames from our webcam and processes them in a way where it detects hand gestures so now we can see that right out of the back it's able to attack the single hand i'll show you how to like enable like multi-hand detection in a few minutes probably but it's able to detect a single hand and it has that like congestion recognition built in in this case it recognizes the open open palm like closed bomb and uh okay sign and the pointer um so there's also a part in this repo that deals with um like recognition of like this point history you know drawings uh that's not what this video is focused on but like feel free to explore it also we're focused on this primarily cool part where we um you know recognizing like hand gestures like this is close hand gesture and this is like open hand gesture okay hand gesture and how we can like add some of some some of the custom hand gestures so yeah so now let's start looking at the color maybe like tweaking a few things here and there so the first thing is we've noticed is that it's only working right now on detecting like hand gestures in one hand you may want to use it to detect hand gestures maybe on two hands or on three hands or i don't know how many hands you want to put into work here but so now if you look at the media pipe docs right here we can see that there are several uh supported configurations options and one of them is maxnam hands which is defined in the python code here and so in our case if we find it in our code by searching max nam hence it's set to one uh we can set it to two and it will be able to detect um two hands pretty much yep so now if we look at you know at the screen right here at the like this like opencv window uh which is like handling like all of the like displaying all the drawing as i was saying you can see that like the detection is happening on two hands here's something interesting that i didn't notice like at first but then i was like there's gotta be more to it right if it's doing that so as you can see it's detecting uh like you know for instance here i have my right hand right so if i flip it around and steal the right hand if i put it like on the other side of the screen still the right hand and if i flip it around like this it's still the right hand so um same thing with the left hand so it's like you can't really fold it like however i want to like rearrange my hands it'll know that like this one's the right hand which which it is and this one's like the left hand even if they're like completely swapped and under like completely opposite sides of the of the like of the screen pretty much so that tells us that media pipe also has like an inbuilt way to detect the handedness of a handle like whether it's the right hand whether that's left hand uh one can not be useful so for instance in the like example of like a the former student of mine that i was talking about in the beginning of the video where she was training like a system to recognize russian uh sign language letters that she can do like with with one hand she used like this feature of media pi which allows for like multi for the handedness detection to essentially um have always one hand be the one that's like in putting the the digits right to be like that that's doing something that's in like from which you can read the gestures and the other hand be the hand that's like controlling the inputting and the typing and all the other things they're like you can use this functionality to like separate maybe from this kind you're detecting the gestures from this kind of detection like different gestures and that's performing different actions in whatever the application that you can that you're working on one thing to notice here you can see that like whenever i put like so for instance right now it's running at like 6 17 18 16 fps something like that and that's because what happens right now when there are like none of my hands detected is that like all it's doing is that it's running the palm detection the hand detection model that like detects palms and like wraps them and stuff on the frames uh but if i put it in the hand it starts to also run the like hand gesture recognition models in the landmark detection that's like taking more of the processing power and that's the frames per per second slowed down if i and like if i had another hand on thus there's like a little bit more processing happening um also on on that on that front so uh pretty much the only limitation for like writing it on like a lot of hands is is that like you know it might slow down the application but as you can see it's like pretty remarkable um running on cpu running pretty fast kind of doing what it pretty much needs to be doing um yeah and so here in the dax in the like output section it also talks about multihandedness um i think that i want to note here is that is that like one determining the handedness of a hand it's assuming that it's taken with like a uh front facing or like selfie cameras it says here um and if it's not the case then you need to like swap the handedness in the application just something i want to note here and by the way here in the darks they also say like how the hand landmark detection model was trained and they say that it was like training like manually annotated like these types of images of people's hands and also on some of the like high quality synthetic hand models over various backgrounds which also was fun to like the training data so now let me actually start giving you kind of like a walk through over like the main points in the call so they should have that understanding um so first of all so the main kind of action in terms of like performing this type of inference in the frames from the camera is happening inside the main function and here again we'll first we'll focus like on the kind of opencv structures that fit in the frames and uh then there's pre-processing that that's performed on those frames but uh pretty much to anybody who's worked with opencv that's gonna be quite familiar and if you haven't worked you're gonna have like a series of cool videos about it if you're interested um the so we're defining the camera capture and stuff like that which is like essentially the webcam that we'll be using for from which webcam to like grab the frames into the python code stuff like that um then and then and then this structure with the like while loop um what am i doing this structure yeah the structure with the while loop and the weight key inside of it is something that's probably like everybody who's worked with opencv will recognize essentially a way of creating this like while loop because like we're grabbing frames from the camera constantly so it's like a wild cycle where we're you know grabbing grabbing and grabbing but for this type of loop to like work with opencv there's got to be like a weight key command present there uh because like it has to wait for it to also put something into the window i can kind of show like if we remove the style here uh first like it'll crash um but if i say like uh you know something like that um if it doesn't have the weight key the window will just be like blank because it you know yeah as you can see like it's it's completely like not working because without a weight key like common tv is just i don't know it the weight key kind of adds the delay to the while loops of that leg because there is a natural delay between the frames from the camera you know it's not shooting in like a billion fps and it's also like that that ability to like wait for is there input there but anyway it's not so much an open sea tutorial i'm just pointing out like very general structure used in a lot of places like very flexible in terms of how how these things put together is what i'm trying to say um then and then and then and then we scroll back here we use here uh camera capture that read which grabs frames and uh so like this first element is like it's either true or false it's false if there aren't any frames since we're in the webcam it's always true because like there are always new frames uh the image gets flipped and then converted from pgr to rgb because open c has its thing where it kind of prefers to have everything in vgr from from the get go but we can kind of convert it using this thing then the interesting stuff happened so uh this hand sting has actually been defined above and this hands thing is defined using uh mpmp is like a shortcut in this case for media pipe and so the hands variable here is essentially like a configured with like this hyper parameters media pipe application that's tracking hands and here in in this like opencv code we're grabbing frames converting them to rgb and then we just pretty much like passing our image through hands that process and that in its turn returns us the hand landmarks um i've added here a handful of like cool prints and what i want to do is like it may get a little bit convoluted at this point but like i want to i want you to kind of bear with me because it's pretty cool i think so we'll actually gonna go and start printing the landmarks like this first print is going to show us kind of how they're first being how we first like are getting them from the from from the media pipe um as you can see it's like a bunch of this points that has have x and y and z coordinates uh which is which is pretty cool but that's not what we're ultimately going to be using we're not going to need like a third coordinate for a point we're kind of going to stick to the that two-dimensional life i suppose so let's jump into the next print and this one is a print after after the hand landmarks get converted into a list like a two-dimensional list and this one's going to be a tiny bit more interesting let's check it out because like it's going to be like closer to the what we're what we'll end up using so now that's what i'm talking about so after that conversion uh we can see two main differences so first first of all like as you can see there there are just like two coordinates x and y pretty much per um per point which is also what's used to like draw them on screen and the values here are actually like pixel values so now let me show you kind of something cool so this um first point here according to the media pipe docs is like it's it's the worst point so like it's um it's like this point um this point right here and we're gonna kind of see that like for example if i take my wrist and the closer my wrist moves to like the um left top corner which is where like an opencv pixel life like here it's like zero zero coordinates like zero x zero y uh zero with zero high in other words here it's like full width full height pretty much so as we as we're getting like let me actually do some let me let's let's just print like the first element like this first kind of wrist point right here so that like there isn't a billion of those on screen at the same time which i get it it's like it's confusing um so let's do that so now i'm just printing like this like zeroth element on that like process list which is like the wrist point um so as you can see like the closer it gets kind of to the left corner like the lower the values are just trying to get it like as close there as i can without losing it and like as closer it gets like to this place the larger the values get in general and if if i move it here then the x values get really really small while the uh like while the y value is like the height values like the yeah the y wow is pretty much they stay in the same and on the other hand if i move it somewhere like here the y values are really really low as you can see it's like 100 115 800 a 120 stuff like that in terms of the y values and the x and the x values in this case are like 700 so like you can see kind of how uh using the example of this one point how they're represented at this point in the like processing pipeline as like the absolute like pixel values um there are a few more steps coming in terms of like repressing them but but it had it's like a cool intuition to know that like these things aren't just like printed dots they're like actual values that we can understand where they come from you know now the next part is really really cool but behind the scenes i was kind of bashing my head for the last like 30 minutes because i was thinking like how do i best explain it because like it's it's it's really cool to understand it's really crucial to like understanding how that like final part with the feed forward real net is gonna work but it's like it's a little bit convoluted but i'll do my best and you just please bear with me okay so like the whole concept of a benefit of using like this like landmarks approach for for feeding that into like that fit forward neural network you know in in the last step pretty much for classification is that um we don't have to worry about the position of the hand like where it is and we don't have to worry about like the i don't know the individual like shape or color or whatever i've like of a given handler like the background you know but with that we need to find a good way to pre-process the this landmarks to be fed into a feed-forward neural network network and pre-processing is like it's like a very crucial concept that pretty much thinking about how to like properly pre-process a data set for training in little net is like it's like half the battle many times honestly because really um the so right now we have a problem as we have seen with this value is being defined as like absolute pixel values like for instance i don't know right now my the value of like this this like wrist point right like on my wrist like this zeroth point the value of this guy right here is like close to like i don't know maybe it's like 50x 50y in pixel coordinates right but if i move my hand here right now it's gonna be like i don't know 400x and 300y or something right so same applies like all of this like other um 24 points on the hand so what's the problem here well the problem is that like if you feed that to neural network somehow the is like this is like a vastly different hand in terms of values than this hand like if we just like focus on this point like this wrist point and this wrist point like say like these two points right like this guy and this guy they're pretty much on the same distance relative to each other here as they are here but like their vowels will be vastly different because right now the representation of kind of landmarks that we have is like you know this like list that we've processed well not weave the author of the repo but you know what i'm saying is heavily relied on like the actual pixel values now it's not just relating them for the sake of being relied on them it's rather than because like there's a next step um coming up and that next step will pretty much address this very problem so to to adjust we need like some sort of system where this landmarks when they're gonna be fed into a neural network and they're gonna be fed by the way as a vector as a as a vector which like a one dimensional list of numbers which will look something like this like 0.5 minus 3 0.4 0.4 minus point four zero point one you get the point where they're gonna be like pairs of x coordinates and y coordinates kind of as we've seen with the uh with the wrist point and so we gotta somehow find a way to like convert this like absolute pixel values in terms of the frame into like some relative values where this hand here and this hand here will have roughly the same values you know and that's done through a process called you know normalization where we're like normally where we're gonna like normalize these values uh between the ranges of minus one and one and in this way and and i'll like and then let's have like the whole like little painting prep here for that that i will dive in a second but for now like basically uh you know we'll we'll we'll follow the algorithms we'll we'll dive into what they're doing and what are the actual like ways we can ways that we can like normalize this whole thing so that it's like relative to each other it's like this hand which isn't just going to solve the problem of hand being in all sorts of different places like it's also going to help with like you know if my hand is a little bit farther away or if it's like a little closer as you can see like the distances kind of change but like if we normalize it like the relative coordinates like this thing will be roughly the same as this thing so like the all the like scale problem problems will also be like handled in this way so now let's dive into like the actually like what's going on this like pre-processing algorithm in this case you know so next up in the code this like list the list that we've talked about of like you know uh absolute pixel values like in the frame gets passed through a pre-processed landmark function that's that's doing like all the like conversions and upon the like output from that function we can see how this file has changed from being like absolute pixel values to being um let's give it a couple of seconds to load uh but to being like this um floating point numbers uh with the values not surpassing you know minus one or one in a way that they really like normalized in this way but like it would be really cool for us to understand understand what's actually going going on behind an algorithm because before doing this video i kind of had like uh an idea of what i was doing but like when i actually started like i kind of had like my idea like how would i do such thing but the thing that like the author of the repository did you know like it obviously does the job and i think it's pretty cool and it's not something that i would have thought immediately at the top of my head so i think it's like worth taking a dive into it so um first thing that happens here is that we create this like temporary landmark list um copy which is essentially like that you know list of but remember like this like uh pair values of like pixel values that look sound like this you know where like x and y and there's like 20 24 right oh no 20 sorry there's 20 of them yeah of this like pairs of hand landmarks and so the first thing we do here is like we convert this whole list to relative coordinates now what does that mean relative coordinates i kind of have like a little uh not fancy at all i get a thing that i drew like in pain but for example let's imagine that this is our wrist point right this is our like point with the index zero uh right so in this case the process of converting everything to relative coordinates is done through um taking the point with the index 0 aka like this like this like wrist point and subtracting that base point from all of the other points and this is kind of what i'm doing here for example for example this is our base point like a wrist point and it's the base point because again it has the index zero because again it has been like zero right here um we subtract from it and we subtract from every point we can subtract like this kind of like base point and if we subtract like baseball from itself we get zero zero and so and so yeah like baseball minus base point is like zero zero let's say let's say for instance we're looking like at this point right here right uh let's say it has coordinates like 450 like it's a little bit to the left and a little bit higher than like this guy right and we'll remember like this is like zero zero and here it's like max max in terms of like x and y coordinates so for this point it will be like 400 minus 600 it will be like minus 200 and 250 minus 300 will be like minus 50. so in this way by referring the separation for each of the points here we get kind of like relative um kind of like each points become represented as as of their distance to the wrist points like you know this points like this far you know this distance etc etc like each point gets like represented as itself minus the base point and that's where get represented as like how far away it is from the base point actually you know what i mean uh then there's so like that's that's pretty much what happens like here um then that happens for like every point in the list and it gets converted to a one-dimensional list because the input to the fit forward neural network it's it's gonna be like a vector a one-dimensional list pretty much and here's something another really interesting thing happens we get like like like we get this like relative values here but there's still like minus 200 and minus 50 or like minus 100 minus 250. they're they're dependent on like where the hand the hand is in the frame pretty much you know what i mean so like they're not relative they're they're you know if a hand would be like you know if if this kind were to be moved slightly to the right all of these values get pushed let's say this thing would be like 700 like this whole thing would be also pushed by like 100 pixels you know what i mean it's like we got to normalize them between like this like more floating points like values between -1 and 1 pretty much in this case so the way that it's done by the author of the repo is we take the maximum value so like so we end up with this like list that that's still like expressed in like pixel values relative to the to how far away from the this like like wrist point uh it moves like a different color let's pretend like let's pretend like i'm drawing something fancy and paint you know what i mean uh yeah it's like we get like liz that's filled up now with like this value so like zero zero minus 200 minus 50 minus 150 minus 250 but it's still like pixel values you know it's they're not like relative you know normalized floating point values you know and we get the floating point values by uh taking this list of the creative values and finding a maximum value on it so what this expression here does is like uh abs map abs time landmarks list it's essentially like disregarding all the like whether a number is negative or positive it's just getting like absolute distance so for instance we have like a list that looks like this like minus one two three like this part here would convert it into like uh one two three pretty much it's useful in our case because what we're doing here is that we're taking all of this values here and we're converting them into like this like absolute length distance list you know what i mean and then we're finding the maximum element here and spoiler alert then all of them get divided by the value of the maximum element and so you may kind of ask them like how does that look in this like actual drawing so the point that's like far that's like farthest away from the wrist point here from like the zeroth point is i mean i kind of eyeballed it here but like i'm assuming that it's like this guy here and i'm assuming it has values uh 450 x and 50 y you know uh such that when we um perform this like subtraction where we take 450 minus the base 600 in this case it's like minus 150 and 50 minus 300 like minus 250 uh it's like this is like our distances pixel values list right we perform them for every single one of them and like minus 250 when it gets converted to like this absolute length will be just just 250 you know and it'll end up being in this example the maximum value however in here kind of like maximum value like 250 like we've identified it in like this line of code uh so then what we go ahead and do is very simple we just go and we divide every single value of this like distance pixel values we divide each of them by the maximum value and we end up with something like this like this thing will be like minus 150 divided by 250 will be minus 0.6 and minus 250 divided by minus 250 will be just minus 1 ik like the kind of farthest point away from from like the our kind of like base point relative to which everything is happening here um then and then and then for this guy i'll be like minus 200 divided by uh 250 will be minus 0.8 and for this guy will be minus 0.2 and for the base point as you might have guessed like you know it'll be it'll still be zero because like you know zero divided by 50 still zero stuff like that so that's that's pretty much what we're ending up with here and that's pretty much what this function returns to so that's pretty much what this function returns to the pre-processed landmarks list you know and so to put into like simpler words the the intuition here of the preprocessed landmarks list is that it's normalized values between minus one and one they're determined by how far away they are from the base point which is in our case like this like wrist point you know it's like normalized values between minus one and one with the values determined by how far away they are from the base point pretty much it's it's like pretty cool but it's like it's pretty cool if we like gotten that but we've also like walked through the code you know like we see the code we see like the intuitions i really you know i really tried my best with this drawing so that's why i'm excited for it i guess hopefully hopefully it helped with the understanding you know and so just like one last thing that i'll say about the drawing is like in this case for the middle finger the um values are like they're like minus 0.6 and minus one here uh if the hand was pointed in the other direction then there would be then they would just be like 0.6 and just just one because the value here would be you know it would be like a positive value now because it would be like you know like i don't know 700 minus 300 or something so you know if it would be a flip it would be like 0.6 and one just just another cool observation okay cool cool cool now after we've like talked about all this we're actually this close we pretty much know all we need to like start training and stuff like that you know so one quick thing before we jump into training just like actual really quick thing so because i was saying this rip with those are like two things which is hand gesture recognition but also uh point history recognition so for instance when it identifies that you're showing like a pointer gesture it started drawing the circles and like recognizing like i'm rotating counterclockwise or that i'm moving or whatever it is that it's like recognizing right now um this is like awesome maybe i'll make something like that in the future but like that's gonna be on the scope of this video so i just wanted to like really quickly show you how you can disable it so the way it works as i was saying is that if when the hand gesture recognizer identified as the pointer sign it invokes like all of this like point history functionality um we can go into the labels file which we'll talk about in more detail very soon and see that the 0 1 2 at the pointer pointer index has like the pointer kind of the label pointer has the index of two which means that like in the neural net it also has has the second index and that we can go here and check yeah it seems correct like whenever um our uh key point classifier gives us the hand id and it equals to two invokes is like point history stuff so like the little you know not super fancy but like reliable way to disable is to just change it from like if hand sign ig was from total like sound like not applicable or something like that you know obviously a hand sign which is like a number will never be not you know a string not applicable so that's it'll never really you know just like point history thing unless you know you want it in the future in this case you can like bring it back and uh so as you can see like it's like disabled now nothing in terms of like point history and stuff like that but in the future you want to like work with it you can like maybe train like a custom gesture and i don't know like put in the index of that here and if you also don't want the like finger gesture writing right here to be displayed you can just like go into the um draw info text function and just like comment this little guy out i did it by pressing like control and slash in pycharm to do that but it's not super annoying but like maybe if you're building like a specific application you really don't want to include that stuff so uh there it is gone you know now on to training for real this time for real so now let's finally talk about training and how we can train uh the model and within like this pipeline of systems to recognize our own like custom hand gestures in the ripple we can see that there is like there are instructions that you know we'll be following to perform the training successfully but i suggest that we talk about training in the context of three main things which is preparing a data set training a model and then testing so let's dive into it so first kind of interesting thing to understand here is like how the author of the repo handles the data set collection and like just in general storing of the data set so uh for the key point we're interested here in key point classification in particular uh same thing as like hand landmarks classification on top of that you know like kind of the thing is we trade on top of hand landmarks pretty much so the training data is stored as a dot cc file right here and i'm opening here like inside a folder so that like in case you're not using pycharm when you don't have the same layout that you also know like where to kind of find uh those those files but i can actually open and view them in pie chart like this and here you can see the keypoint.csv file which contains like the training data that oh at this point is what the author has collected that's kind of what the repo comes along with and the labels so here we can learn that the first column here is a class index so for instance the class of this is these things are like three um here's like two the class index here's like one um here's like zero and et cetera et cetera et cetera here it's just like four classes that it comes along with but you know we'll we'll start creating them in just a few moments the thing about classes is that they're matched by a label file where for example the class of the index 0 so they go like zero one two three uh so pretty much all of this guys here are with like the zero as the first column are the landmarks for the class named open it's like open palm in this case and all of them for instance with the index three here are the landmarks for the class of the for the class okay pretty much so uh as you can see here like the author says like you know pretty much first column pressed numbers is like is this a class id and second subsequent columns are key point coordinates now how do we actually uh collect kind of like new training data in this way so that's that's a great question let's let's let's dive into that too so let's first as an example say that i want to keep this for classes i just want to add like one new class to them so for instance i want to add the peace sign so i'll say here please sign in the labels file and right now when i launch this thing like you know obviously we haven't done any training haven't collected any data so like there isn't gonna be a peace sign uh you know so you know it's like but it's also not gonna give us an error because like it's just simply never gonna guess the class index four which should be the peace sign but it's just the neural network never going to go there because it wasn't trained on that data but we can actually start training it and start collecting a few samples of what the peace sign looks like let's check it out here uh it says that here it says that we press k to enter the mode to save key points which means that like we press k inside the open c window so i kind of click here my keyboard is like in english language and i press k and it says mode logging key point so now uh when we're in the small logging key point you know we can press from zero to nine and it'll kind of add uh the classes with this index of the numbers we press will be added to this like data set that we have here um i'll i'll i'll show you right now so for example i'll go and i'll say peace sign and it's gonna be a class with the index four and what i'm doing right now is like i'll literally start clicking the key four on my keyboard and be like four four four four four and as i'm clicking it it's taking like a snapshot of the image uh and like it's it's it's like saving this you know these key points under the label four so i've clicked it a few times i can go here key pointed csv as you can see here a few new classes appeared with the index four now uh i wanna i wanna add like a more like that's you know i think you know probably want to get like you want to get way less than you would have to get where you're just like doing the pixels first approach but you still want to get like maybe i don't know like 20 or 30 or something like that so i'll go into like four four four four four maybe i'll still try kind of a bit bit of them like different scales even though it's not again as sensitive that because of like normalize that stuff still i think useful some uh some like rotations and stuff can be useful too then i'll go and i'll do them on a different hand four four four four four again the cool thing here is that like the background doesn't matter uh the again the the angle like specifics of what fingers are it doesn't matter like you know so like if you train these things like recognize a piece and your hand chances are like on somebody else's hand with this like landmarks it's also going to be recognized but yeah let's say we've collected let me update the yeah this guy so you know it's not a lot but it might do the job let me add in a few more just maybe i want to like maybe i want to like squish my fingers like if you know that like scales isn't the biggest deal here because it's really maybe i want to like try some of them like you know like a little more you know kind of squished fingers piece signs or something like that maybe like these fingers can be a bit differently orientated like this like depends on how much i want to like go with it really but yeah let's see if connecting enough data i'll press escape like exit this window now the training was that was like the first step the second step it's gonna be really easy thanks again to the author of the repo the training is going to be like so like we'll be able to do it like literally without even thinking about it but we'll still kind of look at the code so for training there's this um there is this jupiter notebook for training on a key point classification that we we can run and if we run like all cells in the jupyter notebook they'll pretty much you know do the training for us yeah so what i'll do is i'll open the console in this folder and i'll say jupyter notebook jupyter in this case like is a python module that i've installed which you can keep install jupyter and you'll be able to open it too and here i'll open the key point classification that i pi and b pretty much the jupiter notebook so a few things that we want to note here is that essentially the um what this thing is doing is that it's um loading the data from the that csv file and uh it's defining a model which like so as you can see it's a very simple like curious little network architecture here with um you know this many like with like 21 times two input neurons which is gonna be like a vector values for for like this like landmarks pretty much that are pre-processed here um followed by followed by some dropout 20 dense neurons some more drop out some more dance durians a very simple architecture and with the output layer having the softmax activation function because you can have because a hand gesture can be both like at least standing on the case and at the same time which is what this guy is useful for and the num classes is determined by you know like how many classes we have in our data set you know in our case we have we'll have four by this point or no i think we'll have five right it'll be like open closed pointer and okay um and then i pretty much just like saves trains model saves it the whole training happens like really really quickly so pretty much i what you can do here is just like go and be like kernel um restart and run all it'll just like run all cells okay i see what's it saying uh i forgot to modify this variable here number of classes as i was saying now we have now we have um five classes so it'll be it'll be five and now go sell run all as you can see it's trading and it's training like really really quickly it looks like you know it says here that it's going to cheer up for 1 000 epoch in reality it's going to do like early stopping anytime soon but as you can see like the whole training process is like really really quick in this case and it's not going to get much longer honestly because again the whole neural network architecture is like you know 20 neurons followed by 10 neurons followed by in our case 5 neurons here a really small architecture but what's doing right now is it's learning to map like the input vector and that input vector being the uh hand landmarks for all sorts of different classes to do to like its label class right so it's like learning to map this like relative landmarks to to a number to like an index and then that index represents according to like that index like we make is that it like corresponds to this like label file you know what i mean um so now that it's finished training it'll save the model automatically i'll save it here with the model save path and so pretty much yeah all you do is like you say sell run all run all the cells it finishes running then you can go and you're gonna like click app and uh if your label file is all right and if like you're if everything went smoothly there then you should be able to start recognizing a new hand gesture so like be signed there you have it be signed beside it thinks that this thing is a pointer which when i see this it's just a signal to me to like add more data and i'll not think that anymore potentially it also like kind of looks a bit more reliable i don't know with this guy uh but as you can see now it's you know we can have still the okay sign the open uh palm sign the close sign the pointer and the peace sign right and so now um yeah pretty much these things you know if it's like not working correctly like in certain orientations like when you train it just like add more data just like do this thing again like press k under the logging mode and be like okay like it's struggling on my left hand like this is a this should be like a you know this should be a peace sign but it's saying it's a pointer let me press four so i'm a little like pressing my keyboard again like just like bashing it like four or four or four as you can see i could press like number four four four and then share with that data and that that that should help because like you know you're feeding up more data um last thing that i want to show you is we've kind of now the example of like how adding one class looks like right like modifying the labeling file adding more data running the whole training again what if we want to like start completely from scratch and like get rid of all of all of all of all of like um get rid of like all of these classes and just like do something completely new let's say we want to have the you know like thumbs up thumbs down and a rock and roll sign like the horn sign whatever it's called i associated with the rock and roll science so we'll just roll with that so as you can see training that model is like really really quickly and the main kind of bottleneck in this case is is like the data so in order to do that like we actually in this case by the way we we haven't like when we've added a new class we haven't like kept an old model we've completely retrained the model from scratch just to knew that i said that like included the new examples so to add like these new classes we kind of have to do like low-key the same thing so first thing i'll go and do is i'll go into that csv file here and i'll just delete everything here pretty much so it'll be completely empty then i'll blanch the training script again and i'll and i'll collect some training data um so let's say we'll do like thumbs down hopefully this video is like a thumbs up but like just for the playing around with it actually i don't like i don't like thomas i don't like the sound of that let's just do peace sign thumbs up on the rock and roll saying can we do that cool so uh i'll say okay let this be like the peace sign but never mind that it's already saying that i'm collecting like a new data set so i'm keep pressing like zero zero zero no no under the that is collection mode with the press of the key button um press zero zero zero zero zero zero zero zero literally like bashing my keyboard right now i don't know if you can hear it but like pressing the ski each time i press the key the data gets saved peace sign now let's do thumbs up like i'll start pressing one now let's do the rock and roll like horn sign up start pressing two and for more accuracy you know you may wanna add like the left-handed example is also a print i've just added couple wrong labels as you can see i have yeah but that's not a problem because i can just go and erase all of them because i did the uh i did this thumbs up with an actual index you know i did like this thumbs up with the index of zero but that's for the p sign you know what i mean so azusa more like left handed like p sign um zero zero zero zero zero zero zero left handed um thumbs up and uh the left hand not like rock and roll sign by the way if any of the signs that mean something different like in your part of the world like let me know like i'd be curious to know what they mean to you because like i kind of did some cooking before that and like i don't know it's there's there's a lot of hand gesture info that this video does not need to be longer than it already is i have a feeling so okay um so i've collected the data okay um go and check out like our training file as you can see it didn't take me a lot of time at all it's just like takes being kind of takes some attention and stuff like that i'll edit the labels file now for peace sign um thumbs up rock and roll sign and then we'll go into the jupiter notebook here again and just press cell and run all cells again oh no actually one just one thing one thing we need to edit here is the number of classes so like i'll go and i'll say kernel um restart and run all so yeah just like don't forget to edit the number of classes that's that's a good one for sure okay so you can see it's starting training um i have we haven't collected a lot of data so i'm assuming that it's going to go by pretty pretty quickly again all the training i think is also happening on the cpu might be wrong but like i'm pretty sure it is okay so they're all stopping and it saved the model and after that's done we can just go and launch the app uh again like this file because like the model file has been updated so which is like this file here so now it should be able to do all the new detections so like that's the peace sign thumbs up sign rock and roll sign as you know we don't have a thumbs down so like even that's a rock and roll sign um same thing on the left hand peace sign you know peace sign uh rock and roll sign rock and roll silent thumbs up rock and roll peace sign and rock and roll sign peace sign rock and roll sign thumbs up uh thumbs up sign up peace sign anything like whenever you get like this edge cases and like problems any problems with like accuracy and stuff um you know you can just like go and again do this whole thing you're like hey look let's train it like this is like that that's like kind of like the peace sign and just add more data that looks like this but with the index fdp sign and uh just run the training again uh yeah pretty much the only thing that i've mentioned here is that like to you know you may just like use some there's um if you want to have like more classes then then here for your hand gestures you can just like write some python conventions in the code to just be like you know the number of the key pressed say like plus 10 or something like that you know open up the whole new array of numbers but i don't think there's any like a limit to how many gestures you can have as i was saying the person who whom i was helping did like that for the russian sign language and the rational language is like 33 letters or something like that uh so you can really do a lot with it and uh yeah as you can see like it's like pretty cool we're like rock and rolling and peace signing and thumbs up in here i'm kind of like hinting at something here guys you know i don't i don't know what i'm hinting at but i'm hinting at something what what cannot be i don't i don't even know what i'm hinting at but can you guess that i'm not just subtle you know you know okay okay fine uh but yeah that's pretty much so a few more things that are valuable to mention here is that like if you want to like incorporate this into like your own projects like yeah some ideas is that you know uh what what the former student did is like she kind of she did like you know she focused like as i was saying gesture recognition like on one hand for instance on the right hand and for the left hand she like hardcoded some rules that like if you move your palm in a certain way then that's like oh like if you move your finger in a certain way then that's like the typing just like showing the sign and like typing it and stuff like that so like the experiment with like hard coding some of this like landmarks um and yeah i hopefully that like this whole part where i explain like opencv in the code and stuff like that the the training process um is is actually useful because like as i was saying um pretty much with the knowledge of opencv that i have like it wouldn't take me much of an effort to like build it into some of my own style or like to build on top and build something cool uh again massive shout out to the um author of the repo this person and the person who translated the ripple this guy and uh yeah just like pretty much uh i hope you enjoyed this video you know smash the like button if you enjoyed it uh any comments questions you have drop them in the comment section down below um and consider subscribing to the channel if you enjoyed this uh it was really fun for me to make it was a it was it was fun to think to do this stuff for me as well and i hope that you got some value out of it and enjoyed this video and i'll just say you know hope you hope you're healthy hope you're well and you know hope you enjoyed it and i'll see you in the next one i guess you know peace

Info

Channel: Ivan Goncharov

Views: 100,391

Rating: undefined out of 5

Keywords: AI, Machine Learning, Deep Learning, YOLO, Darknet, YOLOv3, CNN, Neural Networks, Convolutional Neural Networks, Object Detection, Python, OpenCV, Computer Vision, Apps, AI App, CV, GANs, Face Detection, Dataset, GAN, Generator, TensorFlow Lite, Web Scraping, Hand Landmarks, Palm Detection, Hand keypoints, Mediapipe, Keras, Jupyter Notebook, Sign language recognition, Palm detection, Hand Gesture Recognition, Custom, Custom training, Custom Neural Network, Hand Gesture, Hand Gesture AI

Id: a99p_fAr6e4

Channel Id: undefined

Length: 71min 39sec (4299 seconds)

Published: Mon Mar 14 2022