Build a Deep Learning Model that can LIP READ using Python and Tensorflow | Full Tutorial

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
the field of machine learning is taking Monumental leaps each and every day there's a new machine learning model which is pushing what we thought was possible we've seen the likes of whisper chat jpt daily 2 and large language models so I wanted to take this opportunity to help you build your very own game changer model we're going to be teaching machine learning how to lip read [Music] what's happening guys my name is Nicholas tonight and in this tutorial we are going to be building our very own machine learning model that is able to read lips now all this code is going to be available as well as the data so you're going to be able to build it up from scratch and get this up and running now Nick why do we need to build a machine learning model that's able to read lips well this is almost like an extension of the sign language model that we've built previously it improves accessibility and gives Society the additional power to be able to use machine learning for good so how are we going to build it well we are going to be using a range of Technologies we're going to be using opencv to be able to read in our videos when we're going to be using tensorflow to be able to build up our deep learning model we're then going to bring it all together and test it out so that we're able to decode what a person might be saying and again this is going to be using a client conversation format so you'll be able to get an understanding of what we are doing at each single point in time ready to do it let's go and have a chat to our client yo Nick what's up Johnny yeah not much oh no this isn't one of your ml startup ideas again is it actually uh all right what is it I was hoping you could get me to use machine learning to do lip reading lip reading yup what are you the FBI why lip reading well it kind of goes hand in hand with the stuff that you did around sign language but flipped around okay fair interesting I'll even give you 10 of the company I'm gonna need you to code it all though all right fine let's do it but not because of the 10 cent because of you guys hopefully you enjoyed this tutorial first thing we need to do is install and import our dependencies let's go get em co-founder let's do this thank you we interrupt your regular programming to tell you the courses from me is officially a lot if you'd like to get up and running in machine learning deep learning and data science head on over to www.courses from Nick to find the latest to end greatest I'm also going to be releasing a free python for data science course in the upcoming weeks so be sure to stay in the know but if you're ready to hit the ground running well I highly recommend you check out the full stack machine learning course this goes through seven different projects 20 hours of content all the way through full stack production ready machine learning projects head on over to www.courses from nick4 bundles forward slash full stack ML and use the discount code YouTube 50 to get 50 off back to our regular programming alrighty so the first thing that we told our client that we would be doing was installing and importing some dependencies so let's go on ahead and do that okay so the first thing that we need to do is install a bunch of dependencies so we've got our first line of code which is exclamation mark pip install opencv Dash python map plot lib image i o and G down now we also need to add tensorflow to this list if you don't have it installed already Okay so we're going to be using opencv to pre-process our data and I'm going to show you where to get the data in a second I've actually built a little script that's going to download it for you from my Google Drive matplotlib is going to be used to render the results so we'll actually be able to see the outputs of our pre-processed videos we're going to use image i o to create a quick little GIF so you'll actually be able to see a couple of the frames stacked together it looks pretty cool g down is going to be used to actually download our data set so it just works really seamlessly with Google Drive and I think that's what I'm going to start doing for a data sets going forward and tensorflow is going to allow us to be build a deep neural network so if we go and run that cell this should go ahead and install all of our dependencies that's looking pretty promising if we want to go and take a look at the different versions of libraries that we're going to be using we really should be moving this up we can run pip list and let's go and take a look so we installed I can't even remember what we installed now we installed opencv Dash python let's enable scrolling so let's go and take a look at the version of that that we are using OpenTV python we're using 4.6.0.66 pretty good matplotlib the version of that that we're using is 3.6.2 image i o where's that bad boy over here so 2.23.0 for G down where is that Dune uh over here we're using 4.6.0 and a photo tensorflow we are using 2.10.1 also this code is going to be available on GitHub so if you jump on over to GitHub and go to Nick knock knock and go to repositories this file is under lipnet it's private right now but I'm going to make it public so if you actually go and take a look I've included pre-trained model checkpoints so you don't need to go and train this yourself you can kick things off with a pre-trained model and I've also included the Jupiter notebook so just a quick get clone of that particular repository and you're going to be able to get started with all the code that you see in here okay so that is the packages that we need installed so that is now open all right the next thing that we need to do is import a bunch of dependencies so we are going to be importing OS so the first line there is import OS this is just going to make it a lot easier to navigate and Traverse through different file systems this is going to work a lot more seamlessly if you're working on a Windows machine or a Linux machine there are a few nuances that I noticed that I had to do particularly for the data splitting but I will explain that a little bit later so first line is import OS our second line is import CV2 this is going to last the import opencv which is going to be needed to pre-process and load up our videos then we've got tensorflow so import tens flow as TF that's going to be our primary deep learning framework that we're going to use we're going to use tensorflow data pipelines as well so if we actually go and take a look tensorflow data pipeline take a look so TF data now this can be a little bit tricky so I always try to evaluate whether or not this is better doing intense Flow versus pytorch who knows maybe at some particular stage I'll actually go and transition but I've got this working so tf.data is a great data pipeline framework it allows you to actually go and transform data can be a little bit tricky at times so sometimes you do need to do um some stuff that is a little bit slightly nuanced which is going to be unique regardless of which deep learning framework you're using but we're going to be using a proper tensorflow data pipeline which is probably a lot more closely aligned to proper machine learning operations so we are going to be building up our data pipeline using the tf.data API okay the next thing that we're importing is numpy so import numpy as NP it's always just good to have numpy along for the ride if you need to go and pre-process any arrays I've then imported typing this is something that I've sort of personally taken on as a little bit of a challenge or stretch goal this year to effectively start using type annotations a little bit better I'm not great at it but I am improving hence why we're going to be using the list type annotation we've then imported matplotlib so we've gone a written from matplotlib import pipelot as PLT so this going to allow us to render the pre-process or post process output of our data loading function and then I've also brought in image i o there's like a one-liner which allows you to convert a numpy array to a gif which looks really cool and you're able to see what you've actually got in pre-process which is particularly useful when you're dealing with videos okay those are our Imports now prepared so import OS import TV2 import tensify stf import numpy's MP we're going to import typing uh we're going to import matplotlib and image IO those are our dependencies are now imported the next thing that we need to do is just prevent our exponential memory growth so if you're running this on a GPU which I highly recommend you do whether that be colab or some other sort of cloud service or if you've got a GPU on your machine which is Cuda enabled highly recommend you run this line because it's going to prevent your machine from sucking up all the memory and getting out of memory errors so the full line and you've probably seen me use this a bunch in other deep learning videos as well so first up we're going to get all of our physical devices So Physical underscore devices is equal to tf.config.list underscore physical underscore devices so if I actually grab this line here um and we should need to actually run our Imports so let's let that run five minutes later now import if I actually go and run this we should be able to see which physical devices we have on our machine so you can see I've got my one GPU showing up there basically what we're then going to say is we are going to prevent any exponential memory growth so tf.config.experimental dot set memory growth and we are going to assign that to our one GPU that we've got here and we're going to set it to true but if we do have a GPU then we're going to be able to successfully set that if we don't then we're just going to pass so we can then go and run this and you need to do this pretty much straight away before you go and do any modeling or anything otherwise it's not going to take okay that is our set of dependencies now imported and in or installed and imported so just to Recaps we've gone in the import or installed opencv matplotlib image IO G down intense flow and then we've got an imported all of our dependencies and we've also gone and set our memory growth limit to true for tensorflow this is particularly applicable if you are training on a GPU okay those are our dependencies now installed and imported back to our client foreign so we're going to be working with the grid data set for this nice is this something that we'd eventually be able to use with a custom data set say ourselves sure I've actually got this planned we just need to capture frames of a person speaking then use a speech text model to transcribe what they're saying that data set could then be subbed into the model training pipeline let me know if you want that tutorial ah got it so the grid data set for now yep we need to build two data loading functions one to load the videos and one to load the Align transcriptions got it let's roll alrighty so now that we've gone and installed and imported our dependencies the next thing that we want to go on ahead and do is build our data loading function so there's two key data loading functions that we're going to need to build here the first is to load up our videos and then the second is to actually pre-process our annotations and our annotations in this case are sentences which our particular person in the videos is actually gone and talked about now the data set that we're going to be using is an extract of the original grid data set so this data set was built to be able to go and build lip reading models now I've gone and made your life a little bit easier by just loading this data into my Google Drive so you're just going to be able to download the specific sections or parts that I use to actually go and about how to build this that you're going to be able to go ahead and build this so first things first what we need to do is import G down so that full line is import G down and G down is a library that just makes it super straightforward to going ahead and grab data out of Google Drive once you've got that the next thing that we're going to go on ahead and do is download the data itself we are going to Output it inside of a file called data.zip and then we'll extract it all into its own separate folder so the full line is URL equals and then I've got this specific URL here so you can actually grab that paste it into your browser and you'll be able to download the data set we're just going to use Python to draw it because it makes your life return easier we're going to Output the file to a file called data.zip we're going to use gdown.download to that we pass through the URL which is this over here we also pass through the output file name and we've set quite equal to false we can then extract that as well because we're going to be downloading it as a zip file we don't need it as a zip file we need it unpacked so we can use gdown.extract all and pass through data.zip to be able to go and extract that so if I actually go and run this you'll see it should start downloading our data and there you go we're now downloading data so it's around about 423 Megs this is only one speaker the original grid data set I think has something like 34 different speakers so if you wanted to extend this way further up specifically using the grids data set you definitely could but I'm going to take this a different direction later on and we're actually going to grab data of ourselves and be able to train it on that so let's let that download and then we'll be able to kick things off come here want to know a secret are you looking for your next dream job in data science machine learning deep learning or just data in general will you need to join jobs from Nick each and every week I send you a curated list of the best jobs in data these are jobs with great perks great people and great roles plus you'll get access to exclusive content like amas interviews and resume reviews so why not join plus it's completely free link is in the description below what are you waiting for all right back to the video a few moments later alrighty so that is our data now downloaded you can say that we've gone and successfully downloaded it there now if we actually go and open this up you'll actually see that we have got our data.zip file now downloaded and we've also got this new folder called data which is what the extractor function would have done app is something that I got planned for in the future if you want to see this as part of a code that episode inside of a full stack app let me know in the comments but we are most interested in this over here so this data folder so if we open this up and this is what this uh section or code cell is going to create we're going to have a file called alignments and inside of that a file called S1 or a folder called S1 and this represents all of our annotations they're in this dot align file format which is interesting to say the least if you open them up this is what an annotation looks like now these specific videos are really around moving certain things to certain places so it still means silence and then we've got these different commands so binblue at f2 now if we go and take a look at another one um let's scroll on down there's this one lay white by F5 again so you can see that it's not necessarily things that you'd encounter out there in the real world but we're definitely going to be able to train a model to be able to decode this from purely a video no audio so still mean silent so we're actually going to get rid of those when it comes to pre-processing annotations but we want to really extract this so we want lay white by F5 again and then we in this particular case we'd want bin blue at f2 now now if we actually go and take a look at their videos we've actually got videos in here as well so if I jump into our S1 folder so if I go into my root folder so datum and then S1 and then you can see I've got MP4s here so that's not going to play it within um what do we do inside a jupyter notebook so let's actually just open it up so if I go into the folder that I'm currently working in and we go into data and we go into S1 you can see I've got all of these MPEG files right so these are all about videos if I go and play one pin blue F4 please let me Chuck my headphones on is that been blue F4 please in blue at F4 please so you can see that we've got different videos of a particular person saying something now eventually it's Blue by C7 again Place B Blue by B7 again so you've actually got matchy annotations right so if we go and open up a specific annotation go to alignments S1 and go and open up this one so BB af2n this should effectively represent The annotation for the matching video so bin blew at F true now so if we go and find that particular video which should be the first one so BB AF let me zoom in boom boom boom pbaf2n so we should be able to go and play this blue F2 now right so it just said Ben blue at f2 now in blue at f2 now so this is the data that we're going to begin working with now that is our data now downloaded the next thing that we want to go on ahead and do is actually get this into a data loading function so I've gone and written this function called load video this is going to take a data path and then it's going to Output a list of floats which is going to represent our video so what we do is we first create a CV2 instance a video capture instance which takes in our path and then we're going to Loop through each one of these frames and store it inside of an array called frames what we then do is we reduce them or we calculate the mean we calculate the standard deviation and then we standardize or scale our particular image features so we subtract the mean and we divide it by a standard deviation I'm also doing something here which is effectively isolating the mouth region now I'm doing this using a static slicing function so I'm basically saying go from position 190 to 236 and position 80 to 220 to isolate the mouse there's there is a slightly more advanced way to do this using a specific face detector to extract the lips which is what the original lipnet paper actually does so if I show you the lymph paper lip net paper so lipnet actually uses I think it uses dlib to be able to go and extract the mouth so if I go d-lib search from within here d-lib yeah so they use dlib to be able to go and isolate their mouth now I've just gone and done it statically for the sake of keeping this relatively straightforward but that is effectively what we're doing there so We're looping through every single video we're storing the frame inside of our own set of arrays we're converting it from RGB to grayscale as well that means that we're going to have less dado to pre-pros pre-process and then we're isolating the mouth using this static slicing function we're then standardizing it so we're calculating the mean calculating the standard deviation that's just good practice to scale your data and then we're casting it to a float 32 and dividing it by the standard deviation so if we go and run that that is our load video function now done now we're going to go on ahead and Define our vocab now a vocab is really just going to be every single character which we might expect to encounter within our annotation so bin blew at f2 now we've also got a couple of numbers in there as well just in case we need them so if I go and run that and we can actually go and take a look at our vocab so you can see it's just a list which contains each and every potential integer there now the cool thing about this is that we can actually go and use the Kara string lookup function to be able to go and look up or convert our characters to numbers and our numbers to characters so over here you can see I've got Char to num and this is originally from the Keras CTC uh I think it's ASR tutorial so it actually uses this specific loss function to do automatic speech recognition so they I've actually thought that this was a really neat way to do it and it keeps everything nice and clean so we've got two functions here Charter num and num to chart the first one takes a character and converts it to a number and the second takes a number and converts it to a character so it just makes your life a ton easier when it comes to actually converting text to string and string to text so if I go and type in child to num I think we can go and pass through um let's say a e d and you can see it's converting it to one two three now if I went and typed in um n I C okay it should be a comma you can see it's converting each one of these characters to an integer over here so this is effectively one hotting not necessarily one hot encoding our data set but it is tokenizing it and returning a specific token value or an index effectively now we're going to be able to pass through this data to our loss function to be able to calculate our overall loss because our model is going to be returning a one hot encoded version of this now likewise we can actually go and decode this so if I go and use num to char and if I pass through this array which is 14 9 3 and 11 we should get the reverse which gives back Nick boom and you can see that there so it's a byte encoded value but you can see I've got n i c k so these are going to be our lookup functions that allow us to convert and reconvert our text to encodings okay that is our vocabulary now defined so the full line there is chart underscore two underscore num is equal to TF dot keras.layers.stringlookup we pass through our vocabulary and then we're setting out a vocabulary token so if it encounters a character that it hasn't seen before it's just going to be a blank value then we're doing the opposite we're creating a num to child function which is equal to TF dot keras.layers.stringlookup we are then going and passing through the reverse so we can actually go and get our vocabulary out of this so if I type in chart to num dot get vocabulary boom you can say it's just returning back all of our unique characters beautiful again we're setting out of uh vocabulary token and we're using invert equals to true to say that we want to convert numbers to characters not the other way around okay and then we're printing out our vocabulary uh vocabulary and the size all right and then we're going to use a function to actually load up our alignments so our alignments being these we are going to take in a specific path which eventually is going to map through to these paths or alignments forward slash S1 we're going to open up that path and then we're going to split out each one of these lines if the line contains the value silence we are going to ignore it because we don't necessarily need it we're then going to append these into an array called tokens and we're going to convert them from characters to numbers so we're going to go and split that data out and convert it into a set of characters now there's one last thing that we need to do before we can go on ahead and test this out we need to go and load the alignments and the videos simultaneously so we're going to extract both of those paths and we're going to return the pre-processed videos and the pre-processed alignments together so we need a load data function so this is going to take in the path to our video we are then going to split it out and convert it so that we have a video path and an alignment path what we're then going to do is we're going to use both of our functions so we're going to use our load video function and our load alignments function over here and we're going to return the frames and the alignments out of each one of those functions so if I go and run that what we can then do is get a test data path so if we just go and grab this particular video which is our first one so bba16n and then what we can go on ahead and do is pass that to our load data function which is what we had here I'm going to wrap that specific path inside of our tf.convert to tensor function and this is just going to convert a raw string to a TENS flow tensor so TF dot convert utensa if I pass through test path you can see we're going to get a tensor back now to grab the tensor value we can type in Dot numpy and that's going to grab that and then I believe we can type in decode and it should be UTF -8 and then we're going to be able to grab that specific path inside of our load data function what we're doing is we're actually splitting now if you're running this on Windows you're perfectly fine to run this as is if you're going to run this on a Linux or Mac machine comment this line out and uncomment this so this is going to be I believe to be able to run it on a Linux machine I had to play around with it when I was running it on colab versus running it on my Windows machine so that's the only change that you do need to go and make if you're going to run it on a different type of machine so I'm going to comment that out and leave the windows bit open so what we'll effectively be doing is we'll be grabbing this string and then we'll be splitting it so if I type in dot split I'm going to split on the double backward slash so that we are now able to unpack the entire path because what we actually want to do is we want to grab this file name here because we're going to grab the matching alignments to that because the alignment will be called bba1 or l6n dot align and it's going to be in a slightly different folder so this magic that is happening here in these three lines is exactly what is happening so I'm then going and splitting on oh we're actually grabbing the last value which is index negative one so you can see that we've now got the file name there and then we're splitting again on a DOT so we've now got the file name and the file extension we can then grab the file name like so by grabbing the first index and you can see that there so that is grabbing the file name we're then appending it using os.path.join so we'll grab in the video path and the alignment path remember if we go and take a look at our data it's freezing up a bit all right so for inside of our data folder we've got a folder called alignments in S1 S1 contains all of our videos and alignments contains an S1 folder which contains all of our alignments with speaker one and we've only got one speaker because I've cut down the data set so if we actually go and run this load data function this should return our pre-processed videos as well as our pre-processed alignments which would which we should then be able to go and use inside of our deep learning function Okay cool so take a look so we've got a tensor return which is a 75 frames in length which is 46 pixels High by 140 pixels wide by one channel because we've gone and converted that to RGB if we go and take a look what do we have in our next cell so this is returning a mappable function let me just quickly run you through this first up so if we actually get what we're going to be getting back from this low data function is frames and then alignments so if we go and take a look at frames that is our frames data set so you can see that we've got the shape there if we wanted to go and plot out an example so I could run plot.im show and if we just grab one frame we should be able to show it so that's so you can actually see the person's mouth right there pretty cool right so this actually allows you to go and see all of the different frames that we're going to process and as we go through each one of these frames you're going to see the mouth move so if I jump ahead to frame 40 you can see that the lips are moving and this is the impact of subtracting the mean and the standard deviation we're really isolating these regions that you can see highlighted in yellow here now if we go and take a look at our alignments alignments this is the word representation of what is being said so if we actually go and run this through uh it should be numb to char which remember are our pre-processing functions that allow us to convert numbers to characters we should be able to go and pre-process this so if we grab the numpy value out of that boom you can see uh it's let's go and decode utf-8 and we might need to go and loop through 4X in we're going to return x dot d code utf-8 and what do we get in there it has no attribute decode we just used that up there let's just print it out oh I think we need to grab the numpy value um there we go all right and we could we should be able to go for decode utf-8 that doesn't want to do it all right there we go much better so this is the result of our transformation so you can see it says bin blue at L6 now so that is the result of actually going and Transforming Our alignment so I've just gone around and print but this is a long-winded way to go ahead and do it not exactly the most efficient but it is showing us our result so you can see there that we've got our final end result there's a way to condense this down as well so I think if it's tf.strings dot reduce join yeah unmatched what have we done there what is that enclosing boom there you go so that is the result of our transformation so you can see we've got B and blue at L6 now so that is us going and undoing all of our transformation The Raw representation of that is just this alignments there you go so each one of these individual characters but remember these just represent this specific sentence so being blue at L6 now okay that is our set of alignments done the last thing that we're going to do is we're going to wrap this inside of a mappable function which is going to allow us to return back flow 32 is in in 64s and we're also going to be able to use our raw string processing so this is just something that I noticed and one of the nuances when actually going and dealing with tensorflow data pipeline so typically if you want to go and use pure string processing you gotta wrap it inside of a tf.pi function so if we go and run that now the next thing that we're going to be able to go on ahead and do is create our data pipeline so let's quickly recap so what we've now gone and done is we have successfully gone and downloaded our data using gdown we've gone and created a pre-processing function to be able to go and load our video we've defined our vocabulary we've gone and defined a character to number function a number to character function a load alignments function a low data function we've then gone and tested it out using our test path and you can see that we are now returning back a bunch of frames which show our person's mouth which should effectively show the person's mouth moving when we stack all of these frames together we've also gone and converted our alignments so remember this was our raw alignments we've now converted it into an encoded sequence which will then be able to pass through to our machine learning model and we've also gone and created a mappable function which we're going to need for our data pipeline in a second alrighty let's jump back on over to our client so that's our data loaded right right ish we need to build a data pipeline this will be used to train the Deep learning model tensorflow will draw random samples from our data set in order to complete one training step oh okay anything else yeah we also needed to look at the data to make sure our Transformations have worked successfully nice off to the pipeline we go then so we're now on to creating our data pipeline so let's go on ahead and do this the first thing that we're going to go ahead and do is import matplotlib which I think we already had imported so this might be me oh we did all right that's me importing stuff multiple times ignore that we've already got it imported but most importantly we are going to be able to go and create our data pipeline so this is probably one of the most important bits out of this entire thing because creating the neural network is great but actually having a data pipeline is just as important so first things first we are going to create our data set and to do that we are running tf.data.dataset.list files so this is going to go inside of our data folder inside of our S1 folder and it's going to look for anything that ends in MPG which is the file format that our videos are currently stored in it's quite old but it still definitely does work we're then shuffling it using data.shuffle and we're specifying the cache size to 500 so this will grab the first 500 Shuffle those up and then return a value out of that we're then mapping it so we're going to take the raw file format so if I actually comment this out let me show you what it looks like so at its base file format if I run data dot as numpy iterator dot next this is just going to return a file path which is then going to be passed through to our load data function which is going to do the splitting and then we're going to run two sub functions which is load video and load alignments so that is exactly what we need to do by running the map function so even if I run Shuffle you're still only getting back files right or file paths this isn't returning data yet which is where the map function comes in so data equals data.map and we're running this mappable function the mappable function is just wrapping our load data function inside of a tf.pi function and this is going to allow us to work with our specific file path formats which now if I go and run this and if I go and run this particular cell we're actually going to get our data back so this is returning our frames and our alignments so if I go and take a look at frames boom that's our set of frames we can run plot dot IM show grab one set of frames boom uh we need to close this not boom boom all right so you can see that we're now getting our data back inside of our data pipeline if I go and take a look at our alignments boom you can say that we're now getting our alignments back okay so then what we want to do is we want to pad this out because right now we're going to have variable lengths for each one of these sets of alignments if I go and run this again you can see that one's a different length that one's a different length that one's a different length that one's a different length and this is because there's going to be a different number of characters inside of each one of these sets of alignments that we've got over here so what we can do is we can convert these to a padded batch so we've then gone and overridden our data Pipeline with data equals data dot padded batch we're batching into group sizes of two so each one of these is going to have two videos and two sets of alignments we're then patting out our shapes so we're not really going to pad out our videos we are going to ensure that we have 75 frames we're not going to pad out the actual image itself we're just going to ensure that we have 75 frames for each one of these videos and we're going to ensure that we have 40 tokens for each one of our alignments if there's less than 40 it's going to be padded out to zero and then we're prefetching to ensure that we optimize our data pipeline so that we're loading or pre-loading as our machine learning model is still training so if we go and run this full pipeline that is brilliant we can then go and run this particular line here which is going to now load two videos and two sets of alignments so you can see that we've now got two sets of alignments and you can see that we've got trailing zeros at the end because it is padding out our alignment likewise if we go and take a look at a frame you can say that we should have two frames now so if I type in land frames boom we've got two sets of videos inside of each batch okay that is looking brilliant now what we can actually go ahead and do is run through this so I'm going to run data dot as numpy iterator so this allows you to iterate through exactly the same as what we're doing up there you can see that by running dot next we're going to get a so Val 0 is going to be returning our frames then this is my favorite function right so imageio dot mimsafe actually converts a numpy array to a gif so if I go and run this line here so imageio dot mimsave it's actually going to take our data set which is what we've gone into find over here and it's going to grab the second value which should actually it's going to grab the second instance of our video so if I set I could set this to zero or one because we've got two sets of videos inside of each batch and it's going to convert it into a gif so if I actually go and run this inside of your folder now you should have a file called animation.gif and you can see that this is what our lipnet model is going to learn to decode so purely based on the gift that you're seeing it's going to learn to try to decode what the person is saying and convert this to text this is the amazing thing about this model because we'll actually be able to take nothing but this types of data and convert it into a sentence really and this is going to get even better once we go and convert this onto our data set which is probably going to come inside of another tutorial but I wanted to get this base one out okay so that is what image io.mimsave does we can then go and plot out our image which you've sort of seen already so plot.im show what we're doing here is we're grabbing our sample data set which we've just gone and created over here we're grabbing our video so index 0 is going to reference the so let me explain this indexing or this subscripting so the first zero is representing that we want our videos so that's what our first zero is referencing the second video is saying give me the first video out of the batch and then the third zero is telling me give me the so the third zero is giving me a return the first frame my head's blocking this frame in the video right so if we wanted to go and grab the last frame I could pass through 74 because remember we're going to have 74 or 75 frames per video that is the last frame in our video we could go and grab right in the middle which is going to be 35 which you can see is the mouth moving I could even go and grab the second video by changing this index to one here so you can see that that is a completely different video now and then we've also got our alignments which we went and took a look at so oh wow I didn't need to go and do all of that decoding I knew I had a way more efficient way to write it so tf.strings.reduce join we're then going and looping through every single word inside of our alignment and you can see that this is the end result so bin white by n two now which is going to be The annotation for our first video over here so if we actually go and take a look this is doing our this right now is grabbing our second video so if we went and created the GIF for the first video so that should have gone and done it if we've gone reopen our animation so this animation that you can see here is the representation that we're going to be passing through to our neural network which actually represents in white by N2 now so this is almost like moving chess pieces it's not chess pieces but that's sort of the feeling that you get right or that's sort of the set of commands that we're actually getting out or that a person is actually communicating back through and this animation so we're going to effectively produce the Deep learning model which takes this input and is able to Output this bin wide by N2 now pretty cool right anyway that is our data pipeline now built so kind of straightforward there so we've gone and created a tensorflow data set we've gone and tested it out using the dot as numpy iterator method and using the dot next method to grab the next instance out of our data pipeline we've also used imageio dot mimsave to convert a numpy array into a gif so that you can see what this actually looks like and we have also gone and taken a look at the pre-processed images as well as the pre-processed annotations alrighty that is our data Pipeline and now ready for training so remember we've now got a data pipeline over here we're not going to be splitting this out into training and validation although you definitely he could and now we're just going to be running on this particular data set best practice is you split this out into a training and validation partition but if you guys do this as part of the tutorial let me know and I'll add it to the GitHub repository okay that is our data pipeline now ready alrighty on to modeling I'm destined for the catwalk man bruh seriously though check out my palette face I'm the male embodiment of Bella Hadid yeah well we've got to build this model now we're going to use 3D convolutions to pass the videos and eventually condense it down to a classification dense layer which predicts characters so single Letters At A Time yep we'll use a special loss function called CTC AKA connectionist temporal classification to handle this output interesting why use that loss function well it works great when you have word transcriptions that aren't specifically aligned to frames given the structure of this model it's likely to repeat the same letter or word multiple times if we use a standard cross-entropy loss function this would look like our model's way up CTC is built for this and reduces the duplicates using a special token but our data set was aligned yeah you bang on but when it comes to eventually subbing out the data with data that we create it's going to be way more cost effective to Simply use non-align data our model is going to be ready for it ah got it after the catwalk then the next thing that we need to do as we told our client is actually design our deep neural network although we're not going to be on the catwalk we are going to be working in tensorflow so first things first we are going to go ahead and import our dependencies there's quite a fair few here so we've actually got Os Oh man I need to go and clean up some of those inputs so first up we're going to be importing the sequential model class so from tens loader Carousel models import sequential we're then going to be importing a bunch of layers so from tensorflow.keras.layers import con 3D so this is a 3D convolution the con 3D tensorflow absolutely brilliant when working with videos or we're going to be performing a 3D convolution or spatial convolution over volumes to use quite a fair bit for video processing or video classification in this that particular example that I was sort of looking at previously uh we are then going to be using an lstm so this is going to give us our current neural network eventually I want to convert this over to a Transformer neural network so that we've sort of moving over to state of the art we're using a dense layer Dropout layer a bi-directional layer to be able to go and convert or pass through our temporal component when we're using our lstm we are also using I think we need to clean this up but we've got maxpool 3D activation reshape spatial Dropout batch normalization time distributed and flattened I think I don't actually use all of those there might be leftovers from me prototyping this but we'll take a look we've then got our Optimizer so from tensorflow.cars.optimizes import atom and then we've also got our callback so from tensorflow.carastore callbacks import our model checkpoint so this is going to allow us to save down our model every X number of epochs I think we're doing it every single Epoch and our learning rate scheduler so ideally we don't we want to sort of start out fast and then slow down as we get to our optimization point or the minimum value of loss that we could potentially get to okay then we've got our neural network so this is a couple sets of convolutions we're then flattening it out using a Time distributed layer we've got a two sets of lstms and then we're using a dense layer to drop this out let me walk you through this so first up we're instantiating our model by running model dot sequential or model equal sequential we're then passing through a convolution with an relu activation with a Max pooling layer I could actually condense this down by just passing through activation over here that's perfectly okay this was again prototyping in its process so then I've gone and read a model dot add conf3d so we're going to have 128 com 3D convolution kernels these are going to be three by three by three in size our input shape is going to be 75 by 46 by 140 and that is the representation that we've got from our data so remember data dot as numpy iterator got next and if I grab the first value or the first value we should be able to go let's grab zero dot shape you can say it is 75 by 46 by 140x4 we're passing through that exact same shape into our neural network we're specifying padding equals same so that we preserve the shape of our inputs then we're using our relu activation to give us some non-linearity to our neural network and we are condensing this down using a 3D Max pooling layer so this is effectively going to take the max values inside of each one of our frames and it's going to condense it down between a two by two square so it's going to take the max value to be able to halve the shape of our inputs then we're doing pretty much the same three times so except the only difference is that we're then going to have 256 3D comms layers or three 3D con units and then 75 3D con units and then we've got this time distributed layer over here so this is effectively going to allow us to have 75 inputs into our lstm neural network so that we eventually will output 75 units which represents our text-based characters we've then got two lstm layers of 128 units we've got a specific form of Kernel initialization I actually found a great repo which shows the pure lipnet model and they were using orthogonal kernel initialization we also are going to be returning sequences so that our lstm layer does not just return a single unit returns all 75. we're also specifying bi-directional so we're passing our state from left to right and right to left because it is likely to impact how we actually go and translate this out and it's I believe it's best practice and what was originally done in the paper actually let me show it to you in the paper so if we actually scroll up so they've got there you go so they're using a group as opposed to an lstm so they've got a spatial convolutional neural network they've got a bi-directional Groove we've got a bi-directional lstm and again they're using CTC loss as well then we've got Dropout after each one of our lstm layers so we've got a little bit of regularization and we're dropping out 50 of the units so we've got to drop out after our lstm layer and we've got to drop up out of our other lstm layout and then we've got a dense layer which is going to take in I believe it's 46 units so just taking our vocabulary size plus one to be able to handle our special token the vocabulary size so you can see it's 40 so it'll be 41 outputs so this means that we're going to have our output should be let's take a look so it's going to be 75 by 45. this represents we're going to get one output per frame that we pass through and 45 is going oh 40 actually it's 41 not 45 and 41 is going to represent a one hot encoded representation of our final output now we are using a soft Max Activation so we're going to be able to use an ARG max value to be able to go and return the most likely value we're also going to be doing a little bit of sampling and I think we're using a greedy algorithm later on so that we get the most likely character returned back okay that is our deep neural network so let's go and create that have we not we haven't gone and imported this so let's import this let's go and create our neural network so that is instantiating right now beautiful and then what we can go ahead and do is run the summary so this shows you a little in a little bit better detail what we're actually building up let me zoom out so we've got our convolutional layers we've got our activation and another Max pooling layer again com 3D activation Max pooling con 3D activation Max pooling we've got our time distributed layer right so if you take a look at the last output that we're getting from our conv layers it's going to have the shape of 75 effectively think of this as 75 time steps by 5 by 17 by 75 so that is the last set of output now what we want to do is sort of preserve that temporal component so we keep it at 75 and then we're flattening this down so we've got 6375 here so this is just let me add another cell so this is just these sets of values flattened so it's 5 by 17 by what is it 75 boom 6375 which is what you've got there 6375 then these values are then passed through to our lstm layers which have got 256 units actually they've got 128 units but it's doubled because we're bi-directional so we've effectively got two sets of lstm layers there and then we're then passing that through to a dense layer we've also got our Dropout over here and our drop out over here and then we're passing it through to our dense layer which is going to Output as I said 75 frames by 41 outputs which are one hot encoder representations of our characters total parameters are 8.4 million so it's a it's live but it's definitely no chat GPT large in in that particular respect so that is our deep neural network now created so again you can step through this you can tweak it if you want if you come up with a better architecture by all means do let me know I actually saw on papers with code um code they are grid Corpus that somebody might have already built it using attention I haven't taken a look at this but you can see they've got a CTC attention model which would be really really interesting to take a look at but uh if you wanted to go and dig into that by all means do take a look there's also this model here which is on GitHub this was the official model so it is a brilliant example of how to go in ahead and build this up it is a little bit more hardcore to step through and it the architecture is a little bit different to mine but if you actually go into this particular GitHub repository brilliant implementation of lipnet mine just works through it doesn't use as much data and is a little bit more straightforward to walk through okay that is the model summary I'll include that link in the description below by the way all right cool so let's go and test this out so this is gonna suck at the moment but I like to always do this when I'm prototyping a neural network pass through some inputs just to see what we're outputting so right now if I go and grab our model and use the dot predict method we can pass through our validation or our original sample data which we stored as Val and if we go and pass through Val zero we should get some predictions back might take a little bit of time because we're first initializing it so it'll be loading into GPU memory now let's give it a sec perfect we now have a prediction so if we actually go and take a look at the result of that prediction you can see it's just returning random gibberish right now so that is actually what our model is predicting so three exclamation mark exclamation mark KKK bunch of exclamation marks a bunch of these bunch of exclamation marks a bunch of K's so nothing crazy there so we are just using exactly the same as what we did previously so tf.strings dot reduce join we're then passing through or we're using a greedy algorithm and just grabbing the maximum prediction return back if I show you the raw prediction what we're getting back so if I just grab one example so what we're getting back is a set of let's take a look at the shape so we're getting 75 outputs each one of these represented as an array with 41 values which is just a one hot encoder representation of our vocabulary so if I went and run TF dot ARG Max and said axis equals one this is returning back what our model is actually predicted so a bunch of characters so right now we're returning uh the second prediction there this is what it actually looks like right so a whole bunch of characters return back if we went and ran this through our num to character pipeline so four x in that and then if we were going uh what is it num to char plus through X you can say that we've got all of our characters there and if we run TF dot strings dot reduce join come on buddy boom boom boom right you can say that those are our predictions this is exactly the same as this almost identical right um slight difference in that I'm using ARG Max over here rather than over here but that is effectively showing us what our model is currently predicting this sucks right now we're going to make it way better okay we can also take a look at our model input shape and we can also take a look at our model output shape which we've already sort of given or extracted when we when nrenmodel.summary but that is our deep neural network now defined so if I scroll on back up what we've got to Define is we've gone and imported all of our core dependencies for our deep neural network we've then gone and defined our neural network over here which is giving us that model which has a 8.5 million parameters and we're able to go and pass through our frames to get a set of predictions back out which right now doesn't look so great but once but keep in mind we haven't actually gone and trained this so once we train it we're going to get much better predictions alrighty that is our deep neural network now defined let's jump back on over to our client we're on the home stretch just needed to find our loss function and a callback so we can see how the model is progressing nice well Chop Chop then get coding let's do it so we're now pretty close to training our model the first thing that we need to do is to find a learning rate scheduler so this is just basically going to give us a learning rate of whatever we're passing through if we are below 30 epochs if not we're going to drop it down using an exponential function alrighty cool that's now defined the next thing that we're doing is we're defining our CTC loss this particular block of code I'm going to give original credit to this automatic speech recognition model which I believe is defined somewhere a little further down was it from this one I believe it was where's their CTC lost CTC plus yeah it's over here so this basically allows us to use a similar method so they're passing in uh Audio Waves we're going to be passing through videos to be able to go in ahead and do this so what we're doing is we're taking through our batch length we're calculating our input length and our label length and then we're passing it through to tf.keras.backend.ctc batch cost so there isn't a ton of documentation on this which is funnily enough a lot of times you guys ask me Nick should I learn tensorflow or pytorch sometimes where I feel tensorflow falls short is in some of the documentation within some of this nuanced stuff so this is one example where I'd be like I wish I'd gone and learn pytorch but it definitely works very very well regardless of that fact Okay so we've got CTC loss defined there this is going to take in our y True Value our y pred value our input length which is consequently the length of our y prediction value which should be 75 and our label length which should be 40. so this is going to take in our y true predictions which is going to be our alignments this is going to be our one hot encoded predictions this is going to be the value 75 because it's going to be the same shape of the output of our machine learning model and then our label length over here is going to be 40. so that is our loss function defined then this is a lot of code but really what we're doing is we're going to be outputting a set of predictions so we're going to Output the original prediction or the original annotation and then the prediction itself in order to do that we're using a special function called tf.keras.backend.ctcd code which is specifically designed to decode the outputs of a CTC trained model which we'll also use to make a prediction down here so this is an example of a callback so I've written class produce example and then to that we are passing through the Keras callbacks function we are going to be subclassing that in order to be able to go and call this callback on every Epoch end so if we go and run that now this should allow us to compile our model so we're then grabbing our model we're running dot compile which is a typical standard python org which is a typical standard Keras graph call in order to compile our models so basically what we're saying is that we're going to be setting our Optimizer to an atom Optimizer with an initial learning rate of 0.0001 we're specifying our loss that's being defined as our CTC loss function which is what is defined over here so if we go and compile this no errors we're looking pretty good okay the next thing the next three things that we're doing uh we're just defining our callbacks so we've got one checkpoint callback which is going to save our model checkpoints so this is originally or we originally imported it right up here so we imported model checkpoint and learning rate scheduler we're now going to Define instances of these so model checkpoint is just defining where we're going to be saving our model so we should probably just create a folder for our model so I'm going to create a new folder call it model models and when our model trains we're going to save our example checkpoints to this particular folder so it's going to be saved inside of models it's going to be called checkpoint we're also going to monitor our loss and we're only going to save our weights which means we'll have to redefine our machine learning model in order to load up these weights we're then creating a scheduler this is effectively going to allow us to drop our learning rate each epoch so let's run our checkpoint callback and a schedule call back and then we're also defining our example callback which is going to make some predictions after each Epoch just to see how well our model is actually training so if we go and run that now all that's left to do is actually going ahead and fit our model so I'm going to bump up this number of epochs to 100 because the final model that I'm going to give you the weights for was as of epoch 96. so the last line is model.fit to that we pass through our data if you wanted to have some validation data you could actually just go and pass through validation data specify validation data here I'm not going to use validation data that is not necessarily best practice but you definitely could if you wanted to so the instances of actually keeping this relatively there's probably going to be in our tutorial regardless but you definitely could go and pass through validation data there if you wanted to so model.fit we're passing through our data we're specifying epochs as a hundred and then we're going and passing through all of our callbacks so we'll specify callbacks that we're passing through our checkpoint callback which is this it's basically just saying that we're going to save our model every Epoch a schedule callback which is going to drop our learning rate after we get to Epoch 30 and our example callback which is going to Output our predictions after each Epoch so you can actually see how well or how terrible our machine learning model is actually making predictions so we're actually not going to run it for the 400 I'm actually going to run it for a couple you'll see that it's actually training and then we're going to be able to load up a couple of checkpoints so let's kick this off all things holding equal we should be able to see our model training also this is being trained on a RTX 2070 super so the speed that you're seeing there is effectively that so you can see it's taking around about four minutes four and a half or five minutes per Epoch so let's give this a coupler ebooks and you'll or let's at least give this to a box and then you'll be able to see what the predictions look like a few minutes later alrighty that is true epochs now done so you can see that this is Epoch one and this is Epoch II now I wasn't happy about the fact that I didn't have a training and a testing data set so I went back to the data Pipeline and I added those steps let me show you what I did there so if we scroll on back up what I did is I added three lines of code to be able to go and create this so first up I said that we don't want to reshuffle after each iteration so I added that to the data.shuffle line and then I added these two lines here so first up we're creating a training Partition by taking the first 450 samples and then our testing partition is going to be anything after that so we're running data.skip to grab everything and assigning that to our testing partition then inside of the model.fit method I've just gone and pass through our train data and I've set validation underscore data to test I couldn't live with myself if I didn't actually go and split these out so I went and did it to show you how to do it so that you effectively have best practice because that's what we're all about here getting that little bit better each and every time all right cool let's actually take a look at what's happened so this is Epoch one here so what you're seeing is first up this is the loss for our training data set down here you've also got the loss actually this is the it's the same roughly the same thing this is the validation loss over here so our training loss is 69.0659 our validation loss is 64.34 so not too bad and not too far off if we scroll on down to Epoch 2 our training loss is a 65.58 and our validation loss is 61.24 our learning rate is still at a one a 0.0001 and you can actually see some predictions here now when I was first developing this I was thinking hold on is this just performing like absolute crap but what you'll actually see is that The Closer you get to around about 50 60 70 epochs the better this begins to perform so in this line you can see that this is the original set of transcription so Place blue in B7 soon and then this is what it predicted kind of crap then we've got down here so again kind of crap even though that this was our original prediction but give it a chance so once you actually get to around about 50 epochs the performance increases significantly and you'll actually see that it actually starts performing very very well this brings me to my next bit I've actually made the model checkpoints available to you so that you'll be able to go and leverage these yourself that being said you'll actually now have some checkpoints stored inside of the models folder thanks to the checkpoint uh the Callback that we've already gone and created so we'll actually be able to go on ahead and use these but for now let's jump back on over to our client and then we'll be wrapping this up It's the final countdown some might say we're in the end game now yep yes we are time to use this model to make some predictions let's roll alrighty we're in the final stages of this so we are now going to make some predictions with a model that is not so crap so first things first we are going to be downloading the checkpoints so the checkpoints that I've made available on Google Drive are after 96 epochs so again we're going to be using G down to go on ahead and download this and it is going to download a file called checkpoints dot zip into our root repository so if I go and run this it should start downloading it's around about 90 odd Megs in size as soon as you see it downloaded there so you can see it's downloading it's going to throw it into our models folder boom that's looking promising so it's now saved into our models folder it's gone and overridden our existing checkpoints those kind of sucked anyway so it's perfectly okay okay and then what we're going to do is we're going to load the checkpoints into our model that we just went and downloaded so using model.load weights we're going to load up the checkpoints that are inside of this models folder so if we're going to run this now that's loaded we can then go and grab a section of our data so let's actually go and grab test dot as numpy iterator we're going to grab another sample this is a little bit slow it's something that I noticed inside of the tensorflow data pipelines and I think it's because we're using the skip method and the take method so I think the skip method does take a little bit of time but it's perfectly okay just give it a sec it'll give you a sample back and then we'll be able to go in ahead and make a prediction let's give this a sec a few moments later alrighty we've now got some data back I just uh spent a couple of minutes scrolling through Twitter while I waited for that but we now have some data okay this is looking pretty promising so if we now go and grab a sample out of that oh we've already got a standby I don't know why this these two are redundant we've already got this data all right so this is our sample we can then pass our sample to our model.predict method boom that has made our prediction and if we go and decode this now take a look so these are our predictions so you can see it's actually going to written lay red with the L7 again lay green with G4 please all right drum roll please let's actually see what the actual predictions or what the actual text was take a look lay red with L7 again lay green with G4 please actually performed pretty well so take a look so this is actually the result of passing through these images so it's actually decoding relatively well now if we wanted to go and do this using another video then we'd actually be able to go and load this up using the streamlit app and that is what I am going to be doing inside of the code that episode if you guys want it comment below make sure you give it a like that this video um gets a little bit of air time but let's go and make another prediction so if we went and grabbed another sample so if we went around this again so those are our predictions and this is what we should really Swap this around so this is the real text these are the predictions so set so the original text was set green by J1 soon so you can see that we're actually grabbing data out of our out of our sample and over here we're grabbing the decoded set so this should actually be there to make it make a bit more sense all right so let's actually go and run another sample so if I go and run this run this so the original text was lay red at E3 soon and then the second sample was placed green by R3 again now if we go and run our decoder let's minimize that let's see what our model predicts oh we actually need to go run it through decoder boom take a look lay red at T3 soon so it said T3 soon our model our actual text was E3 soon so not too bad what about over here so play screen by R3 soon play screen by R3 again it's actually performing relatively well my personal thinking though is that this doesn't fully tie it together I still think we personally need the app so again if you reckon that we should go and do the app or build another tutorial for the app let me know in the comments below and we'll be able to build this up but this is actually making valid predictions so if we wanted to we could actually go and load up a new video and try to test this out so if we say load data and let's go and grab a video for example so if we go into this is completely unplanned so let's actually see if this works so if I go into S1 let's grab the path to this particular file so if I go and pass this through and we're going to say dot backwards slash data dot backwards H1 dot backslash start all right let's see how that goes we need to wrap this in TF dot convert tensor okay so that is our data set so this should be BBA f3s so let's go and find that video all right so if we go into videos actually we're not going into videos we're going into YouTube and then let's put my headphones on so I can hear this going to lip former data what do we get BB af3s S1 BBA f3s so this is the video every soon make this ladder pin blue every soon in blue at F3 soon so this is our data so we're actually going to have our sample and we are going to have our uh what is it our text so let's go and say let's add another section test on a video and this is going to be outside of our existing data pipeline so we'll test this out so we've now got a sample and we can pass through let's if we call it sample we should be able to just do this so if I copy this and then what we're going to do is we're going to pass through the sample to here actually if we just call this sample we should be able to use it without much change for that zero boom okay what's happened there uh that's probably because we've only got we need to wrap this so if we run tf.expandims and then pass through access equals zero boom that should make a prediction cool and then I'm going to copy this paste that there and I'm going to copy this paste that there and I'm going to copy this paste that below again okay let's minimize this what do we got so we've got our sample This Is Us custom loading a specific video which we just played which sounds like this every soon cool all right so we should expect this to print out binblue at F3 soon so if we now go and run this this should make a set of predictions if we take a look at why heart that's our set of predictions dot shape cool all right so we've got 75 by 41 so sample one is going to get out of real text so four sentence in Sample uh what have we done there cannot iterate over a scale it tensor so if we just go and wrap sample one should be fine for sentence in Sample sample one it doesn't like this because it is not wrapped so if we do this if I just wrap it inside of another set of arrays Okay cool so that is our real text so been blue at F3 soon let's go and validate that been blue F3 soon beautiful okay and then if we go and make our predictions and we need to go and run it through this first and I think we need to wrap this tf.expandims comma 1 comma zero nope no bueno what's happening transfers expects the vector of size four uh this is only going to be 75 over here what have we done but input is a vector of size three what's y hat returning a shape 1.7 that should be okay okay that worked in blue at j3 soon so it wasn't too far off that's not too bad so this is actually on our live video so if we're going to run this through this pipeline now so this is what was actually said so bin blew at F3 soon it predicted been blue at j3 soon so not so far off let's go and try another one so let's get uh freezing up come on what about this one p-r-a-c let's copy this path right so we're going to do a DOT backslash backward slash so you can actually do it on a separate video without having to use the data pipeline if we go and run this now take a look so the video said Place red at C6 now our model predicted Place red at C6 now let's go check this one out so prac 6n oh my gosh guys how cool is this p-r-a-c PPP RAC ing it it's half the battle battle p r a c c six and this one place to go to C6 now hold on dude show me it show me it plays Red at C6 now guys that is bang on play Square to C6 now plays Red it's it as if that isn't absolutely awesome so it's able to load up a video and decode it and use lip reading to actually go and transcribe this let's try another one um what about this one set blue in A2 please see that blue at A2 please so let's go and copy this name paste that in there all right so let's just quickly Play It Again keep in mind our model doesn't get any audio right it's just using that little GIF that I showed you to be able to decode this if I play it set blue and A2 please set blue at A2 please that's what we're expecting all right set blue at A2 please so this is our annotation set blue at A2 please said blue in A2 please how absolutely amazing is that let's go find another one um what about let's let's get one from up here what about l so I'll be ID 4p so again you could try out a whole bunch of these videos so I'll be uh let's go and play that video light blue in D4 please light blue in D4 please lay blue in D4 please so again let's go and take a look so this is the actual text so lay blue in D4 please to run it through our model we've already made our predictions here this should really be down there it's running through our model blade blue in D4 please guys absolutely amazing is that it's actually making valid predictions oh my gosh this is I get ecstatic every time I see this all right let's do another one uh bras9a been red at S9 again all right let's play that again been red at S9 again so it's bras9a I really want to build this app now all right so being read at S9 again that's it been read at S9 again right okay keep in mind it's not using any audio I run it through our model being read at S9 again how absolutely amazing is that so that is the liftnet model now built hopefully you've enjoyed this tutorial we've been through an absolute ton of stuff just to recap we started out by installing and importing our dependencies building our data loading function which we eventually did a little bit of tweaking too so that we would have a training and testing partition over here we did create our data pipeline we then went and built and designed our neural network which kind of mimics what was originally in the paper a little bit of a set of tweets and then what we went and did is we went and trained it using a custom loss function a custom callback as well as a learning rate scheduler and then last but not least we went and made some predictions and I just went and tested it out on a video so you've got the script to test this out on really whatever video that you want but ideally it's going to perform well on similar images or similar videos to what we've actually gone and trained on but we could definitely go and fine-tune let me know if you'd like that video for now thanks so much tune in I'll catch you in the next one happies thanks so much for tuning in guys hopefully you've enjoyed this video if you have be sure to give it a big thumbs up hit subscribe and that Bell it means the absolute world to me but it doesn't stop here we're going to be taking this to a next step if you want it should we learn how to be able to replace it with videos of ourselves to be able to go and lip read videos in general maybe we should convert this into a code that episode and build up a standalone app to be able to use this out there in the real world let me know in the comments below thanks so much for tuning in guys happy
Info
Channel: Nicholas Renotte
Views: 79,311
Rating: undefined out of 5
Keywords: machine learning, python, ai, coding challenge, MiDaS, depth esimation
Id: uKyojQjbx4c
Channel Id: undefined
Length: 74min 22sec (4462 seconds)
Published: Fri Feb 03 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.