AI as an API - Part 1 - Train an ML Model and turn it into an Rest API using Keras, FastAPI & NoSQL

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
in this one we are going to be creating a custom machine learning model from scratch we will train it and then we'll turn it into a rest api [Music] we're going to be doing this by using keras which is the python machine learning library we'll also be using fast api which is a really popular api framework for python and then we're going to go ahead and integrate all of these things store it all into a cassandra database using the manage astrodb service let's go ahead and jump right in i wanted to thank datastax for sponsoring this series so if you don't already have an account on there check out this link right here as a way to show your support as well thanks so much let's go and take a look at the demo now if you look here this is the a local fast api project working and this is a streaming response in fast api from our astrodb this is actually really really cool so it's literally all of our new data set it's not our old data set it's just everything that's coming in new next thing we actually can take a look at this using a service called ingrock this is how we expose it to the outside world from our local machine it's really easy to do it's just a simple setup and then we just run it the final thing is actually a deployed version of this so this is it right here this ip address is probably not going to be available when you're looking at it but the general idea is all three of these things have the same data right which is really really cool and this is just showing all of the inferences that have been done right so like this right here this is awesome that was what was being predicted on and it gave a 86 chance that this was ham as in not spam and so we can go through this whole thing and check it out or we could actually take a look at a notebook on google collab to see how we actually run this in real time so we import requests we have one of those endpoints here like i'm using the ingrokk1 to start out and i can ask put in my query here so this is not spam probably right and so i run that and it'll give me the query back of course it gives me the label that it thinks it is and it gives me the confidence for that another one would be actually spam so let's say for instance get a new phone at you know a huge discount so i purposely did some weird errors there or weird numbers there but now we've got this weird text coming through looks like spam just from looking at it but we can now see that it's an 80 83 confidence that it is spam that's pretty cool if you ask me next of course is the data set itself so we can actually go through here and go line by line and get every single piece from this new data set which would include this weird new thing that we just did right so this line right here should also be in there if i do a quick search for it there it is right there in that data set so now using the production version one it's the same thing it's not gonna be any different you can actually use that in production and it's gonna be just as fast if not faster than going through ngrok probably faster than ingrock because of the server that we end up using and then of course we can go line by line on that production grade one as well so that's the demo of it working now the entire code all of it is in two places on github one is the actual production version that i'm using which is this repo right here on our github account and these will be linked in the description too so you don't have to search for them the other one is the api course itself so the actual code that we do in this entire series that's this reference right here they're not going to be that much different but the biggest difference will be the actual ai as an api that one will continue to improve off of the video with a number of things that i'm probably not going to cover in this series including the terraforming of all of this this is a whole new sort of topic that we'd want to cover in another time so anyways so that is the demo let me know if you have questions otherwise let's take a look at the requirements to do well in this series now before we jump in i'm going to assume that you have experience with python already so if you don't have experience with python definitely check out my 30 days of python series the idea here is you need to be really comfortable with variables functions classes and really just sort of like cleaning up data maybe have some experience with python requests a lot of those things are done in 30 days of python so check that one out it's on my website at cfv.sh projects it's also on my youtube channel as a playlist so if you look for cfe.sh and then slash youtube you will actually find my youtube channel and then you'll be able to find this playlist as well so let me know if you have any questions on what the requirements are they're pretty straightforward if you have a really solid foundation in python great and even if you don't have that solid of a foundation i'm going to show you how to do a lot of these things step by step so if you have to re-watch any videos because maybe i'm going too fast for your experience level that's okay that is something that a lot of people do all the time i do it as well sometimes i'm watching something and i'm learning something new i gotta rewind it several times before i get it done and then of course as i mentioned with the demo all of our code is on github so if you go to github.com slash coding of entrepreneurs or more simply cfv.sh github this will take you to our primary repositories and if you look for ai as an api that will be a really solid reference for you for this project going forward now if you need the actual course reference itself then you're going to want to look for the one called ai as an api course reference it's just slightly different but it's going to match exactly what's happening in the course where the ai as an api one is going to be updated over time so let's go ahead and jump in to our project and our series now all right so in this one we're going to go ahead and configure our project using python and vs code so i'm going to assume that you have both of these installed i'm going to be using python 3.9 but you want to have at least python 3.6 and above and then also vs code open source definitely not visual studio but visual studio code is what you want to download and open so once you open it you'll see something like this we just need to create our project and so we open up the explorer or you can go to file and add folder to workspace either one basically opening up a folder i'm going to navigate to where i want to save this project i put it in my dev folder which i created next to my user itself and inside of a folder i'm going to call ai-api so go ahead and add and in my case i actually already have a file in here which includes a requirements.txt so for those of you who are incredibly familiar with this process these are the requirements so feel free to install these and skip the rest of this video if you know what to do as far as creating a virtual environment okay so now that i have this workspace folder i'm going to go ahead and save this workspace as whatever you want to call it i'm going to call it ai dash api okay simple enough now i'm going to go ahead and go to the terminal and i'm going to open up a new terminal here that's probably a good idea to learn this actual shortcut to opening this just like that i use that shortcut all of the time okay so again i'm using python 3 and it's 3.9.7 now whenever you open up this terminal it's going to bring you right into the root of your project folder as in where this code workspace ends up being which is really really nice i think okay cool so now that we've got this i'm going to go ahead and do python 3 dash m v e and v period and just create a virtual environment now if you're on windows you probably won't have to type out three you'll probably type out python-m but of course i'm on a mac and that gives me the wrong version of python and of course if you are on a mac or linux you can type out different versions of python as well and see all of that stuff i'm really hoping that's sort of repetitive information that you've already heard before but now that we have this virtual environment i have all of these new files here i'm going to go ahead and activate it with source bin activate of course if you're on windows it's dot slash scripts slash activate just like that but the way you know that it is activated is the name of the folder will be in parentheses like this and then also if you do python v no matter what system you're on it's going to be the python version that you initialized this virtual environment with so now what i want to do is just do pip install r requirements tx now it is very common for this to not work given your system so if that doesn't work you could always open up a new terminal window and again it's in the same route then we want to source bin activate what you can do is you can run python dash m pip install whatever right that's just another quick way to call any of the python modules in fact this is what you'll want to use if any of these modules don't end up working if any of the actual command line commands don't work so the other part of this is tensorflow itself right this is a big package so if you are not ready to have tensorflow installed on your local system just go ahead and remove that that's okay no big deal at this point when we actually bring it into production it's going to be a lot bigger of a deal at some point you'll probably want to have tensorflow working on your system i do highly highly recommend it now while this is finishing up what i will say is we are going to be using jupiter notebooks a good amount to write a few things with the code this is mainly to really see how to experiment with creating a data set or cleaning a data set but also experimenting with training an ai now most of you i'm going to assume do not have a gpu on your local machine so we're actually not going to be using the local machine for our ai training which we'll talk about when we get there but jupyter notebooks are a way to write interactive code essentially so i can just write some code test some things out really quickly which is why we have this in here this is not a production dependency the production dependencies look a little bit more like that right so when we actually bring this live when the world can see it we won't need jupiter and pandas in there those are just really development tools to get to the point where we can actually have a production ai using fast api's api service or creating fast api api service so i'll let this finish installing and then we'll come back right so all of my dependencies ended up finishing so i can actually run pip freeze or simply python m pip freeze and see everything that installed the key one for us is tensorflow 2.6 if a newer version of tensorflow comes out then there's a possibility we might need to use this older version of tensorflow 2.6 but let me know in the comments if that's the case so now that we have these dependencies let's go ahead and take a look at a jupiter notebook by just typing out jupiter notebook and again if you're not able to do that just run this command and that should actually open up the jupiter notebook server and so it will open up your browser and you'll be able to write a bunch of notebooks in here so we'll go ahead and create a folder and just call this folder in here let's actually zoom in a little bit let's call this folder notebooks and i'm just going to give you a very basic notebook here and i'm just going to say print let's go ahead and print out hello world and hit shift enter to run it okay so it's just a really easy way to run code inline and as you see this number incrementing that's the number of times that i actually used that cell now i will say vs code does have the ability to run notebooks as well but this is not something i'm going to be using here right so this is the vs code version of a jupyter notebook by all means if you want to go ahead i'm going to go ahead and close out jupyter notebooks and delete everything i have including the folder itself okay so this is the baseline configuration for our project be sure to make sure that all of these requirements are in requirements.txt that is standard the other thing is you also might want to use git so let's go ahead and close out a jupyter notebook with control c that's what i'm pressing many many times so git is a version control system right so if i did get in it this will actually track my code now if you've actually never used git before then i don't recommend trying to force this in here when we need to use the get code itself the code that we will need for production i'll show you what you need to do for that for now though we're just gonna use get in it um and so i'm gonna go ahead and also do a python git ignore file and add this in as well uh usually on github there's a really good one so i'm gonna go ahead and grab that and we're gonna go ahead and make it in the root of our project with dot get ignore and paste this in here save that and then this number should no longer be 5 000 but maybe a few in my case i don't want bin include or lib any of those things either which you know some projects do want them which is why they get ignored it doesn't have them yet at least okay so we save that and now we'll go ahead and also get include and share here great so really our project doesn't have a whole lot and if we do get status we can see that these are the things that will be tracked from our code or could be tracked and if i add them they will be so initial commit cool all right so with all of the get related things if you actually go to cfe dot sh github this url will have all of our code that we ever produce and if you look for ai as an api this will be the final working production ready code once it is polished this is not the code that we're going to be working off of in our videos now in our videos i will have another repo for that which i'll definitely have linked somewhere in the descriptions but the general idea is the code itself this is code that i've been experimenting with to ensure that a real production ai as an api does work with a bunch of other configuration we're just not going to cover yet but it is a good reference for the things that we might need in general for this project because i'm not going to be diverting too much from what you see on github at this point in order to create our ai algorithm we have to start with a data set now in our case we're going to be using a data set from the uci the university of california at irvine their machine learning repository they have all kinds of open source data sets that we can use in our projects now there's a good chance that it's going to look different by the time you get here but the idea is we want to download one of their data sets and then we're going to clean it up and get it ready for machine learning so that's what we'll do for the next few parts now i will say a lot of this is going to be based off of the guides that we have on the ai as an api repository so if you go to that on github you can actually see a number of these guides that we'll be going through now the reason i recommend that you stick with me if you aren't that familiar with this process is so that you can learn the thought process behind what makes those guides versus just running those guides but by all means do it however you want to do it and i will explain how i think about this right now so going into the uci machine learning repository we're going to do a quick search for spam right so the algorithm that i want to build is a spam detector basically like i'm going to send in a string of text and it's going to tell me if it's spam or not that's it it's fairly simple of a problem but still a pretty useful one itself okay so if we come in here and do the search we see that there's a number of spam-like data sets okay so the ones i'm going to be using is the spam sms data set and the youtube spam collection data set so i'm going to go ahead and open these things up and this is what they look like right now so if i go to download and hit the data folder on both of these what i'll see is a zip file listed here now if you want to think about this not programmatically you would actually just download the zip files into a folder and then you could manually unzip them right by all means go ahead and do that if that's the method you want to go about doing this my method though is to automate this as much as possible i don't want to have to think about this and also when i start experimenting with building the actual ai model i want a really easy way to grab this data set anywhere without extra configuration in other words i want to automate it okay so for this i'm going to go ahead and create a jupyter notebook and we're just going to be focusing on the downloading portion of this inside of a jupyter notebook so we'll go ahead and create a new folder and i'm going to rename it to mbs and inside of here we'll go ahead and create a new file here and i'm just going to call it one download and data sets cool so what is it that i'm trying to download this file right here okay so i'm gonna go ahead and copy the link to it and i'll paste it in here as a string okay that's simple enough right so this is the spam or sms spam zip all right and so there's a command that we can use on our machine called curl right and so curl if we go to like let's say our terminal window or powershell depending on what you're on and just type out curl you'll see something like this not you know command not found basically or some other error so curl is going to allow us to open up this page and do something with it so i can actually run curl inside of jupiter using that url we don't need it in quotes at all and then we can just put an out file as in like spam dataset.zip i can run that and that will actually download that zip file or whatever is on that url and it will output it to spamdataset.zip and so with that i can actually go into my project here open up that notebooks folder again and i see spam data set in here this of course is not where i want to keep this zip file personally i want to keep things organized so i'm going to go ahead and delete that and i want to just restructure how i'm going to go about downloading this now of course this is going to take some python to do it i'm going to use import path lib the actual path loop library that's built in to python itself and we're going to declare our base directory from pathflip it's just pathlib.path and we'll just go ahead and do resolve okay and let's go ahead and see that results there we go so in my case it's going into my notebooks folder i actually want the root of my project so that would just be dot parent that gives me the root of my project so to actually use this variable inside of jupiter on a curl call which again that's a terminal call especially using the exclamation mark before it makes it a terminal call i can actually use dollar sign and that variable with a trailing slash notice there's not a trailing slash here so when i do that and run it let's go see where it goes now so now if i scroll to the bottom i see it in the root of my project it's no longer in that notebooks folder now it's here okay so this may or may not be a review for some of you i'm going to go ahead and delete that file but the idea here is we want to be systematic about where we download things and so with that in mind i'm going to go ahead and add my data set der and this is going to be my base directory and i'll just call it data sets and then we'll go ahead and do our imports or let's call it our zip file actually let's go zips der and this is going to be my data set stirrer zips and zips what i want to do is i want to make sure that that exists by just doing make dir and exist okay being true and parents being true as well so that will create all of the actual folders i want so data sets and zips all of those folders are in there now and so what i want to do is then get it to that path okay so zipster and right here let's go ahead and do that and there it is cool a little bit better now my case in my git ignore file i'm actually going to ignore that so data sets and zips i don't actually want to commit that data because again i'm going to be extracting it from here and the goal of well these actual notebooks would be to download them combine them and make them usable so the final result will probably commit into our repo but the actual current state of it we don't need to commit at all but we can actually make this a little bit better and that is this right here i actually do want to have my spam sms zip path and that's sort of a arbitrary name there but it's going to be based off of this right here and there we go so this is now where i want to save it and we'll just put it next to that dollar sign there and again it's not necessarily going to oh we need to run that and run this again it didn't really change anything right it did override that file but the idea here is it's still the same exact thing now the reason i actually declare a path for it is so i can unzip it then in a moment but before i actually unzip it let's get the other youtube spam zip link and so that one is you know the other link here so i'm going to go ahead and copy this paste that in there okay so much like this one i'll just go ahead and create another one and call this spam youtube zip path and this time is the youtube data set actually the spam data should probably be sms spam and simply youtube spam we want to be well very clear as to what's going on here okay so this curl command i'm just going to add one more to it just right below it and of course it's going to have to be the correct url which i do have these urls here i just haven't been using them so let's go ahead and use them now with the dollar sign there and dollar sign here and let's get rid of the spaces if you have them okay and of course the spam path itself great and so now it should download both of those in the correct data sets folder and once it does then we will have the two data sets we're going to end up using or at least the base for it okay so now that we've got them downloaded we need to actually unzip them right so we need to actually take them out of this current state which we'll do in just a moment now we're going to go ahead and unzip these files right here now i'm going to be using the command unzip if you don't have this command i mean you can download it or you can just unzip these manually so if you did it manually that's completely fine but i want to make it as automated as possible so i'm going to go ahead and do it in the notebook itself i'm actually going to change the name of this notebook because i made a copy you don't have to be working off a copy i just have it for historical reasons but now we've got this download and unzip and at the very bottom here i want to actually create a few new folders here okay so the first one is going to be the spam classifier directory no big deal and then the other two directories are going to be the outputs that we have from these curl commands and after they're unzipped now of course i always want to make sure that these are created so i'm going to use those make dirt commands that we did before and now i'm going to go ahead and use unzip so we call unzip and then we can put in our input which would be our output path from curl so dollar sign that and then our destination so dash d is going to be one of these directories right so dollar sign that now if you run this once no big deal it will unzip it if you do it again well then it asks you this do you want to you know replace the default in other words do you want to override it so what you can do is you can put in the flag of dash o now the only reason i didn't put this in at the beginning is because curl dash o means output where unzip dash o means override so i'm going to go ahead and interrupt this because it was still loading and now if i run it again it should actually do it every single time okay so i can copy this same thing for our youtube one as well the directory first and then the path and there we go okay and so with that um downloaded we can actually take a look at the data set itself so inside of my data sets i now have this new folder here right and i've got a few things in here that we definitely want to look at but if we see the youtube one notice it's only a few well these are musicians and it's comments related to these musicians so it's not necessarily the best data set just off the bat i'm pretty sure it's not the best data set however it's still spam and not spam related which is why i'm going to keep with it and this will become abundantly clear if during our training we see some radical like departures from the accuracy so that's what we'll stick with and i'll explain some more later now let's go ahead and take a look at these actual data sets in this one we're going to review and combine our data sets using python pandas now the first thing that i'm going to do is take one of these directories here and i'm going to just look what's in there so for path in one of those directories.glob and start out and just print out path right so this gives me the path to the files that are in there and what i'm going to do is i'm going to hit try and print out path dot read text and then i'll hit accept and pass i run that and it only reads well it doesn't read the read me text it reads the spam sms collection text right here so that probably means that that's going to be the file i'm going to want to use right and so i'm going to go ahead and say that the sms spam input path is equal to this slash that okay and so again i can actually use that to read that text again run it again and what i see now very first is this slash t so this tells me that it's a tsv tab separated values file so i can use that inside of pandas we'll talk about that in a second but before i go any further with that let's take a look at my youtube directory right and so i'll go ahead and do four path in youtube glob and then all and again printing out that path now when i print it out i see csv files but a lot of them it's not one single file like my other one i have that readme file which probably has some good data in it but i know that these are all csv files and i have a pretty good feeling that those are going to be useful for our data set so we certainly need both of these okay so speaking of our spam sms collection we'll do this one first because it's going to be by far the easiest one and so i'm going to import pandas as pd and then we're going to call this our sms df df4 data frame there's a very common distinction or variable for pandas data frames and so we're going to do pd.read csv and i'll just pass in that path first oops just the path i don't need the whole thing there okay and so i'll go ahead and do sms df.head and so of course i get an error because it actually is not a csv file which is what read csv means it's a tab separated value which we can just do sep with that slash t and if i do that now i actually have tabular data in here right tabular data as in like a excel file right so it's in columns and rows that's what pandas is really good for is dealing with data that has columns and rows not too dissimilar from an actual database like a sql database or a nosql database okay so we've got our read csv in here now one of the things about it is this very first line this of course is the column header all of the columns and in this case it actually doesn't realize that there is none so we have to declare it as header nut right and there we go so we now have a spam data frame that has head and we can do tail and we can actually say you know the number that we want like 200 values and scroll through there it's not going to necessarily show all of them but we can actually see our data right and it's almost out of format that's ready to go for training there's still a few things that we need but what we see on this data is one side actually gives ham or spam spam is probably very obvious to you ham is just the opposite of spam basically ham is the good value spam is the bad value so what we can do is we can actually treat these as the labels and this over here is just like our text this is what we'll be training on both of those things so we can say spam sms the data frame columns is equal to those values so our label is going to be the first column which will be this right and then our text and there we go so for all intents and purposes this is ready to go now we need to youtube one this one is well not as straightforward right because we have a bunch of csv files in here now one of the ways to only get csv files is in this glob just by doing csv now i only have the csv files which makes it a little bit easier of course this is a path lib distinction here not anything to do with pandas so within pandas we can actually load up these data frames so we can say df equals to pd.read csv and pass in that path and then i can print out df.head okay and the way that jupiter works is i should be able to actually use df later so let me just comment this out and say df.head here and so it actually does give me a data frame but it's not the entire data frame which we can actually tell by saying something like df and source this will create a new column for us we can actually pass in the string of the path and we can actually see what the source would end up be or let's use path.name to make it a little easier to get the name of the file itself so we see that it's only from psi that's what all of these are going to end up being because of how we loop through it we are literally creating new data frames each time hopefully that's not that surprising so what we can do then is we can actually say my dfs as in a list of data frames that we can append this to so my dfs.append that data frame and then after all that is said and done then i can say my yt data frame or my youtube data frame is going to be pd.con cat which will just combine all of these they will concatenate all of those and then that way i can actually see that they're literally all in there so the tail is still psy but the head should be you know whatever the first file is in there cool show us some good stuff now one of the things it's also not doing though is we have this content which i would imagine is the same as this text here and then we also have a class of zero or one so i want to change those two things okay so first and foremost i'm going to go ahead and say the df dot rename and we're going to change our column names and it's going to be class all caps it has to be all caps it has to match what's here i'm going to change that to being our raw label and then content the content being our text okay all of the other things i may or may not need but for now i'm going to leave it like that and then we need to declare in place being true okay so now we've got a raw label and text so in our case the raw label is called raw label because it's either a zero or a one right and so we actually want to convert it to being spam or ham right now there is no technical reason to do this in fact later we will turn it back into zero or one but since i'm prepping the actual data set i will keep the original label in this case the original label i'm calling spam or ham now you can think of this in terms of the integer itself does declare what the label is but if you had a bunch of different labels then you would absolutely not want to keep them as a number you'd want to map them to something with this data set prior to mapping them later which we will map later for sure so what i want to do now is say df and label again remember how we created this column here i'm now creating a new column this new column is going to be based off of that raw label but with a function applied to it so dot apply allows me to run a function i'm going to use lambda here just to write it in line lambda of x is going to be the value of any given row in that column so zero so lambda x x would be zero in this case so if i do the string of zero right basically i wanna see if the string of zero is let's say equal to the string of one then i'm going to go ahead and say well it's spam otherwise i'll say it's ham okay so now we've run that run it again and what we see now is those values corresponding correctly so if it's a zero we're gonna say that it's ham if it's a one we'll say it's bam now i might actually have these labels wrong in which case you could definitely go into the readme file or even into the repository where we found this right there is no readme file in this case right so i don't see one this is just redundant data from whoever created this data set um but i'm just going to assume that this is correct right is it spam or not 0b not one being it is right because it's not usually distinguished as if it's ham or not it's usually distinguished if it's spam or not and as you may know one is often true or always true in the true false values of you know something a bit like that cool so now with that out of the way we have two data sets both of them have text and label they both have a source now in this case i'm actually going to change the source right so i can call this raw source if i wanted it for later or i could say df and source being equal to this is going to be my youtube spam data set and then up here in this one i'm going to go ahead and also add a source and this being equal to my sms spam data set okay so the only reason to have that is to distinguish it later so like if at any time i need to look at my newly cleaned data set i will have that so i'm going to define df again and it's going to be equal to pd.concat again combining two data sets the first one being our spam sms data set and then our youtube spam data set or data frame in this case there we go and then df.head hit enter and there is our new data set now um we have a bunch of columns we simply just do not need right so what i can do is i can declare the columns that i want so what we can attempt here is inside of the concat we can call columns equals to the label column and the text column and in our case the source column okay so unfortunately cancat just won't do that right so that is a way that you can do when you are reading a csv file you can declare the columns from reading it but you can't do it when you concatenate it so instead of actually declaring the columns after we concatenate it what i can do is declare it down here and use this instead so the df i'm going to go ahead and say raw data frame being in here right and so the raw data frame is going to rename everything and in here raw label raw source and just like that let's just make sure that's still working it is okay and then my df um it shouldn't be working but it is and so my data frame then is going to be a copy of that raw data frame so rada copy and then inside of this copy this is where we'll declare the fields that we want so label and text and source so basically i'm getting rid of all of the other fields in the iteration itself not later right so if this ends up getting really really big data like really really big then it would be better to get the columns we want in here in fact probably earlier but i actually had a few columns that i needed to use otherwise like that raw label column i actually did end up using so that's kind of the idea there but now we have the full data set without that extra information that we just simply don't need for this data set now i will say in general for big data like if i was going to be storing this in a database which i'm not going to be doing but if i was storing this in a database then i would keep all of the data in fact i probably wouldn't combine them here i would store each one of them individually the only things that i would combine are something like this where it is the same kind of data but then i would actually keep that you know that actual path itself the the raw source itself i would actually keep reference that as well but that's it that's how we're going to go ahead and combine these data sets the last part of course is exporting this so to do this we're going to go ahead and scroll very up up to the top and i'll go ahead and create export and the directory is going to be our data sets and export and we'll go ahead and make that one make dur exist okay and parents true okay and then we'll go ahead and say data set or actually spam sms or spam data set rather path is equal to that export dir and spam data set dot csv we're going to bring this down all the way let's make sure we run that cell actually and now at the end of the file here we'll go ahead and do pd.2 csv or rather df to csv this final data frame and we'll use that export path right the actual spam data set path and we'll also pass index being false we do not need the index value index value being the actual roll value that it's in go ahead and run that and that should actually give us our exported data set of course not inside of spam classifier but now inside of exports and if we open this inside of you know vs code we should see all of this data which should be a lot right it's not a crazy amount it's only about 7000 but it's definitely a lot more than it was which i think is pretty cool and so we have a way to just check these things out and we now have the data set so we just need to prepare this data set for training purposes we only have gotten it to a point where it's ready to be turned into actual vectors for training which we'll talk about in a moment but that i think is pretty easy to do now i will say there's a lot of things that we could have done here as well which is like getting rid of duplicates like there's a pretty good chance that there's duplicate text throughout all of these like repeated things said over and over but i'm not going to worry about those right now that would be something i would worry about in the long run in fact in the long run i would hope to get a lot more data than this like hundreds of thousands of lines not just 7 000 right so really we just need something to start with so that's what we're going to do and very very soon so much of machine learning has its basis in linear algebra which means that we do a lot of matrix multiplication and working with vectors so what we need to do is we need to combine our data set and turn it into actual vectors now you don't have to do the vector part yourself we'll use something that keras has but we actually still need to convert our data set a little bit more in order for it to even be created in two vectors so what we're going to do is we're going to load in our data set so let's load it in with pandas so import pandas as pd and just using everything that we've done before and then we'll go ahead and say df equals to pd.read cfv of that data set and then we'll do df.head okay so the first thing i want to do is actually turn our labels and text into corresponding lists in other words label position whatever position it is in is going to match to the other position so we'll go and say labels equals to df and label and then dot two list and we'll copy that and do the same thing for texts okay and so the idea here is the labels at position you know 120 is like actually corresponds to the texts at position 120 as well right so those should actually match up here okay it's actually a good one because it shows us spam okay great so the idea now is once we have these items we actually need to map our labels back into zeros and ones now the reason i actually mapped them into ham and spam in the first place has to do with having multiple labels and you would do a similar process with those as well but i'm not going to worry about multiple labels right now instead what i'm going to do is actually create my labels as int okay so it's going to be based off of these labels here like 4 or let's say x4x in there but i need to do something with x and that is of course mapping it to something so let's go ahead and create a legend so i'll just say the legend or rather the label legend is equal to well ham being zero and spam being one now the reason for this is just to use it as a way to turn these into values that i can later unpack i'll talk about the unpacking in a moment so we see here oh we need to run that label legend now labels as int is zero and one right and so again we can use labels as int 120 and what we should get is well the label legend value which would be one so this is actually what's going to end up being predicted for us it's going to predict us a 0 or 1 value later once we actually have it all set up so we also need a label legend inverted like as in the opposite value here now i could actually do this manually by actually writing it out but again if i had multiple labels i wouldn't want to do it manually i would actually want to do something more like the string value pair of let's say let me just do it and then the four key our k v key value pairs in those items here and so this will actually give me the inverted values here okay so now it's actually really important that we did change all of our labels to being ham or spam so we can have a actual proper label for it so then when we do a prediction later we can actually use that prediction to our advantage of course i've given a key error string of one so what i want to do then is just turn this into this right here key error of one because it's not a string value let's go ahead and do that and there we go okay so now it's giving me the proper values as we see fit okay so of course this is what it's going to eventually be so we don't need that yet but now we have our labels are pretty much ready to be turned into categorical vectors which we'll talk about when we get there um but for now i want to just verify that these indices are correct as in labels and text right i sort of did it here but i definitely want to run a random sample for it so i'm gonna go ahead and import random and we're gonna grab a random number so this is gonna be a random like uh index number if you will and it's gonna be random.randint it's gonna take in zero and then essentially the length of labels or the data frame whichever it doesn't really matter but the length of that that way we get a random number that is spans the entire data set right so we can run that over and over again and all i want to do here is i want to assert that the let's say the text of this random you know index number is going to be equal to the data frame and going and grabbing the location of that index number and then dot text right just want to make sure that those are the same so if i did plus one and they were incorrect i would see something like this right that way you can run this over and over again and of course not just with text but also the labels and in this case it's not going to be just the labels itself right so that's going to be a text label or while it will give us the text label correctly so we do want to try that one of course yeah but what i meant was the label says int we want to make sure those are in the right direction as well and we want to do it using the label inverted of the string of those labels right and just like that okay cool so again just a really really simple test to verify that we didn't mess up the indices because they are so important because these are the truth values of what this actually represents okay so now that we have that we're going to go ahead and tokenize our text so to do this we're going to import some things from keras and tensorflow so from tensorflow.keras.preprocessing.txt we're going to import tokenizer okay so this might take a moment or so to actually import next we want to have the max number of words that we're going to use here for our texts i'm going to use 280 and the reason is oops that should be pre-processing not poor there we go 280 is the length of what a tweet would be right so it's certainly possible that the text that comes from youtube is longer than 280 we want it to stop at 280 essentially and so that's what we'll keep with and now i'm going to go ahead and do our tokenizer and we're gonna have it just like this number of words being equal to that value here and then we're gonna go ahead and fit we'll do fit on texts for our texts okay that should take just a second from there we're going to go ahead and do our sequences and tokenizer dot texts to sequences of those texts and we'll take a look at these sequences okay so this is now better vectorized data okay and we can actually see the unique the number of unique tokens that are in here by doing our word index equals to the tokenizer dot word underscore index and this will give us our unique tokens here okay so each one of these word index actually gives us the value that corresponds up here right so 7 for example is my 9 is is and so on right so pretty cool we have a way to actually tokenize all this data now and it's done in a sequence so it's not like it's just zero and one um but it's actually now ready to be or closer to being ready to actually training this data okay i think that's fairly straightforward but the challenge of this is when we try to do matrix multiplication we actually want our data to be well the same length all across the board right so this right here and this right here are not the same as if we scroll up we see that the first item has a bunch of text the second one doesn't now when we're doing matrix multiplication we actually have to make sure that they are the same length so the next part of this is actually padding the sequences and this would actually finalize this for us so we'll do from tensorflow.keras dot pre-process dot sequence import pad sequence okay and so now this is actually going to give us our x data so i want to give a max sequence length and i'll give it at sometimes you'll see it as a thousand but it depends on how long we want this to be this will remove some of the words that we don't want right or don't need so if we look at this word index again it's not going to have necessarily every single word or punctuation in there but it could right it just really depends on how you end up doing that so our max sequence length i'll put at just 300. and so now we'll go ahead and do x equals to the pad sequences of all of our sequences right so the sequences we created up here and then doing the max length to this max sequence length okay and so there's our training data okay it's completely ready to go as far as our training date is concerned and it's all the exact same shape so every single like actual vector in here has the same shape as every single every other one unlike our actual data set itself okay so now what we need to do is turn this the actual labels themselves the labels as int we now need to turn those into a vector as well and to do this we're going to go ahead and do from tensorflow.keras.utils and import to categorical and then we'll also import numpy so import numpy as np and then we'll go ahead and say y equals to well let's do labels as int array and we'll do mpe dot as array and then labels as int so this is just turning it into a numpy array and let's make sure numpy isn't there there we go it's it's still the same data right it's still the same length and everything but it's just what the two categorical expects so now we have our y two categorical of these labels as an array and now we've got our y values okay and notice that it's actually now in positions not it does have one and zero but now it's in what's called a one hot array where position zero is going to be related to well one of those values right so if we come up scroll up a bit it's not related to this but rather the position that they occur so 1 is going to be ham 0 is going to be spam or position 0 is going to be ham position one is going to be spam right so if it says one there then it's on we can obviously go through this a lot but the idea being that each one of these arrays actually has all of the labels in it and then if there's a one on it actually signifies what that label is it is not going corresponding directly to this label here but we can actually see this right so if we look in the array itself we've got zero zero one and if you think back to above zero was ham zero was ham and then one down here being that it's a one hot this is now saying that there's only two possible values for all of these the very first possible value is going to correspond to ham which is this 0 here the very second value is not on it's off so it doesn't correspond to spam at all so that's where these three things sort of line into right and so again if we look at 0 0 and 1 as being the ham ham and spam to turn that into a one hot array we have two potential values for the label so the two values are going to be ham or spam in that order according to this and then to turn this into a one hot array it's going to be a new list here so that first iteration is this right here and so it's going to be equal to ham so 1 and 0. right so again this is now correspond to that and so the next value is again ham and so 1 0 and then the next value is 0 1 because it's spam and it's this value right here so the idea of one-hot arrays is actually very very common for the machine learning algorithms themselves but to fully understand this we would have to spend a bit more time on it but hopefully the idea here that you understand is that we took our labels themselves turned them into number representation of that label from that number representation it was turned into a vector with other kinds of rep representation according to it so what we'll get back in our predictions are an array like this where it actually has both numbers in there and a percentage for each one like the percentage that it's a ham like an actual ham score and a spam score it's going to do it for both of them for every single time we run our predictions which we will definitely come back to and we will definitely see so that's pretty cool so now they are fully into vectors right so what we need to do now is actually split up our training data and export it which we'll do in the next one now we're going to go ahead and split up our training data and then export all of this stuff so i really am working off of the previous notebook i just duplicated it and ran everything but before we do that let's actually think about why we split up training data anyway i want you to imagine for a moment you're playing a game with one of your friends and that friend has a really really good strategy to win but that friend never ever changes their strategy they keep it the same every single time so the first 10 games they just absolutely destroy you they do so well they beat you in like record time every single time but then you start noticing what that strategy is you really start picking it up to the point where by the time you play your 20th game they can never beat you because you know exactly what their strategy is and they never change it of course games rarely work this way but if they did you would never lose to that friend anymore because you've learned all of their moves you've learned every piece of data you possibly could about how they play now in real life you could potentially do this with a friend for a game that they actually do change their strategy if you played enough games you could probably eventually figure out all of the moves that they try and beat them every single time still even if they do change that strategy but that's rarely the case when you then face another person or a thousand other people so this is about varying our training data we need to have variants in our training data otherwise we're going to be really too focused on one way of being quote unquote right or being correct and so we want to split up this training data as much as we can while we have time to or at least in the early days of building out this algorithm so that's the purpose for it i i do recommend that you do play around with this training data and split it up in different ways because you will yield different results but as we grow our model or grow our data set we will actually have new opportunities to split this training data more and also just add less and less variance to the potential outcomes of any given item right so in other words if you could get every spoken piece of text that's ever happened along with labels of spam or not well then you might actually have the algorithm that will never get spam or not spam wrong but of course that's not going to happen yet we don't have enough data for that you know maybe some like google or apple or facebook maybe they have that data but even then they're still working on trying to stamp out you know spam altogether so let's go ahead and split up this data it's really simple to do we're going to be using the package called sklearn or scikit-learned called model selection and we're going to import the train test and split now this method itself is a very very easy method to use it looks like i didn't install it yet so let's go ahead and install sklr so pip install psi kit learn and so this method is incredibly common for splitting up data but you can split up data however you like this train test split is just a well an easy way a lot of data scientists end up using it and it's based off of our x and y values right so the x and y value is convention you don't have to call it this you could have called this like features and this like outputs right but x and y again is convention so that's what we're going to stick with and what's also a convention is to take that x and y and do x train x test why train and why test and yes lower case like that i know it's probably a little weird for some of you but then we run this train test split off of our x data and our y data and then our test size being some arbitrary percentage or between 1 and 0 and it's going to be 0.33 in my case so 33 and we want to randomize this so the nice thing about this what it will do is it will keep our indices for us right so that's the reason way up here when we did separate these things out into different lists and we were validating the random indices that is why we want to make sure that that is happening down in our training testing as well and this package does it incredibly well and that's really it that's all we needed to do to actually split up this data so now we're going to go ahead and export all of this data and we're going to use pickle for this so import pickle now the pickle package itself can have malicious data so in the case of our training here you are not going to be using any of my x or y data instead when we actually go into training we will recreate all of this stuff all over again in the actual training environment which is not that big of a deal now that we've done it pickle itself is fine if you are the one manipulating this data it's not great if i send you a pickle like the actual output of this file and then you run it that's not a good idea so if anyone does that to you just just don't take it so we do need our training data though and the question is what is it that we need well of course we need these values that we just split up and so i'm going to put it into a dictionary itself so x train to x train and so on right and so of course the reason for this is well this is for sure values that we want and you know a lot of times you'll see this training split done in the same place as where you actually do the training itself but i want to separate these notebooks out into guides themselves which i do have i've mentioned before but the guides themselves are going to do a lot of these same things but they're all separated out so you can really just focus on whatever that is so this is the core of the training data but the next thing that we also want to do is we also want to export our tokenizer so we can actually use tokenizer.2 json so our tokenizer oops tokenizer to json this is a way to export our tokenizer data so let's go ahead and also write this one tokenizer json okay and then we also probably want to have our max words in here so max words now with the tokenizer we probably we might not necessarily need it but i want to bring it in just in case and the max words of course was up here declared up here with one of these variables there we go and the next one would be our max sequence length and again scroll up a bit grab that one okay and i also want to get my legend and the legend inverted and that's probably it so the legend of course i think we call it label legend actually so we want those two things right here let's leave it as label legend and label legend inverted and there we go okay so i've got two pieces of data that i want to export so the first one is we want our actual you know where are we going to export this right so what is the path that we want to leave this in as so i'm going to come up here and i'll call this my metadata export path and we'll put it in as our experts exports directory and i'll just call this spam metadata dot pkl okay and so we'll go ahead and do with open and then we want to write bytes pickle needs bytes as f and then pickle dot dump of that training data and to that file okay cool and then we also want the same sort of string here for our tokenizer and tokenizer export path spam tokenizer and json okay and bring that all the way back down and this time i can just do right text of that json simple enough and if we go back into our project we should be able to see our exports in there looks like i have my tokenizer but i did not run this one so now i have both things okay so my pickle and my json now my case this will not be in my repository the spam data set will be but these two will not be and of course it's because you should not be running third party pickle data but that's actually all we need to do for splitting and exporting our data i totally could have done this in an earlier step but i wanted to give some time talking about splitting up the data and why it's important for our training now i would also say another method that you could consider is taking this one step even further so we've got our train test right here we could actually do it one more or do it into 50 percent right so let's do point 0.5 and then from there we would do the actual uh test again so let's go ahead and call this test and i'll leave this in as t i'm not gonna i'm not gonna do this but we will just keep it for a moment t 2 and so you would actually have x test and x valid and so you would basically be taking out the testing data originally right so this should still be trained actually um so you would take out the the original test data and split it up one more time so that you have a another set of data that you could test being the y test and the validation so this would be a way to do it three times right so to have three sets of data which is also incredibly common but again i'm not going to do that right now it is something maybe we do in the future our data set is really not big enough to have the validation set the another validation set or another test set in here but as it stands right now i think it's pretty solid so let's go ahead and leave it here and get the training started now we're gonna go ahead and train our classifier so jump over to this blog post right here this will have the most up-to-date information on the actual classifier training itself but the main thing is actually just going down and clicking another link for launching this notebook on google colab since it offers free gpus and then copy this to your drive right so the reason you want to copy it is so you can make changes you can run it you can do all of that stuff now i actually already have one running so i'm going to go ahead and scroll up to the top of the one i have running because i'm currently using the free version of google colab which means i can only run one session at a time as in one training at a time especially so the idea here is we are first going to install boto3 so we can have a place to upload this i'll talk about the uploading in another portion but for now just make sure that it's installed next up we are going to do the necessary imports that we've already seen with the addition of the actual model imports itself which i'll talk about soon too and then we see that we have our exports directory as well as a few other things that we want to use now this prepared data set is everything that we just did which was getting the exported data set and then actually running the portion of converting our exported data set into actual vectors along with the pixel data and all of that and so we can actually see the the actual csv file that comes out this is the same one we just did it's just all done now in this one actual notebook but it's done because we well of course already have that data set but also we run a notebook itself so up here i grab this notebook and i output it to this part two guide here this is that vector one the creating and converting our data set into vectors so it runs that notebook for me in line on google colab of course that's important so i don't have to move this data around i already have it on github so you don't have to move it around is really the key there and so that runs and notice that it's actually creating you know random index so it's actually checking that data again and then it's giving me these tokens and so from there we can actually unpick all that data this data is directly on google colab which we can see inside of the files clicking here going up a little bit and going into data sets we see our exports here and there's our pickle right this is not downloaded from github this is done in this training portion okay so after that training portion is done we load that data back into a dictionary like we did when we exported it we see what that data is it has our inverted legend our you know our x test and train data our y test and train data our tokenizer the actual instance itself it's not the json data we'll get back to that one our max words our max sequence it has all of the data we might need okay so make sure that whatever your mac sequence is it actually matches what you're training in my case it doesn't match what we talked about in the video and so if we go down a little bit more we actually can grab all of what that data was and we can see here here's that legend right that was our original legend it's exactly the same the inverted one would be the same as well next is our actual model this is the actual training model now i didn't actually create this model this was done on keras keras has a ton of documentation there's a lot of really great machine learning data scientists out there that spend a lot of time trying to find the best model for any given problem area this is one of the best models for the cross category classification which is this categorical cross entropy so this actually works for two labels or more right so in our case we only had two labels spam or ham right and so the key things about this for our data set is of course the max number of words the input length remember how i said all of the training data has to be the same length that's what that pad sequences thing did this is important here so you need to make sure that that's right if you don't have it all the same length then you're going to run into some serious issues and so that's all that this is doing is getting it make sure that it is the same length this does gives me a number back that's what shape does it gives me the shape or the um you know like how wide and long a matrix might be and then it adds some layers for us lstm is a type of model that is really common for text related data in this case that's what we're doing of course and then our dense this is our final layer right here this layer right here has an output with two potential values those two values are going to be ham or spam right so if we had multiple labels ham spam and not either or let's say for instance if you were doing like news articles it'd be like science business health sports it'd be four then right so that number corresponds to the number of actual y right y data right this thing right here the number of keys in the legend so that's important too and then after we do that we compile it and we make a summary again you'll find a lot of examples of this but the things i just covered are the important ones for definitely making it work now the things that you can try out with are the drop out the recurrent drop out the spatial dropout now there's a lot of detail and depth that we could go here if you want to understand and know these things a little bit better let me know in the comments i think it is well worth learning about but in the case of ours we want something really practical so actually going in the depth here would just take too long to getting this model into production and it's already incredibly useful as is it's kind of like do you really need to know how to do an exponent or do you really just need the value of that exponent yes of course it's probably a good idea to make the best possible model or even better models by learning and really understanding everything that's going on here totally agree but the key to me is does this spark your institution to learn more if it does let me know in the comments next of course we're going to actually go ahead and train it so epochs is the number of times all of the data is trained thoroughly basically so if you want it to be trained very very thoroughly you can increase this number of epochs now if you only do one it's probably not going to have the greatest training but if you do a thousand it's probably going to take way too long and probably not actually increase the value of the training like it's not going to get any better but the things that we see here is our fit data right so our x train and y train that actual split up data is here next that validation data right so this is actually doing validation while it's training so oftentimes it would be a good method to have another set of data that after all of this is done then you can test on that that's just not something i implemented at this time but it's definitely something worth thinking about and then of course once this finishes training which should be done soon i did start it before the video but it should be done soon once it's done it will export this to colab it's not exporting it to google drive just to collab so that means that once your collab session ends this will no longer work we'll come back to that in a second then we're going to go ahead and predict some new data right so this is just arbitrary prediction but this function itself is something we'll end up using in production for a lot of reasons now this tokenizer right here is based off of the tokenizer class that we had from that pickle data something i did not do in a previous video and it's not something i'll do in a future video either but the idea here is we have the tokenizer working notice that we are converting the texturing whatever we pass as an argument into a sequence that sequence we then need to pad just like all of our other data and then we'll run a prediction here so these two are absolutely important to make sure that our text string is the right length of data for this sequence or for the prediction then what we get is an output like this so it's going to give us an actual uh you know array value here where it's two values it's two values because it's either ham or spam and because we used softbanks up here as our activation this right here we actually get a percentage back so it's not going to be 1 or 0 like our labels have instead it's going to give us hey this is probably about 93 percent this position this is probably about seven percent this position which of course this value actually maps to ham at ninety three percent and spam at seven percent but that's only because going back up to our labels like y train if we look at that our y train label data the very first position of any of these of these one hot arrays is ham the second position represents spam right it's not about the number itself which definitely can be confusing but that's actually pretty cool so that gives us this output here next what's very common to do is get the top value as in the top index value in this case for this example of 93 percent the top index value this would give me an into c that is zero i can use that indicy to get whatever the top prediction would end up being um which is not uh like necessarily going to be this right here but the idea here is we actually have a prediction in fact this should be that right there and so what we get then is our predictions coming back and then we get our labels themselves and that's what this will end up doing so even if this part isn't working correctly that's okay we it'll definitely work when we get go into production and get it working there the key thing here is really just to illustrate generally speaking what we need to do also when we turn this into an api itself like into a fast api model that we'll end up using so the final thing is actually exporting our tokenizer again and just the metadata we need to run inference in other words the metadata that will build up this prediction file here as well as our actual model itself right so this actual model that's been loaded right here so if we scroll down a bit we see the tokenizer as json again and there it is so the final step after all that is to upload it to object storage this is what i'll do in a future part of the series but for now just know that uploading it into object storage is your best option a big reason for this is because well these are cloud-based services that can store almost an unlimited amount of data because our model itself is going to get to be pretty big so we don't necessarily want to download this locally although i will on my machine but if you start downloading hundreds of these you'll start having well your entire system being essentially machine learning models can they get really really big or they can ours right here is not going to be that big so it should be pretty doable um so yeah that's it i mean i'm going to let this finish actually training and then we'll take a look at this prediction function but the idea hopefully is fairly straightforward on what we need to do for the training the model is almost done training but the general idea is look at this accuracy already the first epoch has 89 accuracy even the validation accuracy is actually pretty high also on that first training epoch and then after the fourth one it's at 96 percent across the board and so perhaps that will be the best we can get right now 96 which i think is actually still really good and then if we see these predictions look at this we've run these predictions there we go we've got a input value here which we can actually test out let me just bring this back to being our predictions here and we'll put this at zero and so we get this prediction value here and it's giving us spam or ham for hello world and not spam and if i say you know something like something silly like buy my new phone for a huge discount that seems kind of like a spam to me it gives me mostly spam right it's maybe not perfect but if i did like uh and call [Music] discount and call some number now it's a really really much spam right um so that's actually pretty cool i think it's not perfect but it definitely works and we could get this into production and actually use it somewhere out there right the main thing here is actually exporting this or uploading this data somewhere so that's actually what we'll need to do very soon upload this model somewhere because as it stands right now there's really no way for me to upload this file because the key thing for us is we need to actually put it somewhere that i can then download it later so i won't necessarily have to do this training again anytime soon but i can actually use all of the results of this training well for a while so let's go ahead and start doing that now by all means play around with this prediction function i think it's actually pretty cool where it stands and how it's able to really just give us these values here in a moment we're going to upload our model and all the other model related data to object storage but before i do i'm going to just download it as well so going into the file system here you might need to go up a folder uh and finding the data set folder itself into exports the things that we need are the classifier metadata so i'm going to download that the tokenizer json data and then the model itself the h5 file not the pickle file and not the csv file i certainly do not need the pickle file going forward this actual training will do it again okay so i just did that as a way to back this up because as soon as i leave this session i'm going to have to retrain it again because this notebook itself is considered to be ephemeral which means it's not going to keep any of these files in here permanently yes i'm going to attach google drive and i can totally do it there but google drive you would not use in production on a web application you could use it when you're testing things but you would not use it in production on your web application so instead i'm going to go ahead and use object storage and i'm going to be setting up two different actual object storage providers which is linode and digitalocean by all means you could just do one of these they're roughly the same i'm going to briefly talk about how to do it for each one and then we'll actually just implement it on just simply one of them because they're roughly the same and they're all using boto3 which is actually written by aws for s3 the only reason i'm not doing s3 is because well it's a lot more cumbersome to set up s3 than it is these other two objects storage is so inside of lynnode what we can do is we can create a bucket in here and you can give it a label whatever label you want give it a location that's probably close to you and this one i'm going to just call this inline cfe collab i'm probably not going to keep that long term and i'll go ahead and create that right so here's my bucket all right really really simple going back into object storage i'm going to go ahead and copy this whole url here and i'll go back into my notebook and scroll down to the lenode portion and here is the endpoint i want to use right the region is right in here us southwest 1. the bucket name in the case of lenode is arbitrary as long as you have the endpoint the entire endpoint like that so you can call this whatever you'd like next of course are the access keys we just go back into our buckets here we've got access keys we create one cfe inline colab and of course in this case it's probably a good idea to restrict your access to your single bucket because you're putting these access keys on collab and if you accidentally expose them like you can see my keys right now of course i'm going to delete them but if you accidentally expose them they would only be there for this one hopefully this one single object or this one you know training data right here and so once i set that i just run the cell and then i set it into my os environment variables right here and then i can run boto3 for all of those things and it's really that simple now if we go back into our actual bucket itself we can see it going in here there's our data set there's our exports spam sms there's all of our data right simple go into digitalocean it's not much different go into manage into spaces go to create and spaces it's probably going to ask you for a project again picking a data center near you i'm probably closest to new york you do not need a content delivery network in this case we can restrict file listing and again cfe inline colab or whatever name you want to use notice the origin url you will need to put it into a project for digitalocean we hit create space this is going to create that object storage for us and yet again i'm going to go ahead and copy this go back into my notebook here paste it into the digitalocean stuff i actually could put it right back into where lenode was it doesn't really matter the actual region is right there as well the node one i'm gonna go ahead and comment out that's not completely necessary because i could just run the cell but then i need to actually go back into spaces and manage keys and then we're going to get our space access keys generate a new key collab inline key or whatever you want to call it okay and so the first value up here is going to be the access key itself and then the second value is going to be the secret key and if you lose these keys or you need to reset them by all means do that again the bucket name in this case does not make a difference if you have this endpoint and i'll explain again for both of them where that bucket name ends up going the only place that the bucket name matters is on aws s3 so again i'll go ahead and run this cell here and then set the environment variables for at those access keys get those upload paths and then just you know perform the upload going back into digitalocean into spaces into that recently created bucket there we go okay so here it is datasets exports and spam sms and there's all of our data okay and it's not that big right 3.4 megabytes is definitely a workable model given where it's at so this model itself can be retrained from this model which we will definitely at least show you how to load this model back in so the thing that i didn't mention or didn't show you really is the bucket name right here actually ends up turning into a folder for both places so there's that folder and on the node there's that folder right there again in the case of aws that's the actual bucket name itself not the path that you end up loading so what we have here are the key names this is the path as to where it ends up loading which is why it works the way it did but this is how you would download it then right so you go ahead and download that and that will download it locally in this method exactly so that downloads it i don't remember where i put it but it downloads it somewhere in our project itself which i think actually is probably in the content itself there it is right there right so it's in the root of the main content folder that the colab itself works cool and so now what we want to do is actually create a ai model download pipeline before it go anywhere because just to be on the safe side what we want to do is we want to get rid of those access keys right i don't want to accidentally leave them so i'm going to go ahead and revoke the one from lenode and then also go into digitalocean manage keys and revoke the collab inline key as well okay cool so again this is just when in doubt do that as a way to ensure that your keys aren't exposed to anyone nefarious that's one of the i think the biggest downsides of google co-lab is there's no way to securely just yet to add in environment variable keys maybe sometime in the future they will have that ability but right now they do not all right so that's uploading our files now that we've got them it's time to actually use a pipeline to download them so we're actually getting a lot closer to actually using this model in a web location if you pick nearly any file on github like in this case i'm on a random models.pi file i can see that it's only three kilobytes right if i compare that with my trained machine learning model it's three megabytes it's nearly a thousand times bigger so that makes git not perfectly suited for managing this file and the other thing is of course machine learning model file exports can get a lot bigger than that so object storage is a really good choice for these models now of course i don't have versioning in here i don't have a history of these changes that's something that i probably would want to have but what i do have is a simple way to upload it and that is something we've already seen using photo3 and it's really just this simple these three lines right here we authenticate using these access keys in our environment and then we also use you know just a few different configuration items here and then we can upload a file we can also use that same method for downloading that file when it comes to the project we're working on well we need to download three files pretty much every single time we're going to deploy this code so granted we don't have to worry about deployment just yet but we will at some point and that's why i'm going to be using pipe ir specifically for automation pipelines now if you've ever used something like github actions or ci tools you have used yaml and some form of automation pipelines so that's essentially what we're doing here now i've also got a blog post on this one using piper and i'm working with the maintainers of pipeliner to make it a little bit better but for now i'm just going to go off of what i have here and we will make a script well a pipeline and a script to actually download all of these files in a repeatable manner so it's pretty simple to start the first thing that i need to do is i need to make sure that pipe ir is installed so in requirement.txt there it is the next thing i need to install is boto3 save that and we'll go ahead and do pip install r and requirements.txt okay so i have all everything installed the next thing of course is to create the environment variables so dot env so dot emv files are very common for setting up our environment variables when we're working locally you might also use these in production you just won't move this file around you definitely won't put it in git ever right so in my case it's definitely not there it's done by default on the git ignore file like ignored by default and so the things that i need to put in here are really just this right so the aws access key id aws secret access key and then these few other parameters now these two are the primary ones we need in environment variables we want to make sure that those values are hidden it's another reason why in colab after i finished doing everything i just deleted those keys right off the bat and so it's actually very similar to like what we did in collab in which i will go ahead and reiterate right now so let's go ahead and jump into any of your object storage in this case i'm just going to use lenode but the exact same thing works for digitalocean in almost the same way and the same thing works for aws of course because that's what it was originally designed for anyways so inside of object storage i'm going to go ahead and grab my access keys here i'm going to create a new one this time i'm going to go ahead and call this my local ai model download pipeline or something like that let's go ahead and say local ai as in api okay and this time i could give it access to everything but i like limiting the access to my keys in general i'll leave it with the inline cfe collab bucket we'll go ahead and hit submit and then we're going to go ahead and copy these keys and again this is going to be local now the nice thing about this is i can roll them at any time okay so you might be tempted to put bucket name as well the bucket name right so inside of here our buckets right here you actually don't need to do that right so we don't need to put bucket name in here like this instead what we can do is data sets i'll explain that in a moment next the endpoint url is going to be this right here and we're going to go ahead and use http in there there we go the region name is usually in this endpoint url and there we go i actually don't have to put environment variables in a string we can just leave it like that cool so now that we've got that the next thing i want to make sure that's installed is python dash dot emv and sure enough it is okay great so the main reason that we use this dot emv has to do with when we go into python so if i clear this out and i run python and i do import os os dot environ dot get and we try to grab any of these environment variables i'm going to see something like well nothing so what we do then is we go from dot env that's that python.emv package we're going to import load dot env and then we're going to just simply load dot env this method is actually really really powerful and it will find where our env file is and so now i can actually get that access key like that the other cool thing about load.emb is if it doesn't find one it's not going to raise an error this is actually really useful because in some production environments you might actually pre-load it with environment variables already thus rendering this pretty much useless so this is a key part here because the actual pipeline itself is going to be using boto3 so i'm going to go ahead and create that pipeline now create a folder here called pipelines and i'm going to go ahead and make a new file in here and this is going to be called ai model download diamo okay so there's a couple things that we need to declare first first off is our context parser and this is going to be from piper.parser.key value pairs okay this is actually potentially a optional step but it's nice to have a parser so we can add additional context to our pipe call next we're going to go ahead and do steps and the first step we'll give it a name of context set f so pi pi r dot steps dot context and set f and what we want to do here is add context set f we're going to give our context variables okay so this is for our string substitution now what is our local destination directory where do we want to store everything i'm going to put it in models slash spam sms and then what are my actual file keys this of course are the keys inside of my object storage so if i go in there the key that we don't want to include here is this data sets right here but what i do want to include is exports spam sms and then one of these items right so that's what we'll do and it's going to actually be all three of those items so exports spam sms i'll just copy and pa replace this three times and then we'll go ahead and grab each one of these items okay so all i'm doing here is letting this pipeline know that these are data items or context that i'm going to want to use in my next step these are variables so i set them specifically myself you can set them however you like because now what we want to do is the next step which we're going to go ahead and do name now do keep in mind that i am tabbing things in if you're not familiar with yaml it's actually very similar to json that is also very similar to a python dictionary it's just yaml uses spaces okay a lot like python what do you know and so here we're going to go ahead and just do pi pi r dot steps dot pi so this step is actually going to allow me to execute python and we'll just go ahead and say n and py use a pipe here now i'm going to go ahead and put in my python so in here i can just go ahead and do that same thing i did in the shell which was importing os and then we can print out os.nbyron.get and that was the aws secret key and then we can go ahead and do from dot emv import load dot emv and then load the env and we can print it again okay very simple pipeline let's go ahead and give it a shot i'm going to exit out of the terminal here and i'm going to run pi by r so python dash m pi pi r and then pipelines and the name of it without yaml so ai dash model dash download hit enter and there we go we got a error i think i put a capitalization where i didn't need to and or rather i put a calculation i did not put a capitalization where i should have which is context set f as it says right here let's try that again and there we go so it starts off with none which was this and then it gives us that key okay so what this shows me is that i can actually now run all of the boto3 related things that's it okay so first and foremost let's just grab our other context here so this local destination directory that's another thing i could just add in there and again we can run that pipeline and there it's showing us it so now i'm going to go ahead and say destaff being half lib.path of that destination directory now go ahead and also import pathfib here and it will pause for a second and say yeah it's probably a good idea to actually put this in its own module but the reason i have it in the pipeline itself is so that in the future if i ever need to change this pipeline only subtly for a lot of projects i can and then i have a new pipeline to handle that which is again the reason i'm actually working with piper on making the best possible pipeline for this so now that we've got this i want to make sure that my destination path exists essentially right so let's actually see what this path is and i'm actually going to do dot resolve and i'm just going to print out that path and again we'll run it and there it is okay so the path is going based off of where i'm running pi pi r not where the actual file of this yaml file is so that's pretty good and what i want to do is i want to make that directory so well destination pass make dur and exist okay being true and parents being true okay so now i can not only get that path but also to dot exists and run it and sure enough it does it does now it didn't a moment ago but it does now okay and so this is another thing that i'm going to make sure i get ignore i do not want to check this in to github now the other part of not checking this into github is maybe you don't want to open source your trade model that's another thing to think about right you might want to open source all your other code but not your trained models those might be where all of your services value lies anyways so let's go back in here and so we now have the destination here the next thing is of course using photo3 we're going to import photo three after load dot emv i'm gonna start my session and that's fairly straight simple straightforward straight simple and straightforward that's gonna be session equals to butto three dot session dot session that's gonna start a session for us no big deal next is gonna be our bucket name and this is os.environ.get bucket name this of course is required but it's still data sets that is why i didn't actually put data sets anywhere now in the case of lenode and digitalocean this is our endpoint url our bucket name does not have to be our actual bucket name for those services in this case okay but i still need to use it inside of photo 3. next up is our region and os.viron.get and i think we call this region name again we can hard code these things but it makes a lot more sense to have this being reusable and of course the thing that i did hard code was the actual contents that i'm using for this particular project this part this is why it'd probably be a good python module itself that's why i'm not actually doing it it's just so we can reuse it and make it better later anyways got our region we've got our bucket now we're going to go ahead and create our client oh we've got one thing and that is our endpoint url and of course that's coming from here as well so now we can go ahead and also say or none because if you're using s3 you won't need to set your endpoint url it will do it for you so i'm going to go ahead and do client equals to session.client s3 region name being that region and our endpoint url being that endpoint url okay so we've got our client now and so now what i'm going to do is just say for key in file keys where is that coming from these it's a list so i can iterate through them that's context so i'll go ahead and say the dl path is equal to the destination path slash well maybe the name of this key which would be the root of that file in other words it would be this name instead of this whole thing right because the destination path is just going to be in here we don't need all of this so the simple way to do this is just to say let's say something like fname equals to pathlib dot path of that key and then just.name this doesn't have to be a valid local path it could just be a path like string to get stuff like fname there we go next of course is just simply client dot download file the bucket name we set up here in our environment variables that's important next of course is the actual key itself and then finally the string of our local download path and there we go so that's our pipeline now and so let's give it a shot inside of our models here i'm just going to press up and run it it'll take a second but there it goes it downloads all of our stuff once it's done it has everything so the key thing for this is when i do go into production this is what i'll have to run as a step in production now i certainly can use pipe ir to actually manage all of those steps piper itself can call other pipelines in a step right so it can call one of these which is also really cool but i just wanted a really simple way to always download this now again i know there's a lot of you that are like well why don't you just make this a python module and just call it and just hard code these things in here and i certainly could do that but now in the future for a future model a different model all i really need to change probably is the destination directory and these file keys right so the keys that i actually want to download that's all i need to change i don't need to change anything else right other than maybe my environment variable data but even that i probably won't need to change especially for this project furthermore if i wanted to download more file keys let's say for instance i have a another classifier like a news classifier right i could do stuff like that and that's well a really easy change and perhaps i would change this to you know news whatever news news you know like these keys don't really matter but then it will actually download all of this stuff which is exactly what i wanted so realistically this is not an ai model downloader it's just a simple way to download a file from photo 3. we are just calling it aai model downloader as a good brand for the pipeline download now we're going to go ahead and start the api portion of this project so inside of our root folder here i'm going to create a new folder called app inside of there we'll create an init file so two underscores init dot pi and then after that we'll go ahead and create a main dot pi file right there and this of course is going to be our fast api application i'm going to close down the explorer window for a moment and just declare the things we need to start off with so from the fast api import fast api and we'll just declare app equaling to fast api and if you've done this for any period of time you'll declare a decorator called app.git and this is going to define our read index method here and this is just going to return hello and you know world okay simple enough but what i want to do is i actually want to verify at least here that my model did download so to get that i'm going to go ahead and import pathlib and let's declare our well a few things about this first off our base directory being equal to pathlab.path of this file we want to resolve this and then we want to get the parent okay so we could verify this of course by well taking a look like this let's take a look so uv acorn should be installed with fast api so it's you viacorn app.main colon app so app of course is this folder right here main as the python module and this app is that final one and then i'll just do reload and leave that running okay so let's open this up in a browser and there we go so it gives us the root of our project or at least seemingly the root of our project but it's not quite the root of our project right so we actually want it to be the root of our entire project so dot parent again and that gives me the root of our entire project as in where models are now this is the base directory i want to declare for well really just for fast api everything else in fast api should reference this base directory as the fast api base directory the model directory on the other hand can use that parent so the model directory is going to be outside of the fast api application so go ahead and do base dir dot parent here and then models and so this is a good one where we can just check if it exists save it and refresh simple enough then of course would be all of the other files right so let's go ahead and start that off so the model half and model dir and we'll go ahead and do slash spam sms now we're probably going to be using this again so let's go ahead and do sms spam model path okay or probably der and now auto path okay so spam sms slash spam model.h5 and we will repeat that with our tokenizer and our metadata okay so those names are i think well let's actually just verify what we called it by going into the model itself and just copying the path and pasting it in here that was the metadata path put that there and the tokenizer path put that there cool so i probably don't need to check that all of these are available and of course if i zoom out a little bit i can see that it's clearly on one line zoom in again i just need to check that one of them is and most important is probably the model itself and there we go okay simple enough so the next part of this of course would be to verify that we can have some sort of input so i'm add a query in here and the query itself is well we want a default and an optional value all right so the default i'm just going to go ahead and say hello world and this time i'm going to get rid of everything else in here and just say query and query okay so refreshing here there it is no big deal so now i want q to be a value that i pass in q being a url parameter so something like q equals 2 this is awesome right so now if i refresh in here i have well no query and i'm getting a validation error so of course i want to make this optional and to do that we are going to import typing so from typing import optional this gives a optional type to fast api to this argument so we can do optional string and i'll just set that equal to num so now it's just going to be q or hello world and so now if i refresh in here there we go and now i can actually pass in q as well and say question mark q and this is awesome and there we go it's showing that and it's responding back to what it was all right so we've laid the foundation pretty well to now implement our predict method here right so we definitely need to write that out and one that's well hopefully optimize for fast api but the general idea is we have a method here it does not have to be on the index route right it could be on a different path like slash sms model slash predict and it could also have a different hdb method like post but i'm going to leave it really simple at least to start and if we need to change it in the future we will do that but of course now we just need to implement our prediction method now we're going to go ahead and load in our keras model into fast api now the way we do this is actually very similar to how we're going to load in well virtually anything we want fast api to depend on whether that is a database or this ai model and we're going to do this in app dot on event startup so this decorator can be around any function the function i'm going to call it is just simply on startup and this is where i want to load my model okay so the question then is if i load my model well where do i load it to so what i want to do is actually put a global variable here that i'll end up using and i'm going to call this simply ai model and started out being equal to none and so in this ai model i'm going to go ahead and load it in with tensorflow so from tensorflow.keras.models we're going to import load model okay and so of course i already have a model path here and that's what i'll use on this startup event i'll go ahead and say if model path dot exists my ai model is going to be equal to load model of this path and i will actually declare the global ai model itself this is only so that i can actually reset this and it's for sure available elsewhere okay so now i should have the ai model in there and running so we can try this out by well let's just print out that ai model and again i'll use the global variable for it and let's make sure our project's running and i'm getting no module named tensorflow okay if i close this out and run pip freeze i should have tensorflow in here and there it is so to me this happens from time to time with a virtual environment i have to deactivate it and then reactivate it and then run uv acorn dash m or not dash m but rather just app dot main colon app and reload and now i have no worries right virtual environments just need to be refreshed every once in a while but notice it does take a moment to actually load it in but now we're all set up we're all ready to go it's really just that simple to load that model in okay so it's definitely working i can refresh in here we're all good but the other thing is i actually don't just want my ai model right i also want my ai tokenizer okay so the actual instance of the tokenizer class so to do that we're going to go ahead and do from tensorflow.keras preprocessing text import the tokenizer from json remember how we saved it as json when we uploaded it there it is both in the metadata as well as the tokenizer itself so yet again i'm going to go ahead and say if the tokenizer path exists then the global ai tokenizer is going to be equal to the tokenizer from json now this is from a json string so i actually need to read that text this is how i would open it using path lib so token uh we'll just call it t json is that and we'll use t json here so again i can go ahead and print this out save it and we should oops we will keras should be spelled correctly and we take a look again it loads and we should get some sort of print now yeah posix path has nothing called exits weird should be exists of course okay cool so now with the typos out of the way we should have no problem and there is the tokenizer class or object being printed out itself so i feel pretty confident that that's there now of course the final thing is going to be our model metadata and then probably our legend inverted whatever that ends up being so the model metadata we have this as well so if that exists then we're going to go ahead and set it and again using the global variable for both of these okay so the model metadata is going to be equal to the model metadata path dot read text and i have to read that text which is going to be unpacked json so it's just giving me a json string so i need to import json and do json.loads of that text and now i should be able to print out that model metadata in fact not only should i be able to print it out i should potentially be able to use it on my endpoint here so let's actually do that i'm going to bring in to my endpoint i'm going to try and unpack that model metadata in there this might be risky let's go ahead and see if it restarts looks like we're good there refreshing here and there we go there's our model metadata it's a tiny file so it shouldn't be that big of a deal but there's our labels legend inverted so i'm actually going to go ahead and rename this legend inverted to just label legend inverted and be the exact same name as the metadata may expect cool so now that we've got this i potentially could use some of this other metadata right off the bat or i could actually set this one right here too so i probably don't need to set this right i probably could just pass the model metadata around but i might need that legend elsewhere it's certainly possible that i will okay so then if we use the legend itself again i'm going to go ahead and just say legend and there we go and just let it reboot again and see how it takes a moment to load in the actual mapping of everything that includes the actual ai model imagine if you're doing that every time you try to do a prediction okay cool so now we have the foundation and we have loaded the model in and all of the metadata the tokenizer all we really need to do is make that predict function again now we're going to go ahead and implement this predict method here so we're going to define it predict and it's going to take in the query and then it's going to return a dictionary of some kind so that's going to be roughly what we'll return down here as well and so if you think about it you might actually say oh well the model itself so the ai model we can run predict on right so if we scroll up a little bit that of course is the loaded model n from keras and i want to actually bring in the query here now that is maybe what's intuitive for you but we have to remember back how we actually created this model and what we actually trained it on now this right here is of type string right it is not a vector that is actually how we train the model in the first place so we do need to convert this into its own vector right so we need to convert every single prediction into a vector so thinking about this we actually did it in this steps right sequences pad sequences then we actually do the model prediction and then we'll convert it into labels so these first three are really how we did the training right it didn't do the prediction necessarily in this method but that's essentially what we're trying to do here we need to convert this query into sequences first which is a vector and then we will pad those sequences so they are the correct shape or correct size as our prediction needs and then after we do that prediction then we'll convert everything into labels okay so the first thing is the sequences themselves now we already have the tokenizer so to actually create the sequences so sequences equals to that tokenizer dot texts to sequences and this is the query okay but before we even jump to the query here remember this is a string how did we train it well we had texts and it was a list of strings right and that's how we did our pad to sequences when we did the training this is not any different so we actually need to pass in the query as a list so the other part about this is i could actually do predictions on many things it doesn't have to be just one prediction right so that also means that my result is going to be well a list of preds right a list of predictions or an array of predictions so that is something to keep in mind okay so now that we've got this we need to pad those sequences and that will be our input so this could be pad sequences and that's this right here with a max length and i'll put that in as 280. okay so let's make sure that this is imported and that is from our preprocessing dot sequence import pad sequences okay so we now have our input i'm gonna go ahead and do our prediction and then this is gonna be our prez list okay this should only have one value in it like it has a list or well better to call it an array because that's what it is in terms of numpy i'll explain that in a moment but this is going to be a an array with one single array in it to get it as as well in terms of python it's a list with one list in it okay we'll take a look at that in a moment but before we go any further this max length here is going to catch you up a little bit especially if you've been following exactly what i've been doing in other words if you look back to our local training when we actually converted our data set into vectors we have this max sequence length of 300. if you look at our model training the actual training we have a max sequence of 280. so the reason it's 280 here is only because the version of converting data set into vectors in production the actual github repository has it at 280. when we did it locally we had it at 300. so to save confusion you can absolutely change this to 280 and that way all of it is consistent but i wanted to bring it out to make sure that you knew that we definitely want to use the same length of the sequence all across the board if you use too small of a length or too large of one that means that our input may or may not be actually accurate and thus our prediction may or may not be accurate as well okay so just keep that in mind so i am going to use the one from my exported model metadata based off of the fact that this production run part was going off for the one on github not the one i did locally right so it's definitely a different notebook which you would see if you looked at it okay cool so now with that i'm gonna go ahead and grab that data that's going to be our sequence length so i'll just call this max len equals to model metadata and the value we put in there was just simply max sequence i'll do get that and i'll also say or 280 and then i'll just pass in 280 here okay so this is going to be our prediction that's our predictions array so i can go ahead and print that out and i can also print out the x input and let's also print out the x input dot shape okay so save that now we'll go ahead and run this prediction let me get rid of this print statement down here okay let's make sure our server is still running sure enough it is no import errors or anything like that so let's go ahead and look at our run okay so there it did it did a quick run here's the x input notice that it is it has brackets on the outside and on the inside so this is an array of one single arrays it's very similar to a list but as you noticed there's no commas right so that's why it's not considered a list it's considered a numpy array numpy arrays will give us a shape so this is a one-dimensional array which is very similar to a list right a list is a one-dimensional array and it has um you know 280 columns basically okay cool so the response is this right here also a list of one single array right so it's basically saying this input right here ends up outputting this value right here okay so in other words our actual prediction for this input is the preds array the first element in that so we refresh in here and what will be printed out once it's ready is well it should be the first item i didn't print it out let's go ahead and print it out okay and so it's only going to be this element right here okay so this is important just to understand that you can absolutely very easily get that organization wrong right um so in other words i could i could do all sorts of things with these predictions and i can also put in the wrong query itself so it's not necessarily that straightforward because the answer that would come back from it isn't necessarily wrong either right so here we go there's our predictions okay so let's actually take a look at what i mean by this now let's get rid of the query out of the list okay save that and i'll let it reboot a second and then we'll refresh this again i run it this time quite a different shape but i get results still interesting now if i actually looked at the entire results themselves right so all of the predictions not just the single one i can see that it's actually giving me too many results okay so let's go ahead and refresh that and there we go there's these too many results okay so as we notice here this too many results is because it's essentially taking this query and it's treating it as if this is a list of text sequences it's not but that's what it's trying to do right it's feeding it's kind of giving us back the data we're sending in so that subtle change you know has a huge impact impact on the actual prediction next of course is the length itself if i did a hundred here so definitely lower than what i initially trained it on and ran this again might take a moment for that to boot i get a warning it's got the wrong shape and i get some weird prediction back as well it might actually be accurate or very close to accurate based off of what i'm inputting but look at this it's all zeros the actual input itself is well practically like saying nothing and practically like saying nothing will tend to lean towards not being spam okay so yet another reason to make sure that we've got that sequence down but what's cool though is once we do have it correct we now have a proper predictions array with roughly a percentage okay so if i actually change this into a list we can now see what our preds are this is now a python list and we take a look at this looks like i didn't print out the correct one let's save this again and refresh in here now what we should see is the values as a list pretty cool and so what i want to do then is i want to turn those values into a dictionary of some kind and so what i'm going to do then is i'm going to now say our labeled predictions well are going to be equal to let's put it in as just a dictionary first and so the label itself what's the label and then what is the confidence score okay so if we did each individual one of these so the first one right the first one well that's going to be iteration 0 and the label itself would then also be iteration zero or the confidence score itself print this out now remember the label well we actually had a labels legend inverted so that label is going to take the string of that value that iteration and we can get that string value just like that and the confidence will be this score right here and if i turn turn this into a float then it's going to be a floating point number in python versus a numpy number okay so we save that and we refresh in here and we'll take a look at the value we get and so we got label ham confidence that's the score we got cool so that is now showing me that we can actually convert our original prediction into well a list of predictions okay so this is just a simple way to do it now to actually to turn it into a list comprehension like that we put a bracket around here and then we do four uh by say i and x in preds well we have to enumerate those predictions and so the i itself is the iteration number which is what we were emulating here and then the preds here we can also use the iteration number or since i'm actually iterating through those predictions i can just use the value of x there and so when we say that what our label predictions will be is the values or the percentages for each label pretty cool okay so not quite done though that's going to be our overall predictions what about our top prediction how do we actually get our top prediction so there is a value in numpy so let's go ahead and do import numpy as np move this up here and so in numpy what we can do is we can get the the top value in our predictions array here right or or this list here so i'm actually going to remove this from being a list again turn it into a list down here when we need it to be a list and now what i want to do is get what is the top index value here so top idx and val and that's going to be mpnumpy.org max of these predictions so this will give me whatever that top index value is which means from that top index value i can then give these responses right here the same thing right so then we can say top uh you know pred and set that equal to a dictionary use these values here and so or the dictionary should be from the label itself okay so the label inverted this now is going to be that index value and again we can actually use our preds here and the index value that we have and then we can turn this into a float now we don't have to turn it into a flow but turn it into a float means that when we actually you know load this through json it'll come out correctly okay so i'm going to keep on with this and say the top is going to be equal to this top red oops that should be there we go so that's our top prediction and then all predictions is going to be this labeled prediction list okay so now we have a full on prediction function with the top prediction as well now of course we don't actually have to have that top prediction the only reason i have it here is if i ever change this to having many different labels this function would still work so it only really needs the predictions response but we'll keep it in with top and so now i can say preds you know press dictionary is equal to that predict and now we'll go ahead and unpack that whole thing okay so we've got our query and actually maybe we want to make this a little bit more like results at our dictionary that's coming back so we're gonna go ahead and save that refresh in here now it should do our prediction and we've got our top value here and our all of our values right and so let's go ahead and remove the float from these elements here and see what happens so instead of having to float on these no floating point number we save that we refresh in here we'll get an error right internal server error and what we see is this numpy float 32 object is not iterable okay so what's actually happening here is the prediction is giving us back numpy numbers it's not giving us back floats and what fast api is trying to do is dump this into json okay so let's go ahead and leave those actually i'll leave them as floats for a moment and we'll import json oh looks like i already have it imported and so i'll go ahead and do json.dumps of that value and try that same thing refresh in here and i should get the exact same error or something very similar to it and so now it says the object type of float32 is not json serial okay so that is why we actually can convert these into the own floating point numbers right in python now another way to actually just do this is well we could implement a numpy encoder this is something that's seen fairly often when it comes to working with numpy data is we would have this numpy encoder then and do json.dumps of that data with the class of the numpy encoder and then we just do json.loads of that data right so essentially it encodes it into the correct json and then we load it back into a dictionary a python dictionary so then the actual final response would be something like this okay and so now i didn't actually convert anything manually into a float it just does it for me and so i can actually reuse that over and over again right that actual encoder and i can use it for all sorts of things which is why i think this is pretty cool you know and it's fairly straightforward as to how it works i do already have numpy installed um but now we actually have a way to run this as usual and it's it's going to use the standard json encoder anyway so it's not like it's changing much it's just checking if anything inside of that json data is a numpy type that we need to convert into a json ready type right so just a standard python type so that's how we can actually perform those predictions there's a lot to this and it's actually not that great in the sense that it's not formatted that great right so i have this loading of all the stuff for loading i have all the stuff for actually predicting what i want to do is change this into being a proper class but at this point we could totally go off of this and we can absolutely change the model itself as long as it's a text model this will actually run a prediction on it correctly but of course if it's an image model it wouldn't work right and then this encoding portion you know that would be another step that we'd probably want to have as well as getting the prediction labels and all that so let's actually clean this up by turning it into its own class so it's just slightly different than what we've got here in a moment i'm going to be refactoring my project a little bit and this process is a bit tedious because we're really just re-implementing a bunch of things that we already did now in my philosophy i think that the sooner you get something done and released to the public the better so actually doing this refactoring or watching how it's done is not necessary you could just go in and grab the code itself so ml.pi encoders up high and then updating main to do that so this is the result this right here and this right here instead of the long thing that we had before so main.pie ends up being a lot more focused on simply fast api itself not all of the other things it could potentially be focused on now we could also refactor this a lot more as well and just make it a better data class or a better model class itself but again i don't want to spend too much time on the tediousness of making this particular thing better because really what this is all about is deploying an ai model into production so we can get more data this is still going to be a micro service there's not going to be a whole lot beyond doing the predictions and storing that data in production so i will say if you want to skip this part by all means do it it's a little bit long if you don't want to skip it let's go ahead and jump in right now all right so where we stand right now is our fast api application is really just our machine learning model so we want to change this to being its own class and it will actually help us clean up our fast api stuff but also make our machine learning stuff a little bit more reusable so let's go ahead and jump into the app itself and we'll go ahead and create ml.pi so the first thing i want to do is import from data classes we're going to import the data class decorator here and so if you're using an older version of python like python 3 0.6 you'll have to do pip install and data classes okay so what we want to do here is i'm going to define a data class so let's say class and ai model and this is going to have first off our model path our tokenizer path and our metadata path okay so each one of these things is going to be paths that we'll use and initialize so that means we're going to define a post init method here and based off of those datas actually we don't need to run anything but based off of what's here we will then define our model which we'll set to none our tokenizer which are also set to none and then our metadata and also to none okay so to do this it's pretty simple so we say data class and now when we are going to go to initialize this it's going to have those three paths in here right so it's going to be something like this where it's equal to that path right equal to path and so on and so doing this then means that this class whatever that instance ends up being will be primarily uh responsible for doing everything else which would include our predict method here right and it would take in our query and then we would return our examples okay so the first thing is let's actually declare the model path itself so we're going to go ahead and do from path lab we're going to import the class of path and we'll just set this equal to path or of type path that is and the tokenizer pass and metadata path well we could actually think about this in the long run as being optional the reason it's optional is because well what if i use this ai model class for a image class right or something like that so that way i'm going to go ahead and do from typing import optional and come down in here and say optional half set that equal to none as well and do the same thing for our metadata okay so now in our post init here i'll just go ahead and say if self.model path well it should have self.model path right so i really just want to do if it exists then we're going to do self.model equals to well we need to load model of that self.model path right and of course that load model import is this right here which we did on main okay simple enough so now in this predict method we're going to have to go ahead and do git model and return self.model and of course if it's not in there we'll go ahead and say if not self.model then we'll go ahead and raise an exception model not implemented okay so we could also have our own exceptions right so if i came in here and did class not implemented we could give it a class of exception and do all sorts of cool stuff with that now i'm not going to spend the time to do that right now but that would be a way to start making your own exceptions okay so now we've got our get model here so model equals to self.getmodel and then model.predict uh-oh we've got our query here so what do we need to do with the query well this is actually going to change to being from predict to predict text and well we need to convert our text into a number of things so what i need is my x input from this query so git input from text and it's going to be query okay so i'll just do self dot and then we'll use this method here and again self and query and so now we need our tokenizer right so in this case we'll go ahead and say if self.tokenizer path then if dot self.tokenizerpath.exists if the actual value is implemented on the model and then if it exists then we'll go ahead and say self.tokenizer equals to well let's think about how we did our tokenizer before first off we did load from json right so we need to bring that in so i'm assuming this tokenizer will always be json we could add in another path or another condition in here and basically saying that it is json i'm just going to assume that it is which is probably not a great idea but we'll do that so self tokenizer from json so the tokenizer text equals to you know self.read text and so there is our tokenizer okay in fact let's go ahead and say if not self dot tokenizer path dot name dot ends with json then i'll just go ahead and or let's just say if it does then we'll go ahead and have our tokenizer loaded okay i could raise an exception in here but i'll just leave it like that okay so this is checking the path name that it ends with json this will load it in great so now we've got our tokenizer so tokenizer equals to the self dot well yet again i actually want to do an exception if it's not there so this time being self.get tokenizer and now the tokenizer is not implemented so there we go and we already did the hard part of this which was the x input so we've got our tokenizer stuff right here so sequences right so really this is our sequence so now we'll go ahead and say define get sequences from query or from text we can leave it in as from text and go ahead and put the query there and now we'll go ahead and put this up here okay so it actually kind of makes sense to me to instead of having our x input being from these sequences this would be get input from instead of text being from sequences so now get sequences from text our tokenizer is this we return that sequence so now our sequences equal to self.get sequences from text and query right or we could call it text to sequences but i like getting sequences from text because it's well not conflicting with the keras name okay simple enough now we've got our sequences here now another thing we can consider here is having this as a list so our texts being a list type and so i'm going to leave that in as texts here and we'll place that there and we will give it a type and this is from typing we'll import list and so list of strings okay so then that just means that here just like we've seen before i'll put that query in like that okay so we're just going to assume this query is a string this just gives me some additional flexibility if i ever do want to have a list of texts instead of just one single one okay so now we've got our sequences so get input from sequences we'll just rename that one okay and then going back into main here is our sequence related stuff the actual x input right so again our model metadata it's going to be well just like our tokenizer just like our model so we're going to go ahead and have this here so metadata so get our metadata metadata metadata metadata to implement it very similar to this okay and so back into main how do we load that before well here it is right here i am going to leave out the inverted stuff so ends with i definitely want to use that same ends with json okay so now it's simply self.metadata is reading that text and then doing json.loads okay so we need to import json there we go so now we've got our metadata hopefully so scrolling back down here so it's now self. get metadata get our max sequences our sequences should be the actual input here we've got pad sequences did i import that nope so let's go ahead and import that one okay and so now we've got git sequences from text there we go our x input cool so now we can actually run that prediction right so this of course is going to be our raw preds and then we'll return that or should we actually format this and i would say yes we'll go ahead and do include top being true okay so let's actually talk about the formatting of it let's go back into our main.pie so we've got our formatting right here so our top pretty so we've got our formatting well right here so we've got our formatting roughly right here right so the label and the confidence okay so the first thing is label a pred okay so i'm going to label one single prediction and so that one single prediction is going to have to get the labels legend right so define git label legend takes itself and again we'll get the metadata itself so legend equals to self.get metadata and this is going to be our legend inverted actually okay so dot get and the label legend inverted all right so if the metadata is not there then we've got that and i'll go ahead and say or just an empty list here or an empty dictionary rather and so now that we've got that we want to return that data okay so the actual data that's coming through in here is going to be our index or idx value right so that's what we see right now so the idx value would be from the label legend so now i don't need to have all of this but rather just the legend itself and then the string of that idx that gives us our label okay and then the prediction well i don't necessarily have to have this in as that but rather just the value itself and so i'll come in with that value i'm not going to change it at all we'll just leave it as the raw value okay and so this will give me well i actually did the whole label prediction here so let's go ahead and cut this part to being down here and just return the legend this way so legend equals to self.getlegend and we'll pass the idx in our label so get the label prediction okay and so this would give me the opportunity to raise an exception if there isn't a legend in there at all okay because well there probably would have some issues here so i'll just say if legend if the length of legend.keys is not equal to 2 then we'll raise exception your legend is incorrect okay so this actually should be based off of a whole other thing that we should have had in our metadata like the actual potential for those labels but i'm going to leave it like this just to simplify what we currently have we can make this improvement much later so anyway so now we have our legend here and we can get the prediction itself and so now from these predictions well i want to actually update what this value would end up being so the labeled threads is equal to well we've already seen this go back into main we've got our label threads right here and we can paste that in and so this time instead of actually doing each individual one we'll just get the label prediction and we'll do self.getlabelpred the index value so that's going to be i and then the actual value is x and that's how we actually go through and enumerate that okay so now what i'm going to do is i'll say results equals to the predictions and include those label prints and we'll return the results okay so now what i want is the top one so this so in here we'll say if include top then i'll go ahead and say the results dot top equals to self.get top thread labeled of the predictions so preds so that means up here we're going to go ahead and do define get top thread labeled and it's going to be self and preds okay tab that over and then we'll return that so yet again i need to do basically this so let's get rid of this and top reds while the index value is this the actual value of it will just be simply the value is preds of whatever that is there we go and we can just return that and now we've got the ability to get top thread labeled the results everything like that there we go so now we have an actual class to do our labeling and our modeling and our prediction cool so now back into the fast api app let's go ahead and get rid of a bunch of things one of the things that i did leave out was the encoding so we'll bring that back in a moment so let's go ahead and on startup now it's going to just be global maybe like our ai ai model instance or let's just say ai model okay and so we want to import it we'll leave it in as ai model i'll do from dot import ml okay so global ai model is going to be ai model equals to ml dot our model class name which is also ai model and of course i need to pass in all of these paths here that i want and set them to their proper values which were set up here so model path and model path tokenizer path tokenizer path and our metadata path okay so now we've got our ai model i no longer need these i just need this and so this now will be how we run our prediction and model dot predict and text no longer need these cool and so of course the next thing is going to be getting rid of all of the tensorflow related imports and then our encoder i'm gonna go ahead and bring that in as well so i'm gonna make a new file called encoder or encoders rather just gonna be that exactly import json okay so then back into main we'll go ahead or rather back into ml i'll go ahead and import this so from dot import encoders now in our prediction function we'll go ahead and say encode to json being true okay so that means then we'll use our encoder so encoders dot well actually we have our numpy encoder here so i'm just going to call a new class and call it encode to json and it's going to be whatever data and it's going to return json dumps that data class equaling to the numpy encoder and we can actually do it where it's saying encoded equaling to that and then say as hi so if as hi then we'll return the json to loads of that coded otherwise we'll just return encoded okay so back into ml now encoders dot encoded json other results and we'll declare as pi being true okay so encoded data or encoded results and then just return that go so we definitely could have this being false of course i'll just change this to being simply results so if encode to json just like that and then we'll just return our results great so now back into main we can get rid of maybe any other imports that we might have not used anymore like our numpy import don't think we use that json we might use it again i don't think we need it right now but now it's a little bit cleaner okay and so let's go ahead and make sure everything's running and we have no errors here getting a syntax error i need a comma here and here okay and so we not in a slow way whatsoever it took a while to get here okay so we've got a legend error okay so we're getting our legend is incorrect let's go ahead and print out our legend keys as well as our metadata okay and i refresh that empty so that must mean that we are not getting the correct data out i must have not saved it correctly so let's go ahead and do it again and legend is incorrect scroll up a bit and there we go so inverted does have two keys let's make sure i labels inverted there we go use the wrong value key value in that label legend okay so now i should actually have a lot of things okay numpy is not in ml import numpy as np okay here we go and we refresh and there we go so everything's being printed out and we're now getting those predictions all labeled up and ready to go i do have a problem it's only labeling well one thing so let's go ahead and take a look as to why in our prediction function it's probably related to the preds that are coming back remember the input is multiple things so we actually want to get just one iteration here so we put that at zero that should solve that problem okay we've already seen this so if that's confusing at all let me know but now we've got it okay so now that was not a short way to get to a point where our fast api app is a lot more cleaned up in fact you could argue that we probably should have done this class first i tend to agree but there's a lot of tedious things that go into this that don't really give us the results that we want as fast as we want and also you probably could have just skipped this skip all together and really just copied and pasted it or you know just use whatever you'd like right the other one still worked and it was still valid and we could have just copied and pasted the other one as just importing this single function but i wanted to make this maybe a little bit more reusable for the future as to how i might end up using this actual data in this actual class um if i change to you know whether it's to a new model and all that so going back into main.pi what i actually want here is my spam sms model or really my spam model because that's really what we're seeing here and so now it's a slight change but this actually makes a lot of sense to now have an endpoint for spam right now i'm actually going to leave it in as ai model because if i am actually going to be using multiple models i would have to change a number of things not necessarily just this in other words i don't want why one single application to be actually serving a bunch of different models i would want my one application to serve one model in other words this api application should really only be serving this one single model maybe two maybe three at the most because each model itself takes up a good amount of memory so it'd be better to just duplicate this all over again for a different model but of course the actual model class itself would also maybe need a little bit of changes to make it completely reusable this rest api has three primary purposes number one is to provide really good predictions on whether or not a string of text is spam or not right and then number two is to improve the conditions to which it can make better predictions and then number three is to actually have this open and in production for other applications or users around the world right so what we're going to be focusing on now is number two is to improve the conditions that we can make our model better and a big part of that is just storing the data here right storing what's being predicted on and then storing maybe some of the results or all of the results and to do that i'm going to be using a no sql database which you probably already know called cassandra and this nosql database is going to be managed by data stacks and specifically with the astra db service okay so they definitely did sponsor this but the idea here is we want to sign up because you get 80 gigabytes of free data monthly that's a lot especially when you consider what it is that we're going to be storing right so we're really just going to be storing the query which will be up to 280 characters at the most then we'll restore some of the prediction results the key thing is actually just storing the query itself because in the long run we want to have some sort of way of labeling all of these queries and again in the future that's something that we're not going to be covering in this one so now what we want to do is sign up for a data stacks account so go to this link right here and sign up for your account now if you already have a datastax account that's fine but we're going to sign up and then log into the console something like this so the first step that i want to do with data stacks is getting my api keys so go into your organization settings go down to token management and we're going to go ahead and generate a new role for an admin user okay so i'm going to go ahead and grab the client id and i'm going to jump into my environment variables here and i'll just call this astra db client id paste that one in and then astra db client secret will be equal to the other one okay so those are the only environment variables i'll need at this point now i will be doing more things with the datastax astrodb uh i'll do that later for now i want to actually allow for you know my application to even have that configuration so the next part of this is to just allow in some additional configuration so instead of our app we'll go ahead and do config.pi and what i want here is i want the environment variables to work in my project so os dot uh you know os not in viron now what we saw before when we did that pipeline is i did the dot env load env thing right so i did from dot env import load env now we do something very similar to this but we actually do it using pi danting so from pedantic import base settings so pedantic is going to be installed by default with fast api because it's really how you configure your settings for fast api that includes things that are like environment variables so any of these things in here but it also includes things like other kinds of configuration you might want on your entire project like whether or not it's in quote unquote debug mode so let's go ahead and create our base settings so it's just class settings and its base settings and now we want to have our db client id okay and i'm going to set that equal to a string and then or the type is going to be a string and then db client secret okay so these are the values that i want my environment variables to be known as so the idea here is we are going to go ahead and initialize a method called git settings and this is going to return an instance of this settings class okay and i want this to automatically be set automatically be configured certainly a little bit different than our machine learning model where we did a data class and configured all the stuff ourselves the actual base settings class from pi dan take will do the configuration for us based off of our environment variables so the first thing i want to do for that is say class config and say env file equals to dot env again using that dot emv file in order for us to actually load in these values so the case of the db client id we want it to be mapped to the environment variable of this right and that's actually pretty simple to do we come in here and say field and then we set this equal to field and if we do dot that means it's a required version and then we can say env equals to that environment variable field name so another way to do this is just to say astrodb client id and not have a field in there at all that will map even with the lowercase values to the uppercase values but i'm going to stick with this because this is actually what i did in a different series but the idea here is we're just going to use the environment variables for those two items cool so now that we've got that let's actually import one more thing we'll do from funk tools not tools but tools we're going to import lru cache and we'll wrap this here so what this decorator does is it caches the results of this function anywhere it's called so after it's called one time in a running application we can call it again it won't re-create a new instance of that it just makes things a little bit more efficient okay so now that we've got this we can go into main.pi and we can import it so config and ml like that and now i can do settings equals to config.git settings okay so down here what i just want to do is say db client id is equal to settings.db client id just to verify that it's installed let's make sure our app is still running looks like everything's good i'll go into my home page here and there it is great so the question is can i use os enviro anymore like let's go ahead and see so if we do import os and then we grab one of these environment variables so i'll go ahead and just come in here and say env os.environ dot get that environment variable let's take a look and let's make sure we save it refresh in here looks like maybe i've got an error or something because it's not refreshing let's reload it manually and just give it a moment it's not loading okay so that is the only way we're going to grab environment variables inside of fast api okay is using that pedantic settings model just like that okay so this is going to be important for future steps with integrating the astrodb with our project here okay so now that we have our settings all set up we can absolutely add the aws access key id so let's go ahead and do that as well as the secret access key i'm going to set these to be none for now and i'll put these all lowercase just so we can see it in action so settings dot aws secret access key refreshing here make sure everything's all saved up and there we go okay there so it maps out just fine great so now our project is ready to actually start integrating the cassandra database the astra db managed cassandra database let's go and do that in the next one all right so jump into the astra datastax console and go ahead and create a database i'm going to name this one ai as an api and the key space name i'll use is spam inferences now key space is just essentially a collection of tables that you'll end up using the database name itself is a collection of key spaces which then is a collection of tables next we're going to go ahead and select a provider and region in our case we're going to be using aws north america and i'll just use the oregon region the region that you end up using and the service that you end up using doesn't actually matter that much the manage database service from datastax is great no matter what region or what service you end up using and we don't actually have any access to those things in those services right it's just kind of pick which one you like basically so i'm going to go ahead and create a database now there probably is a more serious reason when you start getting into much higher levels of production then perhaps you will start using a different service accordingly but that's outside the context of what we're trying to do here okay so now that i've got this i have my actual database right and it's full on ready so what i want to do is i want to click on connect and i want to navigate to the python driver here so we're going to be downloading this secure bundle now we're going to be using the cassandra driver as it says in the documentation here the main thing about this cassandra driver is it comes with this bundle this actually presents a problem for us when we go into production a problem we will certainly solve but a problem nonetheless because this secure bundle well we don't want it out everywhere but i still need to move it around so this might be the same sort of challenge that we have with respect to our models like perhaps we use an object storage for this well we'll see what we do when we get into production later but for now what i'm going to do is i'm going to create a folder in here and i'm going to start it out with simply ignored okay so the reason i'm having it called ignored is because inside of my git ignore file here i want to just go ahead and make sure that i have ignored as a part of my git ignore right so that when i actually bring in the bundle that i just downloaded it will be ignored from git from the get go okay so now that i've got that i'm going to go ahead and open up my downloads so into my downloads here and i'm gonna grab this bundle okay and so what i like to do is rename this to something a little bit more generic in terms of connect okay so in my case i'm gonna actually make it a little bit more specific and i'll call it astra db connect that way i can for sure have this being recognized as my astro to be connection bundle okay and so now inside of my app i'm going to go ahead and also make a dot pi okay so the goal of this is to make a way to connect to any given session and cluster inside of azure db but before i do that let's go ahead and make sure in my requirements i do have cassandra driver there and in my virtual environment it actually is in there as well because that is certainly a requirement to storing this inference data right there cool sure enough it is and now going back into db.pi we're going to go ahead and do a number of things related to this first and foremost i'm going to go ahead and import my settings so from config we're going to go ahead and say settings equals to config.git settings now the reason for this is for our environment variable keys now i'm actually going to stick with naming the environment variable keys as i have from the settings themselves right so settings is equal to this dot db client id and the db client secret okay so we definitely don't have to leave it like this but i like doing this so it's very clear in this documentation that i'm using astrodb right that's it okay the next thing is we need to get the path to that ignored file the connection file the connection bundle this is critical for actually integrating the driver itself so go ahead and import pathlib and i want to get the base directory so baster in this case is going to be pathlib.path of this file and then we are going to go ahead and resolve it and then just do dot parent okay so that should give us the parent app itself we will test that out shortly next what we're going to do is declare the cluster bundle where it is so cluster bundle and that equals to well the string of base der then ignored and then the we called it astra db connect astra db connect dot zip okay so we do not need to unzip this file it will do it for us when we run our connection that's simple enough so next we can do a couple things going back into the console itself we can actually see a few of the imports we'll need or a couple of the imports we'll need okay so i can actually bring those in to our db just like that and connect to them so these are slightly different like how i connect is slightly different so you see the connection here so if i do define git cluster and paste all those things in i have my session right there so i'll go ahead and leave that out for a moment but this looks close right so we've got our cloud config here that's looking for a bundle we have our bundle right there and i did make it into a string on purpose because we do not want it to be a path lid path we want it to be an actual string for our cluster here the next thing is our off provider right well what is our provider well the client id client id and client secret and there's our cluster so then i can just return our cluster here okay so that of course is not a whole lot different but it doesn't actually show me the session so one of the things about this is we definitely need to get the session in a different manner than what we have here right so the reason for this has to do with how our application is going to end up working on fast api so we're going to go ahead and do define git session and what it's going to do first is cluster and we're going to use of course git cluster and then our session we'll still use that cluster session connect thing right there and then next what i want to do is import something from my connection as well from cassandra itself so from cassandra dot cql engine dot connection or import connection okay so down here in our session itself we want to set a few things first off connection.register connection this is because we are going to be using this session throughout our fast api application or really wherever we need it so i'm going to set the string of the session itself and then session equaling to the session this is directly from the documentation okay and then finally connection dot set default connection to also be in the string of that session and then we'll go ahead and return the session okay so we'll see this in a moment but we definitely need these two things in here that's really the key thing here that's a little bit different than what is in the documentation because the documentation doesn't account for fast api but that's the baseline configuration for our session and our cluster okay so the session itself is always going to be needed when we want to execute things on our cluster the cluster itself well we can actually reuse this git cluster in other places for example if i wanted to bring it into jupiter notebooks i would use this git cluster method and then do my session call like this i don't necessarily need in my jupyter notebooks to have that connection being the same as my fast api application so we're probably not going to look at the jupyter notebooks at all in this one for using these clusters because the baseline of how we're going to store everything is actually going to be pretty simple so let's go ahead and create our cassandra model map now let's go ahead and talk about the data that we want to store and then create the cassandra model to store that data so if we look at these results here we see our query that of course is definitely something we want to store in fact this might be the only value that we want to store is the actual query that's being used in this inference model the next part is i will actually store the inference itself or the prediction itself so for this particular ai model i only have two labels right only two potential labels which really means i only need to store one of them right so i'm only going to store the top one in this case now the case could be made for storing all of this data but i actually don't need to do that i really only need to store one of them and so i'm going to store the label for ham and that confidence that is associated so that's really just three fields here so let's go ahead and open up our project and go into our app itself and create a file called models.pi and so in here we're going to go ahead and create our inference model so it's going to be a class sms and inference and it's going to take in another class called model but again let's reiterate what we're storing here the query the label and the confidence right it's just those three things and so one other aspect is i'm actually going to store a uuid field as well which will actually be uuid.uuid1 this includes a timestamp to it which is actually really nice so we're going to go ahead and import uuid then so of course we need to start building out this field the actual model itself so we're going to do from cassandra and then dot cql engine dot models we're going to import model so this model class is going to be right here and then we also need to bring in our columns to import columns so the documentation for the columns is i think pretty straightforward on what we need to do so if you look at the documentation here we can look for all of the fields we want now remembering back to like label for example that is a text field so i can look for text field like this but really it's just simply text it's a text column right and that's what it yields text cool it also allows me for some parameters in here which is certainly something i want so columns dot text for both the label and the query the query itself should have a max length remembering back to the data we had we had a max length for even doing the prediction itself and that max length was 280. now i say it should have this but it might not have this so i'm going to go ahead and get rid of that all together now the reason that it might not have this is because well maybe i'm not actually or i'm letting anyone predict anything in here and so i'm not actually enforcing a string size of some kind where the prediction might enforce it but the actual ingesting of it doesn't in other words when i'm actually creating the next data set my query might end up being a lot bigger i want to be able to be flexible enough at this point to do that so the next thing is actually thinking about the index itself so i could consider actually indexing this query so there is some positives and negatives about this as if this query ends up being really really big indexing it is probably not a great idea if the actual query is kind of small and indexing it is well probably a little bit easier to deal with i'm not going to index that at all instead what i'm going to do is turn my sms inference into just a really big data set that i'll end up dealing with later i'm not going to be extracting anything here i'm just really trying to get as much of that raw data as possible okay so the uuid field is the next one so we'll go ahead and go uuid and we want to grab we can use the time uuid or simply the uuid column either one actually works for us in this case so we'll do columns.uuid and this is going to be our primary key you have to set a primary key and in this case this is what it is now if you're coming from sql the primary key is a often an integer right one two three four five and so on now we don't want to do that in a cassandra database because of how it's distributed so it's distributed across many different machines or potentially can be many different machines in this bigger cluster so what a uuid field does for us is it at least identifies this particular data uniquely using a uuid 1 is both unique and related to a time timestamp so that's what we're going to stick with and i'll get just give a default of uuid uuid one okay really simple there not a whole lot going on here with this model but it will be able to store a ton of data for us the final one is our confidence here so we'll go ahead and do columns and then dot float okay so there we go that is our inference model so how do we actually bring this into our project that part is actually surprisingly very simple at this point now that we have our database and our session set up and we have this model set up so all we need to do is go into main i'm going to import a few things here so first off we'll import models and then above that we'll import our database and so in our startup method here i need to grab my db session i'll go ahead and say db session and put that equal to none okay so i need to actually declare this session so i'll use the global value again so db session equals to db dot get session so db of course being the module that we just did and that session we just created now again if this wasn't written in this get session method you would definitely want to do it in here but i want to think keep things nice and simple for me so i'm going to leave it just like that next we're going to go ahead and bring in our model right so the sms inference model so what i want to do here is just say sms inference equals to models.sms inference in fact i will do that up here because i do want to reference it in other places not just the startup method here and so what i'm trying to do here is i want to make sure that my database the actual cassandra database has this model the table itself and so to do that we're going to use a command called sync table and we're going to go ahead and run that in here okay so before i can go any further what i did have to do as you may recall to our key space right so let's go back into our key space and take a look we have this spam interfaces key space we have not used that yet right so we didn't also declare the database itself now the database itself is coming from this astrodb connect the key space on the other hand is not so the key space is going to have to live inside of the model itself so this is where we're going to declare it so we'll do two underscores key space and then the key space that we just set right and so we can absolutely change the table name as well i'm going to leave it in as the same table like it's going to be named based off of the name of the model class itself but there are other configuration options that you can have and of course that is all inside of the documentation for this so if we search for table name we can actually look in here we've got index value here we've got our default whether or not a field is required which i'm leaving all that stuff off but then you can also search for the actual table itself for the models themselves and change the table name as well so a quick way to do that would be cassandra driver python models and then going into the models themselves and you can see all of the configuration for the models in here for how you might want to extend it or how you might want to change things right so table name and then all of that so the other part of this is that we are really just storing one layer of data there's not a lot of data that i need for this application anyway but i'm not actually associating this data to any particular user so there might be a chance that you are going to be like oh well i actually want my rest api to have users and in which case you would actually create a different kind of setup here so i'm going to go ahead and just give a quick example of this so i'll go ahead and say rest api user and again we'll still use that uuid as a potential unique identifier for them and i also might want to have my email in here as another one and so in this case we could say index being our true value or what's better is instead of having a full full-on user here i could literally have a field in here called email and doing you know something like columns.txt and putting our email in there right as a index itself and that would give me all of like when i actually do to query this it would give me all of the emails that are related because of that index now you could do it based off of primary key too as well but the thing about a primary key is you can only have one of them one instance of it in the entire table right so this uuid is only going to work for this single one if it was an email then it would only be one email for all inference models now if you want some more clarification on that check out my series on scraping amazon we definitely implement these this same sort of setup over there as well anyways so back in here now that we have the actual key space name in the model itself i'm going to go ahead and sync the table finally and we will be mostly done with this portion so seeing in the table is of course from cassandra.cql engine dot management import sync table and now we'll go ahead and sync that table and i'm going to just make sure that everything is running okay looks like it is and if i refresh my app um yeah it doesn't seem like i have major errors in there i do get this warning here this engine warning and so the way to solve that is to go into our db or our configuration let's actually go into our configuration and import os and just go ahead and say os dot environ of that and just say zero or we'll go ahead and say one allow schema management yeah we want to allow schema management okay so we refresh in here and now that error should go away so tensorflow error which is another one we could probably add into environment variables i'm not going to worry about that right now but instead what i want to do is i actually want to save some data in this model and then i want to retrieve that data so we'll do that in a moment all right so now let's go ahead and actually store some of our inference data so jumping back into main.pi down into read index of course we're going to go ahead and use our inference model here and so the simple thing is to say top is equal to preds dict dot get top this of course should be a dictionary itself with the label and the confidence so that means that my new data the data i want to store is going to be our query here and then our unpacked top data and then i'll go ahead and say my object is equal to the sms inference dot objects dot create and unpacking that data itself okay and so this i can actually return as a response or should be able to return it as a response so we'll go ahead and save that we'll jump back into our project and run it and what we'll see is well hopefully actual responses from hello world along with table yeah there we go and so it gives us the uuid back as well which is nice to see because if i do it again the u id should actually change so if i refresh this i'm seeing new uuids in here and then i can also do some new queries like this is awesome or you know another one bites the dust or you know any sort of spam this is spam spam spam who knows if that will actually even be considered spam it is not considered spam but if we put something like discount is our phone you know some numbers in here buy now call now something that's almost for sure spam there you go okay cool so it's definitely storing this stuff and maybe this is the response we want our users to see maybe not in my case i actually do want this i wanted to give me a label and a confidence for that label for this particular spam item right if it is the opposite label of course that could go back to that opposite label now the other thing about this is perhaps we always just want the spam confidence right and if that's the case then you would just have some sort of condition in here that's going to flip that around or you know we could we could update our dictionary on that now that's of course not what i want to do but what i do want to see is how do i actually turn this into list data itself and then further how do i actually paginate that list data so that's something i'll do in just a moment but i do want to see the actual responses for this so the simplest way to get one of the responses is to just come in here and give some argument here let's go ahead and put it into brackets and i'll just go ahead and say the uuid or my uuid this turns into an argument here and i'll go ahead and say inferences slash myuid from there i can get my object with sms inference dot objects dot get and the uuid is equal to my uuid that's being passed and then i can return that object so this is going to be read inference right and so granted i can grab this recent inference right here and go into the inferences and take a look at that individual one and what do you know there it is so just a really simple and easy way to save data and look at it like i said i want to actually turn this into a list of data i want to see maybe all of the objects that are in here and i have to do that using pagination that's what we'll do in just a moment now i'm going to show you two ways on how we can list out all of our data here so the first way i'm going to go ahead and just copy our original one and i'll get rid of our uuid stuff and we'll just call it inferences get rid of that and list this out it's really simple we just do.all and that will turn this into a query set and then we can just return a list of that query set and of course we can actually print out what that query set is and what it looks like in our project itself so we'll go let that run and then go into inferences the actual url for it and what we'd see we actually see this right here so this is showing me the cql command that will be executed in the session notice that it has a limit of ten thousand so we're actually going to stick with that limit and show another way to list these things out so it does actually list everything out and it lists out generally pretty well and it's all in json data so what i'm going to do is now take another approach that's not in json data but rather streaming data and so we'll go ahead and copy this right here and i'm going to call this one now export and what i want actually is my cql query and i'm going to set it equal to what we saw just a moment ago this right here right and so this is just going to be something i will execute in my session so a way to actually execute this data is by taking the database session itself and actually running it so we'll go ahead and come back down here i'll call this export inferences or we'll just call it dataset for now and so to actually execute this query what we do is well we can say rows equals to dbsession.execute and this cql query and we can return back the rows there so this may or may not work as an anticipated let's go ahead and do our data set here and that is just a simple way to actually execute one of the rows or some of the rows right and so what we see back is that data coming back in a fairly straightforward manner right and if we limit this number down to like say i don't know 12 it's probably gonna actually respond a little bit faster but then of course the actual data set is going to be quite a bit smaller uh once it does finishing loading and of course i will note that the reason it takes a while to load is because of tensorflow not because of the cassandra engine itself the cassandra stuff is actually really fast tensorflow loading into memory is what takes a while anyway so when we refresh this we get you know whatever this query ends up being okay so for streaming data in particular this data set i actually want every possible row that i can up to a certain limit which in my case i'll leave it at 10 000 right so that's the limit we want to use and we want to go off of this query so before i jump into actually how to run that query itself i'm going to go ahead and implement a streaming response in fast api so from fast api dot responses import a streaming response and what this does it allows me to call a generator and return the response or the yielded data from that generator so for example i'm going to go ahead and call this fetch rose and i'm going to leave it empty for now and i'll just go ahead and do 4i and range let's say 30. and this i'll just go ahead and yield out a string that is basically row and i okay so not that big of a deal i'm gonna get rid of that space there and maybe row i and i don't know abc okay so what i can do is in this streaming response i can actually call that fetch rose there and it's gonna call this function which will generate all of this data and it will turn it back into a streaming response on my actual page here and so another thing that i could consider is actually adding slash in there which would you know change how the generator comes out to where it's now closer to csv like a comma separated values table right as we see here so what i want to do is actually turn this sql query into having the ability to generate this same sort of data and this takes a few steps here so the first thing is we want to import the simple statement query from cassandra so from cassandra dot query i'm going to import simple statement okay so this simple statement we will actually use down here in both our fetch so we'll go ahead and do the statement stmt is of the type statement okay and then we will go ahead and come down and also add in our session here i'll go ahead and say none okay so the statement itself we want to start out with is the query right there so statement is equal to that simple statement passing in the cql query and then we're going to go ahead and fetch those rows like that now i also might want to also include a fetch size like how many items in any given query should i start out with even if the limit is 10 000 what number of rows do i want like think of it in terms of pagination this is going to be an integer and i'm going to set the default to be 25 okay so fetch size being 25 and down here i will leave it in as 25 and then i'll also include the session the actual database session itself which we set with db session at that global variable okay so let's just go ahead and be sure to grab that with global and session okay so now inside of this fetch rows statement we can start actually building this out a little bit more so i'm going to go ahead and separate it a little bit and so the idea here is taking the statement itself and then adding the fetch size right so updating that simple statement to the fetch size we want to use and then what i'm going to get is my results so rs this is a result set is the name of it we are going to now do session.execute that simple statement right so just slightly different than what we just saw where it takes the statement but it also adds in the fetch size so it's going to fetch only 25 entries at a time but it's going to be limited to 10 000 right so there is still that limit that hard limit of number of data in there because more than 10 000 it might take way too long to actually export all of this data and so what i would offer if you need to export all of the sql data is to probably have some sort of crowd job or some sort of um celery task that will actually export the entire data set into like a csv file or something like that and then updating it into object storage that's not what i'm talking about here so now that we've got this result set um i want to just check inside of that result set if it has pages so in other words are there more pages in this result set we'll just go and say has more pages okay so if it does have pages it's going to do some stuff all right so that's where our yield is going to come in and all i need to do here is say while it has more pages then i'm going to go ahead and run through so for row n rs dot or rather our result set dot and current rows now i'm going to go ahead and yield out the row data okay so the row data is literally what's stored in our columns here so uuid query label confidence so what we can do is this is actually considered a dictionary value right here i'm going to leave it in with this sort of you know comma separated values here and so i'll just go ahead and use the f string substitution so row and uuid for example and then comma row same thing row and label and then confidence let's go ahead and just copy and paste this a couple times so confidence and then finally query okay and so now we are going to loop through well initially it's only going to loop through the first 25 right so eventually we wanted to loop through others and so after i end up doing this now what i want to do is see if it has more pages still right and so now i'll just go ahead and run this execution again okay so i'm resetting the result set again this time i'm going to go ahead and execute that same statement but i need to add one thing to it and that is our paging state and that's going to be equal to our previous result set dot paging state so this is how it's doing the pagination okay so it's going to fetch all these rows and do pagination this way this page or paging state is not a number unfortunately if it was a number it'd be a little bit easier to implement pagination into any given actual list view itself but unfortunately cassandra databases aren't as easily paginated especially with the data we currently have just generally speaking how it's stored but this is a quick and easy way to do that in terms of streaming data okay and so now i don't need that trailing comma i have all of this data let's go ahead and take a look let's refresh in here and what i'm hoping i'll see is the actual instances and so on and so i do right so there's the uuid there's the label there's the confidence and there's the query so of course i can actually start out with a yield here as well let's say yield and yielding out a string of uuid label confidence and query right and then putting a new line there and then refreshing in here and we can see what ends up happening with this right and so what we should see is the well basically the table header right there and here's all this other data right and as you might guess if this is too big the request will time out and that it won't necessarily give us all of this data back which is why i mentioned that it's probably not the most efficient in the long run for grabbing all of your data set but it will be pretty good for a large chunk of them which i think is actually really nice but it's also just a way to see how we actually list out a lot of this data and also how to do some pagination with a cassandra database and in our case it's just with astrodb so hopefully they got you got something out of that i would want to test the limits here and i would i'd be very curious to see some of you if you do try to put in you know like a million entries in here what ends up happening how fast will it actually respond for you because a part of it is of course the speed as to which tensorflow is running for us but after tensorflow is running and my fast api application is set up this actually responds pretty quick so we will have to test this quite a bit and i will say that of course if you actually go and create new data that new data should be in here as well and it should be streaming in there too so if we come back in into our data set here and let's put in a few new items in there we have all of this data right and so it's not going to be in order it's definitely not going to be an order that's what this uuid will help us do is to know when this data happens exactly but the way cassandra stores data is not in order it's stored by our primary keys and it has eventual consistency which means that our entire cluster of this data set will eventually have the correct data in it so just a little little tidbit about cassandra that's in there but anyway so so now we have a pretty cool way to at least extract a good chunk of the data that we store in our database and yes of course this is what we would end up using for creating our next you know part of data like if we were actually starting to build a new data set and start labeling the stable data itself right so it does have a label label but this is an automated label this is not necessarily something we had a human look at which is a good and bad thing like perhaps we don't want people looking at all this data because if it ends up being a million entries are we really going to have people hand pick these things out probably not there would be a machine learning algorithm or there is a machine learning way to kind of separate out this data and have some of the anomalies being looked at but really i think the part of this is going to be also getting rid of repeated data right so we don't need repeated data over and over especially when not only is the you know the query itself the same but also the you know prediction the confidence level is the same too so an anomaly in this case would be well what if the query is the same and for some reason the actual prediction is different this would be true with the different versions of our actual model as well like the model version that we are using which kind of gives me to my final point with these sms inference perhaps we would also want to use a another item in here let's say a text item and we'll go ahead and say the model version and setting our default to being v1 and that's it right and so just another way to sort of ensure that we have the correct data in here and of course if i now add inference that's already there this is what i love about nosql databases and especially cassandra's database is i can just add that field real fast and i'll add the data in here now this won't happen to old data right so going back into our iteration here if i try to get the version number here it's only going to work for some of them right so what do we call it call that model version so model version and we refresh in here the streaming data the model version should only show up a handful of times and actually we want to put version here too and we get none in a lot of places right if not all of them here is version one the most recent one that was done right so that's pretty cool now before we go into production i actually want to test my ai as an api through ingrock ingrock will allow me to emulate a production environment by exposing a local project to the world so go to ingrock.com and download it these steps i think are really simple you will have to sign up with an incrock account it is free it's not like they sponsor this or anything it's just a really easy way to expose a project to the world and so we can test it in something like google collab or really just anything that's not on our local system so once you have it installed you'll see that you can just run in grok like this and so the command is simple it's just ngrok http and then the port we have exposed so in my case my fast api port is port 8000 which i can just cancel close it out and see again that's the port right there right so for some reason you used a different port you would want to update what that port would end up being right so again let's just make sure it's running there it is and so let's give this a shot there we go and so what happens is it actually gives me a few different urls here and so if i go to one of these urls ingrock will help the routing of that url and in a secure manner i can actually try out my api here right and so this is actually emulating a production environment so if i do hello world i see all that data and then also if i look at my recently created data set it's going to go ahead oh not data sets but rather data set it will give me and stream out that data for me as well now the reason i'm not zoomed into it is just really to test it out itself but it certainly is working in a very similar fashion so yet again i can use this inside of google collab so i'm gonna go ahead and create a new book new notebook and we're just gonna call this our spam sms requests so that means i'm going to go ahead and import requests the python request module and what i want to do is i just want to use my url so endpoint being http whatever that was and then i'll just put a q equaling to this is my query and then i'll do r equals to request i get that endpoint and then print out r.json okay and there we go it now gives me that score we now have an almost production version of our ai as an api now getting this into production is not trivial but it is something that we will look into okay so the other part of this is what about that data set thing so if i come in here and do data set right let's take a look at that if i print r we get a well we don't get the value that we necessarily want let's go ahead and do text okay and so there we go it's giving us back this text data as well now a better example of this would be to change it just slightly so instead of it being just a request.get it will be a s which is request dot session and then from here we'll go ahead and say r equals to s dot get that endpoint and then stream is true and then we'll go ahead and say for line in r dot itter lines then we'll go ahead and print out that line actually we might want to have a condition here and say if line and then print out that line okay and so this will actually get the streaming data coming right back too now it's actually really fast right this is not that surprising i mentioned before that the reason it seems like it's going slow on our system only has to do with the fact that i keep on saving my python files which ends up changing a lot of that data okay so with this in mind i do want to change one thing about my actual you know inference results and that is the method that we are using to get a proper inference that's this right here right so i don't actually want to get this with git i want it to be post all right so let's call this create inference reason i'm changing it to being create is because i actually do touch a database here i think in general if you're going to touch the database turn it into a post method because that is the method that we are going to want to use in general when we are creating data and so now i'm going to go ahead and just leave it the index itself we'll just go ahead and call this like hello world okay and also now the query set itself or the queue that's coming through we will have to update this of course the way we actually request it so let's go back up to our data here this is no longer git but rather post and the data itself is going to be well we want to go ahead and put in q and this is my new query notice it actually still worked this time still with that post data and the queue here let's go ahead and try it with the actual data in there this time it said hello world what if we put it in as json this time it says hello world still so it's actually not getting the correct data coming through from our post data right it's actually using the query parameters from our url and so the way to solve this issue is by coming in to our project again and adding in a schema so schema.pi what we want to do here is we want to use pedantic so we'll go ahead and do from pi dantic import base model and i'll give this a class of simply a inference query or we could just call it query because that's really all it's ever going to be and in this case i want to actually add in q being a string and that's it it's literally the only thing i need from this model so back into main we'll go ahead and import the schema and inside of our read index or rather our create inference file here instead of it being optional it's going to go ahead and be q is going to be related to that query but really this is going to be query is related to that query and now our actual query itself is just simply q dot query okay so let's go ahead and try that now we run this in here and it will take a moment for this to boot up because i should make some changes on my local project so ngrok also has to wait for those changes to occur and it's certainly possible that we have an error somewhere in here oh yeah query is not defined yeah there's our error this should be schema.query okay and this should be reverse that query.q okay so now it's reloaded and i think we got it q is not defined that's going to be query.q here okay and let's try this again running back and it looks like our system is being built up just fine and there we go we got our post request coming through and this is my new query and now if i leave that out let's just put empty json it's going to say the q is required right so q being anything i need to query okay cool so now it's a little bit more of a secure value here now of course one of the other things is that we'll probably add in when we go into production is proper headers to ensure that we when we are doing post data to this endpoint we have at least somewhat of uniquely identical identifiable headers for our endpoints to be run and we'll still flush that out using ingrock in the future as well but i just wanted to quickly just change the way that endpoint was working the other way is not wrong per se but it is well just not very good practice when it comes to building web applications in a secure manner and even where we're at now we definitely want to have some sort of authentication for our users as well anyways that's it for this one let's go ahead and thanks so much for watching hopefully you got a lot out of this one now in part two what we need to do is actually deploy this application into production now it's a little bit different than a standard python web application because of our machine learning model there's a number of things that we have to consider to make sure that that is running into production as well so we're going to cover all of that and we're going to be deploying from scratch on a virtual machine so that anyone in the world can use this application now sure we used ingrock as a test to deploy it to the world but actually getting on a production server is what it's all about so be sure to subscribe and hit that bell icon to be notified when i release that one but again thanks so much for watching my name is justin mitchell look forward to seeing you guys next time [Music] you
Info
Channel: CodingEntrepreneurs
Views: 76,334
Rating: undefined out of 5
Keywords: install django with pip, virtualenv, web application development, installing django on mac, pip, django, beginners tutorial, install python, python3.8, python django, web frameworks, windows python, mac python, virtual environments, beginner python, python tutorial, djangocfe2021, python, django3.2, web apps, modern software development, web scraping, cassandra, nosql, astradb, selenium, celery, jupyter, keras, tensorflow, machine learning, deep learning, spam classification, datastax
Id: 56qQNcHJxyQ
Channel Id: undefined
Length: 228min 19sec (13699 seconds)
Published: Wed Oct 06 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.