Boot Camp on Cricket Score Prediction Via Machine Learning

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
just let me know when i should start okay so i think we are live so yeah okay all right uh hello everyone uh my name is harshit yagi and i welcome you all to this uh session on cricket score prediction and as you guys already know it's been uh it's uh an ipl season that's been happening and there's just talks around all about ipl and in this particular analysis in this particular prediction and session uh what we're going to do is we're going to simply uh talk about you know how we can go about analyzing uh the past data that we have we already have about the previous matches that has happened and also uh how we can merge data sets how we can explore them and how we can leverage machine learning algorithms to predict uh different sort of variables uh there's uh sports analytics is is a very uh you know uh vast field and there you get a lot of statistics and you can do a bunch of things with with them so uh without any further ado uh let me uh just first introduce you to uh two tas that we have uh so there be uh two tiers who would be helping you out with your questions uh in the chat so uh in order to make it a success for all of us uh keep posting your questions uh in the chat and if there is anything ambiguous or you do not understand anything uh feel free to check in and write in your queries in the chat and we'll try and answer all of your queries so let me quickly share my screen and so here's the ipl uh match analysis and awareness code prediction uh that we are going to do today so these are the topics that we'll be covering uh again i will show you a way to get the data and all of those things you will have the repository and everything uh like after the session ends or if you want to clone it i'll provide the link as well uh so these are the topics that we are going to cover so first of all uh we're going to look into once we have the data so we have taken the data from kaggle uh or about uh for like like last 10 years 2008 to 2019 and all of the matches that have been uh played uh so far and so the concepts that we are going to cover today in this particular session are going to be exploratory data analysis so we're going to look into how to peak first of all initial exploration of the data set how to clean data set how to manipulate their manipulate the data set uh and find out missing values and and a bunch of other things uh in the cleaning process then uh we're gonna look into define how to do how do we define the problem statement because there are so many variables so you can get lost at any point of time now one thing that you should always make sure is that you have a clear understanding of what you want to get from the data then the third uh point is visualizing uh important features so we'll see how we can visualize a few things like toss winnings and match winnings and second part second segment would cover the data preparation process so we have label encoding and one hot encoding of all the categorical variables so i'll be covering label encoding and give you an intuition of how one hot encoding uh would actually transform the data and then you can run it uh over uh as a home task then uh we'll look into uh modeling so there we'll see how we can split data how how how can we choose like what all algorithms we can choose for this uh classification problem uh that we would come up with and the third uh is the cross validation so we'll see how what are the best ways to actually evaluate our model that we have uh and what are the best practices uh regarding uh you know splitting the data into training and testing and why we should not touch the testing set uh you know up until uh once we uh you know until we are really sure as in uh yep we are uh satisfied with the performance of the model and we are ready to go into production then the last part that we will be covering is checking the model on the test data and there are few other things that i have uh gathered so that you can actually build over it to add more complex variables more complex data set and you know build over it to analyze and predict uh many other variables so uh so here i am uh so i have already uh done the analysis so this is all uh available on github and now i this is gonna be more of a hands-on session so we won't be like diving into the mathematics or uh anything of that sort uh in the uh about the models but yes we would be doing that all uh hands-on so yeah let me create a new notebook so all those of you i'm using the jupiter notebook first of all here so this is the directory page basically and you can actually create so if you want to create a new notebook so you click on new and then you click on python 3. this will this takes a few seconds and then it creates a new notebook for you so here you can actually do do everything that you want to do with your analysis and everything like you can document your code you can actually write the code as well so uh let's quickly change the title so you can change the type let's say appeal okay uh so let me give you a brief uh intro to this interface jupiter notebook interface so we have like a couple of different types of cells so this is a code cell as you can see from this drop down over here and so we have code we have markdown and raw and b convert and heading so heading and markdown is just about the same so most of times we would simply be using code and markdown and let's say this is the code so you can type in like any code in python so this is a python notebook so you can run it and it will basically run the code for you then another type of cell is markdown which which you can use to actually document uh or if you want to uh write any sort of you know steps or with the segment heading or whatever that you are doing so let's say i this is ipl protection notebook i can move this cell up and down using these arrows over here so this heading should always go on the top so here i have put the heading on the top now how can you run this cell how did it go from here to here so you can actually click on run over here or you can click or you can just hit shift plus enter on your keyboard so that will run the cell for you and if you want to add a new cell so here you can click on plus if you want to cut a cell you can click on this scissor icon and it'll delete the cell in case you want the cell back so escape plus z is the shortcut and it'll get your cell back by chance if you delete any particular cell all right so this is enough for us to get started so let's let's dive right into it uh so the first step uh is to import python libraries i am assuming that you guys have some experience or some knowledge some familiarity with the python libraries so the data science libraries mainly pandas numpy and sklearn is what we'll be using majorly over here in this particular session then and for visualization we'd be using matplotlib and c1 so firstly let's import the libraries so the main data analysis and data manipulation package is a very important package called pandas so we import pandas as pd then the second most important package is numpy to handle uh you know uh to handle multiple uh multi-dimensional arrays it's it's a core scientific uh computation package numpy and pandas gives you a special data structure which is called a data frame which you can use to manipulate uh you know all your features targets rows everything so we'll uh see how it does that i have done import pandas as pd just to abbreviate pandas so that i don't have to like write pandas every time i go into it every time i use it so import the libraries and another important library that we're going to use is matplotlib and inside matplotlib we would be mainly using the pi plot module and we'll import it as pld these are mainly the conventions that we have about these imports as in like pandas is usually in imported as pd and amp is usually imported as np and then uh we don't want to look at any warnings so you can also uh switch them off using warnings dot filter warning okay yeah you can hit tab and it auto completes but sometimes it's it's a little slow just like in this particular case so again i've written my code i've written uh all the libraries that i would need uh to be used in this particular notebook so you can hit either this run button or you can hit shift plus enter on your keyboard so uh as you can see after this shows you that the notebook is still running and once it's done so if there is an error it will show you an error so yeah we have successfully imported all our libraries now the next step is to load the data set so the data set is uh this data set from kaggle we have two files over here matches dot csv which is matched by match data and we have deliveries.csv which is like ball by ball data we would be mainly using matches.csv in the first segment and then i'll show you how we can incorporate the other uh file which is the deliveries.csv so here we'll be loading the data and i will i have the data downloaded already so it's here in my data directory and i have got matches.csbndeliveries.csv already with me so uh what all i need to do is use the csv read csv function so pandas provides us the method so let's quickly load the data load these files into a data frame match underscore dsdf and then go pd so in the pandas library i have this read underscore csv function it takes the path to the file data slash matches dot csv and once you have provided the path it will read the file it will return a data frame which is now stored in this variable called match underscore df and let's say we run this and if you want to see like what's inside of our data set we can actually do this so match so the last line is basically uh what the cell output would show you you can actually also do print in case you want to print something in between but the last if you are putting something in at the last then it will basically show you what that variable contains okay so the match underscore df so our file is correctly read and it has got a few variables you can see id season city date team 1 team 2 toss winner and a bunch of other variables that we have now this is a long data set so you have 756 rows and 18 columns as can be seen if you want to look at like quickly uh let's say the first few rows you can do match dot head it only gives you the first five rows you can also change the number of rows that you want to look at look into so let's say if you want to see 10 rows you can provide 10 and print you know it will show you the first 10 rows and if let's say you are you don't want to look at the first five or the first 10 you can actually also use sample which will give you a random sample of rows so match underscore df.sample let's say i want 10 samples so these as you can see these are like random samples pegged up from our data frame now uh the next thing is to read the other file which is our deliveries file so let's read that file as well let's store it in delivery underscore df and again same procedure pd uh dot read underscore csv pass the path to the file data slash deliveries dot csv look into deliveries underscore df dot let's say what does the data contain i'm interested in looking at the first five rows so by default it shows you five so we have quite a few variables here this is 21 columns and i'm already i'm only printing the five rows so that's why you are looking at five over here so this is like ball by ball data that we have uh match id inning uh batting team bowling team uh over ball batsman non striker baller uh and a bunch of other variables that we have over here in the deliveries dot csv all right now the next thing that we can use is this is about uh peaking the peaking at the data set now we can also look at the information of what this data set contains so let me put this okay match underscore df dot info so this is another uh you know attribute function that method function that this data frame object has which we can use so match underscore df dot info call the uh invoke the info method and it gives you all the information about the number of rows first of all 756 entries in the match data frame and these are the columns uh from 0 to 17 that is 18 columns id season city along with the uh the nominal values so here you can actually get a hint of you know uh if the data set actually contains like missing values or not so if you have 756 entries like every uh like ideally every column should have uh 756 entries so here we can see seven uh city column contains 749 and then winner column contains 752 umpire 1754 so there are some missing values in this data set that we might have to address all right so that was about the how you can learn about the information similarly you can run it uh for deliveries data frame dot info so this is the information about our delivery data set uh we have match id batting team bowling team uh over ball batsman non-striker so you can see that there are so many variables that we can actually use and the analysis also can go in any which way so it depends like do you want to like predict this code do you want to predict who is going to win do you want to predict who's going to so if let's say if a particular thing happens let's say if some if a team wins a toss what are the chances that they'll win the match as well so you can look into various directions and that's why sports analytics in itself is is a huge field so that's how you can look at into the information of the data set now the next thing that we're going to do is exploratory data analysis so let's quickly add a heading exploratory data analysis so here we're going to look into a bunch of things so first we'll try to look at all the teams in the teams and the winner column so here let's quickly check match underscore df so we have team one team two uh you can see we have team one team two in that particular match then we have toss winner toss decision uh was dl applied or not who was the winner wins by win by runs win by wickets and a bunch of other things so mass underscore df so if you want to access or if you want to look into any of the variable any of the column sorry so here you can do like this square brackets just as you do it in an array or list so i am interested in looking at the winner so who all will won how many matches so let's first like look at like what are all unique variables i want to what all unique observations do we have in this column so match underscore df winner and then i invoke value underscore counts method so this gives me a count of all the observations that are present and their frequency so basically the count so mumbai indians uh appeared one or nine times in this winner column so that's what it's showing me so mumbai indians is 109. uh it's uh moment indians has basically 109 matches chennai super kings 100 so on and so forth up until rising super giants and you have to keep a keen eye like keep your eyes open while analyzing these data sets because there can be something you know suspicious or something that's wrong or a typo anything that needs uh uh you know your attention so you have to keep your eye open while looking into what these functions or methods uh show you then uh let's look at match underscore df uh like what all teams played so okay same way we have a team one column so we want to look into like the count or the frequency of all the uh teams that played so not value count okay it's value counts yeah so you can see match underscore df uh team one uh dot value counts gives you momentum has appeared 101 so that is team one so the other matches there that they would have played would have been like yeah would it be in team two and similarly kings eleven ninety one times uh channel super games 89 times so on and so forth now uh one thing that you should uh you know look at over here is you have rising super giants uh which has played which is like appeared as team one eight times and there is another uh team called rising super giants so the team name is same it's just this small s that is present over here that can cause uh errors in our code so we have to keep this this in mind that we have to address this so it's actually the same team uh that we are looking at uh here so the next thing that we do is let's look at a few missing values do we have any missing values so checking for missing values so we have an is null method so in any variable that you want to check you can actually do so let's say we saw in the information uh that we printed about match underscore df we saw that we have city column has 749 and then winner column the winner column is the most important column so uh for r in a particular analysis we would be interested in looking at who will win the match so here we are going to see we are going to use the winner column as our target variable so that must be in good shape so it has got missing values so let's uh quickly see what that is all about so first uh what we're going to do is we are going to check match underscore df and i'm going to check the winner column now in the winner column i want to check all the rows that has null values so is null method and that is how so basically i in the match underscore data frame i want to look at all the rows where the winner column uh is not null so is null should be it basically gives you true and false it says true and false so whenever it encounters an nan it basically uh marks is null as true over there so i want all the rows where we have a null value so if we look at this it's it's kind of like a filtering method that i have applied over here so this is my data frame match underscore data frame and i want to print all the rows in this data frame and inside of it i want to print like i have provided a filter query so here i am only interested in looking at rows which have uh null values so uh we have got 4 over here and you can see we have result as no result and winner is measured as nan so basically there could have been any reason uh rain or the match didn't happen or what for whatever reasons so we need to address this and match underscore df so we can actually drop these rows or you can like replace it uh replace these nands with some other value that you uh seem is fine so uh one i can show you like how you can replace it there are very less like uh low chances of a draw but let's see if we we can do that we can actually uh you can actually drop these rows as well uh and that would uh that would be okay for this particular uh analysis there are different ways to impute missing values uh that's beyond the scope of this particular session but yeah uh whenever you have a data set which contains a lot of missing values so you have to decide based on your problem statement the context of the business and different other factors that do you want to impute those missing values you want to delete them and if you want to impute or fill those missing values how would you fill those values what strategy would you use so that is another another conversation so match underscore df so what we are going to do here is the null values we are going to replace them with let's say draw so match underscore df uh winner and i'm going to fill all the n a values so you can fill the any values and pass like what do you want to fill it with so let's say draw i'm going to fill draw instead of these nands and in place equals true so that everything happens in the rows itself everything remains intact and run this cell shift plus enter and if you want to now check if you run this now if you run this statement now in this new cell you can actually see that now we have we do not have any uh null values so the winner column is basically all filled with uh so you can actually check it in the info as well or you can do the value counts as well so you can see the winner column now contains 756 nominal values so that part is addressed now one thing if we see in this data we have team one team two toss winner so all of these are uh textual information all of this is textual information we have strings of team names over here and machine learning in general or computers in general understand uh numbers rather than text so what we'll have to do is we'll have to and we'll have to come up with a scheme where we can actually encode these names these titles of these teams and to uh you know with a number so we can use different methods so let's quickly create a dictionary uh which we can use so team underscore encodings and this dictionary what will it contain it will simply have the key let's say the first team is mumbai indians and i pass a key value so that this is my key dictionary uh this is my key of my dictionary the first key and my value that i've assigned to mumbai indians is one so wherever so basically i'm going to use this entire dictionary we'll do it for all the other teams and we're going to use this dictionary to replace all the names with these encodings that we have defined for all of the teams so these this is my team encodings i have done this for all the teams so let me quickly uh copy paste that command we okay so this is my team encodings i have encoded all the teams and now if you see over here i have addressed the rising super giants uh issue which we uh you know encountered earlier while looking at the match's information so here you see the team one uh column that we looked at the value counts we had rising super giant and we're rising super giants as well so there was a typo all i've done is i have addressed the same uh key same value sorry uh same encoding to both of these so that will basically address so if we'll group them or aggregate them that will give us the same value so team encodings uh we have provided the same value rising super giants uh so we have encoded them same uh and now what we need to do here next is this is the general encoding dictionary now we need to create a specific to each column so that we can apply it directly so for each column we'll have to apply this dictionary to so let's say team and code underscore date and then we do it for all the columns that we have so team one theme encodings then we do it for team two team encodings and then we do it for toss winner team underscore encodings and then there is another variable which contains team names steam one team two toss winner and then this winner column winner column also contains the name of the team so we'll encode that as well so our dictionary is now ready now we need to apply these rules to our data frame so match underscore data frame is the main data frame that we are using here we use the replace method so we have a replace method to which we'll pass this team and code depth team and code dict and again we need everything done in place and then if we look at the data frame now so let's look at how the this data has been transformed now so you see team one team two uh we have ten and three and toss winner is three and again winner is ten so let's look at uh let's try to compare if we have done it correctly or not so we have 10 uh 10 and 3 in the first match so 10 is sunriser hyderabad and three is team two that is royal challenges bangalore so the first match was royal challenger bangalore and sunrise hyderabad and if we look at over here and let's let's just print that over here itself okay we have already changed so yeah so okay yeah so team one uh okay this is a sample we can actually look at over here okay let's quickly run it and do it again so the first match you see was between sunrise and hyderabad and royal challenges bangalore so now we have to run all of these again so now uh the team encoding part is done uh now what we need to do next is we're going to explore the city column the other column that has missing values so let's do that as well quickly so what we are doing here is we are trying to first address all the missing values and based on that we'll come up with our problem statement where we choose a particular set of features a particular set of variables and then decide like what we want to predict based on what all features so uh missing values in city column so match underscore df and here city again you can first let's look at the value counts for the city column run this and we will see okay it's taking a few seconds and by the time it returns let's also write the next query which is where we want to look at all the null rows all the missing values rows so it has got us the result so we have mumbai kolkata so many others and yeah so these are the cities uh where these many number of uh these many number of games have been played basically so now let's look at so we saw in the city column there was like 749 so we should be having like seven missing values so let's check what's that is all about in the city column i want to check all the null rows so is null should be marked as true for those rows where i have missing values so let's run this and we have these many rows so you see the city column we have all nans over here team one team two is fine uh then if we can find something else so you see the venue over here we have the venue but we do not have the city so that is that can be easily addressed so venue is dubai international cricket stadium for all all the teams so yeah so all of these uh all of these cities uh all of these roads where we have a missing city uh all of these were played at dubai so we know how we can address this so match underscore df we are going to replace it with uh device so again match df in the city column dot fill n a and then pass dubai to that city in place equals true so now we have addressed that as well you can check mata underscore df dot info and here we have so the if you look at the information basically the data type actually gives you uh which all our strings which all needs to be addressed by some sort of encoding or do you want to drop them or whatever that you want to do uh team one team two now have been converted into integer uh integer data type and toss winner as well the one that we have added then codings for and the city column we see it now contains 756 non-null values so all of those values uh missing values have been imputed correctly now the next step is to drop all the rows but before that let's quickly look at one very important method as well which is to look at the summary statistics summary statistics of the data frame uh so that is how you can understand like what's the average of that particular column standard deviation and a bunch of other uh statistics that would be interested uh that we would be interested in looking at so here the describe method basically gives you this count mean standard deviation minimum value 25th percentile value 50th percentile max value so uh for all of these columns you can actually see that count we have 756 in almost all of these so that is great and then uh so all of these are numerical columns the string or the object base data type columns have been dropped from this summary statistics but you can actually uh provide or calculate the statistics for them as well uh for there you'll find frequency and a bunch of other factors or statistics uh that you might find interesting so we have got average for each of these average might not make sense over here when by runs uh hair 13 winner uh okay that's encoding so that's not uh required so but this is a very important method where you have a lot of numerical variables uh you can actually you know use describe method to get a lot of insight into what the dataset actually contains and what all can you look into then uh let's look at so what we have here is we can actually look at like team one team two uh we can use the venue column we can use the city column and we can look at uh the winner or the toss winner uh column so let's see like uh what effect does toss winning has on match winning so toss wins and match wins by each team so let's convert it into a markdown itself run this now let's code the next part so we want to find out all the toss wins so here if you want to find out uh which all teams won the toss so so it's very again very simple all you have to do is match underscore df use the toss when column toss winner and dot value counts again and this basically let's say we want the sort equals true we want the sorted results and let's also look at so these are the toss wins for each of the team one two five so one is basically mumbai indians as we know uh two we do not know uh so we can find that out as well let's quickly do the match wins here as well match underscore wins underscore df math for match wins we use the winner column and then we find out value counts for this as well here as well let's mark it as sort equals true so we've got the match winners uh all the counts for matchments and toss wins for each of the teams now if you want to like showcase them so you can do that as well for id and well and since we have encoded them so it might be a little difficult so we'll have to like back a reference uh the dictionary so we want all the keys now from the encoded values so you can do it like this so we'll run a loop over these so let's say toss underscore wins item items so this is basically a series that we have so now we'll run loop over this toswins variable and for each of these rows we'll actually try to look into we'll find out what this one encoded value is for what team so let's quickly print the results for that i'm using an f string over here so here it's very simple so let's quickly i'm converting the list team underscore and code date and i am going to use the winner column and then extract all the keys so because we are interested in looking at the keys so i've created a list of all the team variables team names and accordingly now what we're going to do is we are going to get the value or the name of that particular team by so basically the list starts from zero so that's why i have subtracted this one for that particular index and in front of it i simply want to mention uh toss winner so that is my toss wins and for that particular id idx and that is it so i think this should run so now if we run this okay we have an invalid syntax error let's quickly address this okay printf string team encode date winner keys then we have got the keys over here in the list and idx minus one is fine okay i have to close it over here yeah and then close this so this curly braces needs to be closed this was not closed now if i run this okay uh yeah this one is extra okay this again so this is all the toss wins so you can actually again uh go from your encoding to your uh team names as well back so this is uh how i have done it i've first created a list of all the team names and then i am referencing them by index so uh for whatever i item so you can actually look at what this gives you it's basically creates a zip off the id and the value that it contains so that's that's that's just about it uh and then we iterate over it to get the value uh from the value from the index we go to the encoding uh to the actual team name similarly we can do this for match underscore wins so we saw that mumbai indians has won the most uh toss process and if we look at match underscore wins okay we have got some error okay so that is for 15 because of the draw uh column that we have uh at the end so that's nothing to worry about uh match wins dot item items uh so this again mumbains is on 98 and chennai super kings 89 match wins uh okay yeah so basically there is some sort of uh uh you know relation between toss winnings and match winning and that will figure out so you can one way to do that is to again we'll check them out in the modeling process but let's first quickly plot uh the match wins so match underscore df use some visualization methods again very simple very easy uh we can actually plot all the winners using the histogram bins equals 50. so okay this is currently running and this gives me all the winners so these are the team names at the bottom uh team numbers sorry team encoding so first one is actually 109 which is mumbai indians second one again uh second one was uh okay we'll have to check that out so team encodings we can actually look at over here yeah second one is called the nitriders and so on and so forth we can actually predict uh you know visualize these from here now the next thing that we want to do is let's try to plot these together so let's see how what do we get there might give us some error because of the other thing so match winners this is match wins so we are basically plotting uh again i've initiated a plot with the figure size eight comma four using the matplotlib library then i've created a figure uh which says uh i've created a subplot which means that i want two subplots to be just a side by side uh and i want i've labeled uh the x-axis the y-axis the title of the plot and i want a bar chart so plot uh toswins dot plot kind equals bar gives you that and then uh we have at the end did we did the same thing for the match win so let's quickly run that uh here so you can see we have the match winners over here and the toss winners over here uh kind of close match wins okay so you can actually look at them side by side and i think they are all the same so let's quickly go back to check okay we have toss winners we did that correctly then match wins toss wins is fine okay match wins okay yeah if we do this yeah here if uh is for that okay yeah but uh basically this is uh this is how you can actually plot them together there's some error with the match uh match vince over here uh this is something related to the the dictionary or the back referencing that we're using and the draw uh that we have added at the end but yeah but basically this is uh you can suggest that toss wins or match wins uh you can actually plot them together using subplot method uh and add them labels and simply use so these are like both of these are data frames uh so you have this plot method which actually leverages the matplotlib library add them together and you can actually figure out like okay what is the relationship between the two the toss winners or the match winners like is there a direct relationship or what is it another way to do that is again correlation you can check for the correlation as well so match underscore df now let's quickly check for if we have any null values so these are the columns so again based on uh what we are going to model so let's quickly drop the redundant columns so the next step is basically dropping the redundant columns again uh we're doing so much of analysis so much of uh data cleaning and other things it's important because modeling is the easiest part in in the machine learning process like most of the time is you know spent uh on analysis and getting the data set into a format which we can actually use uh to model and get something uh valuable so that's how it is ah let's drop the redundant columns so dropping the redundant columns uh we want to simply use a bunch of columns that we have set so based on that let's quickly set the columns of our data frame apart from these that we set we don't want anything else so we can actually add a list of columns for the data frame so here in this particular list i would mention all the columns that i want in my data set so let's say i want team one which is which team is playing on the first team then team two and now based on our analysis let's say where we are playing is a factor then uh what toss decision was made then we have another feature which is toss winner who won the toss and at the end we also have a venue where was this match played another very important feature and the last one is the target variable which is our winner so who who won the match when it was played in a particular setting in a particular city in a particular venue who won the toss so all of those things so now match underscore df we have 750's row six rows and seven columns so these are the columns that we are interested in looking at so we'll simply focus on these for for now and based on this we are going to create and prepare our model for the prepared the data set for the model that we are going to use in the machine learning segment so here you can see we have the city column toss decision and venue column which are still an object uh data type which is basically these is all textual data strings that we have got over here so we'll have to address this as well now one way to do that is uh we can actually label and code them so what we did with the our team names we can actually do that over here for these as well city toss decision and the other method is the one hot encoding method where we can actually create a variable out of each unique value of each of these columns so uh here let me quickly explain you like what label encoding is and what we we're going to use so let's say if you have this particular data where you have country india u.s japan u.s japan and age and salary and you see that these two are numbers but these this is textual data and you have to address this and convert it into some shape and form where it is actually understandable by the model so here from country uh what we are doing is we are encoding each of these countries and we have assigned india as 0 uss 2 japan s1 so we have encoded these values so this is called label encoding and uh you can definitely see like every number that we have actually put there it contains some value so 2 actually means something 2 is basically uh more than 0 or more than 1 so 2 has got higher weight and that's why label encoding is not good for purposes where we are actually having uh you know non-ordinal data so next up uh so another method of encoding is one hot encoding so again these are for all the categorical variables so these are called this is whenever we have a textual data we have categories of values basically we call them uh categorical variables so country uh we have india u.s japan same data and now we want to create a variable out of each of these so let's say we call india as zero so we create a separate column and we uh only mark one uh set as set the value as one for when the for which particular for that particular instance where it exists so for uh i think for japan we have set one which for so that's why we see in the third instance we have japan and in the uh fifth instance we have japan again and us is two so again that is marked as one and one in the second and fourth instances so this is how you can one hot encode the categorical variables uh and then one thing that you should notice over here is uh so we had like three categories and it created three new variables so if you have like let's say hundreds of uh categories so it'll add those hundred columns to 100 features to your data set that basically creates uh increases the volume of your data set so you'll have to figure out like there's a trade-off that you'll have to address so based on these you decide like which encoding method you should go for and in our particular case uh i'm currently demonstrating how you can do this with label encoder but uh the another home task for you would be to use uh uh you know one hot encoding again using the similar method and you can check out uh the seek sklearn documentation for how you can use that uh again very simple all of the major methods are actually very simple to apply so let's quickly uh address this how we can label and code our textual columns these city toss decision and venue so first from sklearn uh and we have a pre-processing module where the label encoder resides we don't have to write that encoder ourselves we have a label encoder class so uh import the label encoder and then let's say which all columns uh we create a list of all the columns so feature list let's call it and we set all the we actually create a list of all the columns textual columns that we want to transform city toss decision and then we have venue at the end these are the three columns that we need to encode next thing is to this is a class that we have so we'll first have to instantiate this class so let's create our encoder uh instance so label encoder so we have instantiated our label encoder over here now we can use this encoder to transform each of these columns so how do we do that we run a loop for on each of these columns so for feature in feature underscored list so we're going to go over each of these features so how do we transform them so match underscore df let's store the result keep storing the result in the respective column so this is how i can like uh you know assign the new values so here we are going to transform the feature encoder and the method is fit underscore transform and you pass the data to it so the data is basically match underscore df feature so for this particular column all the columnar data for this particular column uh would be transformed using the encoder so match underscore df a feature so this would simply update our column of and since then we are running a loop we will get all the values as well uh we'll do that for all the three columns so print encoder we can actually look at the classes that it generates it encodes so i'm at the same time i'm also printing all the classes so this will basically print the list of uh all the classes uh that the basically in the city we would have like all the unique values it will show you uh what all it has encoded so match underscore df let's look at that once the transformation has taken place so here is this uh print statement uh in execution so for the first we have city and these are all the classes for our city column all the categories abu dhabi ahmedabad uh all the unique values uh that appeared in the city column then uh toss decision was fairly easy uh we had simply bat or you can field and then another uh last column was venue so when you again these are the these are all the classes that we have for uh the venue for each of these matches that have been played so now we have team one team two city toss the decision toss winner venue and winner all in integer format so all of this is integer data type and the machine learning algorithm that we will choose should not throw us any error because it's all understandable by it now as i said i'm using the label encoder again you should try it out with one hot encoding and see what the result how the results would vary so now we are going to run it uh so first thing that you do so the next uh next step is basically the machine learning process that we'll start from here since the data preparation is all done let's create a markdown cell run it so here what we're going to do is import the model that we're going to use uh so first thing that you do is you have 750 rule 756 rows and 7 columns so whenever you think about uh you know applying machine learning to predict something uh you do that by first splitting the data so you assign set aside some testing data at the beginning which you will only touch when you have uh when you're convinced that you know this is the model that i'm going to go with once you have finalized your model then you go and test your uh model the final model uh with which you are ready to go into production so here uh we are simply going to use sklearn.model selection and import train underscore test underscore split this is the common method that we use uh train tests split and again for the same same way we have train underscore data frame and test underscore data frame so these uh we'll create this will basically split we are basically trying to split the data set so we'll reserve like 20 of it which is like the general rule 80 20 uh and you can again decide uh based on uh the amount of information that you have you can like keep aside uh more of your more test data but that's on you the general rule or the commonly used rule is 80 20. so you call train test split pass match underscore data frame this is the original data frame that we have which we need to split then we set aside test data so test size we need to specify so test size we say 20 so which is 0.2 so we say reserve 20 for our uh testing set and then we also set random state equals uh any value that you might like so let's have 5 over here so random state we set random state so that every time we run this cell it should create the same split so that we can actually compare uh the values once we uh run the code so basically so then once we have set the random state as five it will always create the same split for the same value so let's quickly look at the rows now so print train underscore df dot shape so shape attribute gives you the number of rows and column in your data frame and similarly let's print the shape of our test data as well so training set contains six or four rows seven columns and testing set contains 100 52 rows and seven columns and now we can actually go about implementing our machine learning model so since this is going to be uh so we have team one team two city toss decision and the winner is basically one of the teams that have played so here our our problem statement is around predicting the winner out of the two teams who have played in a particular set of conditions okay so uh let's so since this is this becomes a binary classification like out of the two only one can win so from sklearn dot linear model i am importing the logistic regression algorithm which is again the commonly used uh algorithm i won't go into the details of these algorithms or how they're learning what's the cost method uh what's the optimization criterion how does it learn for that we have like a specific dedicated courses as well uh again but this is uh going into the mathematics of the algorithm is beyond the scope of this particular session so i am using logistic regression you can check out with other classification algorithm as well we are going to use one more which is a random forest classifier again another ensemble method which contains uh which is consist which consists of many decision trees and then gives you the output based on the average of all the predictions made by those individual uh so it is uns under the ensembl module and we can import random forest classifier and another thing that we need is let's say we need an accuracy score for the model results so that would be our metric so it is it resides in the matrix module so let's import that as well accuracy score so from esql1.matrix we have imported accuracy score as well so that matrix part is done now the next thing that we need is we need to evaluate the model so now the testing set we said that we'll only touch it once we are convinced with uh where once we're satisfied with the performance of our model so how will we test it so there are there's a better technique uh called cross validation uh which we are going to use here to test our one one way is to you know split uh set aside some part of the training set for validation purposes that you can do and that works fairly well as well but another thing that we have which is already implemented for us and it's easy to implement uh like if you want to do it yourself is the cross validation uh that we're gonna use so sqlearn.model selection import cross val score and i'll tell you like what how this works uh cross validation and uh yeah so i think that's pretty much it let's get down to writing so we'll try to write a function so that we don't have to like uh do it again and again for every model that we try out so it's always important always you know beneficial to write a generic function so i'm writing this print models course and to this function what i'll do is i'll pass the model to it i'll pass the data i'll pass all my predictor variables that is the independent variable of the feature variables and my target variable so i'll pass this so this is again list of features this is again the list of my target variable that i have okay so now i have passed the model so first i'll train the model on the data that i have got so this is this data is basically my data frame and i want to run it over the predictor variables so this is basically my x data variables so most of the times you would see a data predictor is this is basically you see x underscore train and y underscore train so it's the same over here so this is how we're going to do it mold.fit this is my feature variables and i pass the training uh the variable uh which is sorry the target variable so again since this is again uh i hope you guys all know that this is going to be we already know the outcome we already have the data as labeled so we already have a supervised learning so that's why we're doing all of this uh and small dot fit we have passed the features and the target now it doesn't matter we train we use this function for training purposes or testing purposes you can do it for any uh uh you know any purpose so more dot fit the the model is now trained the next step is to uh find out the predictions so model uh dot predict and here we are going to do that predictions over the data that we have okay predictors okay yeah so we've done the predictions we can actually print the accuracy over the training data itself but that doesn't really make sense because we have trained it over the same data set but we'll see how we can address that using cross validation and we can check so we can print the accuracy of this model using the accuracy that we have calculated so let's quickly calculate the accuracy of that of of our predictions so accuracy you we have the accuracy score function and to this we'll simply pass the predictions and our target variable so we'll try to compare these are the actual values these are what the model predicted we'll try to compare them and this will give us the accuracy score for how many predictions were actually correctly done then we predict zero we need two decimal values so that's it dot format and we can actually pass that so i'm printing the accuracy over here and with the value then uh now we will also print the cross validation scores as well now in cross validation what we do is in sklearn it creates a number of folds so let's say let's write the code first and then i'll explain you uh how this works cross underscore val underscore score and we do that for the model and the data prediction predictors and the target variable so here this particular method also contains a scoring argument so scoring argument is uh equals negative mean underscore squared error and so basically this uh how will the model learn so there's a an evaluation method a loss a function that we pass so this is this mean squared error is that function that it uses to evaluate the model performance okay and again it takes cv which is basically where it takes like number of folds that it has to create so here you can actually pass the number of folds that you want to create now what does this fold mean so cross validation what it does is it basically uh creates let's say we have created five folds over here we have passed cv equals five so cross validation will create five folds now the model will train on four parts and reserve the last part for testing and this process will run five times so the cv equals five basically suggest that it will run five times on those on each of these folds uh and use each of these folds for uh testing purposes one time and the train on the rest of the four so that ways we actually use the entire training set uh in uh all the possible ways and again is it's a very uh easy and automatic way of actually checking uh how the model is performed uh how the validation scores are coming in so this part of cross validation is done so this will print this will give us five scores for each of the runs so here so this course basically would be a list of scores so let's try to print all of those course cross validation scores and we can dot format and we can pass so we need to call the square root so these are mean squared errors and these values basically give you a negative so this is since this is an error so they give you a negative value so when you want to calculate the square root of these you would have to add a negative sign by yourself so that you convert it into first positive values and then you calculate the square root otherwise it will give you an error so that's that and then we predict the average root mean square error error value so this is the root mean square error that we are printing over here and then we calculate let's say the average value average rmse is and p dot square root so any numpy basically i'm using numpy square root method and i'm passing the list of scores it will run the square root on all the methods and then we need we are interested in looking at the average so mean and that is it so this our function is now ready let's run this so we'll use this function for all the uh model testing and let's uh quickly uh check it for logistic regression now uh let's say okay logistic regression so here let's check let's first decide our target variable is going to be winner winner is our target variable and then we need to pass the predictor variable again you could have done that like at the beginning as well during the data preparation if you want to like segregate the two like the winner uh column set aside in the beginning and all the feature variables again you can do that ways as well i am mentioning all the names uh predictor variables over here team two venue and similarly uh let's just okay okay so we have team one team two venue toss winner city and toss decision so these are the six columns uh our future columns and this is my target variable uh we have instantiated the model logistic regression model that we are using for uh this prediction and then we simply write or call the method that we have written above print models course to this first we pass on the model that we have defined over here then we pass the data which is match or sorry which is train underscore df the training data set and then we pass the predictor variables predictor variables and we have the target variable at the end so this is ready now if we run this let's see how this gives us okay name predictors is not defined okay we have added predictor somewhere okay the accuracy is 0.32 that is 32 and the cross validation scores you see we have five values for each of the folds so it's giving us the uh root mean square error for each of the fold and average root mean square error is 3.55 now this is this has uh really made our task of testing the model or evaluating the performance of the model really easy so we can test now any other model so random forest and we'll uh run the we'll try to check how the predictions are being run and all of those things in just a bit so random forest let's check that algorithm as well okay again doing the same thing model equals random forest classifier define the model and we can set let's say number of estimators so number of estimators is the number of trees that that we want in a random forest classifier so random forest classified as uh is uh an algorithm which comprises of many individual decision trees and based on those individual results it averages the result and gives you a better uh generally better outcome a better prediction so number of estimators equals hundred uh we want 100 trees you can change it uh as per uh you know your choice so now uh next thing that we do is simply print a model score function and here we pass the model the train df the predictor variables basically these run this let's see how does random forest classifier does so it is again accuracy is pretty good 89 but uh okay and the average root mean square error is a little high but accuracy is good okay fine uh so let's try to run a sample prediction so let's say we want a match let's create a sample instance first so we have team one let's say mumbai indians i have this sample instance created so let me just quickly use that itself see we have mumbai indians sunriser hyderabad which is team two then we say toss winner was sun high riser hyderabad and then input is basically i've created this list so uh team one is mumbai indians so we'll have to encode it first so we have encoded the data uh using the team and code date as earlier and then we have encoded the team to uh 14 this is for the city uh the code for the city and uh team and code date this is my toss winner uh and then similarly uh this is venue and this is for uh our uh this is the last video okay last thing and last variable which was toss decision okay so this is again this is like a list so we'd first have to convert it into a column so we need a 2d array it accepts a 2d array so we what this reshape method here does what it does is so it will create 2d array out of this 1d list that we have over here so this input is currently one day list but we need a 2d array for our model prediction so print under print input so we'll check what this input actually gives you and how does it uh so what it was let's print it before as well before the reshape so that we know what what the reshape method actually does i'll create a numpy array and then num after creating an array it will reshape it to a 2d array by converting it into a column based array then we pass it to the model.predict and the winner is basically again we are using the same pack referencing method to find out the small dot predict will actually give us the encoded value so let's predict output as well let's run this quickly so this was our input of after this the first input the second input was a 2d array as you can see uh this is an umpi array and then the output was simply this value one and one is encoded for mumbai indians so if a match is being played between mumbai indians and sunrise hyderabad in a particular venue which is encoded as 14 i don't know what the value here it is uh or in a particular city in a particular venue with a particular uh toss winner and decision then uh mumbai indians is the team that's going to win the match so that's how you can use to these models to predict uh again you can check for another thing that you can do is you can check uh uh which all uh so we have a number of features over here so you can check which all variables were important based on their scores so you can check the feature importance so we get that as well for data equals so model dot feature import importances that gives you that and we look at okay so we have team one so this is i printed the predictor variable as well as their uh respective feature importance score uh from the model that we have run so this model is currently random for us because that's the that's the one we have defined over here uh at the end so team one uh score is t1 is zero point two two zero team two is zero point two four so these are the two most important variables again because they're being used and these are the actual teams that are going to come out as winners and then after that we saw that venue who plays an important role in predicting who the winner is going to be after that we have toss winner city and toss decision doesn't really have that much effect on who's going to win so that's what the model is suggesting now we can look at other ways uh to predict the data set so i'll just quickly show you uh we're running out of time so i'll show you what the okay how do i hide this so again uh this is like one way or one method of you know looking at this data set from one particular angle as i said there are so many variables that you can use or that you can predict so i have done that analysis for adding some more complexity to this code so right now we are simply using the match data frame match data but you can actually combine it with the deliveries data and do a much more uh you know sophisticated analysis which would uh comprise of scores as well so here is what we have okay so you can check and you can plot uh if let's say mumbai indians and chennai super kings are playing you can plot so this is a plot for uh winner and toss winners and as well as match winners so this is the team for that so if we are playing against them so how many times are they winning so this is uh the plot for that and if let's say uh for your home task i have done this analysis i've added a bit more complexity to the next task which is emerging the data frame uh of the delivery by delivery data set so here what i've done is i have i'm quickly just going to walk you through like what the steps are so first of i have match id and batting team i have just grouped by match id and batting team so that i can get the score of each team so delivery underscore data frame does not contain what uh its delivery by delivery data set so it does not have uh uh what the teams code during the match so we can actually run bad so there's this batsman underscore runs column where we can actually invoke the sum method and will give you the entire score of that particular team of the batting team that was playing so if you see the data after grouping by match id and batting team we have for each match id we have like two teams so there will be two rows for each id so batting team royal challenges bangalore for the first match and sunrise the hyderabad was second now we need it in a format that we can merge it with the data that we have already so we need to merge it with this particular data and we actually want these two variables to go like team one score and team 2 score by capturing these values so how do we do that we try to first again create this match underscore data the data frame that we have already picked so here what we're going to do is we are simply going to group them by team one team two and then find out uh like who the winner is going to be so based on these two we are actually trying to predict like uh find out who the team one score is uh what the tma score was and what the team two score was and adding the scores to the merged data frame so that we can leverage that as well in order to predict the score or whatever that you want to do similarly you can also use these columns for other purposes as well then uh again i have added a dictionary where i am adding these columns one by one so this is like another method of creating a data frame so uh in this new data frame i want match id i want team one which i will get from this team one uh list of teams data that i've created so this is my data of teams team one team two and who the winner was that i've got from our original match underscore data so i have team one team two and similarly we have team one score and team two score and winner uh we have a list of uh team winners uh that we have defined now team one score and team two score are going to come from the deliveries data set for that we have written this for loop and here we are simply going to query match id from the team and uh the batting team that was playing and we calculate the batsman underscore runs for that and we append it to the team one score all of this just to get uh this particular data where we have uh team one team two team one score and team two score and winner so this is where uh this is what we have created from match underscore data frame and uh the deliveries data frame now we are going to merge it with the original data frame after encoding the data so this encoding part is done so now we have match id team one team two team score team two score winner and then uh i have merged it by first segregating so there are some common columns team one team two so i am simply just dropping them off here by calculating the difference in columns and then i'm merging the df with the match underscore df that we have been using so this is what we get after merging the data set so this is my merge data frame and i can use this data to do uh more uh more analysis or predict the score which is steamer score or team to score and again use these uh to find out new winners or you know another adding new feature to the feature variable list to find out who the winner is going to be so this again adds scores another way uh if you want to add more complexity you can add the batsman score who all batsmen played uh and one hot encode them as well so that is again uh to add more value so how will you predict the score so you can convert this you can this is going to be a regression problem but you can convert it into a classification problem how uh you can actually set the criteria basically that if there's if the score is between 0 to 50 you uh you have encoded it as one if the score is between 50 and 100 it's 2 if it's between 100 and 150 it's class 3 class 4 and so so on and so forth so you can make you can categorize those columns so here i've categorized the scores as columns and then if you're predicting like if these two teams are playing uh what will be the score of team one score if it's being played in a certain set of conditions then you can actually predict like what would be the score for that particular team and similarly you can check for the correlation of the variables that we have created over here so we have a heat map where this correlation method so on a data frame you can run the correlation method and it'll give you how one variable is correlated with uh another variable so if let's say is it like you know if one variable increases so does the other or does it go down is it like opposite of what happens or is it just stays in place or there's no effect so could negative correlation is uh the range of correlation is from minus one to one and uh this is positive one because uh these are like uh for the same variable so match underscore df we have variables over here and uh in the x on the x axis as well one is like perfect correlation zero is like uh poor correlation and negative correlation is my negative score negative is again uh like perfectly negative correlation is again fine uh if let's say uh one uh one quantity goes up in in uh one direction and then the other quantity will go in equivalent steps uh in the opposite direction so again this was the correlation now the home task for you guys would be to predict the winner using one hot encoding first of all for the match data frame and then analyze data for batting and boolean performances that is you can check out the notebook the github repository i'll share the link with you so this deliveries uh underscore data frame it contains uh like who are the whole are the top ball bowlers who are the top batsman so you can check out uh you should uh you know perform a detailed analysis on all of uh those performances on all of the delivery by delivery uh data so then the third task would be to try to predict the category of score uh offered for a team that we have covered over here in this particular data set so that's that's uh about it uh now i'm gonna take a few questions if you have uh please uh add it in the chat and i'll go them over them one by one the link has been put in the chat so yeah you can check check that out from there and if you want any of the basic details like how does logistic regression or how does random forest all of that works we have like specific courses that you can opt for or we might uh you know consider to cover them in some other webinar so please keep putting your questions in the chat and i'll try and answer a few of them which python libraries would you suggest so first of all uh as you saw uh pandas numpy and visualization are the most important tasks uh when you are you know uh going about preparing the data set or even if you are you are interested in practicing machine learning this is something that you should all uh you know know definitely and you should be like you should have a good grip over these fundamental libraries so i would say pandas is something that you should master pandas numpy you should know how to use them then basic programming again that should be perfect in the first place like not perfect but you should be like you should know how to write object oriented code how to write functions and all of those things uh then pandas numpy scikit learn uh visualization library matplotlib uh seaborn is something that you can pick up uh once you are like comfortable with atlanta and basic plotting uh libraries so these are the ones that uh basically comprise most that basically help you accomplish uh ninety percent of your uh analysis tasks so everything revolves around pandas numpy and ad plot lib and then if you go into machine learning then you can use sklearn which is very comprehensive covers a lot of algorithms and everything okay can you put some light on the above has asked can you put some light on where to use which uh visualization so uh well that's that's very contextual and very subjective again but yeah as i said if there's something that you uh that you're interested in let's say count uh then you decide okay uh bar plot is something that i can use then if let's say uh another case is for outlier analysis so when your data set contains a lot of outliers you use box plots to do interquartile analysis and find out like uh where the outliers are or how many of them are there and then uh you use so that's basically a basic you know interpretation or basic plotting uh uh fundamentals so like a pie chart should be used when you are just when you have like two three categories and you want to segregate like like how much of what is uh contained for a particular category like what's the value how the demographic or something is distributed and then uh what all do we use uh significantly uh bar chart histograms and uh then line plots is something that shows you let's say you are picking up financial trading data there the stock market prices keep going up uh you have daily end of day stock prices so you would find out like you know how the price is actually moving on day to day basis is it going down up so that's uh that's important over there uh if you want to track like everything uh so time series analysis gives you the line plot gives you a much better view uh there and so on and so forth so keep working on different problems and you will figure out okay like what sort of uh visualization help me interpret or derive more insights out of my data okay we have let me take a last question okay we have a question from snehalatha uh singh single who is asked how to determine whether regression classification or any other type of problem which will be the best input in a particular case okay um so again uh that also uh depends again uh on what you want to predict uh how you have formulated your problem and uh based on so you'll most of them you have an idea like what you want where what the objective is so once you are clear with your objective you will figure out like okay is this going to be a regression problem where we i i'm going to predict uh you know a continuous value uh going forward let's say in a house prediction case house price prediction case or where the stock price is going and what would if i want to predict the stock price uh how so that's also going to be a regression problem then if i let's say something like classification where you have spam not spam fraud detection is something that is being used uh very widely across industry in the fintech sector so there you have most of the classification problem uh and yeah so it basically it's very contextual it's uh it depends on your business problem so if you are clear with your objective you will find out uh what sort of problem it is like supervised or unsupervised what sort of data set that you have first of all and then you decide like is it going to be regression classification or is it going to be anomaly detection or customer segmentation or uh the any of the other uh unsupervised problems okay please recommend some basic problems to start for beginners okay start with a simpler data set like house prediction if you want to start with machine learning if you want to start with the data analysis you can keep on looking at like basic data sets from different repositories uh like kaggle data said pick up a very simple data set and try to uh you know explore the data and try to find out like how each variable how each feature is contributing and what all values or what all insights can you derive so once you have like a good understanding of how it would work or how each variable uh is contributing to the data what the value of that variable is uh then from there from analysis you can like move on to go about predicting a particular feature or predicting uh anything that you that's basically the objective there is so uh you can pick up from pick up a very basic data set and start doing things uh and that will be all and then meanwhile if you get stuck somewhere go on the internet read a few articles that's how you learn okay i think uh that brings us to the end okay okay let me one more question any tips handling seasonal data for prediction yeah so you might want to go into uh you know learn about time series data and that covers a lot of it uh i'll like off the top of my mind i haven't done like too much of time series and also just the fintech sector i have done a little bit of that but yeah you can like connect with me after after this and i'll be able to help you out with your specific query there are a bunch of repository links so i have you can actually look at one of my blog posts as well so where i have covered like all the data repositories that you can use nagarajan uh has asked best repository dataset other than uh kaggle you can check that out uh here there i've provided uh and reviewed all the uh repositories that are like uh that are really good uh data repositories for different sectors like scientific uh astronomical data or similarly uh just a vast repository like uh ucl or something of that sort okay i think that's that brings us to the end of this session thank you so much everyone for attending and thank you so much uh the tas that uh have been working and answering all the questions uh yep thank you so much for attending this and i hope to see you again in some other webinar catch you guys later bye
Info
Channel: Coding Ninjas
Views: 8,461
Rating: undefined out of 5
Keywords:
Id: 7rCW8fLdJGc
Channel Id: undefined
Length: 109min 40sec (6580 seconds)
Published: Sat Oct 31 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.