PUBG Data Science Tutorial - Part 2 | PUBG Data Analysis | Data Science Training | Edureka

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
welcome back all of you to the concluding session of our data science tutorial using the pub G data set if any of you have not seen the first session of this tutorial you can check out its link in the description box below so now let's get straight into our model where we shall finish our analysis now the next part is about cleaning out your data to better your data so that you can lay your analytics on top of it here we are going to deal with something known as outliers we are going to try and detect them in our model as well as it remove them which basically means we're going to try and find out the fraudsters or imposters in the game now an outlier basically means a value that lies outside of your general pattern or curve of your data as I had shown you in the box and whiskers plot earlier once you remove them it actually helps homogenize and improve your model but the question is pertinent how do we do this how do we catch the imposters of fraudsters in the game so for this particular section we are going to be looking at some weird some very absurd statistic that we find in our game these outliers have to have a reason so as to why they are so good why they are performing so well I understand being at a certain skill level so you have a certain amount of accuracy but then there are certain numbers to certain statistics which are just not possible so I'm going to run through the section fairly quickly basically anything that seems like and normally we are going to try to detect ad in the process omit hence here way to try to homogenize our data sets yes so first of all I am dropping the illegal matches which basically means they're not applicable match and then we're going to determine the shape of the data which will give you the dimensions of your data set now we need to understand that in every game that is popular people come up with cheats people come up with BOTS and that is part and parcel of the review game you may game on Steam you may go ahead and game on any other platform and in any other game maybe pop G dota counter-strike a bunch of other games the use of automated aiming software is without a doubt one of the most powerful cheats that you can use and it is definitely used in pop G at this time basically what it allows you to do is it's gonna bind a key to your mouse or maybe sometimes or to aim or have an aim and lock-on function which is automatically going to make the targeting of an enemy in line-of-sight much much easier if you're playing in suppose a squad or a duo your other group members may not be able to use it and you can have a bot for you a sort of a trigger board for you which can be set automatically for your targets yes now how are we going to try to find it yes there is obviously a normal number of everything that is possible during a certain game what we will do is pick out the werid statistics the weird ones from the entire data set and we are going to attempt to drop them so first of all total distance travel by a player which obviously thumbs up your riding walking and swim distance so this is where again feature engineering comes into picture we are creating a new feature called total distance now what we're gonna do is we again going to create another new feature which is called headshot rate now for most people if you shoot 10 people max - max one or two of them are going to be perfect headshots however there are certain anomalies when you look at the data set that we have now here we are also going ahead and defining certain functions for plotting graphs and we will be needing a lot of it later in this section we'll be using a lot of complet and distribution plots for this so V is defining them beforehand here come the anomalies firstly inhumane kills now as I mentioned there are certain weird statistics in our data set these players apparently have such high accuracies now there are some people with an inhumane number of kills now these people might be just having fun or could be fraudsters or just living their ultimate ultimate shooter range dream through this game so removing these outliers might simply just improve our training set just a little bit here can you see now this is what the kill plot looks like the kill counter plot and generally looks like but we picked out certain outliers and plot them all here are their IDs the group ID is the match IDs and their stats now one thing that I see pretty common here is do you see how little the walk distance is for the number of weapons acquired by each of these players they have gone like 20 to 23 meters here eighty five and forty six meters acquire more than 60 weapons look at the longest kills and the amount of headshots that each of them have put up now you ever stop and think about it yourself is it even possible to kill that many people by acquiring more than 60 weapons and maintaining a total distance of less than hundred meters obviously not hence either these people are extremely talented or just plain imposters then again if you look at the next table you look at people who have had 40 headshot kills oh my god there are 42 kills out of 42 girls they're 40 headshots the second person out of 43 has 34 headshots and they have used 0 boosts in total despite having been knocked down 40 and 30 times is such a thing even possible their headshot rate is nearly perfect all of their headshot rates are nearly perfect even if their weapons acquired are fairly normal and just looking for the heels here all of them required no heels and no boosts at all despite having been knocked down 40 times obviously again that is not possible then we have a couple other fraudsters so what we're gonna do is we're gonna drop these fraudsters and clean out our data set a little bit next we have more people with a hundred percent headshot kills now this is the headshot rate of people there are people with a hundred percent headshot kills but that is only up to see people with five headshots but do you see the outlier the outlier goes up to I think 15 now here is a list of all the hitmen who made more than ten kills and all of them were perfect headshots now there can be two reasons behind the list either these people are literally virtual sniper gods and goddesses or they are cheating plain and simple but I've decided to keep these players and not delete them for now because we will know for sure because numbers aren't that great they go up to like barely 13 12 13 which I can still understand somebody with high accuracy might do and they have a fair amount of boosts and shields which makes the data slightly more believable so we're gonna keep them for now next we have killing without moving again it's the same thing all over again there are imposters who are just going kill after kill after kill even without moving so first what we're gonna do is we're gonna identify the total distance traveled by a player and then we're gonna set a boolean value to true as they go ahead and make kills even without moving a little bit and we have a couple of people who are extremely talented that they're just going to stay at one place and just kill everybody around them without having to move at all now we're gonna drop this because quite an anomaly then we have the longest kills and here we have our longest kill Bart plot where we see as the distance goes on increasing the chances of kills get decreased here you can see if the longest kill is all over one kilometer now is it even possible that somebody is shooting from that far away in my opinion it's not possible and thankfully it resonates with the opinion of the creators of the game hence we have not dropped these as well now we are going to drop people who have more than 10 Road kills oh I just have to show you one really funny thing notice how the first ID over here look at the number of Road kills this person has it has 14 Road kills and literally the ride distance is 5 meters how is it possible did you stack up did you pile up all the different men on top of each other and run them with your vehicle that's obviously not possible and hence we are going to drop this fraudster now another anomaly that is very consistently found is an anomaly in traveling now this entire Bob gene map or area or island is a 64 square kilometer map and if you look at it properly you'll see that each of these yellow boxes will represent one kilo meter square so first we have taken out of the walk distance right distance and swim distance the overall swim distance the mean standard minimum maximum so on and so forth we're again going to have a distribution plot of although walking that happens how far does somebody go obviously nobody is going to walk for 25 kilometers but still there are people here which surprises me now I didn't come into this hoping that I would find somebody that has walked 25 kilometers throughout the course of the game but it might be I think Digital wanderlust I guess but Who am I to judge now there are these Wanderers and travelers who just play to roam around and explore places without killing anyone but then again don't you think how were they traveling 13 kilometers and 25 kilometers in a game so what do we do to them we ditch them of course now we're going to do the same things for riding and swimming as well take a better look at it when you go through this model on your own but I'm basically doing the same thing for riding and swimming that I did for walking now this is also funny how are these people swimming for two kilometers without even breathing video games are so extremely realistic these days if something doesn't happen in real life you can be mostly sure that it cannot happen in your video game apart from traveling we also see other anomalies such as anomalies in the weapons acquired as I had mentioned before people who have acquired more than c8t weapons throughout the course of the game the maximum somebody's acquired is a hundred and twenty-eight weapons in one game I don't understand what would a person do with 128 weapons throughout the course of this game is opening a shop and option in this game I don't know if there is you guys tell me then we have anomalies and heels now for most of your regular players you would know that most players use like five or six healing items or less but then what if somebody goes ahead and uses 30 or 40 healing items again are you going to stack them up at this time of coronavirus because my friend you cannot bring digital medication to real life somebody should have told them so we again remove these outliers now this is a very important question that you should ask yourself as you saw I remove most of the outliers except one or two this is a very important question that you as a data analyst should be asking yourself should you be removing all the outliers that you ever find in your data set or not now for that you need to understand that in analytics or data science there is no fixed strategy through which you can approach a model there's no right way there's just a more and more efficient way to do this very same thing five data scientists will have five very different approaches towards this very problem and this very bubbly data said that I'm using in my opinion in particular imagine an entire sea of water if you just picked our safe 10 buckets of muddy water from the shore it would not particularly cause a deficit in the sea but it keeps your shore clean and makes it good from your closest viewing point so it's kind of like the same when it comes to data in our case we have millions of rows we have sometimes thousands and millions of rows so if you lose like a couple thousand of them and it helps you create a better model go ahead do that but make sure you assess the outliers properly before you choose to ditch it from your model like you might have noticed I took out every single outlier that I found except one the one with the perfect headshots that is because even if it was an anomaly I couldn't for sure see that everybody in that list cheated if I have dropped that particular anomaly it could have costed the efficiency the accuracy of my data set as some of them might have been genuine I know people who have five thousand six thousand hours of gaming and they are actually pretty good and they do have perfect headshot scores in certain games that they play but then again I know a lot of them are fraudsters and imposters but if again I cannot afford to lose the good statistics for making sure I'm ditching the bad ones or the ones with a bias so what I'm going to do here I'm just going to create a checkpoint because this is a fairly long model so at this moment our data is clean so I'm going to save it to a separate CSV file and put it in the same drive that had my other files and then I'm going to import pandas as PD and again do the whole cleaning data reduce the memory usage of this particular data set and reduce it for the sake of efficiency of the model now this might take some time because you are exporting data into a CSV file and you're basically putting it into your drive and that's going to take some time and then once you're done with that you're going to import your pandas your library and then reduce the size or the memory usage of the data that you've just saved in your folder moving ahead we have done more features to add so let's go ahead and start experimenting without data a little bit so we are going to be adding and removing some new features and finding that direct correlations with the winning probability now the couple things this data set is missing now all of us know that in a single round there's a gap there's a limit of hundred players in every game but most of the times as we have seen previously while looking at our data we know that the game is in full or complete hundred players are not playing the game and there is no such variable in our current data set that gives us the exact number of players that have joined so let's go ahead and create one so this is what the data that we've saved looks like the cleaned data that we just exported yes this is the head of the data and here we have a variable and a graphical representation of the number of players joined we have the number of players joined and the count of how many times there has been that that particular number of entrants in a certain game so as we can see the most number of times there have been 91 players and if you remember we had actually determined this in one of the cells above where we had ninety one point something or ninety point something players as our maximum number of players now there are very few matches a couple matches only with less than 75 players as you can see below 75 there are not very many matches or counts most of the matches are nearly completely packed and why not this game blew up so hard and it expanded its horizon from all your desktops and PCs to mobile gaming so you could be sitting anywhere and making twenty perfect headshots and swimming for two kilometres without breathing it's completely your choice Who am I to say so yeah this game made a lot of noise when it came out caused a lot of controversies which actually caused a lot more people to be aware of this game and join this fraternity so I'm not actually shocked that there are very few games which are not completely jam-packed with peers so now that we have a feature which has the number of players join we can normalize all the other features based on how many players are there per game now these can be a number of features from the kills the damage dealt the maximum place and the matched duration so again we're going to start out by creating normalized features and then going ahead and comparing the standard features and the normalized features and then you can get great side-by-side comparison of each feature on the left and it's normalized variable on the right so you have kills and you're normalized number of kills per player then you have damage dealt and the normalized damage dealt on the right so on and so forth you have four maxxbass and match duration after that we are going to take all the 16 data and try to establish a correlation between all of these attributes again more correlation is shown in green and less in brown as you can see now with a clean data we have a much smaller correlation matrix we can see walking a certain distance still contributes a lot in fact contributes more to the probability of winning a game and the damage done percentage also contributes greatly to your placement in the end then we have a couple more variables we have help items we have headshot kill rate we have kill plays over max place over walk distance over total distance and / distance for a second we're gonna also create a heat map off of that then we're basically again mapping them on a heat map now you have a better idea with these new attributes as to how your probability your chances of placing in a game are affected by different attributes I understand the distance that you walked has a good correlation with your winning percentage but what matters slightly more is the distance that you walk per unit time then a great contributor towards you winning is also your health items just things like that which make analyzing data all the more fun next we basically ask the cell to return to us the shape of our data again the dimensions and we're going to drop a couple more attributes now our clean data head looks something like this you and then we have this heat map correlating all the attributes that we have created so far or that has been provided to us by the data set now all the positive attributes are fine and so are the negative ones what we do not want are the zeros and the ones that are close to zeros which basically means that those attributes do not affect your probability of winning at all so why do we even need them because our ultimate goal is to find out the value of the variable we are trying to find out each and every player's chances of winning in a said game so what we're going to do is we are going to remove all those features which have absolutely no impact on your winning probability so now as you can see our clean data from where we started we started with 4.4 million rows and 28 columns now we have four point 1 million rows because we just went on cleaning and cleaning cleaning our data more and we have 35 columns which means we have more attributes but we have cleaned the data within those attributes now what that comes up your next assignment now we have already removed all the zeros from your heat map but do you know what else is useless data all of these ones now the correlation of any attribute with itself is definitely going to be one here you can see the players in team Y in this axis and in this axis basically the same thing and hence perfect impact with this players in team y is 100% going to affect players and team y it's the very same variable the existence of it affects the existence of it right so go ahead you can download this model and do it as a part of your assignment again do not forget to leave your approach in our comment section I would appreciate greatly if you did that we also love to look at our learners community growing and learning with us so I highly encourage you attempt these assignments at least give them a try now we are basically modifying our test data and we have nothing but highly correlated data which we are again going to save into a CSV file and we are going to put it in the very same folder that we had created earlier now as that happens let's move on to our final prediction because as I had mentioned earlier putting it in a CSV file is going to take some time it's 2 million rows after with that we've moved on to our final prediction let's go ahead and look at our problem statement one more time and determine our target variable basically we are trying to create a model that forecasts or predicts random players finishing placement in any random given match at any random given point of time based on their final statistics of all the matches they have played so far on a scale of first place to last place your target variable here is the win place percentile this is the percentile of winning placement where one corresponds to the first place and 0 to the last place in the match so our objective right from the beginning is to determine this percentile now depending on what kind of goal or objective that you are looking for from a certain model tells you a lot about the machine learning technique that you would want to preferably use to get to your objective apart from that the target definition tells us a lot about how its distributed so what is the metric now how do you decide what is a metric what is a unit of decision for you to choose between all these different algorithms that are available to you ideally a professional would use a ranking algorithm something like a lambda ranked the lung drank or LTR refers to a class of techniques in supervised machine learning which help you solve problems like this essentially you treat it like a regression problem and what your algorithm does vary procedure does is that it compares different pairs of items and then arranges them or sorts them in ascending or descending order as it I traits through various different random pairs to extrapolate the final ranking of all the items in that particular column now the business metric is the rank and or the mean absolute error as such we are going to optimize our algorithm make it efficient using the mean absolute error objective now there are a couple of limitations for sure such as the outliers and some extent the onion eke nosov predictions but we are going to deal with it and this is also the most straightforward way to do it now as it might have been already pretty much been established throughout this model that our target variable is the percentile of winning placement now in this particular game in pop G all the teams get assigned a percentile value so there would be approximately the same count of 0.5 ones etc now there will be some irregularities in the distribution again due to the imbalance in the team sizes or the number of teams now what you should be expecting is to see uniform distribution and a Gaussian distribution with a mean of 0.5 to go with it for an average percentile score per match so we're gonna start out again by taking our CSV file of the highly correlated data and reduce the memory usage of that particular data set now that the memory usage has been decreased by almost 60% and we know the shape of our data we are also going to be dropping match ID and group ID from our training data sets training and testing data sets because they are of no use to us and finally we have the dimensions of our training and testing data set which leads me to our next section which is to deal with categorical variables for this we are going to hot encode the match type feature and use it in a random forest model so what is a random forest it's basically a machine learning algorithm a supervised machine learning algorithm which is used for regression models like us it's also used for classification models but that is a case for another day now they're extremely simple to build and interpret but then again it has its own pitfalls as it is less accurate one-dimensional and inflexible when it comes to classification but it wouldn't throw us off so much as we have a regression model in mind and not a classification model to explain to you a random forest in by simple terms imagine a person that you know Chandler or maybe you are planning to go on a vacation it's a one year long vacation so you decide what places you want to travel to and you take suggestions from a bunch of your friends now obviously your friends want to know your preference and ask you different questions before they recommend you places now after talking to all of them you decide to get a wart and hence go to the place which has most woods now this is a typical example of a random forest algorithm basically what it does is that this algorithm uses ensemble learning method in which various different individual results combined form the result of a random forest now there are a couple different types of ensemble learning such as bagging and boosting now this is what your typical algorithm in very simple terms is going to look like if you take the example of you taking your friend's advice to go on your year-long trip what you want to do is you're going to select some random subsets of data from say an entire data set you're going to talk all your friends separately basically tree in your decision trees and when each of the individual people predict or in your keys put in your wart you then collect the results of all the routing of all your different decision trees or your friends and then make your final choice or make your final prediction the final prediction is always the one with maximal words let's go ahead and see how we can use it in our model now first of all we are going to determine the number of match types which we know are three kinds of match type solar doors and squad now one hot code for those of you who do not know in digital circuits and machine learning it's one hot is a group of bits among which you can make legal combinations of values with a single high bit or like one and all other low bits which are zeros the zero ones also called one cold in statistics mostly dummy variables represent a very similar technique for representing categorical data this basically is a process by which categorical variables are converted into a form that could be provided to your machine learning algorithms so that your model does a better job here you have one hot encoded all your match types now there are a lot of group IDs and match IDs so one hot coding them is basically computational suicide so what we're going to do is we won't turn them into category codes that way we're still going to benefit from the correlations between groups and matches in our random forest algorithms so first we're gonna start out by turning our group ID and match ID into categorical types then the next two lines are basically to get your category coding for your group ID and match ID and finally we are going to get rid of all the old columns then we're gonna go ahead and drop the ID column because probably it won't be the most useful for our algorithm your test set contains different IDs so I think it would be wiser to drop them out so if we're gonna do that now we are going to get prepared for our machine learning model let's start by sampling we'll take a sample of five hundred thousand rows from our training set because obviously we can take 4 million rows together it will be extremely hard for debugging and EBA then as you saw previously you're going to split your decision trees using your random subset of features our entire data set in tabular form once it has been reduced is basically a data frame and we're going to spread the training data a target variable whether Y is going to be our target variable then your data frame or DF is going to have all the columns except your target variable then we're gonna split the training and validation data and we're going to print the dimensions of our sample training target and validation later and finally we're gonna print our dimensions of samples which are our sample training shape target shape and validation shape now we have a data but we have to see how accurate it is so we are going to define function for calculating the mean absolute error as I have previously mentioned for seeing the reliability of the model that you've created so we are going to use the cycle on library for matrix and ensemble and we're going to import the mean absolute error function as well as the random forest regressive we're also going to define a function to print the mean absolute error score of our model and then we are going to train the basic model here you have your estimators your minimum samples your max features and your jobs then we will find out what are the most predictive features according to our basic random forest model in order you have your walk distance percentile walk distance per second kill place percentage walk distance all the way down till right distance then we are going to plot an importance graph of the 20 most important features and you can see in a graph form what you previously saw in a tabular format and you can see the walking actually affects a lot of your winning person tile just some more training and indexing I'm going to keep all the significant features and finally we are going to build a data frame with only the significant features so now we are going to be taking those features and building a random forest model with those features so we are going to train the model with our top features you then what if you can do is let me go ahead run this you have to train a model on the top features and get their feature importances respectively and here we are again a very similar plot and then we you know execute a dendogram to view very highly correlated features that contribute to our winning percentile are winning placement percentile for that we are going to be using a library called SCI pie so here we basically have a dendogram now for those of you do not know this is a plot which is used for hierarchical clustering yes so you can basically see the hierarchy of correlation of the different attributes you can also see little clusters of what directly or indirectly depends on water attribute for example here you can see that the boosts and health items are basically one cluster but they are again dependent upon this whole walking and distance covered and basically how tired certain then we are basically going to define a function to get a random sample of rows from our data frame followed by installing a couple of packages a couple of libraries you and we're basically just able to see the head of our data frame which is the first five lines we can see your IDs we can see all the attributes then we're gonna use the PD box that we have just installed and using that we're going to plot the predictive quality of kills and then dye thing is using ggplot we see how the walk percentile per kill percentile affects your placement we're gonna do the same for the quality of walk distance and here we see it's a fairly most able logarithmic graph then we're going to prepare our data and drain our final model basically you should try to increase your n estimators and playing around with different parameters to get better results we're gonna get our test data sets head as well as our training data sets head now this might also take some time so kindly be patient you then again we reached a final checkpoint where we are going to put up our modified test data as a CSV file again in the same folder on the drive now in the end let's add the same features the same attributes to our test data that we have in our final version of the data we have our walk distance percentile we have our kill placed over max place players joined the kills norm damage dealt norm all of those extra attributes that we generated through feature engineering then we're basically going to be turning our group ID and match ID into categorical types and get the category coding for both of those and finally remove the irrelevant features from our test data set fill the non applicable fields with 0 which basically means that they're going to be temporary and with that finally we are going to be predicting the final percentile of winning placements now we are going to put in our data frame create a submission file and import it to our CSV now I've already done this one time before and let me navigate to our submission file this is what our submission file is going to look like after all of this hard work in code running this is what your reader should look like you have two columns basically you have your player ID and their placement percentile obviously this will not give you a unique percentile but you need to understand that we're talking about four million rows of data here obviously the top one percentile is going to be a lot of numbers if we have 4.4 million people playing a certain game the top one percentile is going to be like 40,000 people with that I come to the end of my model as well as the end of this session don't forget to check out other content in our data science playlist as well as our data science certification training the link to which will be in the description box below for now I'd like to conclude saying that data science actually continues to evolve as one of the most promising and in-demand career paths for professionals to be a successful professional in this field you must understand that you must advance past the traditional skills of analyzing large amounts of data data mining and typical programming skills you must master a full spectrum of a data science lifecycle and possess a level of flexibility and understanding to maximize returns at each phase of the process data is everywhere and it's ever-expanding and a variety of terms related to it are often used interchangeably but don't let that get into your head it's actually not that complex if you put your mind to it so why not turn this Quarantine into a learning experience by upping your skill and finding something new to learn I shall leave you the thought my name as a plasma thank you have a great day you
Info
Channel: edureka!
Views: 15,535
Rating: undefined out of 5
Keywords: yt:cc=on, pubg data analysis, pubg analysis, pubg game analysis, pubg tournament analysis, analysis gameplay pubg, data science, data science tutorial, machine learning pubg, pubg eda, exploratory data analysis in python, exploratory data analysis in data science, exploratory data analysis in machine learning, python pubg, data science python, data science python edureka, exploratory data analysis python edureka, machine learning python edureka, edureka
Id: l2AbVQ4GPQ4
Channel Id: undefined
Length: 46min 47sec (2807 seconds)
Published: Sun Apr 19 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.