Machine Learning for Everybody – Full Course

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

kylie ying has worked at many interesting places such as mit cern and free code camp she's a physicist engineer and basically a genius and now she's gonna teach you about machine learning in a way that is accessible to absolute beginners what's up you guys so welcome to machine learning for everyone if you are someone who is interested in machine learning and you think you are considered as everyone then this video is for you in this video we'll talk about supervised and unsupervised learning models we'll go through maybe a little bit of the logic or math behind them and then we'll also see how we can program it on google collab if there are certain things that i have done and you know you're somebody with more experience than me please feel free to correct me in the comments and we can all as a community learn from this together so with that let's just dive right in without wasting any time let's just dive straight into the code and i will be teaching you guys concepts as we go so this here is the uci machine learning repository and basically they just have a ton of data sets that we can access and i found this really cool one called the magic gamma telescope data set so in this data set if you don't want to read all this information to summarize what i what i think is going on is there's this gamma telescope and we have all these high energy particles hitting the telescope now there's a camera there's a detector that actually records certain patterns of you know how this light hits the camera and we can use properties of those patterns in order to predict what type of particle caused that radiation so whether it was a gamma particle or some other head like hadron down here these are all of the attributes of those patterns that we collect in the camera so you can see that there's you know some length width size asymmetry etc now we're going to use all these properties to help us discriminate the patterns and whether or not they came from a gamma particle or a hadron so in order to do this we're going to come up here go to the data folder and you're going to click this magic04 data and we're going to download that now over here i have a colab notebook open so you go to collab.research.google.com you start a new notebook and i'm just going to call this um the magic data set so actually i'm going to call this for code camp magic example okay so with that i'm going to first start with some imports so i will import you know i always import numpy i always import pandas and i always import matplotlib and then we'll import other things as we go so yeah we run that in order to run the cell you can either click this play button here or you can on my computer it's just shift enter and that that will run the cell and here i'm just going to or i'm just going to you know let you guys know okay this is where i found the data set um so i've copied and pasted this actually but this is just where i found the data set and in order to import that downloaded file that we we got from the computer we're going to go over here to this folder thing and i am literally just going to drag and drop that file into here okay so in order to take a look at you know what does this file consist of do we have the labels do we not i mean we could open it on our computer but we can also just do pandas read csv and we can pass in the name of this file and let's see what it returns so it doesn't seem like we have the label so let's go back to here i'm just going to make the columns uh the column labels all of these attribute names over here so i'm just going to take these values and make that the column names all right how do i do that so basically i will come back here and i will create a list called calls and i will type in all of those things with f size f conch and we also have an f conc 1 we have f symmetry fm3 long fm3 trans f alpha what else do we have f dist and class and class okay great now in order to label those as these columns down here in our data frame so basically this command here just reads some csv file that you pass in csv has come about comma separated values um and turned that into a pandas data frame object so now if i pass in a names here then it basically assigns these labels to the columns of this data set so i'm going to set this data frame equal to df and then if we call so head is just like give me the first five things now you'll see that we have labels for all of these okay all right great so one thing that you might notice is that over here the class labels we have g and h so if i actually go down here and i do data frame class dot unique you'll see that i have either g's or h's and these stand for gammas or hadrons um and our computer is not so good at understanding letters right our computer's really good at understanding numbers so what we're going to do is we're going to convert this to 0 for g and 1 for h so here i'm going to set this equal to this whether or not that equals g and then i'm just going to say as type int so what this should do is convert this entire column if it equals g then this is true so i guess that would be one and then if it's h it would be false so that would be zero but i'm just converting g and h to one and zeros it doesn't really matter like if g is one and h is zero or vice versa let me just take a step back right now and talk about this data set so here i have some data frame and i have all of these different values for each entry now this is a you know each of these is one sample it's one example it's one item in our data set it's one data point all these things are kind of the same thing when i mention oh this is one example or this is one sample or whatever now each of these samples they have you know one quality for each or one value for each of these labels up here and then it has the class now what we're going to do in this specific example is try to predict for future you know samples whether the class is g for gamma or h for hadron and that is something known as classification now all of these up here these are known as our features and features are just things that we're going to pass into our model in order to help us predict the label which in this case is the class column so for you know sample 0 i have 10 different features so i have 10 different values that i can pass into some model and i can spit out you know the class the label and i know the true label here is g so this is this is actually supervised learning all right so before i move on let me just give you a quick little crash course on what i just said this is machine learning for everyone well the first question is what is machine learning well machine learning is a sub-domain of computer science that focuses on certain algorithms which might help a computer learn from data without a programmer being there telling the computer exactly what to do that's what we call explicit programming so you might have heard of ai and ml and data science what is the difference between all of these so ai is artificial intelligence and that's an area of computer science where the goal is to enable computers and machines to perform human-like tasks and simulate human behavior now machine learning is a subset of ai that tries to solve one specific problem and make predictions using certain data and data science is a field that attempts to find patterns and draw insights from data and that might mean we're using machine learning so all of these fields kind of overlap and all of them might use machine learning so there are a few types of machine learning the first one is supervised learning and in supervised learning we're using labeled inputs so this means whatever input we get we have a corresponding output label in order to train models and to learn outputs of different new inputs that we might feed our model so for example i might have these pictures okay to a computer all these pictures are are pixels they're pixels with a certain color now in supervised learning all of these inputs have a label associated with them this is the output that we might want the computer to be able to predict so for example over here this picture is a cat this picture is a dog and this picture is a lizard now there's also unsupervised learning and in unsupervised learning we use unlabeled data to learn about patterns in the data so here are here are my input data points again they're just images they're just pixels well okay let's say i have a bunch of these different pictures and what i can do is i can feed all these to my computer and i might not you know my computer's not gonna be able to say oh this is a cat dog and lizard in terms of you know the output but it might be able to cluster all these pictures it might say hey all of these have something in common all of these have something in common and then these down here have something in common that's finding some sort of structure in our unlabeled data and finally we have reinforcement learning and reinforcement learning well they usually there's an agent that is learning in some sort of interactive environment based on rewards and penalties so let's think of a dog we can train our dog but there's not necessarily you know any wrong or right output at any given moment right well let's pretend that dog is a computer essentially what we're doing is we're giving rewards to our computer and telling our computer hey this is probably something good that you want to keep doing well computer agent yeah terminology but in this class today we'll be focusing on supervised learning and unsupervised learning and learning different models for each of those all right so let's talk about supervised learning first so this is kind of what a machine learning model looks like you have a bunch of inputs that are going into some model and then the model is spitting out an output which is our prediction so all these inputs this is what we call the feature vector now there are different types of features that we can have we might have qualitative features and qualitative means categorical data there's either a finite number of categories or groups so one example of a qualitative feature might be gender and in this case there's only two here it's for the sake of the example i know this might be a little bit outdated here we have a girl and a boy there are two genders there are two different categories that's a piece of qualitative data another example might be okay we have you know a bunch of different nationalities maybe a nationality or a nation or a location that might also be an example of categorical data now in both of these there's no inherent order it's not like you know we can rate us one and uh france two japan three etcetera right there's not really any inherent order built into either of these categorical data sets that's why we call this nominal data now for nominal data the way that we want to feed it into our computer is using something called one hot encoding so let's say that you know i have a data set some of the items in our data some of the inputs might be from the us some might be from india then canada than france now how do we get our computer to recognize that we have to do something called one hot encoding and basically one hot encoding is saying okay well if it matches some category make that a one and if it doesn't just make that a zero so for example if your input were from the us you would you might have one zero zero zero india you know zero one zero zero canada okay well the item representing canada is one and then france the item representing france is one and then you can see that the rest are zeros that's one hot encoding now there are also a different type of qualitative feature so here on the left there are different age groups there's babies toddlers teenagers young adults adults and so on right and on the right hand side we might have different ratings so maybe bad not so good mediocre good and then like great now these are known as ordinal pieces of data because they have some sort of inherent order right like being a toddler is a lot closer to being a baby than being an elderly person right or good is closer to great than it is to really bad so these have some sort of inherent ordering system and so for these types of data sets we can actually just mark them from you know one to five or we can just say hey for each of these let's give it a number and this makes sense because like for example the thing that i just said how good is closer to great then good is close to not good at all well four is closer to five then four is close to one so this actually kind of makes sense and it'll make sense for the computer as well all right there are also quantitative pieces of data and quantitative pieces of data are numerical valued pieces of data so this could be discrete which means you know they might be integers or it could be continuous which means all real numbers so for example the length of something is a quantitative piece of data it's a quantitative feature the temperature of something is a quantitative feature and then maybe how many easter eggs i collected in my basket this easter egg hunt that is an example of discrete quantitative feature okay so these are continuous and this over here is discrete so those are the things that go into our feature vector those are our features that we're feeding this model because our computers are really really good at understanding math right at understanding numbers they're not so good at understanding things that humans might be able to understand well what are the types of predictions that our model can output so in supervised learning there are some different tasks there's one classification and basically classification just saying okay predict discrete classes and that might mean you know this is a hot dog this is a pizza and this is ice cream okay so there are three distinct classes and any other pictures of hot dogs pizza or ice cream i can put under these labels hot dog pizza ice cream this is something known as multi-class classification but there's also binary classification and binary classification you might have hot dog or not hot dog so there's only two categories that you're working with something that is something and something that isn't binary classification okay so yeah other examples so if something has positive or negative sentiment that's binary classification maybe you're predicting your pictures if they're cats or dogs that's binary classification maybe you know you are writing an email filter and you're trying to figure out if an email is spam or not spam so that's also binary classification now for multi-class classification you might have you know cat dog lizard dolphin shark rabbit etc um we might have different types of fruits so like orange apple pear etc and then maybe different plant species but multi-class classification just means more than two okay and binary means we're predicting between two things there's also something called regression when we talk about supervised learning and this just means we're trying to predict continuous values so instead of just trying to predict different categories we're trying to come up with a number that you know is on some sort of scale so some examples so some examples might be the price of ethereum tomorrow or it might be okay what is going to be the temperature or it might be what is the price of this house right so these things don't really fit into discrete classes we're trying to predict a number that's as close to the true value as possible using different features of our data set so that's exactly what our model looks like in supervised learning now let's talk about the model itself how do we make this model learn or how can we tell whether or not it's even learning so before we talk about the models let's talk about how can we actually like evaluate these models or how can we tell whether something is a good model or a bad model so let's take a look at this data set so this data set has this is from a diabetes a pima indian diabetes data set and here we have different number of pregnancies different glucose levels blood pressure skin thickness insulin bmi age and then the outcome whether or not they have diabetes one for they do zero for they don't so here um all of these are quantitative features right because they're all on some scale so each row is a different sample in the data so it's a different example it's one person's data and each row represents one person in this data set now this column each column represents a different feature so this one here is some measure of blood pressure levels and this one over here as we mentioned is the output label so this one is whether or not they have diabetes and as i mentioned this is what we would call a feature vector because these are all of our features in one sample and this is what's known as the target or the output for that feature vector that's what we're trying to predict and all of these together is our features matrix x and over here this is our labels or targets vector y so i've condensed this to a chocolate bar to kind of talk about some of the other concepts in machine learning so over here we have our x our features matrix and over here this is our label y so each row of this will be fed into our model right and our model will make some sort of prediction and what we do is we compare that prediction to the actual value of y that we have in our labeled data set because that's the whole point of supervised learning is we can compare what our model is outputting to oh what is the truth actually and then we can go back and we can adjust some things so the next iteration we get closer to what the true value is so that whole process here the tinkering the okay what's the difference where did we go wrong that's what's known as training the model all right so take this whole you know chunk right here do we want to really put our entire chocolate bar into the model to train our model not really right because if we did that then how do we know that our model can do well on new data that we haven't seen like if i were to create a model to predict whether or not someone has diabetes let's say that i just train all my data and i see that on my training data it does well i go to some hospital i'm like here's my model i think you can use this to predict if somebody has diabetes do we think that would be effective or not probably not right because we haven't assessed how well our model can generalize okay it might do well after you know our model has seen this data over and over and over again but what about new data can our model handle new data well how do we how do we get our model to assess that so we actually break up our whole data set that we have into three different types of data sets we call it the training data set the validation data set and the testing data set and you know you might have sixty percent here twenty percent and twenty percent or eighty ten and ten um it really depends on how many statistics you have i think either of those would be acceptable so what we do is then we feed the training data set into our model we come up with you know this might be a vector of predictions corresponding with each sample that we put into our model we figure out okay what's the difference between our prediction and the true values this is something known as loss loss is you know what's the difference here in some numerical quantity of course and then we make adjustments and that's what we call training okay so then once you know we've made a bunch of adjustments we can put our validation set through this model and the validation set is kind of used as a reality check during or after training to ensure that the model can handle unseen data still so every single time after we train one iteration we might stick the validation set in and see hey what's the loss there and then after our training is over we can assess the validation set and ask hey what's the loss there but one key difference here is that we don't have that training step this loss never gets fed back into the model right that feedback loop is not closed all right so let's talk about loss really quickly so here i have four different types of models i have some sort of data that's being fed into the model and then some output okay so this output here is pretty far from you know this truth that we want and so this loss is going to be high in model b again this is pretty far from what we want so this loss is also going to be high let's give it 1.5 now this one here it's pretty close i mean maybe not almost but pretty close to this one so that might have a loss of 0.5 and then this one here is maybe further than this but still better than these two so that loss might be 0.9 okay so which of these model performs the best well model c has the smallest loss so it's probably model c okay now let's take model c after you know we've come up with these all these models and we've seen okay model c is probably the best model we take model c and we run our test set through this model and this test set is used as a final check to see how generalizable that chosen model is so if i you know finished training my diabetes data set then i could run it through some chunk of the data and i can say oh like this is how it perform on data that it's never seen before at any point during the training process okay and that loss that's the final reported performance of my test set or this would be the final reported performance of my model okay so let's talk about this thing called loss because i think i kind of just glossed over it right so loss is the difference between your prediction and the actual like label so this would give a slightly higher loss than this and this would even give a higher loss because it's even more off in computer science we like formulas right we like formulaic ways of describing things so here are some examples of loss functions and how we can actually come up with numbers this here is known as l1 loss and basically l1 loss just takes the absolute value of whatever your you know real value is whatever the real output label is subtracts the predicted value and takes the absolute value of that okay so the absolute value is a function that looks something like this so the further off you are the greater your losses right in either direction so if your real value is off from your predicted value by 10 then your loss for that point would be 10 and then this sum here just means hey we're taking all the points in our data set and we're trying to figure out the sum of how far everything is now we also have something called l2 loss so this loss function is quadratic which means that if it's close the penalty is very minimal and if it's off by a lot then the penalty is much much higher okay and this instead of the absolute value we just square the um the difference between the two now there's also something called binary cross entropy loss it looks something like this and this is for uh binary classification this this might be the loss that we use so this loss you know i'm not going to really go through it too much but you just need to know that loss decreases as the performance gets better so there are some other measures of accuracy as well so for example accuracy what is accuracy so let's say that these are pictures that i'm feeding my model okay and these predictions might be apple orange orange apple okay but the actual is apple orange apple apple so three of them were correct and one of them was incorrect so the accuracy of this model is three quarters or 75 percent all right coming back to our collab notebook i'm going to close this a little bit again we've imported stuff up here um and we've already created our data frame right here and this is this is all of our data this is what we're going to use to train our models so down here again if we now take a look at our data set you'll see that our classes are now zeros and ones so now this is all numerical which is good because our computer can now understand that okay and you know it would probably be a good idea to maybe kind of plot hey do these things have anything to do with the class so here i'm going to go through all the labels so for label in the columns of this data frame so this just gets me the list actually we have the list right it's called so let's just use that it might be less confusing of everything up till the last thing which is the class so i'm going to take all these 10 different features and i'm going to plot them as a histogram um so and now i'm gonna plot them as a histogram so basically if i take that data frame and i say okay for everything where the class is equal to one so these are all of our gammas remember now for that portion of the data frame if i look at this label so now these okay what this part here is saying is inside the data frame get me everything where the class is equal to one so that's all all of these would fit into that category right and now let's just look at the label column so the first label would be f length which would be this column so this command here is getting me all the different values that belong to class 1 for this specific label and that's exactly what i'm going to put into the histogram and now i'm just going to tell you know matplotlib make the color blue make oops label this as you know gamma um set alpha why do i keep doing that alpha equal to 0.7 so that's just like the transparency and then i'm going to set density equal to true so that when we compare it to the hadrons here we'll have a baseline for comparing them okay so the density being true just basically normalizes these distributions so you know if you have 200 and of one type and then 50 of another type well if you drew the histograms it would be hard to compare because one of them would be a lot bigger than the other right but by normalizing them we kind of are distributing them over how many samples there are all right and then i'm just going to put a title on here make that the label uh the y label so because it's density the y label is probability and the x label is just going to be the label um what is going on and i'm going to include a legend and plt.show just means okay display the plot so if i run that just be up to the last item so we want a list right not just the last item and now we can see that we're plotting all of these so here we have the length oh and i made this gamma so uh this should be had on okay so the gamma's in blue the hadrons are in red so here we can already see that you know maybe if the length is smaller it's probably more likely to be gamma right and we can kind of you know these all look somewhat similar but here okay clearly if there's more asymmetry or if you know this asymmetry measure is larger um then it's probably a hadron okay oh this one's a good one so f alpha seems like hadrons are pretty evenly distributed whereas if this is smaller it looks like there's more gammas in that area okay so this is kind of what the data that we're working with we can kind of see what's going on okay so the next thing that we're going to do here is we are going to create our train our validation and our test data sets i'm going to set train valid and test to be equal to this so numpy.split i'm just splitting up the data frame and if i do this sample where i'm sampling everything this will basically shuffle my data um now if i i want to pass in where exactly i'm splitting my data set so the first split is going to be maybe at 60 percent so i'm just going to say 0.6 times the length of this data frame so and then casa 10 integer that's going to be the first place where you know i cut it off and that will be my training data now if i then go to 0.8 this basically means everything between 60 and 80 of the length of the data set will go towards validation and then like everything from 80 to 100 is going to be my test data so i can run that and now if we go up here and we inspect this data we'll see that these columns seem to have values in like the 100s whereas this one is 0.03 right so the scale of all these numbers is way off and sometimes that will affect our results so one thing that we would want to do is scale these so that they are you know so that it's now relative to maybe the mean and the standard deviation of that specific column i'm going to create a function called scale data set and i'm going to pass in the data frame um and that's what i'll do for now okay so the x values are going to be you know i take the data frame and let's assume that the columns um are going to be you know that the label will always be the last thing in the data frame so what i can do is say dot dataframe.com all the way up to the last item and get those values now for my y well it's the last column so i can just do this i can just index into that last column and then get those values now in so i'm actually going to import something known as uh the standard scalar from sklearn so if i come up here i can go to sklearn.preprocessing and i'm going to import um standard scalar i have to run that cell i'm going to come back down here and now i'm going to create a scalar and use that skill or so standard scalar and with the scalar what i can do is actually just fit and transform x so here i can say x is equal to scalar dot fit fit transform x so what that's doing is saying okay take x and fit the standard scalar to x and then transform all those values and what would it be and that's going to be our new x all right and then i'm also going to just create you know the whole data as one uh huge 2d numpy array and in order to do that i'm going to call hsac so h stack is saying okay take an array and another array and horizontally stack them together that's what the h stands for so by horizontally stacked them together just like put them side by side okay not on top of each other so what am i stacking well i have to pass in something so that it can stack x and y and now okay so numpy is very particular about dimensions right so in this specific case rx is a two-dimensional object but y is only a one-dimensional thing it's only a vector of values so in order to now reshape it into a 2d item we have to call numpy.reshape and we can pass in the dimensions of its reshape so if i pass a negative 1 comma 1 that just means okay make this a 2d array where the negative 1 just means infer what what this dimension value would be which ends up being the length of y this would be the same as literally doing this but the negative one is easier because we're making the computer do the hard work so if i stack that i'm going to then return the data x and y okay so one more thing is that if we go into our training data set okay again this is our training data set and we get the length of the training data set um but where the training data sets class is one so remember that this is the gammas and then if we print that and we do the same thing but zero we'll see that you know there's around seven thousand of the gammas but only around four thousand of the hadrons so that might actually become an issue and instead what we want to do is we want to over sample our our training data set so that means that we want to increase the number of these values so that these kind of match better and surprise surprise there is something that we can import that will help us do that it's so i'm going to go to from mblearn.org and i'm going to import this random over sampler run that cell and come back down here so i will actually add in this parameter called over sample and set that to false for default um and if i do want to over sample then what i'm going to do and by over sample so if i do want to over sample then i'm going to create this ros and set it equal to this random over sampler and then for x and y i'm just going to say okay just fit and resample x and y and what that's doing is saying okay take more of the less class so take take the less class and keep sampling from there to increase the size of our data set of that smaller class so that they now match so if i do this and i scale data set and i pass in the training data set where over sample is true so this let's say this is train and then x train y train oops what's going on oh these should be columns so basically what i'm doing now is i'm just saying okay what is the length of y train okay now it's 14 800 whatever and now let's take a look at um how many of these are type one so actually we can just sum that up and then we'll also see that if we instead switch the label and ask how many of them are the other type it's the same value so now these have been evenly you know rebalanced okay well okay so here i'm just going to make this the validation data set and then the next one uh i'm going to make this the test data set all right and we're actually going to switch over sample here to false now the reason why i'm switching that to false is because my validation and my test sets are for the purpose of you know if i have data that i haven't seen yet how does my sample perform on those and i don't want to over sample for that right now like i i don't care about balancing those i'm i want to know if i have a random set of data that's unlabeled can i trust my model right so that's why i'm not oversampling i run that and again what is going on oh it's because we already have this train so i have to go come up here and split that data frame again and now let's run these okay so now we have our data properly formatted and we're going to move on to different models now and i'm going to tell you guys a little bit about each of these models and then i'm going to show you how we can do that in our code so the first model that we're going to learn about is k n or k nearest neighbors okay so here i've already drawn a plot on the y-axis i have the number of kids that a family might have and then on the x-axis i have their income in terms of thousands per year so you know if uh if someone's making 40 000 a year that's where this would be and if somebody making 320 that's where that would be somebody has zero kids it'd be somewhere along this axis somebody has five it'd be somewhere over here okay and now i have these plus signs and these minus signs on here so what i'm going to represent here is the plus sign means that they own a car and the minus sign is going to represent no car okay so your initial thought should be okay i think this is binary classification because all of our points all of our samples have labels so this is a sample with the plus label and this here is another sample with the minus label this is an abbreviation for width that i'll use all right so we have this entire data set and maybe around half the people own a car and maybe around half the people don't own a car okay well what if i had some new point let me use choose a different color i'll use this nice green well what if i have a new point over here so let's say that somebody makes 40 000 a year and has two kids what do we think that would be well just logically looking at this plot you might think okay it seems like they wouldn't have a car right because that kind of matches the pattern of everybody else around them so that's a whole concept of this nearest neighbors is you look at okay what's around you and then you're basically like okay i'm going to take the label of the majority that's around me so the first thing we have to do is we have to define a distance function and a lot of times in you know 2d plots like this our distance function is something known as euclidean distance and euclidean distance is basically just this straight line distance like this okay so this would be the euclidean distance it seems like there's this point there's this point there's that point etcetera so the length of this line this green line that i just drew that is what's known as euclidean distance if we want to get technical with that this exact formula is the distance here let me zoom in the distance is equal to the square root of one point x minus the other points x squared plus extend that square root the same thing for y so y one of one minus y two of the other squared okay so we're basically trying to find the length the distance is the difference between x and y and then square each of those sum it up and take the square root okay so i'm going to erase this so it doesn't clutter my drawing but anyways now going back to this plot so here in the nearest neighbor algorithm we see that there is a k right and this k is basically telling us okay how many neighbors do we use in order to judge what the label is so usually we use a k of maybe you know three or five depends on how big our data set is but here i would say maybe a logical number would be three or five so let's say that we take k to be equal to three okay well of this data point that i drew over here let me use green to highlight this okay so of this data point that i drew over here looks like the three closest points are definitely this one this one and then this one has a length of four and this one seems like it'd be a little bit further than four so actually this would be our these would be our three points well all those points are blue so chances are my prediction for this point is going to be blue it's going to be probably don't have a car all right now what if my point is somewhere what if my point is somewhere over here let's say that a couple has four kids and they make 240 000 a year all right well now my closest points are this one probably a little bit over that one and then this one right okay still all pluses well this one is more than likely to be a plus right now let me get rid of some of these just so that it looks a little bit more clear all right let's go through one more what about a point that might be right here okay let's see well definitely this is the closest right this one's also closest and then [Music] it's really close between the two of these but if we actually do the mathematics it seems like if we zoom in this one is right here and this one is in between these two so this one here is actually shorter than this one and that means that that top one is the one that we're going to take now what is the majority of the points that are close by well we have one plus here we have one plus here and we have one minus here which means that the pluses are the majority and that means that this label is probably somebody with a car okay so this is how k nearest neighbors would work it's that simple and this can be extrapolated to further dimensions to higher dimensions you know if you have here we have two different features we have the income and then we have the number of kids but let's say we have 10 different features we can expand our distance function so that it includes all 10 of those dimensions we take the square root of everything and then we figure out which one is the closest to the point that we desire to classify okay so that's k-nearest neighbors so now we've learned about uh k-nearest neighbors let's see how we would be able to do that within our code so here i'm going to label the section k nearest neighbors and we're actually going to use a package from sklearn so the reason why we you know use these packages so that we don't have to manually code all these things ourself because it would be really difficult and chances are the way that we would code it either would have bugs or it'd be really slow or i don't know a whole bunch of issues so what we're going to do is hand it off to the pros from here i can say okay from sklearn which is this package dot neighbors i'm going to import k neighbors classifier because we're classifying okay so i run that and our k n model is going to be this k neighbors classifier and we can pass in a parameter of how many neighbors you know we want to use so first let's see what happens if we just use one so now if i do k and then model dot fit i can pass in my x training set and my weight y train data okay so that effectively fits this model and let's get all the predictions so why k n or i guess yeah let's do y predictions and my y predictions are going to be k n model dot predict um so let's use the test set x test okay all right so if i call by predict you'll see that we have those but if i get my truth values for that test set you'll see that this is what we actually do so just looking at this we got five out of six of them okay great so let's actually take a look at something called the classification report that's offered by sklearn so if i go to from sklearn.metrics import classification report what i can actually do is say hey print out this classification report for me and let's check you know i'm giving you the y test and the y prediction we run this and we see we get this whole entire chart so i'm going to tell you guys a few things on this chart all right this accuracy is 82 which is actually pretty good that's just saying hey if we just look at you know what each of these new points what it's closest to then we actually get an 82 percent accuracy which means how many do we get right versus how many total are there now precision is saying okay you might see that we have it for class one or class zero and class what precision is saying was let's go to this wikipedia diagram over here because i actually kind of like this diagram so here this is our entire data set and on the left over here we have everything that we know is positive so everything that is actually truly positive that we've labeled positive in our original data set and over here this is everything that's truly negative now in the circle we have things that are paused that were labeled positive by our model on the left here we have things that are truly positive because you know this side is the positive side and the side is the negative side so these are truly positive whereas all these ones out here well they should have been positive but they are labeled as negative and in here these are the ones that we've labeled positive but they're actually negative and out here these are truly negative so precision is saying okay out of all the ones we've labeled as positive how many of them are true positives and recall is saying okay out of all the ones that we know are truly positive how many did we actually get right okay so going back to this over here our precision score so again precision out of all the ones that we've labeled as the specific class how many of them are actually that class it's 77 and 84 now recall how out of all the ones that are actually this class how many of those that we get this is 68 and eighty nine percent all right so not too shabby we can clearly see that this recall and precision for like this the class zero is worse than class one right so that means for hadrons it's worked for hadrons and for our gammas this f1 score over here is kind of a combination of the precision and recall score so we're actually going to mostly look at this one because we have an unbalanced test data set so here we have a measure of 72 and 87 or 0.72 and 0.87 which is not too shabby all right well what if we you know made this three so we actually see that okay so what was it originally with one we see that our f1 score you know is now it was 0.72 and then 0.87 and then our accuracy was 82 so if i change that to three all right so we've kind of increased zero at the cost of one and then our overall average accuracy is 81. so let's actually just make this five all right so you know again very similar numbers we have 82 percent accuracy which is pretty decent for a model that's relatively simple the next type of model that we're going to talk about is something known as naive bayes now in order to understand the concepts behind naive bayes we have to be able to understand conditional probability and bayes rule so let's say i have some sort of data set that's shown in this table right here people who have covid are over here in this red row and people who do not have covet are down here in this green row now what about the cova test well people who have tested positive are over here in this column and people who have tested negative are over here in this column okay yeah so basically our categories are people who have coven and test positive people who don't have covid but test positive so a false false positive people who have covet and test negative which is a false negative and people who don't have covid and test negative which good means you don't have coven okay so let's make this slightly more legible and here in the margins i've written down the sums of whatever it's referring to so this here is the sum of this entire row and this here might be the sum of this column over here okay so the first question that i have is what is the probability of having covid given that you have a positive test and in probability we write that out like this so the probability of covid given so this line that vertical line means given that you know some condition so given a positive test okay so what is the probability of having covid given a positive test so what this is asking is saying okay let's go into this condition so the condition of having a positive test that is this slice of the data right that means if you're in this slice of data you have a positive test so given that we have a positive test given in this condition in this circumstance we have a positive test so what's the probability that we have covid well if we're just using this data the number of people that have covid is 531. so i'm going to say that there's 531 people that have copied and then now we divide that by the total number of people that have a positive test which is 551 okay so that's the probability and doing a quick division we get that this is equal to around 96.4 percent so according to this data set which is data that i made up off the top of my head so it's not actually real cova data but according to this data uh the probability of encoded given that you tested positive is 96.4 all right now with that let's talk about bayes rule which is this section here let's ignore this bottom part for now so base rule is asking okay what is the probability of some event a happening given that b happen so this we already know has happened this is our condition right well what if we don't have data for that right like what if we don't know what the probability of a given b is well bayes rule is saying okay well you can actually go and calculate it as long as you have the probability of b given a the probability of a and the probability of b okay and this is just a mathematical formula for that all right so here we have bayes rule and let's actually see bayes rule in action let's use it on an example so here let's say that we have some um disease statistics okay so knock over different disease and we know that the probability of obtaining a false positive is 0.05 probability of obtaining a false negative is 0.01 and the probability of the disease is 0.1 okay what is the probability of the disease given that we got a positive test how do we even go about solving this so what what do i mean by false positive what's a different way to rewrite that a false positive is when you test positive but you don't actually have the disease so this here is a probability that you'd have a positive test given no disease right and similarly for the false negative it's the probability that you test negative given that you actually have the disease so if i put that into a chart for example and this might be my positive and negative tests and this might be my diseases disease and no disease well the probability that i test positive but actually have no disease okay that's 0.05 over here and then the false negative is up here for 0.01 so i'm testing negative but i don't actually have the disease this so the probability that you test positive and you don't have the disease plus the probability that you test negative given that you don't have the disease that should sum up to one okay because if you don't have the disease then you should have some probability that you're testing positive and some probably that you're testing negative but that probability in total should be one so that means that the probability of negative and no disease this should be the reciprocal this would be the opposite so it should be 0.95 because it's 1 minus whatever this probability is and then similarly oops up here this should be 0.99 because the probability that we you know test negative and have the disease plus the probability that we test positive and have the disease should equal one so this is our probability chart and now this probability of disease being point zero point one just means i have a ten percent probability of actually of having the disease right like in the general population the probability that i have the disease is 0.1 okay so what is the probability that i have the disease given that i got a positive i got a positive test well remember that we can write this out in terms of bayes rule right so if i use this rule up here this is the probability of a positive test given that i have the disease times the probability of the disease divided by the probability of the evidence which is my positive test all right now let's plug in some numbers for that the probability of having a positive test given that i have disease is 0.99 and then the probability that i have the disease is this value over here 0.1 okay and then the probability that i have a positive test at all should be okay what is the probability that i have a positive test given that i actually have the disease and then having having the disease and then the other case where the probability of me having a negative test given or sorry positive test giving no disease times the probability of not actually having a disease okay so i can expand that probability of having positive tests out into these two different cases i have a disease and then i don't and then what's the probability of having positive tests in either one of those cases so that expression would become 0.99 times 0.1 plus 0.05 so that's the probability that i'm testing positive but don't have the disease and the times the probability that i don't actually have the disease so that's 1 minus 0.1 the probability that the population doesn't have the disease is 90 so 0.9 and let's do that multiplication and i get an answer of 0.6875 or 68.75 percent okay all right so we can actually expand that we can expand bayes rule and apply it to classification and this is what we call naive bayes so first a little terminology so the posterior is this over here because it's asking hey what is the probability of some class ck so by ck i just mean you know the different categories so c for category or class or whatever so category one might be cats category two dogs category three lizards all the way we have k categories k is just some number okay so what is the probability of having of this specific sample x so this is our feature factor of this one sample what is the probability of x fitting into category one two three four whatever right so that that's what this is asking what is the probability that you know it's actually from this class given all this evidence that we see the x's so the likelihood is this quantity over here it's saying okay well given that you know assume assume we are assumed that this class is class ck okay assume that this is a category well what is the likelihood of actually seeing x all these different features from that category and then this here is the prior so like in the entire population of things what what is what are the probabilities what is the probability of this class in general like if i have you know in my entire data set what is the percentage what is the chance that this image is a cat how many cats do i have right and then this down here is called the evidence because what we're trying to do is we're changing our prior we're creating this new posterior probability built upon the prior by using some sort of evidence right and that evidence is the probability of x so that's some vocab and this here is a rule for naive bayes whoa okay let's digest that a little bit okay so what is let me use a different color what is this side of the equation asking it's asking what is the probability that we are in sum class k ck given that you know this is my first input this is my second input this is you know my third fourth this is my nth input so let's say that our classification is do we play soccer today or not okay and let's say our x's are okay is it how much wind is there how much uh rain is there and what day of the week is it so let's say that it's raining it's not windy but it's wednesday do we play soccer do we not so let's use bayes rule on this so this here is equal to the probability of x1 x2 all these joint probabilities given class k times the probability of that class all over the probability of this evidence okay so what is this fancy symbol over here this means proportional to so how our equal sign means it's equal to this like little squiggly sign means that this is proportional to okay and this denominator over here you might notice that it has no impact on the class like this that number doesn't depend on the class right so this is going to be constant for all of our different classes so what i'm going to do is make things simpler so i'm just going to say that this probability x1 x2 all the way to xn this is going to be proportional to the numerator i don't care about the denominator because it's the same for every single class so this is proportional to x1 x2 xn given class k times the probability of that class okay all right so in naive bayes the point of it being naive is that we're actually this joint probability we're just assuming that all of these different things are all independent so in my soccer example you know the probability that we're playing soccer um or the probability that you know it's windy and it's rainy and and it's wednesday all these things are independent we're assuming that they're independent so that means that i can actually write this part of the equation here as this so each term in here i can just multiply all them together so the probability of the first feature given that it's class k times the probability of the second feature given that's probably like class k all the way up until you know the nth feature of given that it's class k so this expands to all of this all right which means that this here is now proportional to the thing that we just expanded times this so i'm going to write that out so the probability of that class and i'm actually going to use this symbol so what this means is it's a huge multiplication it means multiply everything to the right of this so this probability x given some class k but do it for all the i so i what is i okay we're going to go from the first x i all the way to the end so that means for every single i we're just multiplying these probabilities together and that's where this up here comes from so to wrap this up oops this should be a line to wrap this up in plain english basically what this is saying is the probability that you know we're in some category given that we have all these different features is proportional to the probability of that class in general times the probability of each of those features given that we're in this one class that we're testing so the probability of it you know of us playing soccer today given that it's rainy not windy and um and it's wednesday is proportional to okay well what is what is the probability that we play soccer anyways and then times the probability that it's rainy given that we're playing soccer times the probability that it's not windy given that we're playing soccer so how many times are we playing soccer when it's windy you know and then how many times or what's the probability that's wednesday given that we're playing soccer okay so how do we use this in order to make a classification so that's where this comes in our y hat our predicted y is going to be equal to something called the arg max and then this expression over here because we want to take the arg max well we want so okay if i write out this again this means the probability of being in some class c k given all of our evidence well we're going to take the k that maximizes this expression on the right that's what arc max means so if k is in zero oops one through k so this is how many categories there are we're going to go through each k and we're going to solve this expression over here and find the k that makes that the largest okay and remember that instead of writing this we have now a formula thanks to bayes rule for helping us approximate that right and something that maybe we can we maybe we have like the evidence for that we have the answers for that based on our training set so this principle of going through each of these and finding whatever class whatever category maximizes this expression on the right this is something known as m ap for short or maximum a posteriori uh pick the hypothesis so pick the k that is the most probable so that we minimize the probability of misclassification all right so that is m ap that is naive bayes back to the notebook so just like how i imported k-nearest name uh k neighbors classifier up here for naive bayes i can go to sklearn.naive bayes and i can import gaussian naive bayes all right and here i'm going to say my naive bayes model is equal this is very similar to what we had above and i'm just going to say with this model we are going to fit x train and y train all right just like above so this i might actually have to so i'm going to set that and uh exactly just like above i'm going to make my prediction so here i'm going to instead use my naive bayes model and of course i'm going to run the classification report again so i'm actually just going to put these in the same cell but here we have the y the new y prediction and then y test is still our original test data set so if i run this you'll see that okay what's going on here we get worse scores right our precision for all of them they look slightly worse and our um you know for our precision our recall our f1 score they look slightly worse for all the different categories and our total accuracy i mean it's still 72 which is not too shabby but it's still 72 okay um which you know is not not that great okay so let's move on to logistic regression here i've drawn a plot i have y so this is my label on one axis and then this is maybe one of my features so let's just say i only have one feature in this case deck zero right well we see that you know i have a few of one class type down here and we know it's one class type because it's zero and then we have our other class type one up here okay so many of you guys are familiar with regression so let's start there if i were to draw a regression line through this it might look something like like this right well this doesn't seem to be a very good model like why would we use this specific line to predict why right it's it's iffy okay um for example we might say okay well it seems like you know everything from here downwards would be one class type in here upwards would be another class type but when you look at this you're just you you visually can tell okay like that line doesn't make sense things are not those dots are not along that line and the reason is because we are doing classification not regression okay well first of all let's start here we know that this model if we just use this line it equals m x so whatever this let's just say it's x plus b which is the y intercept right and m is the slope but when we use a linear regression is it actually y hat no it's not right so when we're working with linear regression what we're actually estimating in our model is a probability what's the probability between 0 and 1 that is class 0 or class 1. so here let's rewrite this as p equals mx plus b okay well mx plus b that can range you know from negative infinity to infinity right for any for any value of x it goes from negative infinity to infinity but probability we know probably one of the rules of probability is that probability has to stay between zero and one so how do we fix this well maybe instead of just setting the probability equal to that we can set the odds equal to this so by that i mean okay let's do probability divided by 1 minus the probability okay so now it becomes this ratio now this ratio is allowed to take on infinite values but there's still one issue here let me move this over a bit the one issue here is that mx plus b that can still be negative right like if you know i have a negative slope if i have a negative b if i have some negative x's in there i don't know but that can be that's allowed to be negative so how do we fix that we do that by actually taking the log of the odds okay so now i have the log of you know some probability divided by 1 minus the probability and now that is on a range of negative infinity to infinity which is good because the range of log should be negative infinity to infinity now how do i solve for p the probability well the first thing i can do is take you know i can remove the log by taking the not the um e to the whatever is on both sides so that gives me the probability over the one minus the probability is now equal to e to the m x plus b okay so let's multiply that out so the probability is equal to one minus probability e to the m x plus b so p is equal to e to the m x plus b minus p times e to the m x plus b and now we have we can move like terms to one side so if i do p uh so basically i'm moving this over so i'm adding p so now p one plus e to the m x plus b is equal to e to the m x plus b and let me change this parenthesis make it a little bigger so now my probability can be e to the mx plus b divided by 1 plus e to the mx plus b okay well let me just rewrite this really quickly i want a numerator of one on top okay so what i'm going to do is i'm going to multiply this by negative mx plus b and then also the bottom by negative mx plus b and i'm allowed to do that because this over this is 1. so now my probability is equal to 1 over 1 plus e to the negative mx plus b and now why do i rewrite it like that it's because this is actually a form of a special function which is called the sigmoid function and for the sigmoid function it looks something like this so s of x sigmoid you know at with that sum x is equal to 1 over 1 plus e to the negative x so essentially what i just did up here is rewrite this in some sigmoid function where the x value is actually mx plus b so maybe i'll change this to y just to make that a bit more clear it doesn't matter what the variable name is but this is our sigmoid function and visually what our sigmoid function looks like is it goes from zero so this here is zero to one and it looks something like this curved s which i didn't draw too well let me try that again this is hard to draw something if i can draw this right like that okay so it goes in between zero and one and you might notice that this form fits our shape up here oops let's draw it sharper but it fits our shape up there a lot better right all right so that is what we call logistic regression we're basically trying to fit our data to the sigmoid function okay and when we only have you know one um data point so if we only have one feature x then that's what we call simple logistic regression but then if we have you know so that's only x0 but then if we have x0 x1 all the way to xn we call this multiple logistic regression because there are multiple features that we're considering when we're building our model logistic regression so i'm going to put that here and again from sklearn this linear model we can import logistic regression right and just like how we did above we can repeat all of this so here instead of nb i'm going to call this the log model or lg logistic regression i'm going to change this to logistic regression so i'm just going to use the default logistic regression but actually if you look here you see that you can use different penalties so right now we're using um an l2 penalty but l2 is our quadratic formula okay so that means that for you know outliers it would really penalize that for all these other things you know you can toggle these different parameters and you might get slightly different results if i were building a production level logistic regression model that i would want to go and i would want to figure out you know what are the best parameters to pass into here based on my validation data but for now we'll just we'll just use this out of the box so again i'm going to fix the x train and the y train and i'm just going to predict again so i can just call this again um and instead of l g uh nb i'm going to use lg so here this is decent precision 65 recall 71 f1 68 or 82 uh total accuracy of 77 okay so it performs slightly better than night bass but it's still not as good as knn all right so the last model for classification that i wanted to talk about is something called support vector machines or svms for short so what exactly is an svm model i have two different features x0 and x1 on the axis and then i've told you if it's you know class 0 or class 1 based on the blue and the red labels my goal is to find some sort of line between these two labels that best divides the data all right so this line is our svm model so i call it a line here because in 2d it's a line but in 3d it would be a plane and then you can also have more and more dimensions so the proper term is actually i want to find the hyperplane that best differentiates these two classes let's see a few examples okay so first between these three lines let's say a b and c which one is the best divider of the data which one has you know all the data on one side or the other or at least if it doesn't which one divides it the most right like which one is has the most defined boundary between the two different groups so this this question should be pretty straightforward it should be a right because a has that clear distinct line between where you know everything on this side of a is one label it's negative and everything on this side of a is the other label it's positive so what if i have a but then what if i had drawn my b like this and my c maybe like this sorry they're kind of the labels are kind of close together but now which one is the best so i would argue that it's still a right and why is it still a because in these other two look at how close this is to that to these points right so if i had some new point that i wanted to estimate okay say i didn't have a or b so let's say we're just working with c let's say i have some new point that's right here or maybe a new point that's right there well it seems like just logically looking at this i mean without the boundary that would probably go under the positives right i mean it's pretty close to that other positive so one thing that we care about in svms is something known as the margin okay so not only do we want to separate the two classes really well we also care about the boundary in between where the points in those classes in our data set are and the line that we're drawing so in a line like this the closest values to this line might be like here i'm trying to draw these perpendicular right and so this effectively if i switch over to these dotted lines if i can draw this right so these effectively are what's known as the margins okay so these both here these are our margins in our svms and our goal is to maximize those margins so not only do we want the line that best separates the two different classes we want the line that has the largest margin and the data points that lie on the margin lines the data so basically these are the data points that's helping us define our divider these are what we call support vectors hence the name support vector machines okay so the issue with svm sometimes is that they're not so robust to outliers right so for example if i had one outlier like this up here that would totally change where i want my support vector to be even though that might be my only outlier okay so that's just something to keep in mind as you know you're working with svms is it might not be the best model if there are outliers in your data set okay so another example of svms might be let's say that we have data like this i'm just going to use a one dimensional data set for this example let's say we have a data set that looks like this well our you know separator should be perpendicular to this line but it should be somewhere along this line so it could be anywhere like this you might argue okay well there's one here and then you could also just draw another one over here right and then maybe you can have two svms but that's not really how svms work but one thing that we can do is we can create some sort of projection so i realized here that one thing i forgot to do was to label where zero was so let's just say zero is here now what i'm going to do is i'm going to say okay i'm going to have x and then i'm going to have x sorry x0 and x1 so x0 is just going to be my original x but i'm going to make x1 equal to let's say x squared so whatever is this squared right so now my negatives would be you know maybe somewhere here here just pretend that it's somewhere up here right and now my pluses might be something like that and i'm going to run out of space over here so i'm just going to draw these together use your imagination but once i draw it like this well it's a lot easier to apply a boundary right now our svm could be maybe something like this this and now you see that we've divided our data set now it's separable where one class is this way and the other class is that way okay so that's known as svms um i do highly suggest that you know any of these models that we just mentioned if you're interested in them do go more in depth mathematically into them like how do we how do we find this hyperplane right i'm not going to go over that in this specific course because you're just learning what an svm is but it's a good idea to know oh okay this is the technique behind finding you know what exactly are the are the um how do you define the hyperplane that we're going to use so anyways this transformation that we did down here this is known as the kernel trick so when we go from x to some coordinate x and then x squared what we're doing is we are applying a kernel so that's why it's called the kernel trick so svms are actually really powerful and you'll see that here so from sklearn.svm we are going to import svc and svc is our support vector classifier so with this so with our sbm model we are going to you know create an svc model and we are going to uh again fit this to x train i could have just copy and pasted this i should have probably done that okay taking a bit longer all right let's predict using our svm model and here let's see if i can hover over this all right so again you see a lot of these different parameters here that you can go back and change if you were creating a production level model okay but in this specific case we'll just use it out of the box again so if i make predictions you'll note that wow the accuracy actually jumps to 87 with the svm and even with class 0 there's nothing less than you know 0.8 which is great and for class one i mean everything's at 0.9 which is higher than anything that we had seen to this point so so far we've gone over four different classification models we've done svms logistic regression naive bayes and k n and these are just simple ways on how to implement them each of these they have different you know they have different hyper parameters that you can go and you can toggle and you can try to see if that helps later on or not but for the most part they perform they give us around 70 to 80 percent accuracy okay with svm being the best now let's see if we can actually beat that using a neural net now the final type of model that i wanted to talk about is known as a neural net or neural network and neural nets look something like this so you have an input layer this is where all your features would go and they have all these arrows pointing to some sort of hidden layer and then all these arrows point to some sort of output layer so what is what does all this mean each of these layers in here this is something known as a neuron okay so that's a neuron in a neural net these are all of our features that we're inputting into the neural net so that might be x0 x1 all the way through xn right and these are the features that we talked about there they might be you know the pregnancy the bmi the age etc and now all these get weighted by some value so they get multiplied by some w number that applies to that one specific category that one specific feature so these two get multiplied and the sum of all of these goes into that neuron okay so basically i'm taking w0 times x0 and then i'm adding x1 times w1 and then i'm adding you know x2 times w2 etc all the way to xn times n and that's getting input into the neuron now i'm also adding this bias term which just means okay i might want to shift this by a little bit so i might add 5 or i might add 0.1 or i might subtract 100 i don't know but we're going to add this bias term and the output of all these things so the sum of this this this and this go into something known as an activation function okay and then after applying this activation function we get an output and this is what a neuron would look like now a whole network of them would look something like this so i kind of gloss over this activation function what exactly is that this is how a neural net looks like if we have all our inputs here and let's say all of these arrows represent some sort of addition right then what's going on is we're just adding a bunch of times right we're adding the some sort of weight times these input layer a bunch of times and then if we were to go back and factor that all out then this entire neural net is just a linear combination of these input layers which i don't know about you but that just seems kind of useless right because we could literally just write that out in a formula why would we need to set up this entire neural network we wouldn't so the activation function is introduced right so without an activation function this just becomes a linear model an activation function might look something like this and as you can tell these are not linear and the reason why we introduce these is so that our entire model doesn't collapse on itself and become a linear model so over here this is something known as a sigmoid function it runs between zero and one tan runs between negative one all the way to one and this is relu which anything less than zero is zero and that anything greater than zero is linear so with these activation functions every single output of a neuron is no longer just the linear combination of these it's some sort of altered linear state which means that the input into the next neuron is you know it doesn't it doesn't collapse on itself it doesn't become linear because we've introduced all these non-linearities so this is the training set the model the loss right and then we do this thing called training where we have to feed the loss back into the model and make certain adjustments the model to improve this predicted output let's talk a little bit about the training what exactly goes on during that step let's go back and take a look at our l2 loss function this is what our l2 loss function looks like it's a quadratic formula right well up here the error is really really really really large and our goal is to get somewhere down here where the loss is decreased right because that means that our predicted value is closer to our true value so that means that we want to go this way okay and thanks to a lot of properties of math something that we can do is called gradient descent in order to follow this slope down this way this quadratic is it has different um slopes with respect to some value okay so the loss with respect to some weight w0 versus w1 versus wn they might all be different right so some way that i kind of think about it is to what extent is this value contributing to our loss and we can actually figure that out through some calculus which we're not going to touch up on in this specific course but if you want to learn more about neural nets you should probably also learn some calculus and figure out what exactly backpropagation is doing in order to actually calculate you know how much do we have to backstep by so the thing is here you might notice that this follows this curve at all these different points and the closer we get to the bottom the smaller this step becomes now stick with me here so my new value this is what we call a weight update i'm going to take w0 and i'm going to set some new value for w0 and what i'm going to set for that is the old value of w0 plus some factor which i'll just call alpha for now times whatever this arrow is so that's basically saying okay take our old w0 our old weight and just decrease it this way so i guess increase it in this direction right like take a step in this direction but this alpha here is telling us okay don't don't take a huge step right just in case we're wrong take a small step take a small step in that direction see if we get any closer and for those of you who you know do want to look more into the mathematics of things the reason why i use a plus here is because this here is the negative gradient right if this were just the if you were to use the actual gradient this should be a minus now this alpha is something that we call the learning rate okay and that adjusts how quickly we're taking steps and that might you know tell our that that will ultimately control how long it takes for our neural net to converge or sometimes if you set it too high it might even diverge but with all of these weights so here i have w0 w1 and then wn we make the same update to all of them after we calculate the loss the gradient of the loss with respect to that weight so that's how backpropagation works and that is everything that's going on here after we calculate the loss we're calculating gradients making adjustments in the model so we're setting all the all the weights to something adjusted slightly and then we're saying okay let's take the training set and run it through the model again and go through this loop all over again so for machine learning we already have seen some libraries that we use right we've already seen sk learn but when we start going into neural networks this is kind of what we're trying to program and it's not very fun to try to program this from scratch because not only will we probably have a lot of bugs but also it's probably not going to be fast enough right wouldn't it be great if there are some you know full-time professionals that are dedicated to solving this problem and they could literally just give us their code that's already running really fast well the answer is yes that exists and that's why we use tensorflow so tensorflow makes it really easy to define these models but we also have enough control over what exactly we're feeding into this model so for example this line here is basically saying okay let's create a sequential neural net so sequential is just you know what we've seen here it just goes one layer to the next and a dense layer means that all them are interconnected so here this is interconnected with all of these nodes and this one's all these and then this one gets connected to all of the next ones and so on so we're going to create 16 dense nodes with relu activation functions and then we're going to create another layer of 16 dense nodes with relu activation and then our output layer is going to be just one node okay and that's how easy it is to define something in tensorflow so tensorflow is an open source library that helps you develop and train your ml models let's implement this for a neural net so we're using a neural net for classification now so our neural net model we are going to use tensorflow and i don't think i imported that up here so we are going to import that down here so i'm going to import tensorflow as tf and enter cool so my neural net model is going to be i'm going to use this um [Music] so essentially this is saying layer all these things that i'm about to pass in so yeah layer them linear stack of layers layer them as a model and what that means nope not that so what that means is i can pass in um some sort of layer and i'm just going to use a dense layer uh oops dot dense and let's say we have 32 units okay i will also um set the activation as relu and at first we have to specify the input shape so here we have 10 comma all right all right so that's our first layer now our next layer i'm just going to have another a dense layer of 32 units all using relu and that's it so for the final layer this is just going to be my output layer it's going to just be one node and the activation is going to be sigmoid so if you recall from our logistic regression what happened there was when we had a sigmoid it looks something like this right so by creating a sigmoid activation on our last layer we're essentially projecting our predictions to be zero or one just like in logistic regression and that's going to help us you know we can just round to zero or one and classify that way so this is my neural net model and i'm going to compile this so in tensorflow we have to compile it it's really cool because i can just literally pass in what type of optimizer i want and it'll do it um so here if i go to optimizers i'm actually going to use atom and you'll see that you know the learning rate is 0.001 so i'm just going to use that default so 0.001 and my loss is going to be binary cross entropy and the metrics that i'm also going to include on here so it already will consider loss but i'm i'm also going to tack on accuracy so we can actually see that in a plot later on all right so i'm going to run this um and one thing that i'm going to also do is i'm going to define these plot definitions so i'm actually copying pasting this i got these from tensorflow so if you go on to some tensorflow tutorial they actually have these this like defined uh and that's exactly what i'm doing here so i'm actually going to move this cell up run that so we're basically plotting the loss over all the different epochs epochs means like training cycles and we're going to plot the accuracy over all the epochs all right so we have our model and now all that's left is let's train it okay so i'm going to say history so tensorflow is great because it keeps track of the history of the training which is why we can go and plot it later on now i'm going to set that equal to this neural net model and fit that with x train y train uh i'm going to make the number of epochs equal to let's say just let's just use 100 for now and the batch size i'm going to set equal to let's say 32. all right um and the validation split so what the validation split does if it's down here somewhere okay so yeah this validation split is just the fraction the training data to be used as validation data so essentially every single epoch what's going on is uh tensorflow saying leave certain if if this is 0.2 then leave 20 out and we're going to test how the model performs on that 20 that we've left out okay so it's basically like our validation data set but um tensorflow does it on our training data set during the training so we have now a measure outside of just our validation data set to see you know what's going on so validation split i'm going to make that 0.2 and we can run this so if i run that all right and i'm actually going to set verbose equal to zero which means okay don't print anything because printing something for 100 epochs might get kind of annoying so i'm just gonna let it run let it train and then we'll see what happens cool so it finished training and now what i can do is because you know i've already defined these two functions i can go ahead and i can plot the loss oops plot loss of that history and i can also plot the accuracy throughout the training so this is a little bit ish what we're looking for we definitely are looking for a steadily decreasing loss and an increasing accuracy so here we do see that you know our validation accuracy improves from uh around point seven seven or something all the way up to somewhere around point maybe eight one and our loss is decreasing so this is good it is expected that the validation loss and accuracy is performing worse than um the training loss or accuracy and that's because our model is training on that data so it's adapting to that data whereas the validation stuff is you know stuff that it hasn't seen yet so so that's why so in machine learning as we saw above we could change a bunch of the parameters right like i could change this to 64. so now it'd be a row of 64 nodes and then 32 and then one so i can change some of these parameters and a lot of machine learning is trying to find hey what do we set these hyper parameters to so what i'm actually going to do is i'm going to rewrite this so that we can do something what's known as a grid search so we can search through an entire space of hey what happens if you know we have 64 nodes and 64 nodes or 16 nodes and 16 nodes and so on um and then on top of all that we can you know we can change this uh learning rate we can change how many epochs we can change you know the batch size all these things might affect our training and just for kicks i'm also going to add what's known as a dropout layer in here and what dropout is doing is saying hey randomly choose with at this rate certain nodes and don't train them in you know a certain iteration so this helps prevent overfitting okay so i'm actually going to define this as a function called train model we're going to pass an x train y train um the number of nodes uh the drop out you know the probability that we just talked about um learning rate so i'm actually going to say lr batch size and we can also pass the number of epochs right i mentioned that as a parameter so indent this so it goes under here and with these two i'm going to set this equal to number of nodes and now with the two dropout layers i'm going to set dropout prob so now you know the probability of turning off a node during the training is equal to dropout prop um and i'm going to keep the output layer the same now i'm compiling it but this here is now going to be my learning rate and i still want binary cross entropy and accuracy we are actually going to train our model inside of this function but here we can do the epochs equals epochs and this is equal to whatever you know we're passing in uh x train y train belong right here okay so those are getting passed in as well and finally at the end i'm going to return this model and the history of that model okay so now what i'll do is let's just go through all of these so let's say let's keep the epochs at 100. and now what i can do is i can say hey for a number of nodes in let's say let's do 16 32 and 64 to see what happens for the different dropout probabilities in i mean zero would be nothing let's use 0.2 also to see what happens um you know for the learning rate in uh 0.005 0.001 and you know maybe we want to throw on 0.1 in there as well and then for the batch size uh let's do 16 32 64 as well actually and let's also throw in 128. actually let's get rid of 16. sorry let's throw 128 in there that should be zero one i'm going to record the model in history using this train model here so we're going to do x train y train the number of nodes is going to be you know the number of nodes that we've defined here dropout prob lr batch size and epochs okay and then now we have both the model and the history and what i'm going to do is again i want to plot the loss for the history i'm also going to plot the accuracy probably should have done them side by side that probably would have been easier okay so what i'm going to do is split up split this up and that will be subplots so now this is just saying okay i want one row and two columns in that row for my plots okay so i'm going to plot on my axis one the loss i don't actually know this is gonna work okay we don't care about the grid uh yeah let's let's keep the grid and then now on my other so now on here i'm going to plot all the accuracies on the second plot i might have to debug this a bit but we should be able to get rid of that if we run this we already have history saved as a variable in here so if i just run it on this okay it has no attribute x label oh i think it's because it's like set x label or something okay yeah so it's it's set instead of just x label y label so let's see if that works all right cool um and let's actually make this a bit larger okay so we can actually change the figure size and i'm gonna set let's see what happens if i set that to oh that's not the way i wanted it um okay so that looks reasonable and that's just going to be my plot history function so now i can plot them side by side here i'm going to plot the history and what i'm actually going to do is i so here first i'm going to print out all these parameters so i'm going to print out use the f string to print out uh all of this stuff so here i'm printing out how many nodes um the dropout probability uh the learning rate and we already know how many emails so i'm not even gonna bother with that so once we plot this uh let's actually also figure out what the um what the validation loss is on our validation set that we have that we created all the way back up here all right so remember we created three data sets let's call our model and evaluate what the validation data what the validation data sets loss would be and i actually want to record let's say i want to record whatever model has the least validation loss so first i'm going to initialize that to infinity so that you know any model will beat that score so if i do float infinity that will set that to infinity and um maybe i'll keep track of the parameters actually it doesn't really matter i'm just going to keep track of the model and i'm going to set that to none so now down here if the validation loss is ever less than the least validation loss then i am going to simply come down here and say hey this validation or this lease validation loss is now equal to the validation loss and the least loss model is whatever this model is that just earned that validation loss okay so we are actually just going to let this run um for a while and then we're going to get our least loss model after that so let's just run all right and now we wait all right so we have finally finished training and you'll notice that okay down here the loss actually gets to like 0.29 the accuracy is around 88 which is pretty good so you might be wondering okay why is this accuracy in this like these are both the validation so this accuracy here is on the validation data set that we've defined at the beginning right and this one here this is actually taking 20 of our test our training set every time during the training and saying okay how much of it do i get right now you know after this one step where i didn't train with any of that so they're slightly different and actually i realized later on that i probably you know probably what i should have done is over here when we were defining the model fit instead of the validation split you can define the validation data and you can pass in the validation data i don't know if this is the proper syntax but that's probably what i should have done but instead you know we'll just stick with what we have here so you'll see at the end you know with the 64 nodes it seems like this is our best performance 64 nodes with a dropout of 0.2 a learning rate of 0.001 and a batch size of 64. and it does seem like yes the validation you know the fake validation but the validation um loss is decreasing and then the accuracy is increasing which is a good sign okay so finally what i'm going to do is i'm actually just going to predict so i'm going to take this model which we've called our least loss model i'm going to take this model and i'm going to predict x test on that and you'll see that it gives me some values that are really close to zero and some that are really close to one and that's because we have a sigmoid output so if i do this what i can do is i can cast them so i'm going to say anything that's greater than 0.5 set that to 1. so if i actually i think what happens if i do this oh okay so i have to cast that as type and so now you'll see that it's ones and zeros and i'm actually going to transform this into a column as well so here i'm going to oh oops uh i didn't mean to do that okay no i wanted to just reshape it to that so now [Music] it's one dimensional okay and using that we can actually just rerun the classification report based on these this neural net output and you'll see that okay the the f1 or the accuracy gives us 87 so it seems like what happened here is the precision on uh class 0 so the hadrons has increased a bit but the recall decreased but the f1 score is still at a good 0.81 and um for the other class it looked like the precision decreased a bit the recall increased for an overall f1 score that's also been increased i think i interpreted that properly i mean we went through all this work and we got a model that performs actually very very similarly to the svm model that we had earlier and the whole point of this exercise was to demonstrate okay these are how you can define your models but it's also to say hey maybe you know neural nets are very very powerful as you can tell but sometimes you know an svm or some other model might actually be more appropriate but in this case i guess it didn't really matter which one we used at the end um an 87 percent accuracy accuracy score is still pretty good so yeah let's now move on to regression we just saw a bunch of different classification models now let's shift gears into regression the other type of supervised learning if we look at this plot over here we see a bunch of scattered data points and here we have our x value for those data points and then we have the corresponding y value which is now our label and when we look at this plot well our goal in regression is to find the line of best fit that best models this data essentially we're trying to let's say we're given some new value of x that we don't have in our sample we're trying to say okay what would my prediction for y b for that given x value so that you know might be somewhere around there i don't know but remember in regression that you know given certain features we're trying to predict some continuous numerical value for y in linear regression we want to take our data and fit a linear model to this data so in this case our linear model might look something along the lines of here right so this here would be considered as maybe our line of best fit and this line is modeled by the equation i'm going to write it down here y equals b 0 plus b 1 x now b0 just means it's this y-intercept so if we extend this y down here this value here is b0 and then b1 defines the slope of this line okay all right so that's the that's the formula for linear regression and how exactly do we come up with that formula what are we trying to do with this linear regression you know we could just eyeball where should the line be but humans are not very good at eyeballing certain things like that i mean we can get close but a computer is better at giving us a precise value for b0 and b1 well let's introduce the concept of something known as a residual okay so residual you might also hear this being called the error and what that means is let's take some data point in our data set and we're going to evaluate how far off is our prediction from a data point that we already have so this here is our y let's say this is one two three four five six seven eight so this is y eight let's call it you'll see that i use this y i in order to represent hey just one of these points okay so this here is y and this here would be the prediction oops this here would be the prediction for y 8 which i've labeled with this hat okay if it has a hat on it that means hey this is what this is my guess this is my prediction for you know this specific value of x okay now the residual would be this distance here between y eight and y hat eight so y eight minus y hat eight all right because that would give us this here and i'm just going to take the absolute value of this because what if it's below the line right then you would get a negative value but distance can't be negative so we're just going to put a little hat or we're going to put a little absolute value around this quantity and that gives us the residual or the error so let me rewrite that and you know to generalize to all the points i'm going to say the residual can be calculated as y i minus y hat of i okay so this just means the distance between some given point and its prediction its corresponding prediction on the line so now with this residual this line of best fit is generally trying to decrease these residuals as much as possible so now that we have some value for the error our line of best fit is trying to decrease the error as much as possible for all of the different data points and that might mean you know minimizing the sum of all the residuals so this here this is the sum symbol and if i just stick the residual calculation in there it looks something like that right and i'm just going to say okay for all of the i's in our data set so for all the different points we're going to sum up all the residuals and i'm going to try to decrease that with my line of best fit so i'm going to find the b0 and b1 which gives me the lowest value of this okay now in other you know sometimes in different circumstances we might attach a squared to that so we're trying to decrease the sum of the squared residuals and what that does is it just you know it adds a higher penalty for how far off we are from you know points that are further off so that is linear regression we're trying to find this equation some line of best fit that will help us decrease this measure of error with respect to all the data points that we have in our data set and try to come up with the best prediction for all of them this is known as simple linear regression basically that means you know our equation looks something like this now there's also multiple linear regression which just means that hey if we have more than one value for x so like think of our feature vectors we have multiple values in our x vector then our predictor might look something more like this actually i'm just going to say etc plus b n x n so now i'm coming up with some coefficient for all of the different x values that i have in my vector now you guys might have noticed that i have some assumptions over here and you might be asking okay kylie what in the world do these assumptions mean so let's go over them the first one is linearity and what that means is let's say i have a data set okay linearity just means okay my does my data follow a linear pattern does y increase as x increases or does y decrease at as x increases does so if y increases or decreases at a constant rate as x increases then you're probably looking at something linear so what's an example of a non-linear data set let's say i had data that might look something like that okay so now just visually judging this you might say okay seems like the line of best fit might actually be some curve like this right and in this case we don't satisfy that linearity assumption anymore so with linearity we basically just want our data set to follow some sort of linear trajectory and independence our second assumption just means this point over here it should have no influence on this point over here or this point over here or this point over here so in other words all the points all the samples in our data set should be independent okay they should not rely on one another they should not affect one another okay now normality and homoscedasticity those are concepts which use this residual okay so if i have a plot that looks something like this and my line of best fit is somewhere here maybe it's something like that in order to look at these normality and homoscedasticity assumptions let's look at the residual plot okay and what that means is i'm going to keep my same x axis but instead of plotting now where they are relative to this y i'm going to plot these errors so now i'm going to plot y minus y hat like this okay and now you know this one is slightly positive so it might be here this one down here is negative it might be here so our residual plot it's literally just a plot of how you know the values are distributed around our line of best fit so it looks like it might you know look something like this okay so this might be our residual plot and what normality means so our assumptions are normality and homo schedasticities good ass i might have butchered that spelling i don't really know but what normality is saying is saying okay these residuals should be normally distributed okay around this line of best fit it should follow a normal distribution and now what homoscedasticity says okay our variance of these points should remain constant throughout so this spread here should be approximately the same as this spread over here now what's an example of where you know homo schedasticity is not held well let's say that our original plot actually looks something like this okay so now if we looked at the residuals for that it might look something like that and now if you we look at this spread of the points it decreases right so now the spread is not constant which means that homeostasis homoscedasticity um this this assumption would not be fulfilled and it might not be appropriate to use linear regression so that's just linear regression basically we have a bunch of data points we want to predict some y value for those and we're trying to come up with this line of best fit that best describes hey given some value x what would be my best guess of what y is so let's move on to how do we evaluate a linear regression model so the first measure that i'm going to talk about is known as mean absolute error or m-a-e for short okay and mean absolute error is basically saying all right let's take all the errors so all these residuals that we talked about let's sum up the distance for all of them and then take the average and then that can describe you know how far off are we so the mathematical formula for that would be okay let's take all the residuals all right so this is the distance actually let me redraw a plot down here so suppose i have a data set look like this and here are all of my data points right and now let's say my line looks something like that so my mean absolute error would be summing up all of these values this was a mistake so summing up all of these and then dividing by how many data points i have so what would be all the residuals it would be y i right so every single point minus y hat i so the prediction for that on here and then we're going to sum over all of the different i's in our data set right so i and then we divide by the number of points we have so actually i'm going to rewrite this to make it a little clearer so i is equal to whatever the first data point is all the way through the nth data point and then we divide it by n which is how many points there are okay so this is our measure of m a e and this is basically telling us okay in on average this is the distance between our predicted value and the actual value in our training set okay and mae is good because it allows us to you know when we get this value here we can literally directly compare it to whatever units the y value is in so let's say y is we're talking you know the prediction of the price of a house right in dollars once we have once we calculate the mae we can literally say oh the average you know price the average um how much we're off by is literally this many dollars okay so that's the mean absolute error an evaluation technique that's also closely related to that is called the mean squared error and this is mse for short okay now if i take this plot again and i duplicate it and move it down here well the gist of mean squared error is kind of the same but instead of the absolute value we're going to square so now the mse is something along the lines of okay let's sum up something right so we're going to sum up all of our errors so now i'm going to do y i minus y hat i but instead of absolute valuing them i'm going to square them all and then i'm going to divide by n in order to find the mean so basically now i'm taking all of these different values and i'm squaring them first before i add them to one another and then i divide by n and the reason why we like using mean squared error is that it helps us punish large errors in the prediction and later on mse might be important because of differentiability right so a quadratic equation is differentiable you know if you're familiar with calculus a quadratic equation is different differentiable whereas the absolute value function is not totally differentiable everywhere but if you don't understand that don't worry about it you won't really need it right now and now one downside of mean squared error is that once i calculate the mean squared error over here and i go back over to y and i want to compare the values well it gets a little bit trickier to do that because now my mean squared error is in terms of y squared right it's this is now squared so instead of just dollars how you know how many dollars off am i i'm talking how many dollars squared off am i and that you know to humans it doesn't really make that much sense which is why we have created something known as the root mean square error and i'm just going to copy this diagram over here because it's very very similar to mean squared error except now we take a big squared root okay so this is rmse and we take the square root of that mean squared error and so now the term in which you know we're defining our error is now in terms of that dollar sign symbol again so that's a pro of rooting squared error is that now we can say okay our error according to this metric is this many dollar signs off from our predictor okay so it's in the same unit which is one of the pros of root mean squared error and now finally there is the coefficient of determination or r squared and this is the formula for r squared so r squared is equal to 1 minus rss over tss okay so what does that mean basically rss stands for the sum of the squared residuals so maybe it should be ssr instead but rss sum of the squared residuals and this is equal to if i take the sum of all the values and i take y i minus y hat i and square that that is my rss right it's the sum of the squared residuals now tss let me actually use a different color for that so tss is the total sum of squares and what that means is that instead of being with respect to this prediction we are instead going to take each y value and just subtract the mean of all the y values and square that okay so if i drew this out and if this were my [Music] actually let's use a different color let's use green if this were my predictor so rss is giving me this measure here right it's giving me some estimate of how far off we are from our regressor that we predicted actually i'm going to use red for that well tss on the other hand is saying okay how far off are these values from the mean so if we literally didn't do any calculations for the line of best fit if we just took all the y values and averaged all of them and said hey this is the average value for every single x value i'm just going to predict that average value instead then it's asking okay how far off are all these points from that line okay and remember that this square means that we're punishing larger errors right so even if they look somewhat close in terms of distance the further a few data points are then the further the larger our total sum of squares is going to be sorry that was my dog so the total sum of squares is taking all of these values and saying okay what is the sum of squares if i didn't do any regressor and i literally just calculated the average of all the y values in my data set and for every single x value i'm just going to predict that average which means that okay like that means that maybe y and x aren't associated with each other at all like the best thing that i can do for any new x value just predict hey this is the average of my data set and this total sum of squares is saying okay well with respect to that average what is our error right so up here the sum of the squared residuals this is telling us what is our what what is our error with respect to this line and best fit while our total sum of square is saying what is the error with respect to you know just the average y value and if our line of best fit is a better fit then this total sum of squares that means that you know this numerator that means that this numerator is going to be smaller than this denominator right and if our errors in our mind and best fit are much smaller then that means that this ratio of the rss over tss is going to be very small which means that r squared is going to go towards one and now when r squared is towards one that means that that's usually a sign that we have a good predictor it's one of the signs not the only one so over here i also have you know that there's this adjusted r squared and what that does it just adjusts for the number of terms so x1 x2 x3 etc it adjusts for how many extra terms we add because usually when we um you know add an extra term the r squared value will increase because that'll help us predict why some more but the value for the adjusted r squared increases if the new term actually improves this model fit more than expected you know by chance so that's what adjusted r squared is i'm not you know it's out of the scope of this one specific course and now that's linear regression basically i've covered the concept of residuals or errors and you know how do we use that in order to find the line of best fit and you know our computer can do all the calculations for us which is nice but behind the scenes it's trying to minimize that error right and then we've gone through all the different ways of actually evaluating a linear regression model and the pros and cons of each one so now let's look at an example so we're still on supervised learning but now we're just going to talk about regression so what happens when you don't just want to predict you know type one two three what happens if you actually wanna predict a certain value so again i'm on the uci machine learning repository and here i found this data set about bike sharing in seoul south korea so this data set is predicting rental bike count and here it's the account of bikes rented at each hour so what we're going to do again you're going to go into the data folder and you're going to download this csv file and we're going to move over to colab again and here i'm going to name this fcc bikes and regression i don't remember what i called the last one but yeah fcc bikes regression now i'm going to import a bunch of the same things that i did earlier um and you know i'm gonna also continue to import the oversampler and the standard scaler and then i'm actually also just going to let you guys know that i have a few more things i wanted to import so this is a library that lets us copy things uh seaborne is a wrapper over matplotlib so it also allows us to plot certain things and then just letting you know that we're also going to be using tensorflow okay so one more thing that we're also going to be using we're going to use the sklearn linear model library and actually let me make my screen a little bit bigger so yeah awesome run this and that'll import all the things that we need so again i'm just going to you know give some credit to where we got this data set so let me copy and paste um this uci thing and i will also give credit to this here okay cool all right cool so this is our data set and again it tells us all the different attributes that we have right here so i'm actually going to go ahead and paste this in here um feel free to copy and paste this if you want me to read it out loud so you can type it it's by count hour temp humidity wind visibility dew point temp radiation rain snow and functional whatever that means okay so i'm going to come over here and import my data by dragging and dropping all right now one thing that you guys might actually need to do is you might actually have to open up the csv because there were at first a few um like forbidden characters in mine at least so you might have to get rid of like i think there was a degree here but my computer wasn't recognizing it so i got rid of that so you might have to go through and get rid of some of those labels that are incorrect i'm gonna do this okay but after we've done that we've imported in here i'm going to create a data a data frame from that so all right so now what i can do is i can read that csv file and i can get the data into here so sold by data.csv okay so now if i call data.head you'll see that i have all the various labels right and then i have the data in there so i'm going to from here um i'm actually going to get rid of some of these columns that you know i don't really care about so here i'm going to when i when i type this in i'm going to drop maybe the date whether or not it's a holiday and the various seasons so i'm just not going to care about these things axis equals one means drop it from the columns so now you'll see that okay we still have i mean i guess you don't really notice it but if i set the data frames columns equal to data set calls and i look at you know the first five things then you'll see that this is now our data set it's a lot easier to read so another thing is i'm actually going to df functional and we're going to create this so remember that our computers are not very good at language we want it to be in zeros and ones so here i will convert that well if this is equal to yes then that that gets mapped as one so then set type integer all right great cool so the thing is right now these byte counts are for whatever hour so to make this example simpler i'm just going to index on an hour and i'm going to say okay we're only going to use that specific hour so here let's say um so this data frame is only going to be data frame where the hour let's say it equals 12 okay so it's noon all right so now you'll see that all the hours are equal to 12 and i'm actually going to now drop that column our axis equals one all right so if we run this cell okay so now we got rid of the hour in here and we just have the bike count the temperature humidity wind visibility and yada yada yada all right so what i want to do is i'm going to actually plot all of these so for i and all the columns so the range length of uh whatever this data frame is and all the columns because i don't have byte count as actually it's my first thing so what i'm going to do is say for a label and data frame columns everything after the first thing so that would give me the temperature and onwards so these are all my features right uh i'm going to just scatter so i want to see how that label how that specific data um how that affects the byte count so i'm going to plot the byte count on the y-axis and i'm going to plot you know whatever the specific label is on the x-axis and i'm going to title this uh whatever the label is and you know make my y label the bike count at noon and the x label as just the label okay now i guess we don't even need the legend so just show that plot all right so it seems like functional is not really uh doesn't really give us any utility so then snow rain um seems like this radiation you know is fairly linear dew point temperature visibility uh wind doesn't really seem like it does much humidity kind of maybe like an inverse relationship but the temperature definitely looks like there's a relationship between that and the number of bikes right so what i'm actually going to do is i'm going to drop some of the ones that don't don't seem like they really matter so maybe wind you know visibility yeah so i'm going to get rid of wind visibility and functional so let me now data frame and i'm going to drop wind visibility and functional all right and the axis again is the column so that's one so if i look at my data set now i have just the temperature the humidity the dew point temperature radiation rain and snow so again what i want to do is i want to split this into my training my validation and my test data set just as we talked before here uh we can use the exact same thing that we just did and we can say numpy.split and sample you know that the whole sample um and then create our splits of the data frame and we're going to do that but now set this to 8. okay so i don't really care about you know the the full grid the full array so i'm just gonna use an underscore for that variable but i will get my training x and y's and actually i don't have a um function for getting the x and y's so here i'm going to write a function to find get x y and uh i'm going to pass in the data frame and i'm actually going to pass in what the name of the y label is and what the x what specific x labels i want to look at so here if that's none then i'm just not like i'm only going to i'm going to get everything from the data set that's not the while it so here i'm actually going to make first a deep copy of my data frame and that basically means i'm just copying everything over if uh if like x labels is none so if not x labels then all i'm going to do is say all right x is going to be whatever this data frame is and i'm just going to take all the columns so c for c and data frame dot columns if c does not equal the y label all right and i'm gonna get the values from that but if there is the x labels well okay so in order to index only one thing so like let's say i pass in only one thing in here um then my data frame is so let me make a case for that so if the length of x labels is equal to one then what i'm going to do is just say that this is going to be uh x labels and add that just that label um values and i actually need to reshape to make this 2d so i'm going to pass a negative 1 comma 1 there now otherwise if i have like a list of specific x labels that i want to use then i'm actually just going to say x is equal to data frame of those x labels dot values and that should suffice all right so now that's just me extracting x and in order to get my y i'm going to do y equals data frame and then pass in the y label and at the very end i'm going to say data equals np dot h stack so i'm stacking them horizontally one next to each other and i'll take x and y and return that oh but uh this needs to be values and i'm actually going to reshape this to make it 2d as well so that we can do this h stack and i will return data x y so now i should be able to say okay get x y and take that data frame and the y label so my our y label is by count and actually so for the x label i'm actually going to let's just do like one dimension right now and earlier i got rid of the plots but we had seen that maybe you know the temperature dimension does really well and we might be able to use that to predict why so i'm going to label this also that you know it's just using the temperature and i am also going to do this again for oh this should be and this should be a validation and there should be a test um because oh that's val all right but here it should be val this should be test all right so we run this and now we have our training validation and test data sets for just the temperature so if i look at x train temp it's literally just the temperature okay and i'm doing this first to show you simple linear regression all right so right now i can create a regressor so i can say the temp regressor here and then i'm going to you know make a linear regression model and just like before i can simply fix fit my x train temp y train temp in order to train train this linear regression model all right and then i can also i can print this regressor's coefficients and the intercept so if i do that okay this is the coefficient for whatever the temperature is and then the the x-intercept okay or the y-intercept sorry all right and i can you know score so i can get the um the r squared score so i can score x test and y test all right so it's an r squared of around 0.38 which is better than zero which would mean hey there's absolutely no association but it's also not you know like a good it depends on the context but you know the higher that number it means the higher that the two variables would be correlated right which here it's all right it just means there's maybe some association between the two but uh the reason why i wanted to do this one d was to show you you know if we plotted this this is what it would look like so if i uh create a scatter plot and let's take the training um so this is our data and then let's make it blue and then if i also plotted so something that i can do is say you know the x range that i'm going to plot it is when space um and this goes from negative 20 to 40 this piece of data so i'm going to say let's take 100 things from there so i'm going to plot x and i'm going to take this temp this like regressor and predict x with that okay and this label i'm going to label that um the fit and this color let's make this red and let's actually [Music] set the line with so i can i can change how thick that value is okay now at the very end uh let's create a legend and let's all right let's also create you know title all these things that matter in some sense so here let's just say um this would be the bikes versus the temperature right and the y label would be number of bikes and the x label would be the temperature so i actually think that this might cause an error yeah so it's expecting a 2d array so we actually have to reshape this let's [Applause] [Music] okay there we go so i just had to make this an array and then reshape it so it was 2d now we see that all right this increases but again remember those assumptions that we had about linear regression like this i don't really know if this fits those assumptions right i just wanted to show you guys though that like all right this is what a line of best fit through this data would look like okay now we can do multiple linear regression right so i'm going to go ahead and do that as well now if i take my data set and instead of the labels so actually what's my current data set right now all right so let's just use all of these except for the byte count right so i'm going to just say for the x labels let's just take the data frames columns and just remove the byte count so does that work so if this part should be affects labels is none and then this should work now oops sorry okay so i have oh but this here because it's not just the temperature anymore we should actually do this um let's say all right so i'm just going to quickly rerun this piece here so that we have our temperature only data set and now we have our all data set okay and this regressor i can do the same thing so i can do the all regressor and i'm going to make this the linear regression and i'm going to fit this to x train all and y train all okay all right so let's go ahead and also score this regressor and let's see how the r squared performs now so if i test this on the test data set what happens all right so our r squared seems to improve it went from 0.4 to 0.5 which is a good sign okay and i can't necessarily plot you know every single dimension but this just this is just to say okay this has this is improved right all right so one cool thing that you can do with tensorflow is you can actually do regression but with a neural net so here i'm going to um we already have our our training data for just the temperature and just you know for all the different columns so i'm not going to bother with splitting up the data again i'm just going to go ahead and start building the model so in this linear regression model uh typically you know it does help if we normalize it so that's very easy to do with tensorflow i can just create some normalizer layer so i'm going to do tensorflow keras layers and get the normalization layer and the input shape for that will just be one because let's just do it again on just the temperature and the axis i will make none now for this temp normalizer and i should have had an equal sign there um i'm going to adapt this to x train temp and reshape this to just a single vector so that should work great now with this model so temp neural net model what i can do is i can do you know diacharis that's sequential and i'm going to pass in this normalizer layer and then i'm going to say hey just give me one single dense layer with one single unit and what that's doing is saying all right well one single node just means that it's linear and if you don't add any sort of activation function to it the output is also linear so here i'm going to have tensorflow keras layers.dense and i'm just gonna have one unit and that's going to be my model okay so uh with this model let's compile and for our optimizer um let's use let's use adam again answers dot atom and we have to pass in the learning rate so learning rate and our learning rate let's do 0.01 and now the loss we actually let's give this one 0.1 and the loss i'm going to do mean squared error okay so we run that we've compiled it okay great and just like before we can call history and i'm going to fit this model so here if i call fit i can just fit it and i'm going to take the uh x train with the temperature but reshape it um y train for the temperature and i'm going to set verbose equal to zero so that it doesn't you know display stuff i'm actually going to set epochs equal to let's do 1000 um and the validation data should be let's pass in the validation data set here as a tuple and i know i spelled that wrong so let's just run this and up here i've copy and pasted the plot loss from our previous but change the y label to msc because now we're talking we're dealing with mean squared error and i'm going to plot the loss of this history after it's done so let's just wait for this to finish training and then to plot okay so this actually looks pretty good we see that the values are converging so now what i can do is i'm going to go back up and take this plot and we are going to just run that plot again so here um instead of this temperature regressor i'm going to use the neural net regressor this neural net model and if i run that i can see that you know this also gives me a linear regressor you'll notice that this this fit is not entirely the same as the one up here and that's due to the training process of you know of this neural net so just two different ways to try and try to find the best linear regressor okay but here we're using back propagation to train a neural net node whereas in the other one they probably are not doing that okay they're probably just trying to actually compute the line of best fit so okay given this well we can repeat the exact same exercise with our um with our multiple linear regressions okay but i'm actually going to skip that part i will leave that as an exercise to the viewer okay so now what would happen if we use a neural net a real neural net instead of just you know one single node in order to predict this so let's start on that code we already have our normalizer so i'm actually going to take the same uh setup here but instead of you know this one dense layer i'm going to set this equal to 32 units and for my activation i'm going to use relu and now let's duplicate that and for the final output i just want one answer so i just want one cell and this activation is also going to be relu because i can't ever have less than zero bikes so i'm just going to set that as relu i'm just going to name this the neural net model okay and at the bottom i'm going to have this um neural net model i'm going to have to know that model i'm going to compile um [Music] and i will actually use the same compiler here but instead of instead of a learning rate of 0.01 i'll use 0.001 okay and i'm going to train this here so the history is this neural net model um and i'm going to fit that against x train temp y train temp and valid validation data i'm going to set this again equal to xval temp and y vowel temp now for the verbose i'm going to set that equal to zero epochs let's do 100 and here for the batch size actually let's just not do a batch size right now let's just try let's see what happens here and again we can plot the loss of this history after it's done training so let's just run this and that's not what we're supposed to get so what is going on here is sequential we have our temperature normalizer which i'm wondering now if we have to re-do that okay so we do see this decline it's an interesting curve but we do we do see it eventually um so this is our loss which all right it's decreasing that's a good sign and actually what's interesting is let's just let's plot this model again so here instead of that and you'll see that we actually have this like curve that looks something like this so actually what if i got rid of this activation let's train this again and see what happens all right so even even when i got rid of that relu at the end it kind of knows hey you know if it's not the best model if we had maybe one more layer in here these are just things that you have to play around with when you're you know working with machine learning it's like you don't really know what the best model is going to be um for example this also is not brilliant but i guess it's okay so my point is though that with a neural net i mean this is not brilliant but also there's like no data down here right so it's kind of hard for our model to predict in fact we probably should have started the prediction somewhere around here my point though is that with this neural net model you can see that this is no longer a linear predictor but yet we still get an estimate of the value right and we can repeat this exact same exercise with the multiple uh inputs so here if i now pass in all the data so this is my all normalizer and i should just be able to pass in that so let's move this to the next cell um here i'm going to pass in my all normalizer and let's compile it yeah those parameters look good great so here with the history when we're trying to fit this model instead of temp we're going to use our larger data set with all the features and let's just train that and of course we want to plot the loss okay so that's what our loss looks like um it's an interesting curve but it's decreasing so before we saw that our r squared score was around 0.52 well we don't really have that with a neural net anymore but one thing that we can measure is hey what is the mean squared error right so if i come down here um and i compare the two mean squared errors so so i can predict x test all right so these are my predictions using that linear regressor will linear multiple multiple linear regressor so these are my my predictions linear regression okay i'm actually going to do that at the bottom so let me just copy and paste that cell and bring it down here so now i'm going to calculate the mean squared error for both um the linear regressor and the neural net okay so this is my linear and this is my neural net so if i use my neural net model and i predict x test all i get my two you know different y predictions and um i can calculate the mean squared error right so if i want to get the mean squared error and i have y prediction and y real i can do numpy dot square and then i would need the y prediction minus you know the real so this this is basically squaring everything um and this should be a vector so if i just take this entire thing and take the mean of that that should give me the mse so let's let's just try that out and the y real is why test all right so that's my mean squared error for the linear regressor and this is my mean squared error for the neural net so that's interesting uh i will debug this live i guess so my guess is that it's probably coming from this normalization layer um because this input shape is probably just six and okay so that works now and the reason why is because like my inputs are only for every vector it's only a one dimensional vector of length six so i should have i should have just had six comma which is a tuple of size six from the start or it's a it's a tuple containing one element which is a six okay so it's actually interesting that my uh neural net results seem like they they have a larger mean squared error than my linear aggressor um one thing that we can look at is we can actually plot the real versus you know the the actual results versus what the predictions are so if i say some axis and i use plt.axes and make these equal then i can scatter the the y you know the test so what the actual values are on the x-axis and then what the predictions are on the x-axis okay uh and i can label this as the linear regression predictions okay so then let me just label my axes so the x-axis i'm going to say is the true values the y-axis is going to be my linear regression predictions or actually let's plot let's just make this predictions and then at the end i'm going to plot oh let's set some limits uh because i think that's like approximately the max number of bikes so i'm going to set my x limit to this and my y limit to this so here i'm going to pass that in here too and all right this is what we actually get for our linear regressor you see that actually they align quite well i mean to some extent so 2000 is probably too much 2500 i mean looks like maybe like 1800 would be enough here for our limits um and i'm actually going to label something else the neural net predictions let's add a legend so you you can see that our neural net for the larger values it seems like it's a little bit more spread out and it seems like we tend to underestimate a little bit down here in this area okay and for some reason these are way off as well but yeah so we've basically used a linear regressor and a neural net um honestly there are some times where a neural net is more appropriate and a linear regressor is more appropriate i think that it just comes with time and trying to figure out you know and just literally seeing like hey what works better like here a linear a multiple linear regressor might actually work better than a neural net but for example with the one-dimensional case a linear regressor would never be able to see this curve okay i mean i'm not saying this is a great model either but i'm just saying like hey you know sometimes it might be more appropriate to use something that's not linear so yeah i will leave regression at that okay so we just talked about supervised learning and in supervised learning we have data we have some a bunch of features and for a bunch of different samples but each of those samples has some sort of label on it whether that's a number a category a class etc right we were able to use that label in order to try to predict new labels of other points that we haven't seen yet well now let's move on to unsupervised learning so with unsupervised learning we have a bunch of unlabeled data and what can we do with that you know can we learn anything from this data so the first algorithm that we're going to discuss is known as k means clustering what k means clustering is trying to do is it's trying to compute k clusters from the data so in this example below i have a bunch of scattered points and you'll see that this is x0 and x1 on the two axes which means i'm actually plotting two different features right of each point but we don't know what the y label is for those points and now just looking at these scattered points we can kind of see how there are different clusters in the data set right so depending on what we pick for k we might have different clusters let's say k equals two right then we might pick okay this seems like it could be one cluster but this here is also another cluster so those might be our two different clusters if we have k equals three for example then okay this seems like it could be a cluster this seems like it could be a cluster and maybe this could be a cluster right so we could have three different clusters in the data set now this k here is predefined if i can spell that correctly by the person who's running the model so that would be you all right and let's discuss how you know the computer actually goes through and computes the k clusters so i'm going to write those steps down here now the first step that happens is we actually choose well the computer chooses three random points on this plot to be the centroids and by centroids i just mean the center of the clusters okay so three random points let's say we're doing k equals three so we're choosing three random points to be the centroids of the three clusters if it were two we'd be choosing two random points okay so maybe the three random points i'm choosing might be here here [Music] and here all right so we have three different points and the second thing that we do is we actually calculate the distance for each point to those centroids so between all the points and the centroid so basically i'm saying all right this is this distance this is this distance this is this distance all of these distances i'm computing between oops not those two between the points not the centroids themselves so i'm computing the distances for all of these plots to each of the centroids okay and that comes with also assigning those points to the closest centroid what do i mean by that so let's take this point here for example so i'm computing this distance this distance and this distance and i'm saying okay it seems like the red one is the closest so i'm actually going to put this into the red centroid so if i do that for all of these points this one seems slightly closer to red and this one seems slightly closer to red right now for the blue i actually wouldn't put any blue ones in here but we would probably actually that first one is closer to red and now it seems like the rest of them are probably closer to green so let's just put all of these into green here like that and cool so now we have you know our two three technically centroid so there's this group here there's this group here and then blue is kind of just this group here it hasn't really touched any of the points yet so the next step three that we do is we actually go and we recalculate the centroid so we compute new centroids based on the points that we have in all the centroids and by that i just mean okay well let's take the average of all these points and where is that new centroid that's probably going to be somewhere around here right the blue one we don't have any points in there so we won't touch and then the screen one we can put that hmm probably somewhere over here oops somewhere over here right so now if i erase all of the previously computed centroids i can go and i can actually redo step two over here this calculation all right so i'm going to go back and i'm going to iterate through everything again and i'm going to recompute my three centroids so let's see we're going to take this red point these are definitely all red right this one still looks a bit red now this part we actually start getting closer to the blues so this one still seems closer to a blue than a green this one as well and i think the rest would belong to green okay so now are three centroids or three sorry our three clusters would be this is this and then this right those are our three centroids and so now we go back and we compute the new sorry those would be the three clusters so now we go back and we compute the three centroids so i'm going to get rid of this this and this and now where would this red be centered probably closer you know to this point here this blue might be closer to up here and then this green would probably be somewhere it's pretty similar to what we had before but it seems like it'd be pulled down a bit so probably somewhere around there for green all right and now again we go back and we compute the distance between all the points and the centroids and then we assign them to the closest centroid okay so the reds are all here it's very clear actually let me just circle that and this um it actually seems like this point is closer to this blue now so the blues seem like they would be maybe this point looks like it'd be blue so all these look like they would be blue now and the greens would probably be this cluster right here so we go back we compute the uh centroids bam this one probably like almost here bam and then the green looks like it would be probably um here-ish okay and now we go back and we compute the we compute the clusters again so red still this blue i would argue is now this cluster here and green is this cluster here okay so we go and we recompute the centroids bam bam and you know bam and now if i were to go and assign all the points to clusters again i would get the exact same thing right and so that's when we know that we can stop iterating between steps two and three is when we've converged on some solution when we've reached some stable point and so now because none of these points are really changing out of their clusters anymore we can go back to the user and say hey these are our three clusters okay and this process something known as expectation maximization this part where we're assigning the points the closest centroid this is something this is our expectation step and this part where we're computing the new centroids this is our maximization step okay so that's expectation maximization and we use this in order to compute the centroids assign all the points to clusters according to those centroids and then we're recomputing all that over again until we reach some stable point where nothing is changing anymore all right so that's our first example of unsupervised learning and basically what this is doing is trying to find some structure some pattern in the data so if i came up with another point you know might be somewhere here i can say oh looks like that's closer to um if this is a b c it looks like that's closest to cluster b and so i would probably put it in cluster b okay so we can find some structure in the data based on just how how the points are scattered relative to one another now the second unsupervised learning technique that i'm going to discuss with you guys something known as principle component analysis and the point of principle component analysis is very often it's used as a dimensionality reduction technique so let me write that down it's used for dimensionality reduction and what do i mean by dimensionality reduction is if i have a bunch of features like x1 x2 x3 x4 etc can i just reduce that down to one dimension that gives me the most information about how all these points are spread relative to one another and that's what pca is for so pca principal component analysis let's say i have some points in the x0 and x1 feature space okay so uh these points might be spread you know something like this okay so for example if this were um something to do with housing prices right this here might be x0 might be hey uh years since built right since the house was built and x1 might be square footage of the house all right so like years since built i mean like right now it's been you know 22 years since a house in 2000 was built now principal component analysis is just saying all right let's say we want to build a model or let's say we want to you know display something about our data but we don't we don't have two axes to show it on how do we display you know how do we how do we demonstrate that this point is a further away from this point than this point and we can do that using principle component analysis so take what you know about linear regression and just forget about it for a second otherwise you might get confused pca is a way of trying to find direction in the space with the largest variance so this principle component what that means is basically the component so some direction in this space with the largest variance okay it tells us the most about our data set without the two different dimensions like let's say we have these two different dimensions and somebody's telling us hey you only get one dimension in order to show your data set what dimension like what do we do we want to project our data onto a single dimension all right so that in this case might be a dimension that looks something like this and you might say okay we're not going to talk about linear regression okay we don't have a y value so in linear regression this would be y this is not y okay we don't have a label for that instead what we're doing is we're taking the right angle projection so all of these take that's not very visible but take this right angle projection onto this line and what pca is doing is saying okay map all of these points onto this one-dimensional space so the transformed data set would be here this one's on the data set so or on the line so we just put that there but now this would be our new one-dimensional data set okay it's not our prediction or anything this is our new data set if somebody came to us said you only get one dimension you only get one number to represent each of these 2d points what number would you give me this would be the number that we gave okay this in this direction this is where our points are the most spread out right if i took this plot and let me actually duplicate this so i don't have to rewrite anything so i don't have to erase and then redraw anything um let me get rid of some of this stuff and i just got rid of a point there too so let me draw that back all right so if this were my original data point what if i had taken you know this to be the pca dimension okay well i then would have points that let me actually do that in different color so if i were to draw a right angle to this for every point my points would look something like this and so just intuitively looking at these two different plots this top one and this one we can see that the points are squished a little bit closer together right which means that the variance that's not the space with the largest variance the thing about the largest variance is that this will give us the most discrimination between all of these points the larger the variance the further spread out these points will likely be now and so that's the that's the dimension that we should project it on a different way to actually look at that like what is the dimension with the largest variance it's actually it also happens to be the dimension that decreases that minimizes the residuals so if we take all the points and we take the residual from that the xy residual so in linear regression in linear regression we were looking only at this residual the differences between the predictions right between y and y hat it's not that here in principle component analysis we're taking the difference from our current point in two-dimensional space and then its projected point okay so we're taking that dimension and we're saying all right how much you know how much distance is there between that projection residual and we're trying to minimize that for all of these points so that actually equates to this largest variance dimension this dimension here the pca dimension you can either look at it as minimizing minimize let me get rid of this the projection residuals so that's the stuff in orange or two maximizing the variance between the points okay and we're not really going to talk about you know the method that we need in order to calculate out the principal components or like what that projection would be because you will need to understand linear algebra for that especially um eigenvectors and eigenvalues which i'm not going to cover in this class but that's how you would find the principal components okay now with this two-dimensional data set here sorry this one-dimensional data set we started from a 2d data set and we now boil it down to one dimension well we can go and take that dimension and we can do other things with it right we can like if there were a y label then we can now show x versus y rather than x 0 and x 1 in different plots with that y now we can just say oh this is a principal component and we're going to plot that with the y or for example if there were a hundred different dimensions and you only wanted to take five of them well you could go and you could find the top five pca dimensions and that might be a lot more useful to you than 100 different feature factor values right so that's principle component analysis again we're taking you know certain data that's unlabeled and we're trying to make some sort of estimation like some guess about its structure from that original data set if we wanted to take you know a 3d thing so like a sphere but we only have a 2d surface to draw it on well what's the best approximation that we can make oh it's a circle right pca is kind of the same thing it's saying if we have something with all these different dimensions but we can't show all of them how do we boil it down to just one dimension how do we extract the most information from that multiple dimensions and that is exactly either you minimize the projection residuals or you maximize the variance and that is pca so we'll go through an example of that now finally let's move on to implementing the unsupervised learning part of this class here again i'm on the uci machine learning repository and i have a seeds data set where you know i have a bunch of kernels that belong to three different types of wheat so there's comma rosa and canadian and the different um features that we have access to are you know geometric parameters of those weak kernels so the area perimeter compactness length width with asymmetry and the length of the kernel groove okay so all these are real values which is easy to work with and what we're going to do is we're going to try to predict or i guess we're going to try to cluster the different varieties of the wheat so let's get started i have a colab notebook open again oh you're gonna have to you know go to the data folder download this and let's get started so the first thing to do is to import our seeds data set into our colab notebook so i've done that here okay and then we're going to import all the classics again so pandas um and then i'm also going to import seaborne because i'm going to want that for this specific class okay great so now our columns that we have in our seed data set are the area the perimeter um the compactness the length with asymmetry groove lengths i mean i'm just going to call it group and then the class right the weak kernels class so now we have to import this um i'm going to do that using pandas read csv and it's called seeds data.csv so i'm going to turn that into a data frame and the names are equal to the columns over here so what happens if i just do that oops what did i call this seeds dataset.text all right so if we actually look at our data frame right now you'll notice something funky okay and here you know we have all the stuff under area and these are all our numbers with some dash t so the reason is because we haven't actually told pandas what the separator is which we can do like this and this t that's just a tab so in order to ensure that like all white space gets recognized as a separator we can actually this is for like a space so any spaces are going to get recognized as data separators so if i run that now are um this you know this is a lot better okay okay so now let's actually go and like visualize this data so what i'm actually going to do is plot each of these against one another so in this case pretend that we don't have access to the class right pretend that so this class here i'm just going to show you in this example that like hey we can predict our classes using unsupervised learning but for this example in unsupervised learning we don't actually have access to the class so i'm going to just try to plot these against one another and see what happens so for some i in range you know the columns minus one because the class is in the columns and i'm just going to say for j in range so take everything from i onwards you know so i like the next thing after i uh until the end of this so this will give us basically a grid of all the different like combinations and our x label is going to be columns i our y label is going to be the columns j so those are our labels up here and i'm going to use seaborne this time and i'm going to say scatter my data so our x is going to be our x label our y is going to be our y label um and our data is going to be the data frame that we're passing in so what's interesting here is that we can say hue and what this will do is say like if i give this class it's going to separate the three different classes into three different queues so now what we're doing is we're basically comparing the area and the perimeter or the area and the compactness but we're going to visualize you know what classes they're in so let's go ahead and i might have to show so great so basically we can see perimeter and area we give we get these three groups um the area compactness we get these three groups and so on so these all kind of look honestly like somewhat similar right so wow look at this one so this one we have the compactness and the asymmetry and it looks like there's not really i mean it just looks like they're blobs right sure maybe class three is over here more but one and two kind of look like they're on top of each other okay i mean there are some that might look slightly better in terms of clustering but let's go through some of the some of the clustering examples that we talked about and try to implement those the first thing that we're going to do is just straight up clustering so uh what we learned about was k-means clustering so from learn i'm going to import uh k means okay and just for the sake of being able to run you know any x in any y i'm just gonna say hey let's use some x um what's a good one maybe i mean perimeter asymmetry could be a good one so x could be perimeter y could be asymmetry okay and for this the the x values i'm going to just extract those specific values right well let's make a k-means uh algorithm or let's you know define this so k means and in this specific case we know that the number of clusters is three so let's just use that and i'm going to fit this against this x that i've just defined right here right so um you know if i create this clusters so one thing one cool thing is i can actually go to these clusters and i can say k-mean dot labels and it'll get give me if i can type correctly it'll give me what its predictions for all the clusters are and our actual oops not that um if we go to the data frame and we get the class and the values from those we can actually compare these two and say hey like you know everything in general most of the zeros that it's predicted are the ones right and in general the twos are the twos here and then the third class one okay that corresponds to three now remember these are separate classes so the labels what we actually call them don't really matter we can say oh map zero to one map two to two and map one to three okay and our you know our mapping would do fairly well but we can actually visualize this and in order to do that i'm going to create this cluster cluster data frame so i'm going to create a data frame and i'm going to pass in um a horizontally stacked array with x so my values for x and y and then um the clusters that i have here but i'm going to reshape them so it's 2d okay and the columns the labels for that are going to be x y and plus okay so i'm going to go ahead and do that same seabourn scatter plot again where x is x y is y and now uh the hue is again the class and the data is now this cluster data frame all right so this here this here is my um k means like i guess classes so k-means kind of looks like this if i come down here and i plot you know my original data frame this is my original classes with respect to this specific x and y and you'll see that honestly like it doesn't do too poorly yeah there's i mean the colors are different but that's fine um for the most part it it gets information of the clusters right and now we can do that with higher dimensions so with the higher dimensions if we make x equal to you know all the columns except for the last one which is our class we can do the exact same thing so here and we can predict this uh but now our columns are equal to our data frame columns all the way to the last one and then with this class actually so we can literally just say data frame columns and we can fit all of this and now if i want to plot the k-means classes all right so this was my uh that's my clustered and my original so actually let me see if i can get these on the same page so yeah i mean pretty similar to what we just saw but what's actually really cool is even something like you know if we change so what's one of them where they were like on top of each other ah okay so compactness and asymmetry this one's messy right so if i come down here and i say uh compactness and asymmetry and i'm trying to do this in 2d this is what my scatter plot so this is what you know my k means is telling me for these two dimensions for compactness and asymmetry if we just look at those two these are our three classes right and we know that the original looks something like this and are these two remotely alike no okay so now if i come back down here and i rerun this higher dimensions one but actually this clusters i need to get the labels of the k-means again okay so if i rerun this with higher dimensions well if we zoom out and we take a look at these two sure the colors are mixed up but in general there the three groups are there right this does a much better job at assessing okay what group is what so for example we could relabel uh the one in the original class to two and then we could make sorry okay this is kind of confusing but for example if this light pink were projected onto this darker pink here and then this dark one was actually the light pink and this light one was this dark one then you kind of see like these correspond to one another right like even these two up here are the same classes all the other ones over here which are the same in the same color so you don't want to compare the two colors between the plots you want to compare which points are in what colors in each of the plots so that's one cool application so this is how k-means functions it's basically taking all the data sets and saying all right where are my clusters given these pieces of data and then the next thing that we talked about is pca so pca we're reducing the dimension but we're mapping all these like you know seven dimensions i don't know if there are seven i made that number up but we're mapping multiple dimensions into a lower dimension number right and so let's see how that works so from sklearn decomposition i can import pca and that will be my pca model so if i do pca um component so this is how many dimensions you want to map it into and you know for this exercise let's do two okay so now i'm taking the top two dimensions and my transformed x is going to be pca dot fit transform and the same x that i had up here okay so all the other all the values basically uh area perimeter compactness length with asymmetry groove okay so let's run that and we've transformed it so let's look at what the shape of x used to be so there okay so 7 was right i had 210 samples each seven seven features long basically and now my transformed x is 210 samples but only of length two which means that i only have two dimensions now that i'm plotting and we can actually even take a look at you know the first five things okay so now we see each each one is a two dimensional point each sample is now a two dimensional point in our new um in our new dimensions so what's cool is i can actually scatter these zero and transformed x so i actually have to take the columns here and if i show that basically we've just taken this like seven dimensional thing and we've made it into a single or i guess two a two dimensional representation so that's a point of pca and actually let's go ahead and do the same clustering exercise as we did up here if i take the k-means this pca data frame i can let's construct data frame out of that and the data frame is going to be h stack i'm going to take this transformed x and the clusters dot reshape so actually instead of clusters i'm going to use uh k-means dot labels and i need to reshape this so it's 2d so we can do the h stack um and for the columns i'm going to set this to pca1 pca2 and class all right so now if i take this i can also do the same for the truth but instead of the k means labels i want from the data frame the original classes and i'm just going to take the values from that and so now i have a data frame for the k-means with pca and then a data frame for the truth with also the pca and i can now plot these similarly to how i plotted these up here so let me actually take these two instead of the cluster data frame i want the uh this is the k means pca data frame this is still going to be class but now x and y are going to be the two pca dimensions okay so these are my two pca dimensions and you can see that you know they're they're pretty spread out and then here i'm going to go to my truth classes again it's pca1 pca2 but instead of k-means this should be truth pca data frame so you can see that like in the truth data frame along these two dimensions we actually are doing fairly well in terms of separation right it does seem like this is slightly more separable than the other like dimensions that we had been looking at up here so that's a good sign um and up here you can see that hey some of these correspond to one another i mean for the most part our algorithm our unsupervised clustering algorithm is able to give us is able to spit out you know what the proper uh labels are i mean if you map these specific labels to the different types of kernels but for example this one might all be the comic kernels and same here and then these might all be the canadian kernels and these might all be the canadian kernels so it does struggle a little bit with you know where they overlap but for the most part our algorithm is able to find the three different categories and do a fairly good job at predicting them without without any information from us we haven't given our algorithm any labels so that's the gist of unsupervised learning i hope you guys enjoyed this course i hope you know a lot of these examples made sense um if there are certain things that i have done and you know you're somebody with more experience than me please feel free to correct me in the comments and we can all as a community learn from this together so thank you all for watching

Info

Channel: freeCodeCamp.org

Views: 168,186

Rating: undefined out of 5

Keywords:

Id: i_LwzRVP7bg

Channel Id: undefined

Length: 233min 53sec (14033 seconds)

Published: Mon Sep 26 2022