KNN Algorithm In Machine Learning | KNN Algorithm Using Python | K Nearest Neighbor | Simplilearn

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments

r/datascience


For mobile and non-RES users | More info | -1 to Remove | Ignore Sub

👍︎︎ 1 👤︎︎ u/ClickableLinkBot 📅︎︎ Jun 07 2018 🗫︎ replies
Captions
hello and welcome to k-nearest neighbors algorithm tutorial my name is Richard Kirchner and I'm with the simply learned team today we're gonna cover the K nearest neighbors a lot refer to as K in n and k n n is really a fundamental place to start in the machine learning it's a basis of a lot of other things and just the logic behind it is easy to understand and incorporate it in other forms of machine learning so today what's in it for you why do we need KN n what is KN n how do we choose the factor K when do we use KN n how does knn algorithm work and then we'll dive in to my favorite part the use case predict whether a person will have diabetes or not that is a very common and popular used data set as far as testing out models and learning how to use the different models and machine learning by now we all know machine learning models make predictions by learning from the past data available so we have our input values our machine learning model builds on those inputs of what we already know and then we use that to create a predicted output is that a dog little kid looking over there and watching the black cat cross their path no dear you can differentiate between a cat and a dog based on their characteristics cats cats have sharp claws uses to climb smaller lengths of ears meows and purrs doesn't love to play around dogs have dull claws bigger lengths of ears barks loves to run around usually don't see a cat running around people other why do you have a cat that does that where dogs do and we can look at these we can say we can evaluate their sharpness of the claws how sharper their claws and we can evaluate the length of the ears and we can usually sort out cats from dogs based on even those two characteristics now tell me if it is a cat or a dog nod question usually little kids know cats and dogs bite now listen them a place where there's not many cats or dogs so if we look at the sharpness of the claws the length of the years and we can see that the cat has a smaller ears and sharper claws and the other animals its features are more like Katz's it must be a cat sharp claws length of ears and it goes in the cat group because KNN is based on feature similarity we can do classification using KNN classifier so we have our input value the picture of the black cat it goes into our trained model and it predicts that this is a cat coming out so what is KN n what is the KN algorithm K nearest neighbors is what that stands for is one of the simplest supervised machine learning algorithms mostly used for classification so we want to know is this a dog or it's not a dog is it a cat or not a cat it classifies a data point based on how its neighbors are classified KN in stores all available cases and classifies new cases based on a similarity measure and here we gone from cats and dogs right into wine another favorite of mine KN in stores all available cases and classifies new cases based on a similarity measure and here you see we have a measurement of sulfur dioxide versus the chloride level and then the different wines they've tested and where they fall on that graph based on how much sulfur dioxide and how much chloride k and KN is a perimeter that refers to the number of nearest neighbors to include in the majority of the voting process and so if we add a new glass of wine there red or white we want to know what the neighbors are in this case we're gonna put k equals five and we'll talk about K in just a minute a data point is classified by the majority of votes from its five nearest neighbors here the unknown point would be classified as red since four out of five neighbors are red so how do we choose K how do we know K equals five I mean that's what is the value we put in there and so we can talk about it how do we choose the factor K knn algorithm is based on feature similarity choosing the right value of K is a process called parameter tuning and is important for better accuracy so at K equals three we can classify we have a question mark in the middle as either a as a square or not is it a square or is it in this case a triangle and so if we set K equals to three we're going to look at the three nearest neighbors we're going to say this is a square and if we put K equals to seven we classify as a triangle depending on what the other data is around and you can see is the K change depending on where that point is that drastically changes your answer and we jump here we go how do we choose the factor of K you'll find this in all machine learning choosing these factors that's the face you get he's like oh my gosh you'd say choose the right K did I set it right my values in whatever machine learning tool you're looking at so that you don't have a huge bias in one direction or the other and in terms of K n n the number of K if you choose it too low the bias is based on it's just too noisy it's right next to a couple things and it's gonna pick those things and you might get asked you to answer and if your K is too big then it's gonna take forever to process so you're gonna run into processing issues and resource issues so what we do the most common use and there's other options for choosing K is to use the square root of n so it is a total number of values you have you take the square root of it in most cases you also if it's an even number so if you're using like in this case squares and triangles if it's even you want to make your K value odd that helps it select better so in other words you're not going to have a balance between two different factors that are equal so you usually take the square root of n and if it's even you add one to it or subtract one from it and that's where you get the K value from that is the most common use and it's pretty solid it works very well when do we use KN n we can use K n when data is labeled so you need a label on it we know we have a group of pictures with dogs dogs cats cats data is noise free and so you can see here when we have a class that we have like underweight 140 23 Hello Kitty normal that's pretty confusing we have a variety of data coming in so it's very noisy and that would cause an issue Dana said is small so we're usually working with smaller data sets where you might get into gig of data if it's really clean it doesn't have a lot of noise because K and N is a lazy learner ie it doesn't learn a discriminative function from the training set so it's very lazy so if you have very complicated data and you have a large amount of it you're not going to use the KNN but it's really great to get a place to start even with large data you can sort out a small sample and get an idea of what that looks like using the KNN and also just using for smaller data sets works really good how does a knn algorithm work consider a data set having two variables height and centimeters and weight in kilograms and each point is classified as normal or underweight so we can see right here we have two variables you know true/false or either normal or they're not they're underweight on the basis of the given data we have to classify the below set as normal or underweight using KNN so if we have new data coming in this says 57 kilograms and 177 centimeters is that going to be normal or underweight to find the nearest neighbors we'll calculate the Euclidean distance according to the Euclidean distance formula the distance between two points in the plane with the coordinates XY and a B is given by distance D equals the square root of x minus a squared plus y minus B squared and you can remember that from the two edges of a triangle we're computing the third edge since we know the x side and the y side let's calculate it to understand clearly so we have our unknown point and we placed it there in red and we have our other points where the data is scattered around the distance d1 is a square root of 170 minus 167 squared plus 57 minus 51 squared which is about six point seven and distance two is about 13 and distance three is about 13 point four similarly we will calculate the Euclidean distance of unknown data point from all the points in the data set and because we're dealing with small amount of data that's not that hard to do it's actually pretty quick for a computer and it's not really complicated maths you can just see how close is the data based on the Euclidean distance hence we have calculated the Euclidean distance of unknown data point from all the points as showing where x1 and y1 equal 57 and 170 whose class we have to classify now we're looking at they were saying well here's a Euclidean distance who's gonna be their closest neighbors now let's calculate the nearest neighbor at K equals 3 and we can see the three closest neighbors put some at normal and that's pretty self-evident when you look at this graph it's pretty easy to say okay what we're just voting normal normal normal three votes for normal this is gonna be a normal weight so majority of neighbor are pointing towards normal hints as per knn algorithm the class of 57 170 should be normal so a recap of KN n positive integer K is specified along with a new sample we select the K entries in our database which are closest to the new sample we find the most common classification of these entries this is the classification we give to the new sample so as you can see it's pretty straightforward we're just looking for the closest things that match what we got so let's take a look and see what that looks like in a use case in Python so let's dive in to the predict diabetes use case so use case predict diabetes the objective predict whether a person will be diagnosed with diabetes or not we have a data set of 768 people who were or were not diagnosed with diabetes and let's go ahead and open that file and just take a look at that data and this is in a simple spreadsheet format the data itself is comma separated very common set of data and it's also a very common way to get the data and you can see here we have columns a through I that's what 1 2 3 4 5 6 7 8 8 columns with a particular attribute and then the ninth column which is the outcome is whether they have diabetes as a data scientist the first thing you should be looking at is insolent well you know if someone has insulin they have diabetes that's why they're taking it when that could cause issue and some of the machine learning packages but for a very basic setup this works fine for doing the KNN and the next thing you notice is it didn't take very much to open it up I can scroll down to the bottom of the data there's 768 it's pretty much a small data set you know at 769 I can easily fit this into my RAM on my computer I can look at it I can manipulate it and it's not gonna really tax just a regular desktop computer you don't even need an enterprise version to run a lot of this so let's start with importing all the tools we need and before that of course we need to discuss what IDE I'm using certainly you can use any particular editor for python but i like to use for doing very basic visual stuff the anaconda which is great for doing demos with the jupiter notebook and just a quick view of the anaconda navigator which the new release out there which is really nice you can see under home I can choose my application we're gonna be using Python three six I have a couple different versions on this particular machine if I go under environments create a unique environment for each one which is nice and there's even a little button there where I can install different packages so if I click on that button and open the terminal I can then use a simple pip install to install different packages I'm working with let's go ahead and go back on your home and we're gonna launch our notebook and I've already you know kind of like the old cooking shows I've already prepared a lot of my stuff so we don't have to wait for it to launch because it takes a few minutes for it to open up a browser window in this case I'm gonna it's gonna open up Chrome because that's my default that I use and since the script is pre done you'll see I have a number of windows open up at the top the one we're working in and since we're working on the KNN predict whether a person will have diabetes or not let's go and put that title in there and I'm also gonna go up here and click on sell actually we want to go ahead and first insert a cell below and then I'm gonna go back up to the top cell and I'm gonna change the cell type to markdown that means this is not going to run as Python it's a markdown language so if I run this first one it comes up in nice big letters which is kind of nice mine just what we're working on and by now you should be familiar with doing all of our imports we're gonna import the pandas as PD import numpy is NP pandas is the pandas dataframe and numpy is a number array very powerful tools to use in here so we have our imports so we've brought in our pandas our numpy or to general python tools and then you can see over here we have our train tests split by now I used to be familiar with splitting the data we want to split part of it for training our thing and then training our particular model and then we want to go ahead and test the remaining data to see how good it is pre processing a standard scaler preprocessor so we don't have a bias of really large numbers remember in the data we had like number of pregnancies isn't gonna get very large where the amount of insulin they taking it up to 256 so 256 versus 6 that will skew results so we want to go ahead and change that so that they're all uniform between minus 1 and 1 and then the actual tool this is the K neighbors classifier we're going to and finally the last three are three tools to test all about testing our model how good is it we just put down tests on there and we have our confusion matrix our f1 score and our accuracy so we have our two general python modules we're importing and then we have our six module specific from the SK learn setup and then we do need to go ahead and run this so these are actually imported there we go and then move on to the next step and so in this set we're gonna go ahead and load the database we're gonna use pandas remember pandas as PD and we'll take a look at the data in Python we looked at it in a simple spreadsheet but usually I like to also pull it up so that we can see what we're doing so here's our data set equals peed read CSV that's a panda's command and the diabetes folder I just put in the same folder where my I python script is if you put it in a different folder you need the full length on there we can also do a quick length of the data set that is a simple Python command Elian for length we might even let's go ahead and print that we'll go print and if you do it on its own line links that data set that you put a notebook it'll automatically print it but when you're in most of your different setups you want to do the print in front of there and then we want to take a look at the actual data set and since we're in pandas we can simply do data set head and again let's go ahead and add the print in there if you put a bunch of these in a row you know the data set one head data set to head it only prints out the last one so I used to always like to keep the print statement in there but because most projects only use one data frame pandas dataframe doing it this way doesn't really matter the other way works just fine and you can see when we hit the Run button we have the 768 lines which we knew and we have our pregnancies it's automatically given a label on the left remember the head only shows the first five lines so we have zero through four and just a quick look at the data you can see it matches what we looked at before we have pregnancy glucose blood pressure all the way to age and then the outcome on the end and we're gonna do a couple things in this next step we're going to create a list of columns where we can't have zero there's no such thing as zeros in thickness or zero blood pressure or zero glucose any of those you'd be dead so not a really good factor if they don't if they have a zero in there because they didn't have the data and we'll take a look at that because we're gonna start replacing that information with a couple of different things and let's see what that looks like so first we create a nice list as you can see we have the values talked about glucose blood pressure skin thickness and this is a nice way when you're working with columns is to list the columns you need to do some kind of transformation on a very common thing to do and then for this particular setup we certainly could use the there's some Panda tools that will do a lot of this where we can replace the n/a but we're gonna go ahead and do it as a data set column equals data set column don't replace this is this is still pandas you can do a direct there's also one that's that you look for your n/a n/a a lot of different options in here but the n/a n/a numpy na n is what that stands for is is none doesn't exist so the first thing we're doing here is we're replacing the zero with a numpy none there's no data there that's what that says that's what this is saying right here so put the zero in and we're gonna play zeroes with no data so if it's a zero that means a person's well hopefully not dead oh they just didn't get the data the next thing they want to do is we're going to create the mean which is the integer from the data set from the column mean where we skip n A's we can do that that is a pandas command there the Skip ena so we're gonna figure out the mean of that data set and then we're gonna take that data set column and we're gonna replace all the N P na n with the means why did we do that we could have actually just taken this step and gone right down here and just replace zero and skip anything were except you could actually there's a way to skip zeros and then just replace all the zeros but in this case we want to go ahead and do it this way so you could see that we're switching this to a non-existent value and we're gonna create the mean well this is the average person so if we don't know what it is if they did not get the data and the data is missing one of the tricks is you replace it with the average what is the most common data for that this way you can still use the rest of those values to do your computation and it kind of just brings that particular those missing values out of the equation let's go ahead and take this and we'll go ahead and run it doesn't actually do anything so we're still preparing our data if you wanted to see what that looks like we don't have anything in the first few lines so it's not gonna show up but we certainly could look at a row let's do that let's go into our data set with a printed data set and let's pick in this case let's just do glucose and if I run this this is gonna pray on all the different glucose levels going down and we thankfully don't see anything in here that looks like missing data at least on the ones that show so you can see you skipped a bunch in the middle that's what it does we have two me lines and Jupiter notebook he'll skip a few and go on to the next and a data set let me go and remove this will just zero out that and of course before we do any processing before proceeding any further we need to split the data set into our trained and testing data that way we have something to train it with and something to test it on and you're going to notice we did a little something here with the pandas data base code there we go my drawing tool we've added in this right here off the data set and what this says is that the first one in pandas this is from the PD pandas it's gonna say within the data set we want to look at the eye location and it is all rows that's what that says so we're gonna keep all the rows but we're only looking at 0 column 0 to 8 remember column 9 here it is right up here we printed it in here as outcome well that's not part of the training data that's part of the answer yes column 9 but it's listed as 8 number 8 so 0 to 8 is 9 columns so 8 is the value and when you see it in here 0 this is actually 0 to 7 it doesn't include the last one and then we go down here to Y which is our answer and we want just the last one just column 8 and you can do it this way with this particular notation and then if you remember we imported the Train test split that's part of the SK learn right there and we simply put in our X and our Y we're gonna do random state equals 0 you don't have to necessarily seed it that's a seed number I think the default is 1 when you seed it I have to look that up and then the test size test size is 0.2 that simply means we're gonna take 20% of the data and put it aside so that we can test it later that's all that is and again we're gonna run it not very exciting so far we haven't had any printout other than to look at the data but that is a lot of this is prepping this data once you prep it the actual lines of code are quick and easy and we're almost there but the actual running of our KNN we need to go ahead and do a scale the data if you remember correctly we're fitting the data and a standard scaler which means instead of the data being from you know five to three hundred three in one column and the next column is one to six we're gonna set that all so that all the data is between minus one and one that's what that standard scaler does keeps it standardized and we only want to fit the scaler with the training set but we want to make sure the testing set is the X test going in is also transformed so it's processing at the same so here we go with their standard scaler we're gonna call it SC underscore X for the scalar and we're gonna import the standard scalar into this variable and then our X train equals SC underscore X dot fit transform so we're creating the scalar on the X train variable and then our X test we're also going to transform it so we've trained and transformed the X train and then the X test isn't part of that training it isn't part of that of training the Transformer it just gets transformed that's all it does and again we're gonna go and run this if you look at this we've now gone through these steps all three of them we've taken care of replacing our zeros four key columns that shouldn't be 0 and we replace that with the means of those columns that way that they fit right in with our data models we've come down here we split the data so now we have our test data and our training data and then we've taken and we've scaled the data so all of our data going in oh no we don't trip we don't train the Y part the why train and why test that never has to be trained it's only the data going in that's what we want to train in there then define the model using K neighbors classifier and fit the train data in the model so we do all that data prep and you can see down here we're only going to have a couple lines of code where we're actually building our model and training it that's one of the cool things about Python and how far we've come it's such an exciting time to be in machine learning because there's so many automated tools let's see before we do this let's do a quick links' of and let's do why we want let's just do length of Y and we get 7 or 68 and if we import math we do math dot square root let's do Y train there we go it's actually supposed to be X train before we do this let's go ahead and do import math and do math square root length of Y test and when I run that we get 12 point 409 I want to see show you where this number comes from we're about to use 12 is an even number so if you know if you're ever voting on things remember the neighbors all vote don't want to have an even number of neighbors voting so we want to do something odd and let's just take one away we'll make it 11 let me delete this out of here this one the reasons I love Jupiter notebook because you can flip around and do all kinds of things on the fly so we'll go ahead and put in our classifier we're creating our classifier now and it's going to be the K neighbors classifier and neighbors equal 11 remember we did 12 minus 1 for 11 so we have an odd number of neighbors P equals 2 because we're looking for is it are the diabetic or not and we're using the Euclidean metric there are other means of measuring the distance you could do like square square means value there's all kinds of measure this but the Euclidean is the most common one and it works quite well it's important to evaluate the model let's use the confusion matrix to do that and we're going to do is a confusion matrix wonderful tool and then we'll jump into the f1 score and finally an accuracy score which is probably the most commonly used quoted number when you go into a meeting or something like that so let's go ahead and paste that in there and we'll set the CM equal to confusion matrix why test why predict so those are the two values we're going to put in there and let me go ahead and run that and print it out and the way you interpret this is you have the Y predicted which would be your title up here you can do assist do PR IDI predict it across the top and actual going down it's always hard to write in here actual that means that this column here down the middle that's the important column and it means that our prediction said 94 and prediction in the actual agreed on 94 and 32 this number here the 13 and the 15 those are what was wrong so you could have like three different if you're looking at this across three different variables instead of just two you'd end up with the third row down here and the column going down the middle so in the first case we have the the and I believe the zero has a 90 for people who don't have diabetes the prediction said 213 of those people did have diabetes and we're at high risk and the 32 that had diabetes it had correct but our prediction said another 15 out of that 15 it classified as incorrect so you can see where that classification comes in and how that works on the confusion matrix then we're gonna go ahead and print the f1 score let me just run that and you see we got 2 point 6 9 and our f1 score the f1 takes into account both sides of the balance of false positives where if we go ahead and just do the accuracy account and that's what most people think of is it looks at just how many we got right out of how many we got wrong so a lot of people in your data scientists and you're talking the other data scientists they're gonna ask you what the f1 score the F score is if you're talking to the general public or the decision-makers in the business they're gonna ask what the accuracy is and the accuracy is always better than the f1 score but the f1 score is more telling let's just know that there's more false positives than we would like on here but 82% not too bad for a quick flash look at people's different statistics and running an SK learn and running the KNN the K nearest neighbor on it so we have created a model using KN n which can predict whether a person will have diabetes or not or at the very least whether they should go get a checkup and have their glucose checked regularly or not the print accurate score we got the point eight one eight was pretty close to what we got and we can pretty much round that off and just say we have an accuracy of 80% tells us that it's a pretty fair fit in the model to pull that all together there's always a lot of fun make sure we cover everything we went over we covered why we need a KN in looking at cats and dogs great if you have a cat door and you want to figure out where there's a cat or dog coming in don't let the dog in or out using Euclidean distance the simple distance calculated by the two sides of the triangle or the square root of the two sides squared choosing the value of K we discuss that a little bit at least one of the main choices that people use for choosing K and how KNN works then finally we did a full cane and classifier for diabetes prediction thank you for joining us today for more information visit www.subscriptorium.com
Info
Channel: Simplilearn
Views: 270,234
Rating: 4.8798623 out of 5
Keywords: knn algorithm in machine learning, knn algorithm, knn algorithm using python, knn algorithm example, knn algorithm example in python, knn algorithm machine learning, how knn algorithm works, how to implement knn algorithm in python, what is knn algorithm, knn algorithm in data mining, k-nearest neighbor classification algorithm, k-nearest neighbor classification algorithm example, k-nearest neighbor classifier, k-nearest neighbors (knn), machine learning algorithms, simplilearn
Id: 4HKqjENq9OU
Channel Id: undefined
Length: 27min 42sec (1662 seconds)
Published: Wed Jun 06 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.