Unsupervised Learning and Recommendations (6/6) | Machine Learning with Python: Zero to GBMs

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Hello and welcome to machine learning with python zero two GBMs. This is a beginner friendly online certification course being offered by. Today, we're on the final lesson of the scores lesson six, and the topic for today is unsupervised learning and recommendations. So let's get started first. We'll go to the core speech zero two gbms.com on the core speech, you will be able to find some information about the. And you can enroll in the scores and get a certificate of accomplishment, you can check out the course discord server and the course discussion forum using the links here, and you can find links to all the lessons and assignments that you need to complete to get a certification for this course, let's scroll down to lesson six, unsupervised learning and recommendations. On the lesson page, you will be able to find a recording of the lesson, some information about the topics covered and links to the discussion forum and the discord server for this lesson. Now let's scroll down to listen notebooks here. You'll find the Jupiter notebooks that we are going to use today. So we have a couple of Jupiter notebooks, one on unsupervised learning and another one on recommended. So let's open up the unsupervised learning notebook. You can look at the notebook here, but we are going to run this notebook by clicking run and selecting run on binder. This will take the notebook and set up a Jupiter server for us on the cloud so that we can run the code and experiment with the material for this lesson. Now feel free to follow along, or you can also just watch this video and then complete the note. On your own after watching the video all right We now have a jupyter notebook running in front of us. So the first thing I like to do on jupyter is just click on a restart in clear output so that all the steel outputs from the previous execution are removed and we can see the outputs fresh. And I'm also going to hide the header in toolbar and zoom in here a little. Okay. So the topic for today is unsupervised learning, using scikit-learn. And specifically, we're going to talk about a couple of unsupervised learning techniques, clustering, which is taking some data points and identifying clusters from those data points. And this is a visual representation here and diamond dimensionality reduction, which is taking again, a bunch of data points that exists in 2, 3, 4, 5, or any number of dimensions and reducing them to fewer dimensions. For example, here, we're taking all these points and then we are taking a line and projecting all these points on the line and simply looking at their positions on the line instead of looking at these two coordinates. So that's what we're going to talk about today. And we'd start with just an overview of unsupervised learning algorithms in scikit-learn and then talk about clustering and then talk about damaged functionality deduction. Now, this is going to be a high level overview. We not going to look at a very detailed implementation, neither the detailed internal workings. We will try and grasp the intuitive idea of how these different algorithms work and what they're used for and how they're different from one another. Okay. And I encourage you to explore more as I've been saying for the last few weeks, we're at that point where you can now learn things from online resources, from books, from tutorials, from courses on a need to know basis. So whenever you come across a certain term that you need to know about you look it up and you find a good resource, and then you spend some time on it. Maybe spend a day, a couple of days working through some examples and become familiar with it. Okay? So from this point, you have to start searching. You have to start doing some research on your own and a great way to. Consolidate your learning is to put together a short tutorial of your own , try creating your own tutorial on any topic of your choice, for example, on principle component analysis and publish that and share it with the community. , so let's install the required libraries. I'm just installing NumPy, pandas matplotlib and Seaborn. These are the standard data analysis libraries, and I'm also installing Jovian and scikit-learn because these are the libraries we'll need to use to now supervised machine learning refers to the category of machine learning techniques, where models are trained on a dataset without any labels, unlike supervised law. And you might wonder what exactly are we training for? If there are no labels, if you just have a bunch of data. So, and supervised learning is generally used to discover patterns in data and to reduce high dimensional data to fewer dimensions. And here is how it fits in into the overall machine learning landscape. Of course you have computer sciences or where, or artificial intelligence machine learning and everything that we're doing comes under, but within computer science, you have artificial intelligence where you have, sometimes you have rule-based systems. Sometimes you have machine learning models where models are learning things learning patterns and learning relationships between data. Now, again, machine learning is comprised of unsupervised learning and supervised learning. Supervised learning is where you have labels for your data and unsupervised learning is where you do not have labels. Now, there are also a couple of other categories or overlaps called semi-supervised learning and self supervised learning. We not get into that right now, but I encourage you to check this out. And then there is one branch of machine learning, deep learning, which we have not talked about in this course, but we have another course on it called deep learning with PI torch, zero two Ganz. So encourage you to check that out sometime later which kind of cuts across all of these categories and it's just a new paradigm or a new way of doing machine learning. So, uh encourage checking it out as well. It's a sort of a wide reaching approach that applies to various different kinds, no problems. And here is what we are starting in this course, we are looking at classical machine learning algorithms as opposed to deep learning. And the reason we're doing that is because a lot of the data that we work with today is tabular data. Like 95% of the data that companies deal with is tabular data, Excel sheets, database tables, CSV files, etc, and the best known and God them's for tabular data. At this point, especially algorithms that can be interpreted and controlled well are classical machine learning algorithms. And we've looked at supervised learning algorithms, already. Things like classification, where we've looked at again, linear regression, we've looked at logistic regression. We've looked at our decision tree classification, grading, boosting based classification and regression, where we try to predict a number. So classification is where we divide observations into different classes and predict those classes for new observations. Regression is where we try to predict the continuous value. And today we're looking at unsupervised learning where there is no label for any data. And you either try and cluster the data, which is create similar stacks of data, or you try and reduce the dimensionality of reduced the dimensions of the data. Or sometimes you try and find associations between different data points and use that for doing things like recommendations. And scikit-learn offers this cheat sheet for you to decide which model to pick for a given problem. Now, most of the time you will, if you've done a little bit of machine learning, you will automatically know what models to pick. So this is a very simple or very obvious kind of what three, but it's still good to see these other four categories of algorithms available in scikit-learn. So if you are predicting a category and you have some label data, that is when you use classification, but on the other hand, if you're trying to predict a category and you do not have labor data, that is when you use clustering, right? So that's one difference between classification and clustering. That's a common confusion in clustering. There are no labels for us to look at. We just want to group whatever data points we have into different clusters. And of course, if you're predicting a quantity, you should be using regression. And if you're just looking at data, if you just want to visualize data or reduce its size, that's when you look at principal component analysis and embedding and things like that. Okay. So let's talk about clustering. as I've said, a couple of times already clustering is the process of grouping objects from a dataset. It says that the objects in the same group are more similar in some sense to each other than to those in other groups. So that's the definition of Wikipedia definition for you and scikit-learn offers several clustering algorithms. And in fact, it has an entire section on clustering algorithms that you should check out. So it talks a little bit about. The different clustering methods you can see there are more than 10 clustering methods. And it tells you about the use cases for different clustering methods. It tells you about the scalability, how well these methods scaled or different kinds of different sizes of datasets and different number of clusters and the parameters that they take. And all of these are fairly in how they are implemented. But the goal is ultimately the same. The goal is to take some data. For example, here, let's say we just hired a bunch of points here where here, this example, plots, incomes and debt. So this is a scatter plot between incomes of various people and the amount of debt that they have at the moment. So each point represents one person. So these people are people who have high income, but low debt. And these people are people who have low income and low debt. And these people are people who have low income and very high debt, right? So clearly there are three clusters of people here and typically these classes don't come predefined. So you would simply have income and debt information. And what you might want to do is identify which cluster a particular person belongs to, or even figured out what the clusters should look like. That's what clustering is. And that's what we'll try and do today. Given all these points, we will try and figure out what, how many clusters there aren't in the data and which clusters do each of the points belong to. And potentially if there is a new data point that comes in, which cluster will that data point belong to? Okay. So that's what we'll talk about now. Why are we interested in clustering in the first place? world applications of clustering. One is of course, customer segmentation. Now suppose you are an insurance company and you're looking at, or you're a bank and you're looking at applications for loans. This kind of whole cluster analysis is something that you may want to do this plot incomes in debt and see where the person lies, which clustered the line. And you may have a different set of operating rules for high income, low loaded people, low income, low debt, and high, low income high-tech people. And then you may just want to, just to simplify decision-making is sort of each time having to look at both variables, you can simply feed the variables into a computer algorithm, get back a category for them and use that category to make decisions for them. Right? So often the classes in several classification problems are obtained in the first place through clustering processes, especially when you're looking at things like low risk, medium risk, high risk, etc, then product recommendation is another important application of clustering, where if you can identify clusters of people who like a particular product, then you can recommend that same product to people who have similar behaviors, feature engineering is yet another application. What you could do is you could perform some clustering on your data, and then you could actually take the cluster number and add it to your data, your training data as a categorical column. And it's possible that adding that categorical column may improve the training of a decision tree or a gradient boosting tree Then you have anomaly or fraud detection. Now, of course, if you have, if you blot again, let's say you have a bunch of credit card transactions, and then you cluster credit card transactions. You will notice that fraudulent transactions stand out. Maybe there are certain credit cards that make a lot of transactions, so they don't fall within the regular cluster. They fall within the anomalous cluster and that can be used to then detect. For this kind of a fraudulent behavior. What is the activity? How do you deal with the how do you deal with the problem? Right. Another use of clustering called hierarchical clustering specifically is what taxonomy creation. Now, you know that there are several hierarchical divisions in biology between first you have, of course the animals and plants, and then between plants, you have a bunch of different families. And then in each family you have a bunch of different species and so on, right? So that kind of a hierarchy is created using clustering where you take a bunch of different attributes, like what kind of reproduce reproduction of particular animal has, and whether they, what kind of feed they have, what kind of weight they have where do they live and things like that. And use that to create clusters and create families of related animals and then related families get grouped into related kingdoms and so on. So those are some applications of clustering and we will use the Iris flower dataset to study some of the clustering algorithms available in scikit-learn. So here is the Iris floor dataset. I'm just going to load it from the Seaborn library. It comes included with Seabourn. So here is the Iris dataset in this data set, you have four, you have observations about 150 flowers. You can see one 50 rows of data. Each row represents some observations floor of flower, and the observations are the lent of the sepal. The weight of the sepal and sepal and pedal are two parts of the flower and the lent of the pedal and the width of the pedal. Okay. So these are four measurements we have, and then we also know which species, these flowers. Now for the purpose of today's tutorial, we are going to assume that we don't know what species these flowers belong to. Okay. Let's just pretend that we don't know what species the flowers belong to. We just have these measurements. Now, what we'll try and do is we will try and apply clustering techniques to group all of these flowers into different clusters and see if those clusters, which our clustering algorithms have picked out just by looking at these four observations, match these species or not. Okay. So we're not going to use species at all. We are going to treat this data's on labeled. Here's what the data looks like. And you can see that you have, if I just plot Sepal length was a spectral length. This is what the points look like. And if I did not have this information about species, then this is what the points would look like. So if you look at the points here, you might say that maybe there are two clusters, there's one cluster here. And then there's one cluster here. Maybe if you look even closely, you might say that, okay, maybe there's like three clusters here. Maybe this is a cluster. And maybe this seems like a cluster. And of course, we're just looking at two dimensions here. We're looking at , we're not really looking at all four dimensions because we can't even visualize four dimensions, but we could look at sepal length. We could look at sepal weight, then petal weight. And there seemed to be a couple of clusters, maybe three clusters here. And, and again, we could look at a diff different combinations or if we could take three of them and visualize them in 3d and try and identify some clusters we have no way of actually visualizing things in 40. So that's where we will have to take the help of a computer. Right. But even within Representation here of . You can see that clusters do start to form. And it's an interesting question to ask, how do you train a computer to figure out these clusters given just these four measurements? sepal legth sepal width petal length petal width okay. So as I've said, we are going to attempt to cluster observations using numeric columns in the data. So I'm going to pull out a list of numeric columns and just dig the data from the numeric columns into this variable X and X is typically used to represent the input into a machine learning algorithm. Okay. So there's just this X and there's no white here. It's the forced clustering algorithm we'll talk about is k-means clustering and the k-means clustering algorithm attempts to classify objects into a predetermined number of clusters. So you have to decide the Kane k-means is the number of clusters that you wanted the data to have. So in this case, let's say we have some intuition that maybe by looking at some scatterplot, we feel like maybe there are two, maybe there are three clusters, so we can give an input number of clusters to the k-means algorithm, and then it will do something. And I'll take this data. Let's say we give the input of K, it's going to then figure out it's going to then figure out three central points for each cluster. So here is the center point for cluster number one here as the central point for cluster number two, and here is the center point for cluster number three. And then each of these central points are also quite centroids and in each object or each observation in the data set is going to be classified as belonging to the cluster represented by the closest center point. Okay. So that's where you can see that all these belong to this cluster, all these are set to belong to this cluster and all these are set to belong to this cluster. And of course, now if you go out and make a new observation, maybe you get another flood measure. It . Put in the observations here. And if the flower lies somewhere here, then it will belong to the cluster of the scent center that it is closest to. Okay. So that's the K means algorithm. Now, the question you may have is how exactly are these centers determined? So let's talk about that and maybe let's take an example. Let's take maybe a one dimensional example initially, and then we will expand that to two dimensions and go from there. So let me draw a line here and let's say this line represents the petal length. Okay. And I'm just going to take better lens from, let's say zero to five. All the values seem to have Petrel lens between zero to five. And let me take about 12 flowers. Let me, let's not consider all one 50 and let's say you have these four floods. They have fairly low petal land. Okay. Very close to one maybe. And then you have another four floors. They have a medium Petal length, and then you have yet another four floods. They have a hyper Length. Okay. Now we're actually just by looking at, but by looking at just the pedal and you can kind of cluster, you can kind of visually say that, okay, this is one cluster and this is one cluster, and this is one cluster right now, the challenges, how do you train a computer to figure out these clusters? So here's what we are going to do. First. We determine in k-means algorithm. What is the value of key? So let's say, I said the value of K to three. Okay. For some reason I have some intuition that maybe the value of case three. And we talk about how to pick the right value of K. Then what we do is we pick three random points. So let's pick three random points. I pick this one. That's a random point shot. I pick this one. Okay. That's that point? And then I pick this. Alright, that's a random point. So now we are going to treat these three random points as the centers of our clusters. So now this becomes the center of cluster one. This becomes the center of cluster two, and this becomes a center of cluster three. Okay, great. So here's what we've done so far, pick K Random objects as the initial cluster centers. And then the next step is to classify each object into the cluster whose center is closest to the, to that object. So now let's start, let's try and classify this point or this flower now, which cluster is this lot closest to where you can clearly see that it is closest to the center of it's closest to center one. So this becomes one and then check the next one. And this is also closest to one. So this also gets assigned the cluster one, and this also gets assigned a cluster one. This also gets assigned the cluster one. I would say that even this is closer to one, and two. So this also gets assigned the question cluster one. And of course this point is already in cluster one. Then you have this one, cluster two. And this also, this gets assigned cluster two. This one I would say is closer to three. So this gets assigned to cluster three, and this is cluster three and this is close to three as well. Okay? Okay. So now we have a clustering, but definitely this clustering is not that great. You can clearly see that these two should rather belong to the class two. So here is where we do the next interesting thing. Once we have determined the cluster of this is cluster one, and then this is cluster two. And then this is cluster three. We then for each cluster of classified objects, compute the centroid or simply the mean the centroid is nothing but the mean. Right now we're looking at one dimension so that the centroid is simply the mean, and we talk about two dimensions, but here's what we do. Then we basically take all these values. So this is like, this is about a 0.7. This is about one, this is about 1.2, etc, etc. And then we take that average. So if you take the average of all these values and let me draw another line here. So if you take the average of all these values, the average of all these values would be somewhere around here, right? And then we take the average of all these values in the cluster two, and that would be somewhere around here. And then we take the average of all these values and that would be somewhere around here. Now here's the interesting thing. Once we've taken the averages, we make these, the new centers of the cluster. So this becomes the center for Casa one. This becomes a center for cluster two. This becomes the center for cluster three. Okay. And now you can see things that are already starting to look better. Let's just put back these points here. okay. So now what we've done is we've taken each cluster that we created using randomly picked points. And we took the averages of those clusters and said, these are the new cluster centers. Now using these new cluster centers, we reclassify the points. Okay. So that's what we're doing. We are now going to reclassify each object using the centroids as the cluster centers. So now this point is given the class one, this point goes to class one, and this one goes to class one. This one goes to class one, two. Now this one goes to class to class two, class two and class two. And you have class three, class three, class three and class three. Okay. And that's it. So just like that, now we have ended up with cluster one cluster to cluster three. Okay. So that is exactly how k-means works. But one last issue here. Suppose a point that we had picked out in our random selection initially, were very bad points. Let's say they were somewhere here. Let's a we at pick, this, this, and this as the points, what would happen then? Well, what would happen is all of these would belong to the first cluster and this would be the second cluster and this would be the third cluster and the average would lie somewhere here. So your first cluster would be here and the second and third would be here. So you would still end up even after performing the entire average and reclassifying the point, you would still end up with this huge cluster. Maybe this, maybe these might get excluded. Like maybe these might go into two, but you'd still get both of these sections into one big cluster. Right? So k-means does not always work perfectly, right? It depends on that random selection. So that's where, what we do is we repeat these steps steps 1, 2, 6, a few more times to pick the cluster centers with the lowest or to Williams. And we'll talk about that last piece again. So here's what we do just to recap, big gear and them objects as the initial cluster centers like this one, this one, and this one classify each object into the cluster who sent it as the closest to the point. So now we've classified each object based on the randomly picked clusters, then compute the or the mean for each set of clusters. So here the centeroid lay here and here are the Centeroid lay here, here are the centeroid lay. Then use the centroids as the new cluster centers and reclassify all the points. Okay, then, so you do that and you keep track of those centroids and then you do the whole process. Again, get random objects, classify, compute the centroid reclassify. Keep that option. And keep doing that over and over and over multiple times. And let's say you do it about 10 times or 20 times. And out of the 20, you simply pick the one where you got the best, lowest, total variance. So what do we even mean by the total variance? Well, here's how you can compare two sets of clusters, right? So, and I'm not going to draw the point anymore, but let's say you have one set of clusters that looks like this, and you have one set of clusters again, through k-means that looks like this. Okay. So what you do is you compute the variance. Now you have a bunch of points in this cluster, right? So these points are nothing but values on the line. Remember, this is better lent. So whenever you have points, you can compute the radius or winnings as simply the measure of spread, how much spread is there in this data and the new computer variants of these points, the points that you have here, and you compute the variance of these points and add up these values. Okay? So here, this is like a relatively low variance side. This is, let's say this variance is 0.7. This variance is 0.2. And this variance is 0.5. This is like a relatively low variance. On the other hand considered this, the variance is pretty high. So the variance here is like 2.1. And the variance here is about 0.4. The variance here is about 0.1, let's say, right? So the total variance here in this case is about 0.6 and the total variance year 0.7 plus 0.2 plus 0.5 is going to be about 1.4. Okay. And the variance is simply the square of the standard deviation. If you remember from basic statistics, right? So what do we mean when we have a low total variance? When we have a low total variance, what we are essentially expressing is that all the points in every cluster are very close together. And when we have a high total. What we are expressing there is that there are points in certain clusters that are very far away from each other. So as we try all of these random experiments, we simply pick the cluster centers, which minimize the total medians. And with enough random experiments, you can almost be sure that you will get to a point, which is very close to the optimal solution. Okay. You may not get the best possible division. And sometimes even when you run, k-means multiple times, you may get different cluster divisions depending on the kind of data. But once you run it enough time, you will get to a pretty good place. And that's basically the, that's basically how k-means algorithm works. Now, of course, how does this apply to two dimensional data? Exactly the same way. Now let's say you have Petal length and you have petal width of it. So you're looking at both of these and you have a couple of flowers here, and then you have some flaws here and then you have some flaws here. Okay. So what do you do? You pick three random points. Let's say, you said cake was three. So you pick three random points. Maybe let me pick that. Using those three random points, you set them as the cluster centers. And once you set them as the cluster centers you get and you label all the other points, 1, 1, 1, and then all these I think are also close to one. And then, yeah, and then these are close to two and this one is close to three or something like that. Then what do you do? You take all these points in the new, take the centroid. So when you take the centroid, the right ends up here, and then when you take the center of these two, the ends up here and you take the center of these two, the centurions up here. Now, once you get the centroid, you once again, then do the classification. So now when you do computations against the centroid. Plus some of the clusters may change. This is probably not a great example, but maybe let's say you got a centroid here and you are centered here. So what will then happen is all of these will now fall into a different cluster and all of these will go fall into a different cluster and all of these will fall into a different cluster. Okay. So the exact same process, gate ANAM objects classify, compute the centroid. And what is the centroid? Well, you're take all the X values of the points in this cluster and take their mean. You'll take all the Y values of the point in the cluster and take their wins. So centroid is simply a dimension wise, mean that you take, okay. And you, then you do this multiple times using random initial cluster centers. And you pick simply though the centers with the Lowest, total variance to get the measure of earth and use that as a measure of goodness. Okay. So that's how k means works. And here is how you actually perform k-means clustering. You don't have to write code for any of that. All you do is you import k means from sklearn dot cluster and then here. You simply stop initialize k means you give it and clusters the number of clusters you want to create. And then you'll give it a random state. You don't have to give this up. This is just to ensure that each time that randomization is initialized in the exact CMV and you call model.fit, you call molded fit, and you just give it the X. And remember the X simply contains the numeric data. There is no target given here. And once the model is fitted, you can actually check the cluster centers. So now, remember this is clustering, not on one dimension, not on two dimensions, but on four different dimensions. And the algorithm is exactly the same. It's just at the number of dementias and has increased. So what the model has found is that the cluster centers are as follows for cluster one. The center is at a sepal lent of 5.9 and the sepal weight of 2.7 and a pedal and a 4.3 and petal rate of four point of 1.4. And this is the second cluster similarly, and this is the third cluster. Now we, when we want to classify points, you're using the model. This is what we do. We check the distance 5.12 5.9. Okay. That's about 0.8. We check the distance of 3.5 to 2.7. That's about 0.7. We check the difference of 1.4 to 4.3. Okay. That's a lot. That's a, that's about 3.0, that's about 2.9 and we check the distance 0.2 to 1.4. Okay. So we see how far are we. Each of these values are from the original values. So we're basically going to subtract the cluster center from the actual values. And then we add up the squares of those differences. Okay. Basically what that means is if you have a cluster center here, let's say this is a cluster center, and this is a point, okay, now the cluster center is going to look something like this X, Y into dimensions. And the point is going to look like this X Y into dimensions. So let me call that X one, Y one and X two. What we do is we compute X one minus X, two squared, plus Y one minus Y two squared. And we take the square root of that. Okay. So what is that exactly? Well, that is nothing but the actual distance between the two points in the two dimensions. Because if you just see this triangle over here, this right angle, triangle, this part of the triangle, this edge of the triangle. X two minus X, one, this length, and then this length over here is simply Y two minus Y one. So this distance over here by the Pythagoras theorem. It is grouped over the sum of squares of these two sites. So this is actually the distance deep between the two points, right? So we check for each point it's distance from this cluster center, from each cluster center. So we compute for this point, the distance from this cluster center, the distance from this cluster center, the distance from this cluster center and the distance where this from this cluster center. And we find that disco, cluster center is the closest. So this is the cluster that it gets assigned. Okay. So don't worry about the math. The basic idea is each point is going to get assigned to the cluster that it is the closest to, and that closeness is determined using this. Sometimes it's called the Alto norms. Sometimes it's called the Euclidean distance. It's basically just the Pythagoras theorem. It's calculating the distance in these four dimensions. Okay. So the way we can now classify each point or each flower into these clusters is by calling model dot predict. So we call model dot predict on X and the model. Now figures out that compare calculating the distance of this point or this flower to cluster center one and cluster center two and cluster center three. It turns out that it is closest to cluster center. Number one. Oh, sorry. 0 1 2. So it's closest to this cluster center. Okay. And and you can verify that if you want to actually use there's a distance formula and scikit-learn too. So all these points belong to. The cluster number one, and then a bunch of these points belong to cluster number zero. And some of these in between you can see, belong to cluster number two, and then a lot of these points belong to cluster number two, and some of these in between belong to cluster number zero. Okay. So each flower has now been classified into a cluster. And if I want to plot that here is what that will look like. So I've plotted the cluster centers here. This is cluster center. Number one, this is cluster, sorry, this cluster center number zero. This is cluster center for the cluster. Number one, this is cluster center number two. And you can I let you verify here that it is classifying based on closeness of the points and not just here we are just looking at and Petland, but if you actually measure the closeness, it's taking into consideration all four dimensions. So that's why you may not get the perfect picture here. But if you measure the closeness across all four dimensions, all of these flowers have been detected as belonging to this cluster and all of these laws, belonging to this cluster and all of these belong here. Okay. And that looks pretty similar to the chart that we had earlier. The scatterplot with species it's, it seemed like there was this one species of flood. Then there was a second species and a third species. It's possible that there is some misclassification, but already you can see the power of clustering. Right? And imagine now these were not flowers, but these were attributes about customers coming to a website and we took these four attributes about customers. Attributes like how long they stayed on the site, how many things they clicked on, what extent of the page did this scroll to and where did they come from? Right. Or something like that. So we take four such attributes, and then we clustered our customers based on those four attributes. And we get these three clusters. Now we can then look into each of these clusters and figure out, maybe these customers are spending a very little time on the site. Maybe these customers are spending a decent amount of time on the site and are also scrolling to a large degree. And maybe these customers are actually making a purchase. So then maybe we can go and interview some of these customers and then figure out how maybe we should focus more of our marketing efforts on these customers. Maybe understand their demographics, maybe understand the things, the products they look at, the kind of celebrities they follow, maybe get one of those celebrities to endorse our products. And then you can get a lot more customers who are in this cluster, right? So in general, you want to grow the cluster of your paying users and you probably want to ignore the people who are not really interested in your product. Okay. So that's how this extends into a real world analysis. So you can think of clustering more as a data analysis tool. Like sometimes you can just take the data, cluster it, and then present your observations and use that as a kickoff point for further analysis. But technically speaking, because it is figuring out these patterns out of data. So it is a machine learning algorithm. Okay. And I mentioned to you about the goodness of the fit. So the total variance of these three clusters, what we do is we take the variance of all these points and we take the variance of all these points and weightings of all these points across all four dimensions. And then we average out the variance across all dimensions, and then we add up the variances for the individual clusters. So the total variances of all the individual clusters is called the inertia. So to remember variance tells you the amount of spread of the data. So the less the spread within the clusters, the better is the goodness of the fit, right? So here it turns out that we have an overall variance at an overlord inertia of 78. Now let's try creating six clusters instead of three clusters. So here we now have k means and we are putting number of clusters as six, and we're trying to predict here, and you can see that it has made a bunch of predictions and these are what the clusters look like. So we have a couple of clusters here, so this got broken down, and then we have a couple of clusters here, and then we have a couple of clusters here. So you can see that even with six clusters, it basically took those three clusters and broke them into two more clusters. And maybe now, if you go back and look at the actual species and maybe actually look at the actual flower, you may realize that, okay, maybe these are fully grown, Sentosa flowers, and maybe these are young setups of Lars. Maybe these are fully grown virginica floss. Maybe these are Young's virginica Lars or things like that. Right? So when you do clustering, you may uncover more interesting things, even if you already have some kind of classes or labels for your data, right? So here's, here's what we get with six clusters. And you can check the entropy here. Hopefully the entropy here should be lower. Sorry. The inertia here should be lower. So if we just checked model or inertia here, you can see that the model Latino, she has 39 instead. 78. So there's definitely, you can see that these clusters definitely a better classification because there is total overall variance across all clusters is actually pretty low. Now in most real world scenarios, there is no predetermined number of clusters. And in such a case, what you can do is maybe just take a sample of data. Like if you have millions of data points, maybe just take a hundred or maybe a thousand data points and then try different numbers of clusters and for each each value of key, which is the number of clusters running the model and compute the inertia of the model, the total overall variance, and then plot the inertia. So what we're going to do here is take cluster sizes of two to 10 and then try them all out. And then we're going to plot the number of clusters on the x-axis and the inertia on the y-axis. And it's going to create a graph like this and the scolding elbow curve. It may not always be this nice reducing this nice exponential kind of curve in a lot of cases, it'll actually flatten out like this beyond a certain point, creating new clusters, custom clusters won't really help. So what you can then decide is, okay, the point at which things start to flatten out is maybe the right number of clusters. So in this case, I would say that definitely around, like, there's a huge decrease when we go from two clusters to three clusters and three to four and maybe even four to five, but definitely around six things start to flatten out. So maybe for this data, I should just go with six clusters. Okay. So this is the kind of analysis that you can do, and you don't have to use the entire data to do this analysis and do the analysis, pick the point where you get this elbow kind of shape. So in a lot of cases, what you will end up with is a graph that looks like this. Here, you have the. Value of K the number of clusters here, you have the inertia and this is what the graph will typically look like. So you want to pick either this point or this point as the number of clusters. Okay. So I can't really say that you need three clusters or five clusters, but you draw this graph and then based on this, you figure out where the elbow point. Yes. So that's a k means algorithm. And if you remember the algorithm, one thing we said was that we want to randomly test a lot of these things each time. So remember we said that you pick K points and then you use them as cluster centers classify all the points, compute the centroids, then reclassify all the points, and then repeat that process randomly many times that may not be a good idea, because if you hard working with a really large data set, because it can get really, really slow. So that's where you have a variation of k-means called a mini batch k-means algorithm. Now in the mini batch k means algorithm, instead of taking all the points, instead of classifying all the points, you pick just a fixed number of points. And the fixed number is called the mini batch size. So you just pick a fixed number of points, compute their centroids and compute the cluster centers. For those, let's say those hundred points, and then you pick a next hundred points. And for those next 200 points, you start by using the previous centroids, rather than using the, rather than using a random key points, you start by using the previous case centroids okay. So there's a small change that you apply here that each time you pick about a hundred or 200 or 300 points, whatever is the bat size and use that to update the centroids or upgrade the clusters. From the previous batch. And again, this is the point where you can now go through and read about many batch k-means clustering and figure out how that is different from the traditional k-means clustering that we've just looked at. Okay. So here's a dataset. This is this is called a malls customer data set, where you have information about a bunch of customers who visited a mall and you can try performing k-means clustering by downloading this data set and then using the k means class from scikit-learn. And then maybe you can also study the segments. Once you have these cluster segments, use them as a new column in the data, and then do some more analysis, maybe look at for each cluster study, whether what the spends look like, study what the age group looks like and things like that. Okay. So do check this out and you can also try and compare k-means clustering with mini batch k-means clustering using this dataset. One other thing you may try out is you can also configure how many random picks you want the k-means algorithm to take. So by default, we say up to 300 iterations, like maximum iterations is simply the number of times this whole experiment should be repeated of trying to find a good cluster centers and you can set maximum iterations to any value you want. So see what kind of impact that has on the quality of clusters. Okay. All right. So that's the k-means clustering algorithm . So let's talk about a couple more clustering algorithms where he quickly very briefly the next one again, a very common one is called DB scan, which is short for density based spatial clustering of applications with noise. So that's a mouthful, but it's actually not that complicated a technique. It's again, fairly straightforward. If a once you understand the basic steps involved and it uses the density of points in a region to form clusters. So it has two main parameters. It has a parameter called Epsilon, which we'd seen a moment what that means. And another parameter called min samples and using these parameters, Epsilon and min samples, it classifies each point as a core point or reachable point or a noise point or outlier. Okay. So let's try and understand how DB scan works by looking at this example for the moment now, forget all this, all these circles and all these arrows and everything, and even all the colors. Just imagine you have these points. You have this point, maybe let's try and replicate this. You have a point here, here, here. okay. So these are the points we have and here's how. Gave me here's our DB scan works. First. You set an Epsilon. Let's say we sent we, and of course, all of this is on a coordinate plane. Let's say if you're still talking about petal length and petal weight in two dimensions, let's say you said Epsilon 2.5 and let's say, we said mint samples to four. Okay. Now look at this point. This point is let's just consider this point and you can start at any point, take this point and around this point, draw a circle with the radius Epsilon. So let me just, I think this would be about 0.5. Let me just draw the circle here with the radius Epsilon. Okay. So we draw circle with the radius Epsilon around the point. Then we check if in that circle you have at least four points. So if you have at least four points in the circle, which we do 1, 2, 3, and four, including the point itself, as you can see here, 1, 2, 3, and four around the point E so if you have at least four points, then we see that this is a core point. . So this is a core point. So let me just put it in dark, all the core points I'm going to make them dark. So now we've classified a core point and then everything else that is connected to the core point is now part of the same cluster. Okay? So these three points are part of the same cluster. Then let's go to this point and across this point, let's also draw the circle of size Epsilon. And this is what it looks like. And now once again, you have 1, 2, 3, and four. So then this point is also a core point. Okay? So this point is a core point. And then this point is within 0.5 of the core point. So this is now this point now belongs to the same cluster there. Let's do this one again. If you draw a circle around this. You will notice that four points lie here. So this is also a core point. So these are all part of the same cluster. And then this one also turns out to be a core point. You can verify it, we'll have four points around it, and this will be connected to this. And this will also be a core point. It will have all these inside. The source will be corporates. So this like this you've continued creating, connecting core points. And there will be some points now, which will not be core. Like this point right here, this is not a core point. This is a point which is connected to a core point. It is, it lies within the circle of a core point, but it, by itself, like if I just draw the circle around this point, it does not contain the min sample values. So this is called and edge point. Yeah. So this called, sorry, this is called a reachable point. so this is kind of the edge of the cluster, right? This is not a core point. This is not connected to three other points in that circle, but it is still part of an existing cluster. And similarly, this one is, well, like if you draw the circle around this, the circle would look something like this. So this is not a core point, but this is still a connected edge point. So this way we have now identified this one cluster of points. Okay. And we're done with all these, let's say there is another cluster of points here where you have like these four core points. This is all connected to each other. And then these are connected to this. So you have two edge points here in this cluster as well. And then you have four core points in this cluster. So this becomes another cluster right here. This point, however, it is neither core, not, is it an edge point? It does not have four things around it or three things around it close by, and it is not connected to a core point. So this is called an outline. Okay. And it's odd. Sometimes it's also called a noise point. So that's what these triangles and these colors and these lines represent. You have core points, which are all which have within their Epsilon radius, main sample numbers of observations. You have these noise point noise points, which are not connected with any points. And then you have these edge points or reachable points, which are connected to core points, but themselves are not core. So that's where these DB scan does. And again, the way to implement or use DB scan is simply to import from a cylinder cluster DB scan. And you can see here in the signature, you can configure Epsilon. You can configure mint samples and you can configure how the distance is calculated. And by default it as Euclidean, I was showing you in two dimensions, but because we have four dimensions. So in four dimensions, it's going to take the square root off first dimension, different squared, plus second dimension, different squared, plus third dimension, different squared plus. So on the extension of Pythagoras theorem, and then you can also specify how to go about, and you can also specify some other things. So I'll let you look up the remaining arguments here, but that's the basic idea. And then you do a DB scan for the model. So then you instantiate the DB scan model with the Epsilon and with the main samples, and then you fit it to the data. Now here's one thing in DB scan, there is no prediction step because remember the definition of a core point depends on certain points being there in the datasets. So in k-means, you're trying to figure out the center of a cluster, but in DB scan, there is no center of a cluster. The cluster is defined by the connections between points. So you cannot use DB scan to classify. To classify new observations. DBS can simply assign labels to all the existing observations. So you can just check model dot labels and it will automatically just tell you that for the inputs that we're given, when we perform the DB scan algorithm on all of them, because all of them need to be considered to perform the DB scan algorithm, the DB scan clustering. These are the labels that got assigned. So this, all of these got assigned zero. And then all of these got assigned one. And then you can also check which ones are core points and which ones are reachable points and which ones are noise points. A good way to check all the attributes on all the properties that you have on a particular model. All of methods you have is by using DIR. So yeah, I think you have these core sample indices. So this will tell you the core sample indices, and you can see here that these are all the core points. Here, it seems like most points are core points, and maybe we can try changing the Epsilon value. Maybe we can reduce the Epsilon or increase the Epsilon. And that will maybe tell us that not all points are core points. So let me do a scatterplot here and now I'm using the sepal length the pedal length. And instead of as a Hue, I'm using the model dart liberals. So you can see that only two classes were detected here, zero and one. And maybe if we changed the value of Epsilon, maybe if we change the value of mint samples, that number might change. So that's an exercise for you. Try changing the value of Epsilon. The circle, the size of the circle that is drawn around each point and try changing the value of min samples, which decides when something is treated as a core point and try wild ranges. Maybe try close to zero, maybe try close to like a hundred Epsilon does not have to be between zero or one. It can be very large. And similarly for men samples, maybe try value one, maybe try value hundred experiment and try and figure out how each of these hyper parameters affect the clustering. Okay. And see if you can get to the desired clustering, which is ideally clustered these points, according to the species that the flowers belong to. Okay. So that's the DB scan algorithm. And the natural question you may have is when should you use DB scan and when should you use k-means? So here's the main difference between DB scan K means k means uses the concept of a cluster center and it uses distance from a cluster center to define the cluster DB scan. On the other hand, uses a near-miss between the points themselves to create a cluster. So here is an example of k-means where if you have data like this, again, let's say this is x-axis and y-axis now visually you can clearly tell that this is the right clustering, the outer point, these are all connected. It seems to be one cluster and the net points on collected all connected seem to be another cluster, but DB scan can do this because it is concerned with the near-miss between points, but k-means actually cannot identify these two clusters because if you set this as a cluster center, then any cluster that includes this point would also need to include all of these points because these points are closer to the center than this point, right? So there's no way you can create a cluster using k-means of these, of these outer rings. On the other hand, this is what k-means clustering would look like. Maybe you'll end up with. Centroid here and one centroid years. So half the point, we'll go here and have the points will go here. Here is another example. You have these two horseshoe shapes and this again, DB scan is able to detect them, but k means is not on the other. And then there are a few more examples for you to check out. Now, one other thing about DB scan and K means is that in Kmeans, you can specify how many clusters you want, but in DB scan, it will figure out on its own. And you can only change the values of Epsilon and min samples to maybe try and indirectly affect the number of clusters that get created. So that's the DB scan algorithm. So just keep in mind, whenever you want to detect these patterns, which are more concerned with the nearness of the points themselves, DB scan may make more sense, but if you want a distance based clustering technique, then you use K means. One more thing is you can classify new points into clusters using k-means, but you cannot use DB scan to classify new points. You would have to run the entire scan again, because it's possible that by the introduction of a new 0.2 clusters may join together or change in some fashion. okay. One last clustering technique I want to talk about is hierarchical clustering. And as the name suggests, it creates a hierarchy or a tree of clusters and not just a bunch of different clusters. So what does that mean? That means that as you can see in this, as you can see here in this animation, we have a bunch of points and first we take the two closest points and we'd combine those two closest points into a cluster. Then we see we combined. Next two closest points, which in this case turns out to be clustered the cluster point and then another point and combine this, combine them. And then in this way, we create this three of clusters. So at the bottom of the tree are these individual points. And above you have these cluster of two points and sometimes above these, you have clusters of three or four points. And above these, you have clusters of cluster of cluster, of points and things like that. And this is the kind of thing that can typically be used to generate a taxonomy. For example, like if you have observations about many different types of animals, you have a bunch of observations and you start performing clustering. You may realize that there are very close relationships between humans and chimpanzees and then between humans, chimpanzees, and Bonobos, there is another relationship. So that is what is captured by this point. And then of course, between these, between this family and between, let's say mammals, there is a relationship that is captured by this. And then here on the other hand, you have a relationship between plants. And finally at the top you have a single cluster. Okay? And this is how hierarchical flustering works. You first mark each point in the dataset as a cluster by itself, like all of these points, P zero two B five are clusters in the dataset. Then you pick the two closest cluster centers like here, you can see your big, the two closest ones and treat them as a single cluster. You pick the two closest cluster centers without a parent and combine them into a new cluster. Now the new cluster is the parent cluster of the two clusters and it center is the mean of all the points in the cluster. Then you repeat the step two, which is you pick the two closest cluster. In the dataset without a parent. So this time you could be combining a cluster center from a cluster and a leaf, and that could then become their parent cluster. And then you pick the two closest and you keep picking the closest cluster centers each time that do not already have a parent. And that's how you get to the top level. Okay. And this structure that you end up with is often called Venn diagram. So yeah, that's what this will look like. And these are all the cluster centers here now for each type of clustering, I've also included a video that you can watch for a detailed visual explanation. If you're willing to just get deeper and maybe follow along on your own, and scikit-learn allows you to implement hierarchical clustering, so you can try and implement it for the Iris dataset. So I'll let you figure it out. Okay. So we've looked at three types of clustering. Now we have looked at k-means clustering. We have looked at DB scan and then we've looked at hieracial clustering. There are several other clustering algorithms in scikit-learn, so you can check them out too. Here you have about 10 or so clustering algorithms, so you can check out all of them. Okay. so with that, we will close our discussion on clustering by no means exhaustive, but I hope you've gotten a sense of what clustering is. How couple of useful, how couple of common clustering algorithms work and how to use them. There is a question can, k-means be used for clusterings your locations. Yes, absolutely. If you have a bunch of geo locations and you want to cluster them, simply use the latitude, the longitude. The two columns of data and put them into k-means and give it the number of clusters. And you can find a bunch of clusters of geo locations. Yep. That's a great idea. And then one thing you could also do is you could take, you could plot those points then on a map and color them according to the cluster, and also show the center of the cluster on the map. Okay. So let's talk about dimensionality reduction in machine learning problems. We often encounter data sets with a very large number of dimensions and by dimensions, we mean the number of features or number of columns. Sometimes they make, go into dozens. Sometimes they immigrant to hundreds. And especially when you're dealing with sensor data, for example, like let's say you're trying to train a machine learning algorithm, which based on the data of an actual flight, like a flight that started from a certain point and ended at a certain point. Now flights have hundreds of sensors or sometimes thousands of sensors, like same kinds of sensors at every step are different places on the flight. And if you look, if you just collect the information from all of these sensors, you will end up with thousands of columns. And that may be a very inefficient thing to analyze and a very inefficient thing to even train machine learning models on because more columns means more data, which means more processing, which requires more resources and more time. And it made us significantly slow down what your rank to do. So, one thing we typically have learned in the past is to just pick a few useful columns and throw away the rest of the column so that we can quickly train a model. But what if you didn't have to throw away most of the information? What if you could reduce the number of dimensions from that say a hundred to five without losing a lot of informal. What, if you could do that. And that is what dimensionality reduction and manifold learning are all about. So what are the applications of dimensionality reduction reducing the size of data without the loss of information? So let's say you have a certain data set here, which has, let's say 200 columns. Okay. What if you could reduce those 200 columns of data to just five columns without losing information, as much information, right? What if this could still retain 95% of the information? And we'll talk about what we are saying, what we mean by information retention, but what if you could do that, then, then you can clearly see already that this, anything that you do on this model or, or on this data set is going to be 40 times faster. So are you willing to trade maybe 5% of the information? Four 40 times a speed? Probably. Yes. Right. So that's one reason to do damage narrative reduction, and that will then allow you to train machine learning models efficiently, but yet another very important application of dimensionality reduction is wish utilizing high dimensional data in two or three dimensions. Now, of course, as humans, we can only visualize in three dimensions, but even three dimensions can get a little bit tricky. Like 3d scatterplots are very hard to read. So we are really the most comfortable, at least right now, by looking at screens, we are most comfortable looking at data in two dimensions. So even right now with the Iris dataset, we've seen the problem that we have four dimensions, but we can only really visualize two dimensions at a time. Right? So visualizing high dimensional data into two, three dimensions is also an application of dimensionality to it. So let's talk about, let's talk about, let's talk about a couple of techniques for dimensional data reduction. The first one is principal component analysis. Principle component analysis is a dimensionality reduction technique that uses linear projections. And we talk about what linear projections mean off data to reduce dimensions while still attempting to maximize the variance of the data in the projection. So that's what this is about. It's about projecting data, and this is what BCL looks like while still maintaining as much of the variance as possible. So let's say you have this data, let's say X one represents better lent and X two represents petal bait, and then you have all these points. Ignore the line for a moment now, instead of having X one and X two, if we could simply draw this line and if we could just draw this line and then maybe at, on this line, we could set this as the zero point and then we could simply draw these project, these points onto the line. So just see how far away they are from the zero point. Let's say, we said, this has a zero point, and then we simply give this point, the value minus one, this point would be minus 1.8. This point when we minus two, this point would be minus 2.7. And this one would be maybe 0.1. This would be 0.3. This would be one. This would be two. This would be 2.3, 2.4. Okay. So now we have gone from taking this data in two dimensions to this data in one dimension. Okay. So how do we do that? Let's take a look. So let's say we have some data. Let me look at better lent and better wait, just two dimensions right now. So we have 0.1 0.2 0.3 0.4 and so on. And for each one we have some better land, better better rate. And so on. Now we take that and we plot it. So we plot these points here. Maybe it looks a bit like this okay. There are more points here. So this is what it looks like. Now, what we would rather want to see is maybe we just want to see one value. Right? And we, we don't know what this call is. This call the specie one. Okay. We just want to see one value. So for 1, 2, 3, 4, 5 we just want to see one value here instead of these two columns of data. And if we can do go from two columns to one column, then we should be able to go from three columns to one column or three columns, two columns, or 200 columns to five columns. So the idea remains the same. So how do we do that? Well, the first thing we do is we shift our axes a little bit to center these points. So what do I mean by shifting the axes? Well, we take this X axis and we do Y axis, and then we move it here somewhere here. What exactly is meant by moving the axis? Well, in the new axis, you will notice that the X values of these points have changed. So we are calculating the X the mean X coordinate of all these points and simply subtracting the mean from all these points. Similarly, we take the x-axis and move it around and moving the x-axis means subtracting the mean of the Y coordinate. Okay. So what we do for each point is we take the point B and we subtract from it the average. So let's say P has an X and a Y. And the mean also has an X in a way. So the X attracts from the average X, the Weiss attacks from the average. So that is going to center the points. Okay. Some of the points will now have negative value. Some of the points will now have positive values. Then we try out the candidate line. So we try a candidate line. Okay. This looks like a candidate line on which to project. And then we project all of these points on the candidate line. So we project project, project, project, project, project. Okay. So now we have projected all of these on the candidate line. Now, once we've projected things on the candidate line, let me for a moment, just get rid of the X and Y coordinates. Now that we know that they're centered, let me just move them away. Yeah. So now that we have all these projections and we know that this is a zero point, we can see that, okay, this is the, this point can now be represented by this point it's perpendicular. And this point cannot be represented by this point. And this point can now be represented by this March. So for each one, if this is zero, and this is one, and this is two, and this is three on this new line, we can now represent each point using a single number, which is it's the distance of it's projection from the zero point on this projected line. Okay. So we can now start filling in these values. So the distance of 0.1 distance of 0.2, this is a 0.3, and this one's a 0.4. So we start filling in these values. Okay. All right. So now we already have this VM now already reduced from two coordinates to one coordinate, but we want it to retain as much of the information or as much of the, where it from these two coordinates as possible. So this is where, what we try and do is we try and maximize D one square plus D two squared plus D three square plus on, okay. So dos pueblos D two squared, but DT squared is basically the. Now if you like, keep in mind that we have subtracted the mean, and then you, now we are squaring the distances and things like that. So it, it ultimately turns out to be the variance Sigma. So what we want to do is we want to try different options for this line. So let's say we want to take the same line, okay. Square to be hard to select this line, but we want to take the same line and we want to rotate this line. So you want to rotate this line. And each time as we rotate, all the DS will change because all the perpendiculars will change. And we pick the line, the rotated line, for which the sum of squares of the DS, the sum of squares of all these values is the highest. Alright, so just to maybe clean this up a little bit, what we're trying to see is once we have these centered points, you can see here. If I pick this line, all of the projections are very close to zero. So if all of the predictions are very close to zero, then that means most of the information is lost because all the values, all the D one D two D three D four, all of them are very close to zero. On the other hand. Yeah, on the other hand, if we pick a line that goes like this, you can see that the projections are quite far away. So we are capturing the spread of the data very nicely for a nice fitting line. And we are not capturing the spread of the data very nicely for an ill-fitting line. So that's what PC tries and figures out. Okay. So it figures out one. And this is from going from two dimensions to one dimension. Here is an example of going from three dimensions to two dimensions. So now here we have feature one feature to feature three, ignore the word gene here, and then you have points. So what we try and do is we first find PC one. We find a line along which we can project. The best blood possible line, which maintains the highest ratings. Then PC too is aligned, which is perpendicular to the first line. Now, remember we are in three dimensions here. So there are an infinite number of lines that are perpendicular to the first line PC one. So again, we pick the line which maximizes the radiance of the points when the points are projected on PC two. So for three dimensions, you can have PC one, PC two, and then when you would afford dimensions, you can have PC one PC to PC three. When you go to five dimensions, you can have . And then what you can do is like, if you have 200 dimensions, you can just choose five most relevant axes of variance, right? But remember each of these are highest variants preserving possible lines. So let's say you have 200 dimensions, you can do a principal component analysis and you can reduce them down to the five dimensions along which the variance of the data is preserved as much as possible. So that's principle component analysis for you once again. How do you do principal component analysis in scikit-learn let's say we take this data set again, and we're just going to pick the new medical data, no species for now. So we just look at these numerical columns. We simply import from decomposition. We import the PCA model, and then we create the PCN model. We provided the number of target dimensions. We want the number of target dimensions in this case is two. And then we call fit. So we say pc.fit and we give it the data. And so we are going here from four dimensions. 2, 2, 2, 2 dimensions. And what are those two dimensions? At this point, we can't really interpret them as physical dimensions. I mean, it's not like we've picked no, it's like with big two possible linear combinations. And remember because lions and projections are involved, so everything is still linear. So we've picked two possible linear combinations of these four features, which are independent, which means that the lines along which we projected are perpendicular. So we have big two independently near combinations of these four variables. And it is a projections on those lines that we are left with. Okay. So what just happened when you called fit? So those two things got calculated. So what got calculated is the lines. Now PCA internally knows what are the lines, the line number one, line number two. And you can look at the internal, like you can look at what those line number one line number two are like, if you do DIY RPCA, you can look at some information about the lines. Yep. So these are the components, right? So these are the components. These four numbers convey the direction of the first line. One numbers together can wait. And these four numbers can weigh the direction of the second line in four dimensional space. Now that we have the components, we can project these points on these lines. By doing PCA dot transform. So now if we give PCs or transform this data and this data, which has four dimensions, Iris DF numeric goals, it is going to give us transformed data. In two dimensions, you see here. Now you have this transformed data into dimensions. This is the data projected on line number one, which has this direction. And this is the data which is projected onto line number two, which has this direction. So if you know, a little bit of linear algebra, these are both unit vectors in the direction of the lines that have been picked in any guests. Now we've gone from this to this and let's check it out. Let's maybe now plot. So now when we do a scatter plot, you can see now when we do a scatterplot and I'm uploading this data and uploading this data, we can now finally visualize in two dimensions. Of course not perfectly, there's still some, something is lost for sure, but we can still visualize information from all four dimensions. So if we really want to study the clusters and I right now have plotted the species, but we could just as well have plotted the clusters that were detected. So let's maybe look at the DB scan clusters. Yeah. So these are the DB scan clusters. So now we can better visualize the clusters that we generate from clustering. So that's one other thing with dimensionality reduction. It can let you visualize the results of clustering and maybe evaluate the results of clustering. Now of course, principal component has some limitations. The first limitation is that it uses these linear projections, which may not always achieve a very good separation of the data. Like obviously one thing that we noted here was if you have this kind of a line, then information gets lost because most of the projections fall in the same place. Here's one problem that I can tell you with principal component analysis. Like if you have a bunch of points and you have a bunch of points here, all of them will project exactly the same value, right? So there is some information that does get lost and in suppose these points belong to a different class and then these points belong to a different class. So as soon as you do PCA, and then you are trying to maybe train a classification machine learning algorithm, you are going to lose some information, right? So there are some limitations with PCA, but for the most part, it works out pretty well. So I would use PCA whenever you have like hundreds of features and maybe thousands of features, and you need to reduce them to a few features. And as an exercise, I encourage you to apply principle component analysis to a large hybrid high dimensional dataset. Maybe take the house prices, Kaggle competition dataset, and try and reduce the numerical columns from like 50 or whatever, to less than five, and then train a machine learning model using the low dimensional results. And then what do you want to observe is the changes in the loss and training time for different number of target dimensions. Okay. And that's where we come back to this. If you could trade 200 columns, four 50. Ford, maybe a 5% loss in radiance. Now we know what information means by information. We're seeing billions. If you could create 200 columns for five columns for a 5% loss in variants, that would give you a 40 Expedia up. And maybe that speed up could be a make or break for your analysis because now you can actually analyze 40 times more data in the same time. Right? So when you have really large datasets, PCA is a definitely a very useful thing to do. Now that is this jupyter notebook. You can check out, it goes into a lot more depth about principle component analysis. The way this is done is using a technique called singular value decomposition or SVD. Yeah, so there's a bunch of linear algebra involved there, but the intuition or the process that is followed is exactly the same. You find a line, you rotate the line to the point that you get the minimum, sorry, the maximum variance. And then you find the next line, which is perpendicular to this line and still it guarantees the maximum possible variance. And you keep going from there. . okay. Let's talk about another dimensionality reduction technique. And this is called the T distributed stochastic neighbor, embedding technique, or T SNE or T SNE for short. And this belongs to a class of algorithms called manifold learning. And manifold learning is an approach to perform nonlinear. Dimensionality reduction of PCA is linear. And then there are a couple more like PC this something called ICA, LDA, etc. They're all linear in the sense that they all come through some sort of linear algebra matrix multiplications, but there are some limitations sometimes with linear methods. So you can use some of these non-linear methods for dimensionality reduction. And they're typically based on this idea that their dimension, the dimensionality of many data sets is only artificially high. And that there are that most of these datasets, which have a hundred, 200 or 500 columns can really be captured quite easily with four or five columns. And we just have to try and figure out. What those, how to come up with those four and five, four or five columns of data, whether through like some formula being applied to all the columns or whether through like it's kind of feature engineering, except that the computer is trying to figure out these features for you based on certain rules that have been trained into these different kinds of models. So you have a bunch of different manifold learning techniques in scikit-learn and you can see here, this is the original data set. So it is plotted in 3d. So this is the original data sets. And they've, they've just colored the points maybe to give you a sense of which point goes where, and this 3d dataset, when you apply these different feature engineering or these different manifold learning techniques, collapses to this kind of a graph in 2d. So you can see here that this basically is able to separate out red from red, from yellow, from green, from blue, which would be very difficult for PCA to do. Like if you tried to draw two lines in drop projections, you would, it will be very difficult for you to get a separation like this. But ISO map is able to do that. you can see with Disney as well. It is able to separate out, read it with, from the yellow, from the green, from the blue. And these are all different kinds of separations that you get. Now T SNE specifically tSNE specifically is used to visualize high, very high dimensional data in one, two or three dimension. So it is for the purpose of visualization and varied, roughly speaking. This is how it works. We're not going to get into the detailed working of because it's a little more involved. We'll take a little more time to talk about it in a lot of detail, but here's how it works. Now. Suppose you have, again, these clusters of data, of data of points, let's say the suspect, a lent and the suspect. What would happen normally if you'd, we're directly going to, if you're just directly projected them onto this line, is that all of these blues would then overlap with a bunch of oranges. And those would overlap with a bunch of reds. What these new tries to do is first project them on a line and then move the points around and it moves the point points around using a kind of a near miss rule. So every point that is there that is projected down on the line is moved closer to the points that are closer to it in the real dataset, in the, in the original dataset. And it moves away from the points that it is far away from in the original dataset. Okay. Now I know that's not very, it's not a very concrete way of putting it, but here's what it means as you project this line, this point down, it will end up here the blue point as you project this point down, it will end up here. The orange point. Now what these knee will do is Disney will realize that the blue point here should be closer to other blue blue points and should be farther away from orange points because that's how it is in the real dataset. So it's going to move the point. It's going to move the point closer to the blue points, and it's going to move the orange points closer to orange points. Okay. So the closeness in the actual data is reflected as closeness in the reduced dimension. Okay. In the dimensionality reduced dimension, reduced data. Okay. So just keep that in mind. And if you have that intuition, you will be able to figure it out when you need these names. When you need to maintain the closeness, no matter how much you reduced the number of dimensions by Disney's useful. And here is a visual representation of applied to the MNIST Dataset. And the MN IST dataset contains 60,000 images of handwritten digits. So it contains 20, 28 pixels by 28 pixel images of the a hundred and digits, zero to nine and 28 by 28 pixel. Each pixel remember is simply represent a color intensity like red, green, blue, or in this case they are gray. So this represents like how great that particular pixel is. Each pixel is a number. So that means each image is represented by 784 numbers, 28 times 28. So we can take those 784 dimensions and we can perform, we can use Disney to reduce that data to two dimensions and then plot it. And when we plotted, this is what we get these knees able to very neatly separate out all the images of the numbers, zero from all the images of the number one from all the images of the number 2, 3, 4, 5, and so on. Okay. So there is I encourage you to check out this video and then there's also a tutorial or that I've linked below on. Yeah. Sorry does it, your total here, you can check this out on how to actually create this graph. It will take you maybe an hour or so to create this graph to download the data set and maybe look at some samples and create this graph. But what we're trying to get at here is that Disney is very powerful for visualizing data. So whenever you have high dimensional data, use Disney to visualize data. Now, Disney does not work very well. If you have a lot of dimensions, like maybe even 7 84, it's not ideal to directly reduce it to two dimensions. So typically what ends up happening is you first have, maybe let's say 7 84 dimensions. You would take those 7 84 dimensions and you would perform PCA principal component analysis and reduce it to about 50 dimensions. And then you would take those 50 dimensions and reduce it to two dimensions. Disney for visualization. Now these two dimensions, the one, the data that you get While T SNE is not that useful for doing machine learning or even for doing data analysis. It is useful for visualization because you can see which points are closer together in the original data. That's what it is trying to tell you. Okay. So that's where you'll see, T-SNE used as a visualization technique in a lot of different machine learning algorithms. And how do you perform these knee? Exactly the same way as the other SPCA. So you just import the T as any class, and then you set the number of components or the number of dimensions that you want. And we want to take this four dimensional data and we want to transform it to two dimensions. Now, again, with Disney, it there's no predict or there's not transform. There's no fit 10 transform step. Both of them are combined into a fit transform because closeness of the point is very important. So in some sense, you don't use these new or new data. You only use Disney on the data that you already have. So that's why we are doing a fit transform here. And that gives us the transformed data. So this is the transformed data here. So we've gone from four dimensions to two dimensions, and then when we plot it, you can now see that these nieces has really separated out the points here. So you have one class here and then you have these two classes here. So these points, it may hold true that these points were actually quite near to each other in the original data set. And that is why they are near to each other here. And these points are far away from each other in the original dataset. So that's where they're far away from each other here. Yup. So their takeaways PCA is good when you're doing machine learning and T-zone is good when you want to visualize the results. So try and use these need to visualize the MN IST handwritten digital dataset. I've linked to the data set here. Okay. So with that, we complete our discussion of unsupervised learning, at least two aspects of it, clustering and dimensionality reduction. Again, there are many more ways to perform clustering in scikit-learn. There are many more ways to perform dimensionality reduction and all of them have different use cases, but in general, you mostly would just start out by using k-means clustering and by using a PCA for doing dimensionality reduction. And then in a lot of cases, maybe a better clustering algorithm or a better dimensionality algorithm will give you a slight boost, but you should be fine with just using the most basic to begin with in a lot of cases and do check out some of these resources to learn more. So that brings us to the end of this notebook on unsupervised learning. Now let's go back to the lesson page and on the lesson page, let's scroll down once again to lesson notebooks and open up the second. This is a notebook on the topic called collaborative filtering, which is a common technique used to build recommended systems. So once again, let's click the run button and select run on. and here we have the notebook running on binder. You, now that I've run this notebook, I'm just going to run garner, restart and clear output to remove all the previous outputs. I am going to hide the header to toolbar and let's zoom in and we are ready to get started. So we are going to talk about this technique called collaborative filtering with using as this library called fast da. But there are a bunch of other libraries to do the same thing and we'll try and build a state of the art movie recommendation system with this 10 lines of code, again, as with scikit-learn, and as with most machine learning algorithms, you don't really have to implement the internals. You simply have to use the library, but you have to know what parameters to change. And it helps to understand how these algorithms work so that you can pick the right algorithm for the job. Okay. So you combine that intuitive understanding with being able to manipulate the code well, and that's what makes you a good machine learning practitioner. And as you go deeper, you can learn more about the math in world, but it's a more, it's a practical aspect that is more useful. so recommender systems are at the core of pretty much every online system we interact with social networking sites like Facebook, Twitter, Instagram recommend posts. You might like, or people you might know, or you should follow video streaming services like YouTube and Netflix recommend videos, movies, or TV shows. You might like online shopping sites, like Amazon recommend products that you might want to buy. In fact, there is a big debate right now about how much personalization there should be, because that sends people into these filter buDBles, where or echo chambers on Twitter, or filter buDBles on Facebook and other places. And there is also a question about how much of this recommendation and targeting should you be applying for advertising? At what point does it become mine manipulation? Right? So it's a tricky territory, but at the very least we should, we should try and understand how some of these systems work so that we can build intuition about how these algorithms work. And maybe we can try and counter those algorithms in the, in the cases that we need to and there are two types of recommendation methods. There is a content based recommendation. So let's say if I watch a movie and Netflix has a lot of attributes about movies, Netflix has attributes like when the movies was released or who was the director of the movie or which language the movie was in, what was the cost of the movie? What was the length of the movie? What was the genre of the movie, etc, etc. Netflix can then maybe recommend to me similar movies in the genre and Netflix can recommend to me. Movies by the same director or movies with the same actors. So that is called content based recommendation. But the other kind of recommendation is called collaborative filtering and collaborative filtering is a method of making predictions about the interests of a user by collecting preferences from many users. So the underlying assumption in collaborative filtering is that if a person, he has the same opinion as a person be on an issue, then a is ease more likely to have B's opinion on a different issue than that of a randomly chosen person. That's the Wikipedia definition per ticket in Netflix terms. If I know that you've watched a movie, or if I look at your watch history, and I look at 10 other people who have a similar watch history, and if I pick a movie that they have watched and you haven't watched, it is very likely that you will like that movie. So that's a very different way of thinking about recommendation because now we're no longer asking, okay, what's the genre was the language was who's the director who is the, who is the actor, who's the actress, etc. Now we are saying, this person likes a few movies, and there are other people who also like these kinds of movies. And there is a certain movie which this person has not watched, but a lot of these other people have watched. So just based on that fact, you can make a pretty good guess, right? And that's just simply because of how human beings function or how human beings think. We think alike in several ways. And collaborative filtering tries to capitalize on that. That if you have the same opinion as a person be on one issue, then you are likely to have the same opinion as a person be on a different issue as well. There are many different algorithms that implement collaborative filtering. It is the collaborative. The key idea of collaborative filtering is instead of looking at the content, you look at the connections between users and items. Now these items could be movies. In the case of Netflix, these items could be products. In the case of Amazon, these items could be other users or friend suggestions in the case of Facebook or Twitter and so on. But on the one side, you almost always have users. It's a very human centric algorithm in some sense. And there is this library called Liberec for Java, which has over 70 different algorithms for collaborative filtering. And in this tutorial, we will look at a relatively new technique, which is also one of the most powerful ones called a neural collaborative filtering. Okay. So here's the dataset we are going to use. This data set is called a movie lens, a hundred K dataset. It is a collection of movie ratings by 943 users on 100, 1,682 sites, 1,682 movies. So there is a movie site like I am DB. This one is called Moby lens and they have a bunch of movies, mystery. They have 1600 movies listed on the site and they have several users we've bought ratings on these movies. So here's what the data set looks like. You have a user ID and then you have a movie ID, and then you also have the title of the movie. And then you have a rating that the user has given to the movie. Now, of course not every user has seen or rated every movie. So if every user had rated every movie, then you would have 1 6, 8, 2 times 9 43. You would, you would have more than a million ratings, but in total you only have a hundred thousand ratings, which tells you that most users have watched maybe 10 movies or sorry, a hundred movies right out of the thousands. So most users have watched less. 6% of movies in the data in the database. And that actually gives us a big opportunity. Can you recommend movies to these users now? Can you look at user number 42? Look at all the movies they've watched. Can you look at other users who have also watched those movies? And can you recommend to use a 42, a movie that other users have watched or the users like him, but he hasn't watched. Okay. And recommending is the final target that we want to recommend, but what is the objective we set? So the objective we can set is given any random movie. Can we figure out what rating user 42 is going to give that movie once they watch it? So if you can predict that user 42 is going to give wag the dog, a rating of two, then maybe you shouldn't recommend wag the dog to use it 42. But if you can predict that user 42 will give a rating of four to the graduate, then you should recommend graduate to a user 42. So ultimately recommendation is more of a ranking problem because you have a bunch of items which a particular user has not seen or has not bought, or has not accessed. If you could rank those items by the probability or by some, some measure. In this case, the rating, an estimated rating of what the user would give that item. Then you can simply recommend the highest rated or the highest scoring item to the user. Okay. So keep that in mind that when we save, we want to recommend things. What we want to do is we want to rank all the unseen items, and then we want to take the top two, three or five months in items and show them to the. So when Facebook recommends of Frank's suggestion to you, or, or when Netflix recommends a movie to you, they're actually ranking hundreds of movies behind the scenes and they just for you. So that's just for you and all of those, they pick out maybe the top five, or maybe a random five or the top 20 and show you those on your dashboard. And that's why each time you refresh, you may actually see a different result because they also introduce some randomization because they know that it's, it doesn't make sense to always keep showing you the same thing. People also like variety. So there's a good combination of randomization as well as just like a ranking system here. Okay. So that's the setup here. Every user is given a unique numeric ID ranging from one to 943. And every movie is given a unique numeric ID to linking from one to 1,682 users rating for movies are integers ranging from one to five, five being the highest one being the lowest. And there are each user has on average, people have seen about a hundred movies. So there are 900 other or sorry. There are four, 1500 other movies to choose from for pretty much every user. And what we want to do is build a model that can predict how a user would rate a movie that they have not already seen by looking at the movie ratings of other users with similar tastes. That's our objective. Now, if we can rate movies that they haven't already seen or predict how they would rate a movie that they haven't already seen, then we can simply make those predictions and recommend to them the movies with the highest predicted ratings. So the first step is to download the movie lens dataset. It is available on the group lens website, and once it is downloaded, we would also have to unzip because it's a zip file. So all that code is here. So you can just download and unzip the dataset. And that puts us, that puts us data. This ML a hundred K folder. Okay. And then there's a bunch of information here in all these files, and we will look at them one by one slowly, but now we have the, now we have the data downloaded and the main data is in this file. you.data. So if I just look at the file you would, or data. Oops. Yeah. If I just look at the file load or data for just do head, you would, or data ML, a hundred case slash UDA data, you can see that there is yeah, there is this user ID that has this movie ID, and then there is this rating and we can ignore this last one for now, but we have all the useful information that we need in user data, user ID, movie ID, and rating. So let's install the library. The, we are going to use the and a specific version of this library and let us also import, and let's first open up the final user data. Now UDA data is, oh, tab delimited file. So far we have seen space stable. We've seen comma separated file. What's called a CSV. This is a TSV file. So in this case you would actually see, this is one character. So this is the tab character and binders can read in tab deliberate files as well using the same read CSV function. So we just repeated or read CSV. We give it the link to the file. Oh, sorry. The path to the file. And we just specify that instead of separating my commerce, it is being separated by tabs. And this is used to indicate the tab character and there is no header. But it's, it's given him the read me that this is the user ID movie ID rating, and this is some kind of a timestamp, and this is what it looks like. So now we have a user ID, movie ID rating. And now from this point on, we are going to use this library. And from this libraries, we are going to train our collaborative filtering model. Okay. So we need to create a collab data bunch. This is this library specific stuff. So I won't worry too much about it, but we are simply putting ratings data frame into this collab data, bunch glass. And we are also telling this library that we want to use 10% of the data as a validation set. Okay. So this is what the data looks like. Internally, user ID, movie ID, and target. Now here is our model. The model itself is actually quite simple. What we want to do is represent each user and each movie by a vector of a predefined lint N okay. So I take each user, I take user number one and represent them using N numbers. Let's say Ennis five or yeah, let's say, and is five. So I want to represent each user with five numbers. Okay. You one year, two year three, year four, year five. So user number one, or using number 14 is represented using these numbers. The number 29 is represented using these numbers. User number 72 is represented using these numbers. Use number 2, 1, 1 is represented using these numbers. So each variable will be represented using a number. Okay. And each movie will also be represented using a vector asset. Each user would be represented using a vector, which is a set of simply a list of numbers. And each movie will also be represented using a set of vectors, and both of them will have the same size. So movie number 27 is represented using the, using this vector. Moving number 49 is represented using this vector will be number 57 is represented using this vector Ang. Our model or like our predicted our model predicted the rating for for a particular movie by a particular user, simply by taking the dot product of the two vectors and a dot product is simply a element wise, multiplication followed by you some of the products. So it, so let's say we want to get the predicted rating for user number 14 for movie number 27. So what is the rating? That user number 14 is going to give moving number 27. We multiply 0.21 by minus 1.69. We multiply 1.61 by 1.01. We multiply 2.89 plus by 0.82. So you went in one plus your two MTU plus U three and three plus euphoria and four, plus your five and five. And that gives us this value 3.25. So in our model, assuming we have these vectors for all the users and have these vectors for all the movies on models, rating is simply the dot product okay. Of the user and the movie. So that's what our model is going to be. Now, the objective for us is going to be to figure out good vectors if we can figure out good vectors, which satisfy the given data. Remember that not every user has watched every movie, but we do have a hundred thousand movie ratings. So if we can figure out a good set of user vectors and a good set of movie vectors so that it is able to fit the training data well that if I take the vector for user number 2 24 and the vector for movie number 9 25 and multiply them, if I get an value closed. That means I have a good model. Similarly, if I take the vector for user ID 5, 0, 2, and movie ID two 17, multiply them together and get the rating two. That means I have a good model. Right? So the training part of the algorithm is to figure out a good set of vectors for all the users and good set of vectors for all the movies. So that the data that we know already can be predicted well. So if we can predict the data that we know, no, well, no already, well then that means we should be able to predict the data that we don't know. Well, right. So the idea is that if you can, if your model can accurately predict ratings for the movies that have already been watched by users, your model will be able to predict ratings for the movies that have not been washed by users. Okay. That's the idea now, how does this component of collaborative filtering come into picture up here? Like how does the interplay of different users come together? Well, think of it this way. This number 1.69 minus 1.69 will get multiplied with every user vector to get the rating, right? So if a lot of users have given a particular movie, a higher rating, then it's possible that all, a lot of those users have, have a very high value in a certain dimension, which multiplies with the same dimension of the movie and gives a high result. So another new user should have every other new user, which has a high value on that. Same dimension will also give that movie high rating. Okay. To make it more concrete, suppose this represented, this value represented the amount of romance in the movie. So all the users who like romantic mood. We'll have a very high value. Not this, maybe let's consider this one, this little, this represents the amount of romance and movies. So right now the value is one. So that means it's probably high. So this is a highly romantic movie, whereas this is not a very romantic movie. So all the users who like romantic movies will have a high, second dimension. So they will have a high value in this column. And all the users who do not write like romantic movies will have a low value in this column. Similarly, all the movies that are romantic will have a high value in this column and all the movies that are not romantic will have a low value or maybe even a negative value in this call. Maybe this is the opposite of romantic movie, right? So, it's not that this does actually represent romantic what these represent. We don't know the model, we'll figure them out, but that's how you can think of each of these dimensions. It rates the movie on that particular dimension romantic to non romantic. And it also rates a user's preference on that dimension, how much they like romantic was is how much they hate romantic. Okay. All right. So now we want to figure out these vectors. That is the whole learning process to our object. The vectors are currently initially chosen randomly. So it's unlikely that the ratings predicted by the model matched the actual ratings. So our objective while training the model is to gradually adjust the elements inside each user and movie vector so that the predicted ratings get closer to the actual ratings for this plastic has a collab learner class. So we'll not bother too much with the details here, but it needs just a couple of things we need to give it. What are all the factors? Sorry, what is the number of factors, which is what is the size of each vector? So remember user vectors and movie vectors, each should have the same number of dimensions. Otherwise the dot product will not match up. So we will create a factors of 14. And you can think about this as the power of the model or the capacity of the model, the fewer factors you. The less flexibility, you will have. The more factors you've put in them, the richer, the range of relationships you can capture. Maybe if you're put in a Fort, if you put in just two or three factors, then your model is forced to optimize on just maybe one or two attributes of the movie and the users. But if you allow the model to have 40 factors, it can optimize on a bunch of different a bunch of different attributes. For example, if you, if you have 40 factors, one of the factors could end up representing due to the process of due to the process of training, whether the movie stars, Brad Pitt or not, and maybe that's a big factor on for making sure that your predictions are good. So that's the balance there. Of course, if you have too many factors, then your model will be overfitted. Then your training loss will be very low. Then basically your model will try and memorize every single movie. And then it will not generalize well on the validation set. And that's where we have a validation set as well. And then you have this wide range. This wide range is simply to indicate to the model that the predictions or the predictions that we want to get the rating predictions are in the range of zero to 5.5. They're more in the range of one to five, but you give a slightly bigger range so that the model can go up or down a little bit. Okay. And then we also have this weight decay term. This is simply just a regularization term. So this weight decay term ensures that these values don't get too large. Now with a lot of machine learning models, we, we generally want to keep the weights small. That's it's just a bit nicer when the weights are smaller, all the numerical methods work well. So this way decay term, just to ensure us that it is just for regular. And the model that fast AI or any, or this neural collaborative filtering technique uses is just slightly more advanced instead of representing. So each user is still represented by a vector. Each movie's also still represented by a vector, but then there are also two constant terms that are included. So the rating is now calculated as the user vector multiplied by the movie vector, the dot product. Plus our user bias. This is a fixed value that is there. It's not per user, it's there for the entire user dataset. There's just one number you B and then similarly, there's this one number MB. So you can think of it like the bias term in linear regression. Suppose a user rector is all zeros. There should still be a certain rating. Then similarly, suppose a movie vector is all zeros. There should still be a certain rating, right? So this is just a bias term. A bias term is always nicer because it gives slightly more power to the model. And finally remember this wide range. So the model internally also takes this output. You want a month plus your two M two plus so on plus UBI plus MB, and it just applies a sigmoid function to it. So it takes the output and squishes it into the zero to one range and then it scales it to the given wide range. So the model internally and show us that no matter what the vectors user and movie are, the rating will always fall in the zero to 5.5 range. Okay. That's just another internal implementation detail that you don't really need to worry about, but that's roughly what the model is. So you are simply telling the library that I want you to create vectors of size 44 users and movies in such a way that when you multiply them together and you add the bias term and then you scale them, you scale the results, always scale the results into the zero to 5.5 range, dope predictions that you get are very clear. To the prediction, to the actual data that I've given you. And this data is you can check the data. This data has a a hundred thousand reviews, a hundred thousand rows, right? So your model has a hundred thousand ratings to optimize on. And its job is to create create these vectors, which will predict those hundred thousand ratings accurately. And then of course there is a loss function involved internally because this is now you can see that now you've framed it as a regression problem. The model is trying to predict the rating. So it will try and predict the rating and its ratings will be way off it. Settings will be bad for all the a hundred thousand pairs of users and movies that we have in movie DF or movies DF. So the model will try and oops, I guess it's called ratings. DF. Yeah, ratings TF. So the moderator will try and make a prediction for user ID 16 and movie ID to 2 42 by multiplying them and calculating the rating. And the rating will be way off. That will be used to then compute the mean squared error or the root mean squared error. So the more the input data, which is the input data is simply the Mo user ID. And the movie ID goes into the model model checks the vector for the user ID model texts, a vector for the movie ID, multiplies them together at the bias, applies the scaling to come up with a predicted rating. There is of course a real rating. So we take the predicted rating. We take the real rating and we do this for all the input data. So all the a hundred thousand ratings that we have in the training set and we compute the root mean squared error. We then apply an optimization technique called gradient descent and use that to then adjust all the weights. We adjust all the user vectors. We are just all the movie vectors slightly, so that the next time that we put the input into them. It gives slightly better predictions. And then we compare it again with the target or the target rating. And then once again, we performed great in the sense. So each time in each iteration, we are adjusting the weights of the vectors of we are adjusting the vectors for the movies and for the, all the users slightly each time. Okay. And that's what this, all of this businesses here, we are going to use like a, there's a certain way of training these models. And there are certain learning rate involved, how much you adjust the vectors in each iteration, all of that is involved. So what we're going to do is we are going to run five iterations. So we are going to pass all the data through the model five times, and we are going to use a learning rate of 0.1. So each time the adjustments that are made to the weights or adjustments that are made to the vectors for the users in the movies are scaled by 0.1. Okay. So that's what we're going to do here. And there's also another detail internally that when gradient descent is performed, it is done in batches. So we don't take all the data and put it into the model. At once we take batches of maybe a hundred or a hundred inputs, put them into the model, get the outputs, calculate loss, and perform the optimization, improve the weights. We then take the next batch of hundred. So that that's just a little bit faster. So when you run this learn.fit one cycle, etc, all it's doing is it is taking a bunch of legs, taking a hundred movies, putting it into the model. And it is then getting some ports. It's going to compute the laws. Then it's going to change the weights. And then it's going to come put, put the next hundred. And we're going to do a hundred, a hundred hundred till we run out of all a hundred thousand. So that means a thousand, a thousand iterations. And then we repeat that whole thing five times. So the entire data set. In batches of a hundred, goes through the, this training process five times. So this is five. This is called the number of eBooks. And this is called the learning rate. And here, after each law, each ebook in each learning rate, you can see what is the training loss. So this is basically the root mean squared error. And what is the validation loss, but actually it does mean squared error. So the square root of that would be the root mean squared error. But the meat, when the mean squared error after about five iteration is going to come down to about 0.80. So that's a mean squared error. So if I just do SQRT of 0.8, it's a square root of 0.8 is 0.89. So that means our root mean squared error is 0.8, nine or 0.9. That means our predictions are off by 0.9. So if a user has given a movie or rating of four, our prediction is either closer to three or closer to five it's in the three to five range. If the user has given a movie rating off to art prediction, as in the range of 0 2, 0 2, or sorry, 1, 2, 3. So we're not that far away. And how are we deciding how good our predictions are? So we've set aside 10% of the ratings as the validation set, and we're only using 90% of the data as the training set for using the 90% of the data. We train the model and then using the last 10% validation set, we are not showing the actual ratings. And we are asking the model to just make predictions, given a user ID and a movie ID and take that prediction and compare it. We're comparing it with the validation set just as we've been doing normally for all our supervised learning techniques, right? So this is also a form of supervised learning. And this validation set basically tells us that the model. For user movie pass, it has not seen, is able to make predictions with the mean squared error of 0.8, which means it is off by plus or minus one on the whole are plus or minus 0.9 on the whole. And that's not bad at all. Plus or minus 0.9 is not bad at all. If you heard, we are just making some predictions on the validation set. So the real prediction for a particular user movie pear was okay. Let me also just maybe print the users and items here. Just give me a second. yeah. P zero P one P two. Yup. So here are some predictions. This is one the validation set. So the model did not see this, this data was not used for training. So for user number 8 61, and for moving number 7 36, the real rating the user has given the movie. It was 4.00 and odd model. Having never seen this combination before having never seen the real rating was able to predict 4.1. So it was only off by 0.14 using number 1, 1 8 and moving number 5 47. The real rating user gave them a years. 5.2 odd model predicted 4.2. So it was only off by 0.8. Similarly for user 4 58 movie. The real prediction, the real rating was 4.0, our model predicted 3.8. So you're only off by minus 0.2. So a model has gotten pretty good. And of course it's short, it has 90, it has looked at 90% of the training. It has looked at 98,000 ratings and using those 90,000 ratings, it has come up for these vectors were all these users and all these movies. Now here's what we can do. Given a particular user. We can now make prediction for that user for all the possible movies. Okay. And the way to do that would still be disliked. This you we just do we just call this model object with a bunch of users and a bunch of items. So if we just created a list of like a list like this, so for users, we simply put in 1, 1, 1, 1, 1, 1, 1. So I just want to make predictions while use a number one and then four items. I simply put in all the movies that I want to rank for the user. So 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12. How much is that? One? 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 13, 14, 15. Yup. There you go. 15. And I'm just going to hide this year. Okay. And this needs to be yep. So now for user one movie, one dead predicted rating is 4.4. Let's forget the real rating for a moment. But for user one movie to the predicted rating is 3.0 for user one, movie three, the predicted rating is 3.4. So for every user and for everyone. You can get this rating and then you can try and find the highest rated movie for that user and simply recommend those. Okay. So we can now use this model to predict how users would rate movies that they haven't seen and recommend. These are the movies that have a high predicted reading. And this is what that interface might look like. The moment you open Netflix or something, you could recommend that these are the movies you're interested in. And of course you could also just show an average rating, or maybe you could just show rating from users who have a very similar vector. So now here's the interesting thing that you could look at these vectors and you can use these vectors as a similarity measure the users for which the vectors are very close by, or have a, have a very short or very low Euclidean distance are very similar. So you can now use these vectors to cluster users, and you can use these vectors to cluster movies. So as you plus the users, you will maybe find action and Tuesdays, maybe you will find teenage girls, maybe you will find older people. And if you then cluster movies, maybe you'll find action movies together. Maybe you find a like movie starring a particular actor together. Maybe you'll find Christopher Nolan movies together or close by. So now there's a lot of analysis that you can do on these vectors that you have found for these users and for movies. Okay. And these vectors can be seen inside model dot parameters, I believe. Oh, sorry. Lauren dot modeled parameters should have these vectors yup. Yeah. So there is this, these are the user vectors, and then these are the movie vectors, I believe. Yeah. And then these are the, probably the biases or something. Okay. So that's the recommender systems and . We've seen like a high level overview of how this works. But what I want you to take away is just the simple idea that a lot of different problems, including these recommendation problems can be expressed in these terms. So it's all always a question of you try and classify whether you're looking at a classification problem or a regression problem, or a clustering problem, or a dimensionality reduction problem, or a recommendation problem. And then you have a bunch of patterns. If it's a recommendation problem, then maybe you can either do content based filtering. If you have a lot of content attributes, like if you had genre director, etc, for a movie, we did not do that. Or you can do collaborative filtering. If you have a lot of original user data now collaborative filtering softwares from what's called a cold start problem, which is that initially when you're starting your website and there are no users and no movies, what do you do? So initially you want to use content based recommendations, but maybe later you want to make them collaborative recommendations on maybe you want to keep using a combination of both. But yeah, this is yet another category of machine learning problems and recommend a systems could potentially be a course in itself. So that brings us to the end of this notebook and this lesson on unsupervised learning and recommendations. This is also the last topic in the course. So let's go back to the course homepage and review the topics that we've covered in this course. So far. So we started out talking about linear regression with scikit-learn. We saw how to download a real world data set and prepare data for machine learning. We talked about building a linear regression model with a single feature, and then with multiple features, then we also saw how to generate predictions using the train model and how to evaluate machine learning. We then applied the same principles to logistic regression for classification. So this was a different kind of problem where instead of predicting a single number, we were attempting to classify in ports into one of two categories. So we saw how to download and process data sets from gaggle. We saw how to train a logistic regression model and how it works under the hood. We also looked at model evaluation prediction. We saw how to save the train weights of the model so that we do not have to train a model from scratch each time we need to use it. Then you worked on the first assignment where you train your first machine learning model, where you downloaded and it prepared a data set for training. Then you're trained a linear regression model using scikit-learn and then you also made some predictions and evaluated the model. Next, we looked at decision trees and hyper-parameters where we, once again, downloaded real world. And prepared that dataset for training. And then we train and it integrated decision trees. We also learned about hyper parameters that can be applied to decision trees, to improve their performance and to reduce overfitting. In the next lesson, we looked at random forests and regularization. We saw how to go from a single decision tree to a forest of decision trees. Several randomizations that are applied in to each tree. So we saw how to train and interpret random forests. We looked at ensemble methods in general, why they work and how random forest supplied them. And we also tuned some hyper parameters for, I know for us specifically to reduce overfitting and regularize the model. The second assignment was. Training decision trees and random forest, where you prepared at your world data set for training and that you're trained the decision tree and a random forest. And finally doing some hypodermic doesn't regularize. The module. Next, we looked at gradient boosting with XG boost. We drained and evaluated and XG boost model. We learned about gradient boosting the technique where we drain multiple models to correct the errors made by previous models. Start as the technique called boosting. Yeah. Specifically when it is done with trees that is called gradient in decision trees, you can also have linear models, so you can have great in booster, linear models as well. And we looked at the XG boost library. We also looked at techniques like data normalization and cross-validation, and we also looked at hyper parameter tuning and regularization XG boost, the course project. Is for you to build a real world machine learning model where you will perform data cleaning and feature engineering on a data set that you download from an online source, such as Kaggle. Then you will perform training and you will compare and tune multiple types of models. And finally, you will document and publish your work online today, we looked at unsupervised learning and recommendations. We looked at clustering and dimensionality reduction using scikit-learn. We also learned about collaborative filtering and recommendations, and there are several other supervised learning algorithms available in scikit-learn that you should also check out, or you will also have an optional assignment on gradient boosting where you will train a light TBM model from scratch, make predictions and evaluate results and tune hyper-parameters and regularize them. This assignment will be live shortly. So you can own a verified certificate of accomplishment for free by completing all the weekly assignments and the coats project. And the certificate can be added to your LinkedIn profile or linked from your resume or even downloaded as a PDF. So, where do you go from here? Well, there are four good sources for learning more about machine learning and machine learning is all about building projects, building models, and experimenting with different kinds of machine learning techniques and hyper parameters. So I recommend checking out gaggle notebooks, datasets, competitions, and discussions. Just pick any popular data set on Kaggle, check out the core tab or any competition. Check out the core tab and read through some of the notebooks here. You will find data science experts from around the world, sharing the best practices that they use in their day-to-day work. And other course, I recommend checking out is machine learning by Andrew NG on Coursera. This is a great course. Helps build some of the theoretical and mathematical foundations of machine learning. So it's a great compliment to this course, which is far more applied and practical. You should also check out the book hands on machine learning by audit Lynn Gannon. It's a great book on machine learning, using scikit-learn and deep learning, using the TensorFlow framework. And finally, we have a course called deep learning with. Zero two Ganz. This is an online course on Jovian, so you can check it out on jovian.ai. And the most important thing is to keep training models, keep building great projects. So what should you do next it review the lecture videos and execute the Jupiter notebook, completely lecture exercises, and start working on the assignment and discuss on the forum and on the discord. So would So with that, I'll see you in the forums and you can find us on zero to gbms.com or on Twitter at actual Vian ML. And , this was less than six of the course machine learning that by 10 0 2 GBMs the topic was unsupervised learning and recommendations. I hope to see you again in a new course, and there are several other courses available on Jovian. Just go to jovian.ai and check out the courses that we. So you have a course on data analysis with biotin. Zero two pandas. You have a course on deep learning, but by torch, you have a course on data structures and algorithms, and we have a course on machine learning. And if you're interested in making a career transition to data science, then you should definitely consider checking out the zero to data science bootcamp by Jovian a 20 week live certification program designed to help you learn industry standard tools and techniques for data science, build real world projects and start your career as a data scientist, we have limited seats for this program. with that we've reached the end of the scores. So thank you and have a good day or good night. I'll see you again soon.
Info
Channel: Jovian
Views: 11,347
Rating: undefined out of 5
Keywords: jovian, certification, unsupervised learning, unsupervised learning algorithms, unsupervised learning tutorial, supervised and unsupervised learning, what is unsupervised learning, unsupervised learning example, sklearn machine learning, scikit learn, k means, k means clustering, k means algorithm, dbscan, dbscan clustering, db scan, dbscan parameters, dbscan vs k means, clustering algorithms, clustering, k means clustering algorithm, cluster analysis, k means clustering example
Id: aMpVzUg3Ep0
Channel Id: undefined
Length: 135min 58sec (8158 seconds)
Published: Sat Jul 24 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.