Hello and welcome to machine
learning with python zero two GBMs. This is a beginner friendly online
certification course being offered by. Today, we're on the final lesson of the
scores lesson six, and the topic for today is unsupervised learning and recommendations. So let's get started first. We'll go to the core speech zero two
gbms.com on the core speech, you will be able to find some information about the. And you can enroll in the scores and get a
certificate of accomplishment, you can check out the course discord server and the course
discussion forum using the links here, and you can find links to all the lessons and assignments
that you need to complete to get a certification for this course, let's scroll down to lesson
six, unsupervised learning and recommendations. On the lesson page, you will be able to find a
recording of the lesson, some information about the topics covered and links to the discussion
forum and the discord server for this lesson. Now let's scroll down to listen notebooks here. You'll find the Jupiter notebooks
that we are going to use today. So we have a couple of Jupiter
notebooks, one on unsupervised learning and another one on recommended. So let's open up the
unsupervised learning notebook. You can look at the notebook here, but
we are going to run this notebook by clicking run and selecting run on binder. This will take the notebook and set up
a Jupiter server for us on the cloud so that we can run the code and experiment
with the material for this lesson. Now feel free to follow along, or you can also
just watch this video and then complete the note. On your own after watching the video all right We now have a jupyter
notebook running in front of us. So the first thing I like to do on jupyter is
just click on a restart in clear output so that all the steel outputs from the previous execution
are removed and we can see the outputs fresh. And I'm also going to hide the header
in toolbar and zoom in here a little. Okay. So the topic for today is unsupervised
learning, using scikit-learn. And specifically, we're going to talk about
a couple of unsupervised learning techniques, clustering, which is taking some data points
and identifying clusters from those data points. And this is a visual representation here and
diamond dimensionality reduction, which is taking again, a bunch of data points that exists
in 2, 3, 4, 5, or any number of dimensions and reducing them to fewer dimensions. For example, here, we're taking all these
points and then we are taking a line and projecting all these points on the line and
simply looking at their positions on the line instead of looking at these two coordinates. So that's what we're going to talk about today. And we'd start with just an overview of
unsupervised learning algorithms in scikit-learn and then talk about clustering and then
talk about damaged functionality deduction. Now, this is going to be a high level overview. We not going to look at a very
detailed implementation, neither the detailed internal workings. We will try and grasp the intuitive
idea of how these different algorithms work and what they're used for and how
they're different from one another. Okay. And I encourage you to explore more as I've
been saying for the last few weeks, we're at that point where you can now learn things
from online resources, from books, from tutorials, from courses on a need to know basis. So whenever you come across a certain
term that you need to know about you look it up and you find a good resource,
and then you spend some time on it. Maybe spend a day, a couple of days working through some examples
and become familiar with it. Okay? So from this point, you have to start searching. You have to start doing some research
on your own and a great way to. Consolidate your learning is to put together a
short tutorial of your own , try creating your own tutorial on any topic of your  choice, for
example, on principle component analysis and publish that and share it with the community. , so let's install the required libraries. I'm just installing NumPy,
pandas matplotlib and Seaborn. These are the standard data analysis
libraries, and I'm also installing Jovian and scikit-learn because these
are the libraries we'll need to use to now supervised machine learning refers to
the category of machine learning techniques, where models are trained on a dataset
without any labels, unlike supervised law. And you might wonder what
exactly are we training for? If there are no labels, if
you just have a bunch of data. So, and supervised learning is generally used
to discover patterns in data and to reduce high dimensional data to fewer dimensions. And here is how it fits in into the
overall machine learning landscape. Of course you have computer sciences or
where, or artificial intelligence machine learning and everything that we're doing
comes under, but within computer science, you have artificial intelligence where you
have, sometimes you have rule-based systems. Sometimes you have machine learning models where
models are learning things learning patterns and learning relationships between data. Now, again, machine learning is comprised of
unsupervised learning and supervised learning. Supervised learning is where you have
labels for your data and unsupervised learning is where you do not have labels. Now, there are also a couple of other
categories or overlaps called semi-supervised learning and self supervised learning. We not get into that right now, but
I encourage you to check this out. And then there is one branch of machine learning,
deep learning, which we have not talked about in this course, but we have another course on it
called deep learning with PI torch, zero two Ganz. So encourage you to check that out sometime
later which kind of cuts across all of these categories and it's just a new paradigm
or a new way of doing machine learning. So, uh encourage checking it out as well. It's a sort of a wide reaching approach that
applies to various different kinds, no problems. And here is what we are starting in this
course, we are looking at classical machine learning algorithms as opposed to deep learning. And the reason we're doing that is
because a lot of the data that we work with today is tabular data. Like 95% of the data that companies
deal with is tabular data, Excel sheets, database tables, CSV files, etc, and the
best known and God them's for tabular data. At this point, especially algorithms that
can be interpreted and controlled well are classical machine learning algorithms. And we've looked at supervised
learning algorithms, already. Things like classification, where we've
looked at again, linear regression, we've looked at logistic regression. We've looked at our decision tree classification,
grading, boosting based classification and regression, where we try to predict a number. So classification is where we divide
observations into different classes  and predict those classes for new observations. Regression is where we try to
predict the continuous value. And today we're looking at unsupervised
learning where there is no label for any data. And you either try and cluster the data,
which is create similar stacks of data, or you try and reduce the dimensionality
of reduced the dimensions of the data. Or sometimes you try and find associations
between different data points and use that for doing things like recommendations. And scikit-learn offers this cheat sheet for you
to decide which model to pick for a given problem. Now, most of the time you will, if you've
done a little bit of machine learning, you will automatically know what models to pick. So this is a very simple or very obvious
kind of what three, but it's still good to see these other four categories of
algorithms available in scikit-learn. So if you are predicting a category and
you have some label data, that is when you use classification, but on the other
hand, if you're trying to predict a category and you do not have labor data,
that is when you use clustering, right? So that's one difference between
classification and clustering. That's a common confusion in clustering. There are no labels for us to look at. We just want to group whatever data
points we have into different clusters. And of course, if you're predicting a
quantity, you should be using regression. And if you're just looking at data, if you
just want to visualize data or reduce its size, that's when you look at principal component
analysis and embedding and things like that. Okay. So let's talk about clustering. as I've said, a couple of times already clustering
is the process of grouping objects from a dataset. It says that the objects in the same
group are more similar in some sense to each other than to those in other groups. So that's the definition of Wikipedia
definition for you and scikit-learn offers several clustering algorithms. And in fact, it has an entire section on
clustering algorithms that you should check out. So it talks a little bit about. The different clustering methods you can see
there are  more than 10 clustering methods. And it tells you about the use cases
for different clustering methods. It tells you about the scalability, how well
these methods scaled or different kinds of different sizes of datasets and different number
of clusters and the parameters that they take. And all of these are fairly
in how they are implemented. But the goal is ultimately the same. The goal is to take some data. For example, here, let's say we just
hired a bunch of points here where here, this example, plots, incomes and debt. So this is a scatter plot between
incomes of various people and the amount of debt that they have at the moment. So each point represents one person. So these people are people who
have high income, but low debt. And these people are people who
have low income and low debt. And these people are people who have
low income and very high debt, right? So clearly there are three clusters of people here
and typically these classes don't come predefined. So you would simply have
income and debt information. And what you might want to do is identify which
cluster a particular person belongs to, or even figured out what the clusters should look like. That's what clustering is. And that's what we'll try and do today. Given all these points, we will try
and figure out what, how many clusters there aren't in the data and which
clusters do each of the points belong to. And potentially if there is a new data
point that comes in, which cluster will that data point belong to? Okay. So that's what we'll talk about now. Why are we interested in
clustering in the first place? world applications of clustering. One is of course, customer segmentation. Now suppose you are an insurance company
and you're looking at, or you're a bank and you're looking at applications for loans. This kind of whole cluster analysis is
something that you may want to do this plot incomes in debt and see where the
person lies, which clustered the line. And you may have a different set of
operating rules for high income, low loaded people, low income, low debt,
and high, low income high-tech people. And then you may just want to, just to
simplify decision-making is sort of each time having to look at both variables, you can
simply feed the variables into a computer algorithm, get back a category for them and
use that category to make decisions for them. Right? So often the classes in several classification
problems are obtained in the first place through clustering processes, especially
when you're looking at things like low risk, medium risk, high risk, etc, then product recommendation is another important
application of clustering, where if you can identify clusters of people who like a particular
product, then you can recommend that same product to people who have similar behaviors,
feature engineering is yet another application. What you could do is you could perform
some clustering on your data, and then you could actually take the cluster
number and add it to your data, your training data as a categorical column. And it's possible that adding that categorical
column may improve the training of a decision tree or a gradient boosting tree Then you have anomaly or fraud detection. Now, of course, if you have, if you
blot again, let's say you have a bunch of credit card transactions, and then
you cluster credit card transactions. You will notice that fraudulent
transactions stand out. Maybe there are certain credit cards
that make a lot of transactions, so they don't fall within the regular cluster. They fall within the anomalous cluster
and that can be used to then detect. For this kind of a fraudulent behavior. What is the activity? How do you deal with the how
do you deal with the problem? Right. Another use of clustering called hierarchical
clustering specifically is what taxonomy creation. Now, you know that there are several hierarchical
divisions in biology between first you have, of course the animals and plants, and then between
plants, you have a bunch of different families. And then in each family you have a bunch
of different species and so on, right? So that kind of a hierarchy is created
using clustering where you take a bunch of different attributes, like what kind of
reproduce reproduction of particular animal has, and whether they, what kind of feed
they have, what kind of weight they have where do they live and things like that. And use that to create clusters and
create families of related animals and then related families get grouped
into related kingdoms and so on. So those are some applications of
clustering and we will use the Iris flower dataset to study some of the clustering
algorithms available in scikit-learn. So here is the Iris floor dataset. I'm just going to load it
from the Seaborn library. It comes included with Seabourn. So here is the Iris dataset in this
data set, you have four, you have observations about 150 flowers. You can see one 50 rows of data. Each row represents some observations
floor of flower, and the observations are the lent of the sepal. The weight of the sepal and sepal and pedal
are two parts of the flower and the lent of the pedal and the width of the pedal. Okay. So these are four measurements we have, and
then we also know which species, these flowers. Now for the purpose of today's tutorial,
we are going to assume that we don't know what species these flowers belong to. Okay. Let's just pretend that we don't know
what species the flowers belong to. We just have these measurements. Now, what we'll try and do is we will try and
apply clustering techniques to group all of these flowers into different clusters and see if
those clusters, which our clustering algorithms have picked out just by looking at these four
observations, match these species or not. Okay.
So we're not going to use species at all. We are going to treat this data's on labeled. Here's what the data looks like. And you can see that you have, if I just
plot Sepal length was a spectral length. This is what the points look like. And if I did not have this
information about species, then this is what the points would look like. So if you look at the points here,
you might say that maybe there are two clusters, there's one cluster here. And then there's one cluster here. Maybe if you look even closely,
you might say that, okay, maybe there's like three clusters here. Maybe this is a cluster. And maybe this seems like a cluster. And of course, we're just
looking at two dimensions here. We're looking at , we're not really
looking at all four dimensions because we can't even visualize four dimensions,
but we could look at sepal length. We  could look at sepal weight, then petal weight. And there seemed to be a couple of
clusters, maybe three clusters here. And, and again, we could look at a diff
different combinations or if we could take three of them and visualize them in 3d and
try and identify some clusters  we have no way of actually visualizing things in 40. So that's where we will have
to take the help of a computer. Right. But even within Representation here of . You
can see that clusters do start to form. And it's an interesting question to ask, how
do you train a computer to figure out these clusters given just these four measurements? sepal legth sepal width petal length  petal width okay. So as I've said, we are going to
attempt to cluster observations using numeric columns in the data. So I'm going to pull out a list of numeric
columns and just dig the data from the numeric columns into this variable X
and X is typically used to represent the input into a machine learning algorithm. Okay. So there's just this X and there's no white here. It's the forced clustering algorithm we'll talk
about is k-means clustering and the k-means clustering algorithm attempts to classify
objects into a predetermined number of clusters. So you have to decide the Kane
k-means is the number of clusters that you wanted the data to have. So in this case, let's say we have some
intuition that maybe by looking at some scatterplot, we feel like maybe there are two,
maybe there are three clusters, so we can give an input number of clusters to the k-means
algorithm, and then it will do something. And I'll take this data. Let's say we give the input of
K, it's going to then figure out it's going to then figure out three
central points for each cluster. So here is the center point for cluster
number one here as the central point for cluster number two, and here is the
center point for cluster number three. And then each of these central points are
also quite centroids and in each object or each observation in the data set is going
to be classified as belonging to the cluster represented by the closest center point. Okay. So that's where you can see that all
these belong to this cluster, all these are set to belong to this cluster and all
these are set to belong to this cluster. And of course, now if you go out and make a new
observation, maybe you get another flood measure. It . Put in the observations here. And if the flower lies somewhere here,
then it will belong to the cluster of the scent center that it is closest to. Okay. So that's the K means algorithm. Now, the question you may have is how
exactly are these centers determined? So let's talk about that and
maybe let's take an example. Let's take maybe a one dimensional example
initially, and then we will expand that to two dimensions and go from there. So let me draw a line here and let's say
this line represents the petal length. Okay. And I'm just going to take better
lens from, let's say zero to five. All the values seem to have
Petrel lens between zero to five. And let me take about 12 flowers. Let me, let's not consider all one 50
and let's say you have these four floods. They have fairly low petal land. Okay. Very close to one maybe. And then you have another four floors. They have a medium Petal length, and
then you have yet another four floods. They have a hyper Length. Okay. Now we're actually just by looking at, but
by looking at just the pedal and you can kind of cluster, you can kind of visually
say that, okay, this is one cluster and this is one cluster, and this is one cluster
right now, the challenges, how do you train a computer to figure out these clusters? So here's what we are going to do. First. We determine in k-means algorithm. What is the value of key? So let's say, I said the value of K to three. Okay. For some reason I have some intuition
that maybe the value of case three. And we talk about how to
pick the right value of K. Then what we do is we pick three random points. So let's pick three random points. I pick this one. That's a random point shot. I pick this one. Okay.
That's that point? And then I pick this. Alright, that's a random point. So now we are going to treat these three
random points as the centers of our clusters. So now this becomes the center of cluster one. This becomes the center of cluster two,
and this becomes a center of cluster three. Okay, great. So here's what we've done so far, pick K
Random objects as the initial cluster centers. And then the next step is to classify
each object into the cluster whose center is closest to the, to that object. So now let's start, let's try and classify
this point or this flower now,  which cluster is this lot closest to where you
can clearly see that it is closest to the center of it's closest to center one. So this becomes one and then check the next one. And this is also closest to one. So this also gets assigned the cluster one,
and this also gets assigned a cluster one. This also gets assigned the cluster one. I would say that even this
is closer to one, and two. So this also gets assigned
the question cluster one. And of course this point
is already in cluster one. Then you have this one, cluster two. And this also, this gets assigned cluster two. This one I would say is closer to three. So this gets assigned to cluster
three, and this is cluster three and this is close to three as well. Okay? Okay. So now we have a clustering, but definitely
this clustering is not that great. You can clearly see that these two
should rather belong to the class two. So here is where we do the next interesting thing. Once we have determined the cluster of  this
is cluster one, and then this is cluster two. And then this is cluster three. We then for each cluster of classified
objects, compute the centroid or simply the mean the centroid is nothing but the mean. Right now we're looking at one dimension so that
the centroid is simply the mean, and we talk about two dimensions, but here's what we do. Then we basically take all these values. So this is like, this is about a 0.7. This is about one, this is about 1.2, etc, etc. And then we take that average. So if you take the average of all these
values and let me draw another line here. So if you take the average of all these
values, the average of all these values would be somewhere around here, right? And then we take the average of all
these values in the cluster two, and that would be somewhere around here. And then we take the average of all these
values and that would be somewhere around here. Now here's the interesting thing. Once we've taken the averages, we make
these, the new centers of the cluster. So this becomes the center for Casa one. This becomes a center for cluster two. This becomes the center for cluster three. Okay. And now you can see things that are
already starting to look better. Let's just put back these points here. okay. So now what we've done is we've taken each cluster
that we created using randomly picked points. And we took the averages of those clusters
and said, these are the new cluster centers. Now using these new cluster
centers, we reclassify the points. Okay. So that's what we're doing. We are now going to reclassify each object
using the centroids as the cluster centers. So now this point is given the class
one, this point goes to class one, and this one goes to class one. This one goes to class one, two. Now this one goes to class to
class two, class two and class two. And you have class three, class
three, class three and class three. Okay. And that's it. So just like that, now we have ended up
with cluster one cluster to cluster three. Okay. So that is exactly how k-means works. But one last issue here. Suppose a point that we had picked out in our
random selection initially, were very bad points. Let's say they were somewhere here. Let's a we at pick, this, this, and this
as the points, what would happen then? Well, what would happen is all of these would
belong to the first cluster and this would be the second cluster and this would be the third
cluster and the average would lie somewhere here. So your first cluster would be here
and the second and third would be here. So you would still end up even after performing
the entire average and reclassifying the point, you would still end up with this huge cluster. Maybe this, maybe these might get excluded. Like maybe these might go into two,
but you'd still get both of these sections into one big cluster. Right? So k-means does not always work perfectly, right? It depends on that random selection. So that's where, what we do is we repeat these
steps steps 1, 2, 6, a few more times to pick the cluster centers with the lowest or to Williams. And we'll talk about that last piece again. So here's what we do just to recap, big
gear and them objects as the initial cluster centers like this one, this one, and this
one classify each object into the cluster who sent it as the closest to the point. So now we've classified each object based on
the randomly picked clusters, then compute the  or the mean for each set of clusters. So here the centeroid lay here and here are the
Centeroid lay here, here are the centeroid lay. Then use the centroids as the new cluster
centers and reclassify all the points. Okay, then, so you do that and you keep track of
those centroids and then you do the whole process. Again, get random objects, classify,
compute the centroid reclassify. Keep that option. And keep doing that over and
over and over multiple times. And let's say you do it
about 10 times or 20 times. And out of the 20, you simply pick the one
where you got the best, lowest, total variance. So what do we even mean by the total variance? Well, here's how you can compare
two sets of clusters, right? So, and I'm not going to draw the point anymore,
but let's say you have one set of clusters that looks like this, and you have one set of clusters
again, through k-means that looks like this. Okay. So what you do is you compute the variance. Now you have a bunch of
points in this cluster, right? So these points are nothing
but values on the line. Remember, this is better lent. So whenever you have points, you can compute
the radius or winnings as simply the measure of spread, how much spread is there in this data and
the new computer variants of these points, the points that you have here, and you compute the
variance of these points and add up these values. Okay? So here, this is like a
relatively low variance side. This is, let's say this variance is 0.7. This variance is 0.2. And this variance is 0.5. This is like a relatively low variance. On the other hand considered
this, the variance is pretty high. So the variance here is like 2.1. And the variance here is about 0.4. The variance here is about 0.1, let's say, right? So the total variance here in this case is
about 0.6 and the total variance year 0.7 plus 0.2 plus 0.5 is going to be about 1.4. Okay. And the variance is simply the
square of the standard deviation. If you remember from basic statistics, right? So what do we mean when we
have a low total variance? When we have a low total variance, what we are
essentially expressing is that all the points in every cluster are very close together. And when we have a high total. What we are expressing there is that
there are points  in certain clusters that are very far away from each other. So as we try all of these random
experiments, we simply pick the cluster centers, which minimize the total medians. And with enough random experiments, you can
almost be sure that you will get to a point, which is very close to the optimal solution. Okay.
You may not get the best possible division. And sometimes even when you run, k-means
multiple times, you may get different cluster divisions depending on the kind of data. But once you run it enough time,
you will get to a pretty good place. And that's basically the, that's
basically how k-means algorithm works. Now, of course, how does this
apply to two dimensional data? Exactly the same way. Now let's say you have Petal length
and you have petal width of it. So you're looking at both of these and you have
a couple of flowers here, and then you have some flaws here and then you have some flaws here. Okay. So what do you do? You pick three random points. Let's say, you said cake was three. So you pick three random points. Maybe let me pick that. Using those three random points,
you set them as the cluster centers. And once you set them as the cluster centers you
get and you label all the other points, 1, 1, 1, and then all these I think are also close to one. And then, yeah, and then these are
close to two and this one is close to three or something like that. Then what do you do? You take all these points in
the new, take the centroid. So when you take the centroid, the right ends
up here, and then when you take the center of these two, the ends up here and you take the
center of these two, the centurions up here. Now, once you get the centroid, you
once again, then do the classification. So now when you do computations
against the centroid. Plus some of the clusters may change. This is probably not a great example,
but maybe let's say you got a centroid here and you are centered here. So what will then happen is all of these will
now fall into a different cluster and all of these will go fall into a different cluster and
all of these will fall into a different cluster. Okay. So the exact same process, gate ANAM
objects classify, compute the centroid. And what is the centroid? Well, you're take all the X values  of the
points in this cluster and take their mean. You'll take all the Y values of the
point in the cluster and take their wins. So centroid is simply a dimension
wise, mean that you take, okay. And you, then you do this multiple times
using random initial cluster centers. And you pick simply though the centers with the
Lowest, total variance to get the measure of earth and use that as a measure of goodness. Okay. So that's how k means works. And here is how you actually
perform k-means clustering. You don't have to write code for any of that. All you do is you import k means from
sklearn dot cluster and then here. You simply stop initialize k means
you give it and clusters the number of clusters you want to create. And then you'll give it a random state. You don't have to give this up. This is just to ensure that each time
that randomization is initialized in the exact CMV and you call model.fit, you call
molded fit, and you just give it the X. And remember the X simply
contains the numeric data. There is no target given here. And once the model is fitted, you can
actually check the cluster centers. So now, remember this is clustering, not
on one dimension, not on two dimensions, but on four different dimensions. And the algorithm is exactly the same. It's just at the number of
dementias and has increased. So what the model has found is that the
cluster centers are as follows for cluster one. The center is at a sepal lent of 5.9 and
the sepal weight of 2.7 and a pedal and a 4.3 and petal rate of four point of 1.4. And this is the second cluster
similarly, and this is the third cluster. Now we, when we want to classify
points, you're using the model. This is what we do. We check the distance 5.12 5.9. Okay.
That's about 0.8. We check the distance of 3.5 to 2.7. That's about 0.7. We check the difference of 1.4 to 4.3. Okay.
That's a lot. That's a, that's about 3.0, that's about
2.9 and we check the distance 0.2 to 1.4. Okay. So we see how far are we. Each of these values are from the original values. So we're basically going to subtract the
cluster center from the actual values. And then we add up the
squares of those differences. Okay. Basically what that means is if you have
a cluster center here, let's say this is a cluster center, and this is a point, okay,
now the cluster center is going to look something like this X, Y into dimensions. And the point is going to look
like this X Y into dimensions. So let me call that X one, Y one and X two. What we do is we compute X one minus X, two
squared, plus Y one minus Y two squared. And we take the square root of that. Okay. So what is that exactly? Well, that is nothing but the actual distance
between the two points in the two dimensions. Because if you just see this triangle over
here, this right angle, triangle, this part of the triangle, this edge of the triangle. X two minus X, one, this length, and then this
length over here is simply Y two minus Y one. So this distance over here
by the Pythagoras theorem. It is grouped over the sum of
squares of these two sites. So this is actually the distance
deep between the two points, right? So we check for each point it's distance from
this cluster center, from each cluster center. So we compute for this point, the distance
from this cluster center, the distance from this cluster center, the distance
from this cluster center and the distance where this from this cluster center. And we find that disco,
cluster center is the closest. So this is the cluster that it gets assigned. Okay. So don't worry about the math. The basic idea is each point is going to get
assigned to the cluster that it is the closest to, and that closeness is determined using this. Sometimes it's called the Alto norms. Sometimes it's called the Euclidean distance. It's basically just the Pythagoras theorem. It's calculating the distance
in these four dimensions. Okay. So the way we can now classify each
point or each flower into these clusters is by calling model dot predict. So we call model dot predict on X and the model. Now figures out that compare calculating
the distance of this point or this flower to cluster center one and cluster
center two and cluster center three. It turns out that it is closest to cluster center. Number one. Oh, sorry. 0 1 2. So it's closest to this cluster center. Okay. And and you can verify that if
you want to actually use there's a distance formula and scikit-learn too. So all these points belong to. The cluster number one, and then a bunch of
these points belong to cluster number zero. And some of these in between you can see, belong
to cluster number two, and then a lot of these points belong to cluster number two, and some of
these in between belong to cluster number zero. Okay. So each flower has now been
classified into a cluster. And if I want to plot that here
is what that will look like. So I've plotted the cluster centers here. This is cluster center. Number one, this is cluster, sorry,
this cluster center number zero. This is cluster center for the cluster. Number one, this is cluster center number two. And you can I let you verify here that it is
classifying based on closeness of the points and not just here we are just looking at  and Petland,
but if you actually measure the closeness, it's taking into consideration all four dimensions. So that's why you may not
get the perfect picture here. But if you measure the closeness across
all four dimensions, all of these flowers have been detected as belonging to this
cluster and all of these laws, belonging to this cluster and all of these belong here. Okay. And that looks pretty similar to
the chart that we had earlier. The scatterplot with species it's, it seemed
like there was this one species of flood. Then there was a second
species and a third species. It's possible that there is some
misclassification, but already you can see the power of clustering. Right? And imagine now these were not flowers,
but  these were attributes about customers coming to a website and we took
these four attributes about customers. Attributes like how long they stayed on
the site, how many things they clicked on, what extent of the page did this
scroll to and where did they come from? Right. Or something like that. So we take four such attributes,
and then we clustered our customers based on those four attributes. And we get these three clusters. Now we can then look into each of these
clusters and figure out, maybe these customers are spending a very little time on the site. Maybe these customers are spending a
decent amount of time on the site and are also scrolling to a large degree. And maybe these customers are
actually making a purchase. So then maybe we can go and interview
some of these customers and then figure out how maybe we should focus more of
our marketing efforts on these customers. Maybe understand their demographics, maybe
understand the things, the products they look at, the kind of celebrities they follow, maybe get
one of those celebrities to endorse our products. And then you can get a lot more
customers who are in this cluster, right? So in general, you want to grow the
cluster of your paying users and you probably want to ignore the people who
are not really interested in your product. Okay. So that's how this extends
into a real world analysis. So you can think of clustering
more as a data analysis tool. Like sometimes you can just take the data, cluster
it, and then present your observations and  use that as a kickoff point for further analysis. But technically speaking, because it is
figuring out these patterns out of data. So it is a machine learning algorithm. Okay. And I mentioned to you about
the goodness of the fit. So the total variance of these three clusters,
what we do is we take the variance of all these points and we take the variance of
all these points and weightings of all these points across all four dimensions. And then we average out the variance across
all dimensions, and then we add up the variances for the individual clusters. So the total variances of all the
individual clusters is called the inertia. So to remember variance tells you
the amount of spread of the data. So the less the spread within the clusters,
the better is the goodness of the fit, right? So here it turns out that we have an overall
variance at an overlord inertia of 78. Now let's try creating six
clusters instead of three clusters. So here we now have k means and we are
putting number of clusters as six, and we're trying to predict here, and you can
see that it has made a bunch of predictions and these are what the clusters look like. So we have a couple of clusters here, so this got
broken down, and then we have a couple of clusters here, and then we have a couple of clusters here. So you can see that even with six clusters,
it basically took those three clusters and broke them into two more clusters. And maybe now, if you go back and look at the
actual species and maybe actually look at the actual flower, you may realize that, okay,
maybe these are fully grown, Sentosa flowers, and maybe these are young setups of Lars. Maybe these are fully grown virginica floss. Maybe these are Young's virginica
Lars or things like that. Right? So when you do clustering, you may uncover more
interesting things, even if you already have some kind of classes or labels for your data, right? So here's, here's what we get with six clusters. And you can check the entropy here. Hopefully the entropy here should be lower. Sorry.
The inertia here should be lower. So if we just checked model or inertia here, you
can see that the model Latino, she has 39 instead. 78. So there's definitely, you can see that these
clusters definitely a better classification because there is total overall variance
across all clusters is actually pretty low. Now in most real world scenarios, there
is no predetermined number of clusters. And in such a case, what you can do
is maybe just take a sample of data. Like if you have millions of data points,
maybe just take a hundred or maybe a thousand data points and then try different numbers of
clusters and for each each value of key, which is the number of clusters running the model
and compute the inertia of the model, the total overall variance, and then plot the inertia. So what we're going to do here is take cluster
sizes of two to 10 and then try them all out. And then we're going to plot the
number of clusters on the x-axis and the inertia on the y-axis. And it's going to create a graph like
this and the scolding elbow curve. It may not always be this nice reducing
this nice exponential kind of curve in a lot of cases, it'll actually flatten out like
this beyond a certain point, creating new clusters, custom clusters won't really help. So what you can then decide is, okay, the
point at which things start to flatten out is maybe the right number of clusters. So in this case, I would say that definitely
around, like, there's a huge decrease when we go from two clusters to three clusters and
three to four and maybe even four to five, but definitely around six things start to flatten out. So maybe for this data, I should
just go with six clusters. Okay. So this is the kind of analysis that you can
do, and you don't have to use the entire data to do this analysis and do the analysis, pick the
point where you get this elbow kind of shape. So in a lot of cases, what you will end
up with is a graph that looks like this. Here, you have the. Value of K the number of clusters here,
you have the inertia and this is what the graph will typically look like. So you want to pick either this point
or this point as the number of clusters. Okay. So I can't really say that you need
three clusters or five clusters, but you draw this graph and then based on this,
you figure out where the elbow point. Yes. So that's a k means algorithm. And if you remember the algorithm, one
thing we said was that we want to randomly test a lot of these things each time. So remember we said that you pick K points and
then you use them as cluster centers classify all the points, compute the centroids, then reclassify
all the points, and then repeat that process randomly many times that may not be a good idea,
because if you hard working with a really large data set, because it can get really, really slow. So that's where you have a variation of
k-means called a mini batch k-means algorithm. Now in the mini batch k means algorithm,
instead of taking all the points, instead of classifying all the points,
you pick just a fixed number of points. And the fixed number is
called the mini batch size. So you just pick a fixed number of points, compute
their centroids and compute the cluster centers. For those, let's say those hundred points,
and then you pick a next hundred points. And for those next 200 points, you start by
using the previous centroids, rather than using the, rather than using a random key points, you
start by using the previous case centroids okay. So there's a small change that you apply here
that each time you pick about a hundred or 200 or 300 points, whatever is the bat size and use that
to update the centroids or upgrade the clusters. From the previous batch. And again, this is the point where you can
now go through and read about many batch k-means clustering and figure out how that
is different from the traditional k-means clustering that we've just looked at. Okay. So here's a dataset. This is this is called a malls customer
data set, where you have information about a bunch of customers who visited a mall and
you can try performing k-means clustering by downloading this data set and then
using the k means class from scikit-learn. And then maybe you can also study the segments. Once you have these cluster segments, use them as
a new column in the data, and then do some more analysis, maybe look at for each cluster study,
whether what the spends look like, study what the age group looks like and things like that. Okay. So do check this out and you can also try
and compare k-means clustering with mini batch k-means clustering using this dataset. One other thing you may try out is you
can also configure how many random picks you want the k-means algorithm to take. So by default, we say up to 300 iterations,
like maximum iterations is simply the number of times this whole experiment should be repeated
of trying to find a good cluster centers and you can set maximum iterations to any value you want. So see what kind of impact that
has on the quality of clusters. Okay. All right. So that's the k-means clustering algorithm . So
let's talk about a couple more clustering algorithms where he quickly very briefly the
next one again, a very common one is called DB scan, which is short for density based
spatial clustering of applications with noise. So that's a mouthful, but it's actually
not that complicated a technique. It's again, fairly straightforward. If a once you understand the basic
steps involved and it uses the density of points in a region to form clusters. So it has two main parameters. It has a parameter called Epsilon, which
we'd seen a moment what that means. And another parameter called min samples and
using these parameters, Epsilon and min samples, it classifies each point as a core point or
reachable point or a noise point or outlier. Okay. So let's try and understand how DB scan works
by looking at this example for the moment now, forget all this, all these circles and all these
arrows and everything, and even all the colors. Just imagine you have these points. You have this point, maybe
let's try and replicate this. You have a point here, here, here. okay. So these are the points we have and here's how. Gave me here's our DB scan works. First. You set an Epsilon. Let's say we sent we, and of course,
all of this is on a coordinate plane. Let's say if you're still talking about
petal length and petal weight in two dimensions, let's say you said Epsilon 2.5
and let's say, we said mint samples to four. Okay. Now look at this point. This point is let's just consider this
point and you can start at any point, take this point and around this point,
draw a circle with the radius Epsilon. So let me just, I think this would be about 0.5. Let me just draw the circle here with the radius Epsilon. Okay. So we draw circle with the
radius Epsilon around the point. Then we check if in that circle
you have at least four points. So if you have at least four points in the circle,
which we do 1, 2, 3, and four, including the point itself, as you can see here, 1, 2, 3, and four
around the point E so if you have at least four points, then we see that this is a core point. . So this is a core point. So let me just put it in dark, all the
core points I'm going to make them dark. So now we've classified a core point and
then everything else that is connected to the core point is now part of the same cluster. Okay? So these three points are
part of the same cluster. Then let's go to this point and across this
point, let's also draw the circle of size Epsilon. And this is what it looks like. And now once again, you have 1, 2, 3, and four. So then this point is also a core point. Okay? So this point is a core point. And then this point is
within 0.5 of the core point. So this is now this point now
belongs to the same cluster there. Let's do this one again. If you draw a circle around this. You will notice that four points lie here. So this is also a core point. So these are all part of the same cluster. And then this one also turns
out to be a core point. You can verify it, we'll have four points
around it, and this will be connected to this. And this will also be a core point. It will have all these inside. The source will be corporates. So this like this you've continued
creating, connecting core points. And there will be some points
now, which will not be core. Like this point right here,
this is not a core point. This is a point which is
connected to a core point. It is, it lies within the circle of a
core point, but it, by itself, like if I just draw the circle around this point,
it does not contain the min sample values. So this is called and edge point. Yeah. So this called, sorry, this
is called a reachable point. so this is kind of the edge of the cluster, right? This is not a core point. This is not connected to three other
points in that circle, but it is still part of an existing cluster. And similarly, this one is, well, like
if you draw the circle around this, the circle would look something like this. So this is not a core point, but
this is still a connected edge point. So this way we have now identified
this one cluster of points. Okay. And we're done with all these, let's say
there is another cluster of points here where you have like these four core points. This is all connected to each other. And then these are connected to this. So you have two edge points
here in this cluster as well. And then you have four core
points in this cluster. So this becomes another cluster right here. This point, however, it is neither
core, not, is it an edge point? It does not have four things around it
or three things around it close by, and it is not connected to a core point. So this is called an outline. Okay.
And it's odd. Sometimes it's also called a noise point. So that's what these triangles and
these colors and these lines represent. You have core points, which are all
which have within their Epsilon radius, main sample numbers of observations. You have these noise point noise points,
which are not connected with any points. And then you have these edge points or
reachable points, which are connected to core points, but themselves are not core. So that's where these DB scan does. And again, the way to implement or use DB scan is
simply to import from a cylinder cluster DB scan. And you can see here in the
signature, you can configure Epsilon. You can configure mint samples and you can
configure how the distance is calculated. And by default it as Euclidean, I
was showing you in two dimensions, but because we have four dimensions. So in four dimensions, it's going to take the
square root off first dimension, different squared, plus second dimension, different squared,
plus third dimension, different squared plus. So on the extension of Pythagoras theorem,
and then you can also specify how to go about, and you can also specify some other things. So I'll let you look up the remaining
arguments here, but that's the basic idea. And then you do a DB scan for the model. So then  you instantiate the DB scan
model with the Epsilon and with the main samples, and then you fit it to the data. Now here's one thing in DB scan, there
is no prediction step because remember the definition of a core point depends on
certain points being there in the datasets. So in k-means, you're trying to figure
out the center of a cluster, but in DB scan, there is no center of a cluster. The cluster is defined by the
connections between points. So you cannot use DB scan to classify. To classify new observations. DBS can simply assign labels to
all the existing observations. So you can just check model dot labels and it
will automatically just tell you that for the inputs that we're given, when we perform the
DB scan algorithm on all of them, because all of them need to be considered to perform the
DB scan algorithm, the DB scan clustering. These are the labels that got assigned. So this, all of these got assigned zero. And then all of these got assigned one. And then you can also check which ones are
core points and which ones are  reachable points and which ones are noise points. A good way to check all the attributes on all the
properties that you have on a particular model. All of methods you have is by using DIR. So yeah, I think you have
these core sample indices. So this will tell you the core
sample indices, and you can see here that these are all the core points. Here, it seems like most points are core points,
and maybe we can try changing the Epsilon value. Maybe we can reduce the Epsilon
or increase the Epsilon. And that will maybe tell us that
not all points are core points. So let me do a scatterplot here and now I'm
using the sepal length the pedal length. And instead of as a Hue, I'm
using the model dart liberals. So you can see that only two classes
were detected here, zero and one. And maybe if we changed the value of
Epsilon, maybe if we change the value of mint samples, that number might change. So that's an exercise for you. Try changing the value of Epsilon. The circle, the size of the circle that is drawn
around each point and try changing the value of min samples, which decides when something
is treated as a core point and try wild ranges. Maybe try close to zero, maybe try
close to like a hundred Epsilon does not have to be between zero or one. It can be very large. And similarly for men samples, maybe try
value one, maybe try value hundred experiment and try and figure out how each of these
hyper parameters affect the clustering. Okay. And see if you can get to the desired clustering,
which is ideally clustered these points, according to the species that the flowers belong to. Okay. So that's the DB scan algorithm. And the natural question you may
have is when should you use DB scan and when should you use k-means? So here's the main difference between DB
scan K means k means uses the concept of a cluster center and it uses distance from a
cluster center to define the cluster DB scan. On the other hand, uses a near-miss between
the points themselves to create a cluster. So here is an example of k-means where if you
have data like this, again, let's say this is x-axis and y-axis now visually you can
clearly tell that this is the right clustering, the outer point, these are all connected. It seems to be one cluster and the net points
on collected  all connected seem to be another cluster, but DB scan can do this because
it is concerned with the near-miss between points, but k-means actually cannot identify
these two clusters because if you set this as a cluster center, then any cluster that
includes this point would also need to include all of these points because these points are
closer to the center than this point, right? So there's no way you can create a cluster
using k-means of these, of these outer rings. On the other hand, this is what
k-means clustering would look like. Maybe you'll end up with. Centroid here and one centroid years. So half the point, we'll go here
and have the points will go here. Here is another example. You have these two horseshoe shapes and
this again, DB scan is able to detect them, but k means is not on the other. And then there are a few more
examples for you to check out. Now, one other thing about DB scan and K
means is that in Kmeans, you can specify how many clusters you want, but in DB
scan, it will figure out on its own. And you can only change the values of Epsilon
and min samples to maybe try and indirectly affect the number of clusters that get created. So that's the DB scan algorithm. So just keep in mind, whenever you want to detect
these patterns, which are more concerned with the nearness of the points themselves, DB scan
may make more sense, but if you want a distance based clustering technique, then you use K means. One more thing is you can classify new
points into clusters using k-means, but you cannot use DB scan to classify new points. You would have to run the entire scan
again, because it's possible that by the introduction of a new 0.2 clusters may
join together or change in some fashion. okay. One last clustering technique I want to
talk about is hierarchical clustering. And as the name suggests, it creates
a hierarchy or a tree of clusters and not just a bunch of different clusters. So what does that mean? That means that as you can see in this,
as you can see here in this animation, we have a bunch of points and first we take
the two closest points and we'd combine those two closest points into a cluster. Then we see we combined. Next two closest points, which in this case turns
out to be clustered the cluster point and then another point and combine this, combine them. And then in this way, we
create this three of clusters. So at the bottom of the tree
are these individual points. And above you have these cluster of two
points and sometimes above these, you have clusters of three or four points. And above these, you have clusters of cluster
of cluster, of points and things like that. And this is the kind of thing that can
typically be used to generate a taxonomy. For example, like if you have observations
about many  different types of animals, you have a bunch of observations
and you start performing clustering. You may realize that there are  very close
relationships between humans and chimpanzees and then between humans, chimpanzees, and
Bonobos, there is another relationship. So that is what is captured by this point. And then of course, between these, between this
family and between, let's say mammals, there is a relationship that is captured by this. And then here on the other hand, you
have a relationship between plants. And finally at the top you have a single cluster. Okay? And this is how hierarchical flustering works. You first mark each point in the dataset as a
cluster by itself, like all of these points, P zero two B five are clusters in the dataset. Then you pick the two closest cluster centers
like here, you can see your big, the two closest ones and treat them as a single cluster. You pick the two closest cluster centers without
a parent and combine them into a new cluster. Now the new cluster is the parent cluster
of the two clusters and it center is the mean of all the points in the cluster. Then you repeat the step two, which
is you pick the two closest cluster. In the dataset without a parent. So this time you could be combining a cluster
center from a cluster and a leaf, and that could then become their parent cluster. And then you pick the two closest and you
keep picking the closest cluster centers each time that do not already have a parent. And that's how you get to the top level. Okay. And this structure that you end up
with is often called Venn diagram. So yeah, that's what this will look like. And these are all the cluster  centers here now for each type of clustering, I've
also included a video that you can watch for a detailed visual explanation. If you're willing to just get deeper and maybe
follow along on your own, and scikit-learn allows you to implement hierarchical clustering, so you
can try and implement it for the Iris dataset. So I'll let you figure it out. Okay. So we've looked at three types of clustering. Now we have looked at k-means clustering. We have looked at DB scan and then
we've looked at hieracial clustering. There are several other clustering algorithms
in scikit-learn, so you can check them out too. Here you have about 10 or so clustering
algorithms, so you can check out all of them. Okay. so with that, we will close our discussion on
clustering by no means exhaustive, but I hope you've gotten a sense of what clustering is. How couple of useful, how couple of common
clustering algorithms work and how to use them. There is a question can, k-means be
used for clusterings your locations. Yes, absolutely. If you have a bunch of geo locations
and you want to cluster them, simply use the latitude, the longitude. The two columns of data and put them into
k-means and give it the number of clusters. And you can find a bunch of
clusters of geo locations. Yep. That's a great idea. And then one thing you could also do is you could
take, you could plot those points then on a map and color them according to the cluster, and
also show the center of the cluster on the map. Okay. So let's talk about dimensionality
reduction in machine learning problems. We often encounter data sets with a very large
number of dimensions and by dimensions, we mean the number of features or number of columns. Sometimes they make, go into dozens. Sometimes they immigrant to hundreds. And especially when you're dealing with
sensor data, for example, like let's say you're trying to train a machine learning
algorithm, which  based on the data of an actual flight, like a flight that started from
a certain point and ended at a certain point. Now flights have hundreds of sensors or sometimes
thousands of sensors, like same kinds of sensors at every step are different places on the flight. And if you look, if you just collect the
information from all of these sensors, you will end up with thousands of columns. And that may be a very inefficient thing to
analyze and a very inefficient thing to even train machine learning models on because more columns
means more data, which means more processing, which requires more resources and more time. And it made us significantly
slow down what your rank to do. So, one thing we typically have learned
in the past is to just pick a few useful columns and throw away the rest of the
column so that we can quickly train a model. But what if you didn't have to
throw away most of the information? What if you could reduce the number of
dimensions from that say a hundred to five without losing a lot of informal. What, if you could do that. And that is what dimensionality reduction
and manifold learning are all about. So what are the applications of
dimensionality reduction reducing the size of data without the loss of information? So let's say you have a certain
data set here, which has, let's say 200 columns. Okay. What if you could reduce those 200 columns
of data to just five columns without losing information, as much information, right? What if this could still
retain 95% of the information? And we'll talk about what we are saying,
what we mean by information retention, but what if you could do that, then, then you
can clearly see already that this, anything that you do on this model or, or on this
data set is going to be 40 times faster. So are you willing to trade
maybe 5% of the information? Four 40 times a speed? Probably. Yes. Right. So that's one reason to do damage narrative
reduction, and that will then allow you to train machine learning models efficiently,
but yet another very important application of dimensionality reduction is wish utilizing high
dimensional data in two or three dimensions. Now, of course, as humans, we can only
visualize in three dimensions, but even three dimensions can get a little bit tricky. Like 3d scatterplots are very hard to read. So we are really the most comfortable, at least
right now, by looking at screens, we are most comfortable looking at data in two dimensions. So even right now with the Iris dataset,
we've seen the problem that we have four dimensions, but we can only really
visualize two dimensions at a time. Right? So visualizing high dimensional data
into two, three dimensions is also an application of dimensionality to it. So let's talk about, let's talk about,
let's talk about a couple of techniques for dimensional data reduction. The first one is principal component analysis. Principle component analysis is a dimensionality
reduction technique that uses linear projections. And we talk about what linear projections
mean off data to reduce dimensions while still attempting to maximize the
variance of the data in the projection. So that's what this is about. It's about projecting data, and this is
what BCL looks like while still maintaining as much of the variance as possible. So let's say you have this data, let's say X
one represents better lent and X two represents petal bait, and then you have all these points. Ignore the line for a moment now, instead of
having X one and X two, if we could simply draw this line and if we could just draw this line
and then maybe at, on this line, we could set this as the zero point and then we could simply
draw these project, these points onto the line. So just see how far away
they are from the zero point. Let's say, we said, this has a zero point,
and then we simply give this point, the value minus one, this point would be minus 1.8. This point when we minus two,
this point would be minus 2.7. And this one would be maybe 0.1. This would be 0.3. This would be one. This would be two. This would be 2.3, 2.4. Okay. So now we have gone from taking this data in
two dimensions to this data in one dimension. Okay. So how do we do that? Let's take a look. So let's say we have some data. Let me look at better lent and better
wait, just two dimensions right now. So we have 0.1 0.2 0.3 0.4 and so on. And for each one we have some
better land, better  better rate. And so on. Now we take that and we plot it. So we plot these points here. Maybe it looks a bit like this okay. There are more points here. So this is what it looks like. Now, what we would rather want to see
is maybe we just want to see one value. Right? And we, we don't know what this call is. This call the specie one. Okay. We just want to see one value. So for 1, 2, 3, 4, 5 we just want to see one
value here instead of these two columns of data. And if we can do go from two columns to one
column, then we should be able to go from three columns to one column or three columns,
two columns, or 200 columns to five columns. So the idea remains the same. So how do we do that? Well, the first thing we do is we shift our
axes a little bit to center these points. So what do I mean by shifting the axes? Well, we take this X axis and we do Y axis,
and then we move it here somewhere here. What exactly is meant by moving the axis? Well, in the new axis, you will notice that
the X values of these points have changed. So we are calculating the X the mean X
coordinate of all these points and simply subtracting the mean from all these points. Similarly, we take the x-axis and move
it around and moving the x-axis means subtracting the mean of the Y coordinate. Okay. So what we do for each point is we take the
point B and we subtract from it the average. So let's say P has an X and a Y. And the mean also has an X in a way. So the X attracts from the average X,
the Weiss attacks from the average. So that is going to center the points. Okay.
Some of the points will now have negative value. Some of the points will now have positive values. Then we try out the candidate line. So we try a candidate line. Okay. This looks like a candidate
line on which to project. And then we project all of these
points on the candidate line. So we project project, project,
project, project, project. Okay. So now we have projected all
of these on the candidate line. Now, once we've projected things on the
candidate line, let me for a moment, just get rid of the X and Y coordinates. Now that we know that they're
centered, let me just move them away. Yeah. So now that we have all these projections and
we know that this is a zero point, we can see that, okay, this is the, this point can now be
represented by this point it's perpendicular. And this point cannot be
represented by this point. And this point can now be
represented by this March. So for each one, if this is zero, and this
is one, and this is two, and this is three on this new line, we can now represent
each point using a single number, which is it's the distance of it's projection from
the zero point on this projected line. Okay. So we can now start filling in these values. So the distance of 0.1 distance of 0.2,
this is a 0.3, and this one's a 0.4. So we start filling in these values. Okay. All right. So now we already have this VM now already
reduced from two coordinates to one coordinate, but we want it to retain as much of the
information or as much of the, where it from these two coordinates as possible. So this is where, what we try and do is we
try and maximize D one square plus D two squared plus D three square plus on, okay. So dos pueblos D two squared,
but DT squared is basically the. Now if you like, keep in mind that we have
subtracted the mean, and then you, now we are squaring the distances and things like that. So it, it ultimately turns
out to be the variance Sigma. So what we want to do is we want to
try different options for this line. So let's say we want to take the same line, okay. Square to be hard to select this
line, but we want to take the same line and we want to rotate this line. So you want to rotate this line. And each time as we rotate, all the DS will
change because all the perpendiculars will change. And we pick the line, the rotated line, for
which the sum of squares of the DS, the sum of squares of all these values is the highest. Alright, so just to maybe clean this up a
little bit, what we're trying to see is once we have these centered points, you can see here. If I pick this line, all of the
projections are very close to zero. So if all of the predictions are very close to
zero, then that means most of the information is lost because all the values, all the D one D two D
three D four, all of them are very close to zero. On the other hand. Yeah, on the other hand, if we pick a
line that goes like this, you can see that the projections are quite far away. So we are capturing the spread of the
data very nicely for a nice fitting line. And we are not capturing the spread of the
data very nicely for an ill-fitting line. So that's what PC tries and figures out. Okay. So it figures out one. And this is from going from two
dimensions to one dimension. Here is an example of going from
three dimensions to two dimensions. So now here we have feature one feature
to feature three, ignore the word gene here, and then you have points. So what we try and do is we first find PC one. We find a line along which we can project. The best blood possible line, which
maintains the highest ratings. Then PC too is aligned, which is
perpendicular to the first line. Now, remember we are in three dimensions here. So  there are an infinite number of lines that
are perpendicular to the first line PC one. So again, we pick the line which
maximizes the radiance of the points when the points are projected on PC two. So for three dimensions, you can have PC
one, PC two, and then when you would afford dimensions, you can have PC one PC to PC three. When you go to five dimensions, you can
have . And then what you can do is like, if you have 200 dimensions, you can just choose
five most relevant axes of variance, right? But remember each of these are highest
variants preserving possible lines. So let's say you have 200 dimensions,
you can do a principal component analysis and you can reduce them down to the five
dimensions along which the variance of the data is preserved as much as possible. So that's principle component
analysis for you once again. How do you do principal component analysis
in scikit-learn let's say we take this data set again, and we're just going to pick
the new medical data, no species for now. So we just look at these numerical columns. We simply import from  decomposition. We import the PCA model, and
then we create the PCN model. We provided the number of target dimensions. We want the number of target
dimensions in this case is two. And then we call fit. So we say pc.fit and we give it the data. And so we are going here from four dimensions. 2, 2, 2, 2 dimensions. And what are those two dimensions? At this point, we can't really
interpret them as physical dimensions. I mean, it's not like we've picked  no,  it's
like with big two possible linear combinations. And remember because lions and projections
are involved, so everything is still linear. So we've picked two possible linear
combinations of these four features, which are independent, which means that the lines
along which we projected are perpendicular. So we have big two independently near
combinations of these four variables. And it is a projections on those
lines that we are left with. Okay. So what just happened when you called fit? So those two things got calculated. So what got calculated is the lines. Now PCA internally knows what are the
lines, the line number one, line number two. And you can look at the internal, like you
can look at what those line number one line number two are like, if you do DIY RPCA, you
can look at some information about the lines. Yep. So these are the components, right? So these are the components. These four numbers convey the
direction of the first line. One numbers together can wait. And these four numbers can weigh the direction
of the second line in four dimensional space. Now that we have the components, we can
project these points on these lines. By doing PCA dot transform. So now if we give PCs or transform
this data and this data, which has four dimensions, Iris DF numeric goals, it
is going to give us transformed data. In two dimensions, you see here. Now you have this transformed
data into dimensions. This is the data projected on line
number one, which has this direction. And this is the data which is projected onto
line number two, which has this direction. So if you know, a little bit of linear algebra,
these are both unit vectors in the direction of the lines that have been picked in any guests. Now we've gone from this to
this and let's check it out. Let's maybe now plot. So now when we do a scatter plot, you can
see  now when we do a scatterplot and I'm uploading this data and uploading this data,
we can now finally visualize in two dimensions. Of course not perfectly, there's still some,
something is lost for sure, but we can still visualize  information from all four dimensions. So if we really want to study the
clusters and I right now have plotted the species, but we could just as well have
plotted the clusters that were detected. So let's maybe look at the DB scan clusters. Yeah. So these are the DB scan clusters. So now we can better visualize the
clusters that we generate from clustering. So that's one other thing
with dimensionality reduction. It can let you visualize the results of clustering
and maybe evaluate the results of clustering. Now of course, principal
component  has some limitations. The first limitation is that it uses these
linear projections, which may not always achieve a very good separation  of the data. Like obviously one thing that we noted
here was if you have this kind of a line, then information gets lost because most
of the projections fall in the same place. Here's one problem that I can tell
you with principal component analysis. Like if you have a bunch of points and you
have a bunch of points here, all of them will project exactly the same value, right? So there is some information that does
get lost and in suppose these points belong to a different class and then
these points belong to a different class. So as soon as you do PCA, and then you
are trying to maybe train a classification machine learning algorithm, you are
going to lose some information, right? So there are some limitations with PCA, but
for the most part, it works out pretty well. So I would use PCA whenever you have like hundreds
of features and maybe thousands of features, and you need to reduce them to a few features. And as an exercise, I encourage you to
apply principle component analysis to a large hybrid high dimensional dataset. Maybe take the house prices, Kaggle competition
dataset, and try and reduce the numerical columns from like 50 or whatever, to less
than five, and then train a machine learning model using the low dimensional results. And then what do you want to observe is
the changes in the loss and training time for different number of target dimensions. Okay. And that's where we come back to this. If you could trade 200 columns, four 50. Ford, maybe a 5% loss in radiance. Now we know what information means by information. We're seeing billions. If you could create 200 columns for
five columns for a 5% loss in variants, that would give you a 40 Expedia up. And maybe that speed up could be a make or break
for your analysis because now you can actually analyze 40 times more data in the same time. Right? So when you have really large datasets, PCA
is a definitely a very useful thing to do. Now that is this jupyter notebook. You can check out, it goes into a lot more
depth about principle component analysis. The way this is done is using a technique
called singular value decomposition or SVD. Yeah, so there's a bunch of linear algebra
involved there, but the intuition or the process that is followed is exactly the same. You find a line, you rotate the
line to the point that you get the minimum, sorry, the maximum variance. And then you find the next line, which is
perpendicular to this line and still it guarantees the maximum possible variance. And you keep going from there. .
okay. Let's talk about another
dimensionality reduction technique. And this is called the T distributed
stochastic neighbor, embedding technique, or T SNE or T SNE for short. And this belongs to a class of
algorithms called manifold learning. And manifold learning is an
approach to perform nonlinear. Dimensionality reduction of PCA is linear. And then there are a couple more like
PC this something called ICA, LDA, etc. They're all linear in the sense that they
all come through some sort of linear algebra matrix multiplications, but there are some
limitations sometimes with linear methods. So you can use some of these non-linear
methods for dimensionality reduction. And they're typically based on this idea
that their dimension, the dimensionality of many data sets is only artificially high. And that there are that most of these
datasets, which have a hundred, 200 or 500 columns can really be captured
quite easily with four or five columns. And we just have to try and figure out. What those, how to come up with those four and
five, four or five columns of data, whether through like some formula being applied to
all the columns or whether through like it's kind of feature engineering, except that the
computer is trying to figure out these features for you based on certain rules that have been
trained into these different kinds of models. So you have a bunch of different manifold
learning techniques in scikit-learn and you can see here, this is the original data set. So it is plotted in 3d. So this is the original data sets. And they've, they've just colored the points
maybe to give you a sense of which point goes where, and this 3d dataset, when you
apply these different feature engineering or these different manifold learning techniques,
collapses to this kind of a graph in 2d. So you can see here that this basically
is able to separate out red from red, from yellow, from green, from blue, which
would be very difficult for PCA to do. Like if you tried to draw two lines in drop
projections, you would, it will be very difficult for you to get a separation like this. But ISO map is able to do that. you can see with Disney as well. It is able to separate out, read it with, from
the yellow, from the green, from the blue. And these are all different kinds
of separations that you get. Now T SNE specifically tSNE specifically is used to visualize
high, very high dimensional data in one, two or three dimension. So it is for the purpose of visualization
and varied, roughly speaking. This is how it works. We're not going to get into the detailed
working of  because it's a little more involved. We'll take a little more time to talk about it
in a lot of detail, but here's how it works. Now. Suppose you have, again, these clusters
of data, of data of points, let's say the suspect, a lent and the suspect. What would happen normally if you'd,
we're directly going to, if you're just directly projected them onto this
line, is that all of these blues would then overlap with a bunch of oranges. And those would overlap with a bunch of reds. What these new tries to do is first project
them on a line and then move the points around and it moves the point points
around using a kind of a near miss rule. So every point that is there that is projected
down on the line is moved closer to the points that are closer to it in the real
dataset, in the, in the original dataset. And it moves away from the points that it
is far away from in the original dataset. Okay. Now I know that's not very, it's not a very
concrete way of putting it, but here's what it means as you project this line, this point
down, it will end up here the blue point as you project this point down, it will end up here. The orange point. Now what these knee will do is Disney will
realize that the blue point here should be closer to other blue blue points and should
be farther away from orange points because that's how it is in the real dataset. So it's going to move the point. It's going to move the point closer to
the blue points, and it's going to move the orange points closer to orange points. Okay. So the closeness in the actual data is
reflected as closeness in the reduced dimension. Okay. In the dimensionality reduced
dimension, reduced data. Okay. So just keep that in mind. And if you have that intuition, you will be
able to figure it out when you need these names. When you need to maintain the closeness,
no matter how much you reduced the number of dimensions by Disney's useful. And here is a visual representation
of  applied to the MNIST Dataset. And the MN IST dataset contains
60,000 images of handwritten digits. So it contains 20, 28 pixels by 28
pixel images of the a hundred and digits, zero to nine and 28 by 28 pixel. Each pixel remember is simply represent
a color intensity like red, green, blue, or in this case they are gray. So this represents like how
great that particular pixel is. Each pixel is a number. So that means each image is represented
by 784 numbers, 28 times 28. So we can take those 784 dimensions and we
can perform, we can use Disney to reduce that data to two dimensions and then plot it. And when we plotted, this is what we get
these knees able to very neatly separate out all the images of the numbers, zero from all
the images of the number one from all the images of the number 2, 3, 4, 5, and so on. Okay. So there is I encourage you to check
out this video and then there's also a tutorial or that I've linked below on. Yeah. Sorry does it, your total here, you can check
this out on how to actually create this graph. It will take you maybe an hour or so to create
this graph to download the data set and maybe look at some samples and create this graph. But what we're trying to get at here is that
Disney is very powerful for visualizing data. So whenever you have high dimensional
data, use Disney to visualize data. Now, Disney does not work very well. If you have a lot of dimensions, like
maybe even 7 84, it's not ideal to directly reduce it to two dimensions. So typically what ends up happening is you
first have, maybe let's say 7 84 dimensions. You would take those 7 84 dimensions and
you would perform PCA principal component analysis and reduce it to about 50 dimensions. And then you would take those 50
dimensions and reduce it to two dimensions. Disney for visualization. Now these two dimensions, the one, the data that
you get While T SNE is not that useful for doing machine learning or even for doing data analysis. It is useful for visualization
because you can see which points are closer together in the original data. That's what it is trying to tell you. Okay. So that's where you'll see, T-SNE used
as a visualization technique in a lot of different machine learning algorithms. And how do you perform these knee? Exactly the same way as the other SPCA. So you just import the T as any class, and
then you set the number of components or the number of dimensions that you want. And we want to take this four dimensional data and we want to transform it to two dimensions. Now, again, with Disney, it there's
no predict or there's not transform. There's no fit 10 transform step. Both of them are combined into a fit transform
because closeness of the point is very important. So in some sense, you don't
use these new or new data. You only use Disney on the
data that you already have. So that's why we are doing a fit transform here. And that gives us the transformed data. So this is the transformed data here. So we've gone from four dimensions to
two dimensions, and then when we plot it, you can now see that these nieces
has really separated out the points here. So you have one class here and then
you have these two classes here. So these points, it may hold true that
these points were actually quite near to each other in the original data set. And that is why they are near to each other here. And these points are far away from
each other in the original dataset. So that's where they're far
away from each other here. Yup. So their takeaways PCA is good when you're
doing machine learning and T-zone is good when you want to visualize the results. So try and use these need to visualize
the MN IST handwritten digital dataset. I've linked to the data set here. Okay. So with that, we complete our discussion of
unsupervised learning, at least two aspects of it, clustering and dimensionality reduction. Again, there are many more ways to
perform clustering in scikit-learn. There are many more ways to perform dimensionality
reduction and all of them have different use cases, but in general, you mostly would just
start out by using k-means clustering and by using a PCA for doing dimensionality reduction. And then in a lot of cases, maybe a better
clustering algorithm or a better dimensionality algorithm will give you a slight boost, but you
should be fine with just using the most basic to begin with in a lot of cases and do check
out some of these resources to learn more. So that brings us to the end of this
notebook on unsupervised learning. Now let's go back to the lesson page and on
the lesson page, let's scroll down once again to lesson notebooks and open up the second. This is a notebook on the topic called
collaborative filtering, which is a common technique used to build recommended systems. So once again, let's click the
run button and select run on. and here we have the notebook running on binder. You, now that I've run this notebook, I'm
just going to run garner, restart and clear output to remove all the previous outputs. I am going to hide the header to toolbar and
let's zoom in and we are ready to get started. So we are going to talk about this
technique called collaborative filtering with using as this library called fast da. But there are a bunch of other libraries to do
the same thing and we'll try and build a state of the art movie recommendation system with this
10 lines of code, again, as with scikit-learn, and as with most machine learning algorithms,
you don't really have to implement the internals. You simply have to use the library, but
you have to know what parameters to change. And it helps to understand how these
algorithms work so that you can pick the right algorithm for the job. Okay. So you combine that intuitive understanding
with being able to manipulate the code well, and that's what makes you
a good machine learning practitioner. And as you go deeper, you can learn more
about the math in world, but it's a more, it's a practical aspect that is more useful. so recommender systems are at the core
of pretty much every online system we interact with social networking sites like
Facebook, Twitter, Instagram recommend posts. You might like, or people you might know,
or you should follow video streaming services like YouTube and Netflix
recommend videos, movies, or TV shows. You might like online shopping sites, like Amazon
recommend products that you might want to buy. In fact, there is a big debate right now about
how much personalization there should be, because that sends people into these filter
buDBles, where or echo chambers on Twitter, or filter buDBles on Facebook and other places. And there is also a question about how
much of this recommendation and targeting should you be applying for advertising? At what point does it become mine manipulation? Right? So it's a tricky territory, but at the very
least we should, we should try and understand how some of these systems work so that we can
build intuition about how these algorithms work. And maybe we can try and counter those algorithms
in the, in the cases that we need to and there are two types of recommendation methods. There is a content based recommendation. So let's say if I watch a movie and Netflix
has a lot of attributes about movies, Netflix has attributes like when the movies
was released or who was the director of the movie or which language the movie
was in, what was the cost of the movie? What was the length of the movie? What was the genre of the movie, etc, etc. Netflix can then maybe recommend
to me similar movies in the genre and Netflix can recommend to me. Movies by the same director or
movies with the same actors. So that is called content based recommendation. But the other kind of recommendation
is called collaborative filtering and collaborative filtering is a method of making
predictions about the interests of a user by collecting preferences from many users. So the underlying assumption in collaborative
filtering is that if a person, he has the same opinion as a person be on an issue, then a is ease
more likely to have B's opinion on a different issue than that of a randomly chosen person. That's the Wikipedia definition
per ticket in Netflix terms. If I know that you've watched a movie, or
if I look at your watch history, and I look at 10 other people who have a similar watch
history, and if I pick a movie that they have watched and you haven't watched, it is
very likely that you will like that movie. So that's a very different way of thinking
about recommendation because now we're no longer asking, okay, what's the genre was the
language was who's the director who is the, who is the actor, who's the actress, etc. Now we are saying, this person likes a
few movies, and there are other people who also like these kinds of movies. And there is a certain movie which
this person has not watched, but a lot of these other people have watched. So just based on that fact, you can
make a pretty good guess, right? And that's just simply because of how human
beings function or how human beings think. We think alike in several ways. And collaborative filtering
tries to capitalize on that. That if you have the same opinion as
a person be on one issue, then you are likely to have the same opinion as a
person be on a different issue as well. There are many different algorithms
that implement collaborative filtering. It is the collaborative. The key idea of collaborative filtering is
instead of looking at the content, you look at the connections between users and items. Now these items could be movies. In the case of Netflix, these
items could be products. In the case of Amazon, these items could
be other users or friend suggestions in the case of Facebook or Twitter and so on. But on the one side, you almost always have users. It's a very human centric algorithm in some sense. And there is this library called Liberec
for Java, which has over 70 different algorithms for collaborative filtering. And in this tutorial, we will look
at a relatively new technique, which is also one of the most powerful ones
called a neural collaborative filtering. Okay. So here's the dataset we are going to use. This data set is called a movie
lens, a hundred K dataset. It is a collection of movie ratings by 943
users on 100, 1,682 sites, 1,682 movies. So there is a movie site like I am DB. This one is called Moby lens and
they have a bunch of movies, mystery. They have 1600 movies listed on the
site and they have several users we've bought ratings on these movies. So here's what the data set looks like. You have a user ID and then you have a movie ID,
and then you also have the title of the movie. And then you have a rating that
the user has given to the movie. Now, of course not every user
has seen or rated every movie. So if every user had rated every movie,
then you would have 1 6, 8, 2 times 9 43. You would, you would have more than a million
ratings, but in total you only have a hundred thousand ratings, which tells you that most
users have watched maybe 10 movies or sorry, a hundred movies right out of the thousands. So most users have watched less. 6% of movies in the data in the database. And that actually gives us a big opportunity. Can you recommend movies to these users now? Can you look at user number 42? Look at all the movies they've watched. Can you look at other users who
have also watched those movies? And can you recommend to use a 42, a
movie that other users have watched or the users like him, but he hasn't watched. Okay. And recommending is the final target that we want
to recommend, but what is the objective we set? So the objective we can set
is given any random movie. Can we figure out what rating user 42 is
going to give that movie once they watch it? So if you can predict that user 42 is going to
give wag the dog, a rating of two, then maybe you shouldn't recommend wag the dog to use it 42. But if you can predict that user 42 will
give a rating of four to the graduate, then you should recommend graduate to a user 42. So ultimately recommendation is more of a
ranking problem because you have a bunch of items which a particular user has not seen
or has not bought, or has not accessed. If you could rank those items by the
probability or by some, some measure. In this case, the rating, an estimated
rating of what the user would give that item. Then you can simply recommend the highest
rated or the highest scoring item to the user. Okay. So keep that in mind that when we
save, we want to recommend things. What we want to do is we want to rank
all the unseen items, and then we want to take the top two, three or five
months in items and show them to the. So when Facebook recommends of Frank's suggestion
to you, or, or when Netflix recommends a movie to you, they're actually ranking hundreds of
movies behind the scenes and they just for you. So that's just for you and all of those, they pick
out maybe the top five, or maybe a random five or the top 20 and show you those on your dashboard. And that's why each time you refresh, you may
actually see a different result because they also introduce some randomization because
they know that it's, it doesn't make sense to always keep showing you the same thing. People also like variety. So there's a good combination of randomization
as well as just like a ranking system here. Okay. So that's the setup here. Every user is given a unique
numeric ID ranging from one to 943. And every movie is given a unique numeric ID
to linking from one to 1,682 users rating for movies are integers ranging from one to five,
five being the highest one being the lowest. And there are each user has on average,
people have seen about a hundred movies. So there are 900 other or sorry. There are four, 1500 other movies to
choose from for pretty much every user. And what we want to do is build a model that can
predict how a user would rate a movie that they have not already seen by looking at the movie
ratings of other users with similar tastes. That's our objective. Now, if we can rate movies that they
haven't already seen or predict how they would rate a movie that they haven't
already seen, then we can simply make those predictions and recommend to them the
movies with the highest predicted ratings. So the first step is to
download the movie lens dataset. It is available on the group lens website,
and once it is downloaded, we would also have to unzip because it's a zip file. So all that code is here. So you can just download and unzip the dataset. And that puts us, that puts us data. This ML a hundred K folder. Okay. And then there's a bunch of information here in
all these files, and we will look at them one by one slowly, but now we have the, now we have the
data downloaded and the main data is in this file. you.data. So if I just look at the file you would, or data. Oops. Yeah. If I just look at the file load or data for
just do head, you would, or data ML, a hundred case slash UDA data, you can see that there
is yeah, there is this user ID that has this movie ID, and then there is this rating and
we can ignore this last one for now, but we have all the useful information that we need
in user data, user ID, movie ID, and rating. So let's install the library. The, we are going to use the  and a specific
version of this library and let us also import, and let's first open up the final user data. Now UDA data is, oh, tab delimited file. So far we have seen space stable. We've seen comma separated file. What's called a CSV. This is a TSV file. So in this case you would actually
see, this is one character. So this is the tab character and binders
can read in tab deliberate files as well using the same read CSV function. So we just repeated or read CSV. We give it the link to the file. Oh, sorry. The path to the file. And we just specify that instead of separating
my commerce, it is being separated by tabs. And this is used to indicate the tab
character and there is no header. But it's, it's given him the read me that this is
the user ID movie ID rating, and this is some kind of a timestamp, and this is what it looks like. So now we have a user ID, movie ID rating. And now from this point on, we
are going to use this library. And from this libraries, we are going to
train our collaborative filtering model. Okay. So we need to create a collab data bunch. This is this library specific stuff. So I won't worry too much about it, but
we are simply putting ratings data frame into this collab data, bunch glass. And we are also telling this library that we
want to use 10% of the data as a validation set. Okay. So this is what the data looks like. Internally, user ID, movie ID, and target. Now here is our model. The model itself is actually quite simple. What we want to do is represent each user and each
movie by a vector of a predefined lint N okay. So I take each user, I take user number
one and represent them using N numbers. Let's say Ennis five or
yeah, let's say, and is five. So I want to represent each
user with five numbers. Okay. You one year, two year
three, year four, year five. So user number one, or using number
14 is represented using these numbers. The number 29 is represented using these numbers. User number 72 is represented using these numbers. Use number 2, 1, 1 is
represented using these numbers. So each variable will be
represented using a number. Okay. And each movie will also be
represented using a vector asset. Each user would be represented using a vector,
which is a set of simply a list of numbers. And each movie will also be represented
using a set of vectors, and both of them will have the same size. So movie number 27 is represented
using the, using this vector. Moving number 49 is represented using
this vector will be number 57 is represented using this vector Ang. Our model or like our predicted our model
predicted the rating for for a particular movie by a particular user, simply by taking
the dot product of the two vectors and a dot product is simply a element wise, multiplication
followed by you some of the products. So it, so let's say we want to get the predicted
rating for user number 14 for movie number 27. So what is the rating? That user number 14 is going
to give moving number 27. We multiply 0.21 by minus 1.69. We multiply 1.61 by 1.01. We multiply 2.89 plus by 0.82. So you went in one plus your two MTU
plus U three and three plus euphoria and four, plus your five and five. And that gives us this value 3.25. So in our model, assuming we have these
vectors for all the users and have these vectors for all the movies on models,
rating is simply the dot product okay. Of the user and the movie. So that's what our model is going to be. Now, the objective for us is going to be to
figure out good vectors if we can figure out good vectors, which satisfy the given data. Remember that not every user has
watched every movie, but we do have a hundred thousand movie ratings. So if we can figure out a good set of user
vectors and a good set of movie vectors so that it is able to fit the training data well
that if I take the vector for user number 2 24 and the vector for movie number 9 25
and multiply them, if I get an value closed. That means I have a good model. Similarly, if I take the vector for user
ID 5, 0, 2, and movie ID two 17, multiply them together and get the rating two. That means I have a good model. Right? So the training part of the algorithm is to
figure out a good set of vectors for all the users and good set of vectors for all the movies. So that the data that we know
already can be predicted well. So if we can predict the data that we know, no,
well, no already, well then that means we should be able to predict the data that we don't know. Well, right. So the idea is that if you can, if your model
can accurately predict ratings for the movies that have already been watched by users, your
model will be able to predict ratings for the movies that have not been washed by users. Okay. That's the idea now, how does this component of
collaborative filtering come into picture up here? Like how does the interplay of
different users come together? Well, think of it this way. This number 1.69 minus 1.69 will get multiplied
with every user vector to get the rating, right? So if a lot of users have given a particular
movie, a higher rating, then it's possible that all, a lot of those users have, have
a very high value in a certain dimension, which multiplies with the same dimension
of the movie and gives a high result. So another new user should have every other
new user, which has a high value on that. Same dimension will also
give that movie high rating. Okay. To make it more concrete, suppose this
represented, this value represented the amount of romance in the movie. So all the users who like romantic mood. We'll have a very high value. Not this, maybe let's consider this
one, this little, this represents the amount of romance and movies. So right now the value is one. So that means it's probably high. So this is a highly romantic movie,
whereas this is not a very romantic movie. So all the users who like romantic movies
will have a high, second dimension. So they will have a high value in this column. And all the users who do not write like romantic
movies will have a low value in this column. Similarly, all the movies that are romantic will
have a high value in this column and all the movies that are not romantic will have a low
value or maybe even a negative value in this call. Maybe this is the opposite
of romantic movie, right? So, it's not that this does actually
represent romantic what these represent. We don't know the model, we'll figure
them out, but that's how you can think of each of these dimensions. It rates the movie on that particular
dimension romantic to non romantic. And it also rates a user's preference
on that dimension, how much they like romantic was is how much they hate romantic. Okay. All right.
So now we want to figure out these vectors. That is the whole learning process to our object. The vectors are currently
initially chosen randomly. So it's unlikely that the ratings predicted
by the model matched the actual ratings. So our objective while training the model is
to gradually adjust the elements inside each user and movie vector so that the predicted
ratings get closer to the actual ratings for this plastic has a collab learner class. So we'll not bother too much with
the details here, but it needs just a couple of things we need to give it. What are all the factors? Sorry, what is the number of factors,
which is what is the size of each vector? So remember user vectors and movie vectors,
each should have the same number of dimensions. Otherwise the dot product will not match up. So we will create a factors of 14. And you can think about this as the
power of the model or the capacity of the model, the fewer factors you. The less flexibility, you will have. The more factors you've put in them, the richer,
the range of relationships you can capture. Maybe if you're put in a Fort, if you put in
just two or three factors, then your model is forced to optimize on just maybe one or
two attributes of the movie and the users. But if you allow the model to have 40
factors, it can optimize on a bunch of different a bunch of different attributes. For example, if you, if you have 40 factors,
one of the factors could end up representing due to the process of due to the process of
training, whether the movie stars, Brad Pitt or not, and maybe that's a big factor on for
making sure that your predictions are good. So that's the balance there. Of course, if you have too many factors,
then your model will be overfitted. Then your training loss will be very low. Then basically your model will try
and memorize every single movie. And then it will not generalize
well on the validation set. And that's where we have a validation set as well. And then you have this wide range. This wide range is simply to indicate
to the model that the predictions or the predictions that we want to get the rating
predictions are in the range of zero to 5.5. They're more in the range of one to five,
but you give a slightly bigger range so that the model can go up or down a little bit. Okay. And then we also have this weight decay term. This is simply just a regularization term. So this weight decay term ensures
that these values don't get too large. Now with a lot of machine learning models, we,
we generally want to keep the weights small. That's it's just a bit nicer when the weights
are smaller, all the numerical methods work well. So this way decay term, just to
ensure us that it is just for regular. And the model that fast AI or any, or this neural
collaborative filtering technique uses is just slightly more advanced instead of representing. So each user is still represented by a vector. Each movie's also still represented
by a vector, but then there are also two constant terms that are included. So the rating is now calculated as the user vector
multiplied by the movie vector, the dot product. Plus our user bias. This is a fixed value that is there. It's not per user, it's there
for the entire user dataset. There's just one number you B and then
similarly, there's this one number MB. So you can think of it like the
bias term in linear regression. Suppose a user rector is all zeros. There should still be a certain rating. Then similarly, suppose a
movie vector is all zeros. There should still be a certain rating, right? So this is just a bias term. A bias term is always nicer because it
gives slightly more power to the model. And finally remember this wide range. So the model internally also takes this output. You want a month plus your two M two
plus so on plus UBI plus MB, and it just applies a sigmoid function to it. So it takes the output and squishes
it into the zero to one range and then it scales it to the given wide range. So the model internally and show us that no
matter what the vectors user and movie are, the rating will always fall in the zero to 5.5 range. Okay. That's just another internal implementation
detail that you don't really need to worry about, but that's roughly what the model is. So you are simply telling the library that I
want you to create vectors of size 44 users and movies in such a way that when you multiply
them together and you add the bias term and then you scale them, you scale the results, always
scale the results into the zero to 5.5 range, dope predictions that you get are very clear. To the prediction, to the
actual data that I've given you. And this data is you can check the data. This data has a a hundred thousand
reviews, a hundred thousand rows, right? So your model has a hundred
thousand ratings to optimize on. And its job is to create create these
vectors, which will predict those hundred thousand ratings accurately. And then of course there is a loss
function involved internally because this is now you can see that now you've
framed it as a regression problem. The model is trying to predict the rating. So it will try and predict the rating
and its ratings will be way off it. Settings will be bad for all the a
hundred thousand pairs of users and movies that we have in movie DF or movies DF. So the model will try and oops,
I guess it's called ratings. DF. Yeah, ratings TF. So the moderator will try and make a prediction
for user ID 16 and movie ID to 2 42 by multiplying them and calculating the rating. And the rating will be way off. That will be used to then compute the mean
squared error or the root mean squared error. So the more the input data, which is
the input data is simply the Mo user ID. And the movie ID goes into the model model
checks the vector for the user ID model texts, a vector for the movie ID, multiplies
them together at the bias, applies the scaling to come up with a predicted rating. There is of course a real rating. So we take the predicted rating. We take the real rating and we
do this for all the input data. So all the a hundred thousand ratings
that we have in the training set and we compute the root mean squared error. We then apply an optimization technique
called gradient descent and use that to then adjust all the weights. We adjust all the user vectors. We are just all the movie vectors
slightly, so that the next time that we put the input into them. It gives slightly better predictions. And then we compare it again with
the target or the target rating. And then once again, we
performed great in the sense. So each time in each iteration, we are
adjusting the weights of the vectors of we are adjusting the vectors for the movies and
for the, all the users slightly each time. Okay. And that's what this, all of this businesses
here, we are going to use like a, there's a certain way of training these models. And there are certain learning rate
involved, how much you adjust the vectors in each iteration, all of that is involved. So what we're going to do is we
are going to run five iterations. So we are going to pass all the data
through the model five times, and we are going to use a learning rate of 0.1. So each time the adjustments that are
made to the weights or adjustments that are made to the vectors for the
users in the movies are scaled by 0.1. Okay. So that's what we're going to do here. And there's also another detail
internally that when gradient descent is performed, it is done in batches. So we don't take all the data
and put it into the model. At once we take batches of maybe a hundred
or a hundred inputs, put them into the model, get the outputs, calculate loss, and perform
the optimization, improve the weights. We then take the next batch of hundred. So that that's just a little bit faster. So when you run this learn.fit one cycle, etc, all
it's doing is it is taking a bunch of legs, taking a hundred movies, putting it into the model. And it is then getting some ports. It's going to compute the laws. Then it's going to change the weights. And then it's going to come
put, put the next hundred. And we're going to do a hundred, a hundred
hundred till we run out of all a hundred thousand. So that means a thousand, a thousand iterations. And then we repeat that whole thing five times. So the entire data set. In batches of a hundred, goes through
the, this training process five times. So this is five. This is called the number of eBooks. And this is called the learning rate. And here, after each law, each
ebook in each learning rate, you can see what is the training loss. So this is basically the root mean squared error. And what is the validation loss, but
actually it does mean squared error. So the square root of that would
be the root mean squared error. But the meat, when the mean squared
error after about five iteration is going to come down to about 0.80. So that's a mean squared error. So if I just do SQRT of 0.8, it's
a square root of 0.8 is 0.89. So that means our root mean
squared error is 0.8, nine or 0.9. That means our predictions are off by 0.9. So if a user has given a movie or rating of
four, our prediction is either closer to three or closer to five it's in the three to five range. If the user has given a movie rating
off to art prediction, as in the range of 0 2, 0 2, or sorry, 1, 2, 3. So we're not that far away. And how are we deciding how
good our predictions are? So we've set aside 10% of the ratings
as the validation set, and we're only using 90% of the data as the training
set for using the 90% of the data. We train the model and then using
the last 10% validation set, we are not showing the actual ratings. And we are asking the model to just make
predictions, given a user ID and a movie ID and take that prediction and compare it. We're comparing it with the validation set
just as we've been doing normally for all our supervised learning techniques, right? So this is also a form of supervised learning. And this validation set basically
tells us that the model. For user movie pass, it has not seen, is able to
make predictions with the mean squared error of 0.8, which means it is off by plus or minus one
on the whole are plus or minus 0.9 on the whole. And that's not bad at all. Plus or minus 0.9 is not bad at all. If you heard, we are just making some
predictions on the validation set. So the real prediction for a
particular user movie pear was okay. Let me also just maybe print
the users and items here. Just give me a second. yeah. P zero P one P two. Yup. So here are some predictions. This is one the validation set. So the model did not see this, this
data was not used for training. So for user number 8 61, and for moving number 7
36, the real rating the user has given the movie. It was 4.00 and odd model. Having never seen this combination
before having never seen the real rating was able to predict 4.1. So it was only off by 0.14 using
number 1, 1 8 and moving number 5 47. The real rating user gave them a years. 5.2 odd model predicted 4.2. So it was only off by 0.8. Similarly for user 4 58 movie. The real prediction, the real rating
was 4.0, our model predicted 3.8. So you're only off by minus 0.2. So a model has gotten pretty good. And of course it's short, it has 90,
it has looked at 90% of the training. It has looked at 98,000 ratings and using those
90,000 ratings, it has come up for these vectors were all these users and all these movies. Now here's what we can do. Given a particular user. We can now make prediction for that
user for all the possible movies. Okay. And the way to do that would still be disliked. This you we just do we just call this model
object with a bunch of users and a bunch of items. So if we just created a list of like
a list like this, so for users, we simply put in 1, 1, 1, 1, 1, 1, 1. So I just want to make predictions while
use a number one and then four items. I simply put in all the movies
that I want to rank for the user. So 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12. How much is that? One? 2, 3, 4, 5, 6, 7, 8, 9, 10,
11, 12, 13, 14, 15, 13, 14, 15. Yup.
There you go. 15. And I'm just going to hide this year. Okay. And this needs to be yep. So now for user one movie, one
dead predicted rating is 4.4. Let's forget the real rating for a moment. But for user one movie to the predicted
rating is 3.0 for user one, movie three, the predicted rating is 3.4. So for every user and for everyone. You can get this rating and then you can
try and find the highest rated movie for that user and simply recommend those. Okay. So we can now use this model to
predict how users would rate movies that they haven't seen and recommend. These are the movies that
have a high predicted reading. And this is what that interface might look like. The moment you open Netflix or something,
you could recommend that these are the movies you're interested in. And of course you could also just show an average
rating, or maybe you could just show rating from users who have a very similar vector. So now here's the interesting thing that you
could look at these vectors and you can use these vectors as a similarity measure the
users for which the vectors are very close by, or have a, have a very short or very
low Euclidean distance are very similar. So you can now use these vectors to cluster users,
and you can use these vectors to cluster movies. So as you plus the users, you will maybe
find action and Tuesdays, maybe you will find teenage girls, maybe you will find older people. And if you then cluster movies, maybe
you'll find action movies together. Maybe you find a like movie starring
a particular actor together. Maybe you'll find Christopher
Nolan movies together or close by. So now there's a lot of analysis that
you can do on these vectors that you have found for these users and for movies. Okay. And these vectors can be seen inside
model dot parameters, I believe. Oh, sorry. Lauren dot modeled parameters
should have these vectors yup. Yeah. So there is this, these are the user vectors,
and then these are the movie vectors, I believe. Yeah. And then these are the, probably
the biases or something. Okay. So  that's the recommender systems and . We've
seen like a high level overview of how this works. But what I  want you to take away is just
the simple idea that a lot of different problems, including these recommendation
problems can be expressed in these terms. So it's all always a question of  you try
and classify whether you're looking at a classification problem or a regression problem,
or a clustering problem, or a dimensionality reduction problem, or a recommendation problem. And then you have a bunch of patterns. If it's a recommendation problem, then maybe
you can either do content based filtering. If you have a lot of content attributes,
like if you had genre director, etc, for a movie, we did not do that. Or you can do collaborative filtering. If you have a lot of original user data now
collaborative filtering softwares from what's called a cold start problem, which is that
initially when you're starting your website and there are no users and no movies, what do you do? So initially you want to use content based
recommendations, but maybe later you want to make them collaborative recommendations on maybe
you want to keep using a combination of both. But yeah, this is yet another category of
machine learning problems and recommend a systems could potentially be a course in itself. So that brings us to the end of this
notebook and this lesson on unsupervised learning and recommendations. This is also the last topic in the course. So let's go back to the course homepage and review
the topics that we've covered in this course. So far. So we started out talking about
linear regression with scikit-learn. We saw how to download a real world data
set and prepare data for machine learning. We talked about building a linear regression
model with a single feature, and then with multiple features, then we also saw how
to generate predictions using the train model and how to evaluate machine learning. We then applied the same principles to
logistic regression for classification. So this was a different kind of problem
where instead of predicting a single number, we were attempting to classify
in ports into one of two categories. So we saw how to download and
process data sets from gaggle. We saw how to train a logistic regression
model and how it works under the hood. We also looked at model evaluation prediction. We saw how to save the train weights of the
model so that we do not have to train a model from scratch each time we need to use it. Then you worked on the first assignment
where you train your first machine learning model, where you downloaded
and it prepared a data set for training. Then you're trained a linear regression model
using scikit-learn and then you also made some predictions and evaluated the model. Next, we looked at decision trees
and hyper-parameters where we, once again, downloaded real world. And prepared that dataset for training. And then we train and it
integrated decision trees. We also learned about hyper parameters that
can be applied to decision trees, to improve their performance and to reduce overfitting. In the next lesson, we looked at
random forests and regularization. We saw how to go from a single decision
tree to a forest of decision trees. Several randomizations that
are applied in to each tree. So we saw how to train and
interpret random forests. We looked at ensemble methods in general, why
they work and how random forest supplied them. And we also tuned some hyper parameters
for, I know for us specifically to reduce overfitting and regularize the model. The second assignment was. Training decision trees and random forest,
where you prepared at your world data set for training and that you're trained
the decision tree and a random forest. And finally doing some
hypodermic doesn't regularize. The module. Next, we looked at gradient
boosting with XG boost. We drained and evaluated and XG boost model. We learned about gradient boosting the
technique where we drain multiple models to correct the errors made by previous models. Start as the technique called boosting. Yeah. Specifically when it is done with trees that
is called gradient in decision trees, you can also have linear models, so you can have
great in booster, linear models as well. And we looked at the XG boost library. We also looked at techniques like data
normalization and cross-validation, and we also looked at hyper parameter
tuning and regularization XG boost, the course project. Is for you to build a real world machine learning
model where you will perform data cleaning and feature engineering on a data set that you
download from an online source, such as Kaggle. Then you will perform training and you will
compare and tune multiple types of models. And finally, you will document and publish
your work online today, we looked at unsupervised learning and recommendations. We looked at clustering and dimensionality
reduction using scikit-learn. We also learned about collaborative filtering
and recommendations, and there are several other supervised learning algorithms available
in scikit-learn that you should also check out, or you will also have an optional
assignment on gradient boosting where you will train a light TBM model from scratch,
make predictions and evaluate results and tune hyper-parameters and regularize them. This assignment will be live shortly. So you can own a verified certificate of
accomplishment for free by completing all the weekly assignments and the coats project. And the certificate can be added to
your LinkedIn profile or linked from your resume or even downloaded as a PDF. So, where do you go from here? Well, there are four good sources for learning
more about machine learning and machine learning is all about building projects, building models,
and experimenting with different kinds of machine learning techniques and hyper parameters. So I recommend checking out gaggle notebooks,
datasets, competitions, and discussions. Just pick any popular data set on Kaggle,
check out the core tab or any competition. Check out the core tab and read
through some of the notebooks here. You will find data science experts from
around the world, sharing the best practices that they use in their day-to-day work. And other course, I recommend checking out
is machine learning by Andrew NG on Coursera. This is a great course. Helps build some of the theoretical and
mathematical foundations of machine learning. So it's a great compliment to this course,
which is far more applied and practical. You should also check out the book hands
on machine learning by audit Lynn Gannon. It's a great book on machine learning,
using scikit-learn and deep learning, using the TensorFlow framework. And finally, we have a course
called deep learning with. Zero two Ganz. This is an online course on Jovian,
so you can check it out on jovian.ai. And the most important thing is to keep
training models, keep building great projects. So what should you do next it review the
lecture videos and execute the Jupiter notebook, completely lecture exercises,
and start working on the assignment and discuss on the forum and on the discord. So would So with that, I'll see you in
the forums and you can find us on zero to gbms.com or on Twitter at actual Vian ML. And , this was less than six of the course
machine learning that by 10 0 2 GBMs the topic was unsupervised learning and recommendations. I hope to see you again in a new course, and there
are several other courses available on Jovian. Just go to jovian.ai and
check out the courses that we. So you have a course on data analysis with biotin. Zero two pandas. You have a course on deep learning,
but by torch, you have a course on data structures and algorithms, and
we have a course on machine learning. And if you're interested in making a career
transition to data science, then you should definitely consider checking out the zero to
data science bootcamp by Jovian a 20 week live certification program designed to help you
learn industry standard tools and techniques for data science, build real world projects
and start your career as a data scientist, we have limited seats for this program. with that we've reached the end of the scores. So thank you and have a good day or good night. I'll see you again soon.