StatQuest: Hierarchical Clustering

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] going on a quest on a stat Quest stat Quest hello and welcome to stat Quest today we're going to be talking about hierarchical clustering hierarchical clustering is often associated with heat Maps if you're not already familiar with what heat maps are just know that the columns typically represent different samples and that the rows typically represent measurements from different genes red typically signifies High expression of a gene and blue or purple means lower expression for a gene hierarchical clustering orders the rows and or the columns based on similarity this makes it easy to see correlation in the data for example these samples express the same genes and these genes behave the same on the left we have a heat map without hierarchical clustering and on the right we have a heat map with hierarchical clustering so you can see that the clustering makes a big difference on how the data is presented heat Maps often come with dendrograms so we'll talk about those too let's get started we'll start with a simple example here we've got a simple heat map that has three samples and four genes for this example we are just going to Cluster or reorder the rows or the genes conceptually the first step is to figure out which Gene is most similar to Gene number one genes number one and two are different we can tell because the colors are very different Gene one is highly expressed in Sample number one so it has a red color Gene 2 however is not highly expressed on Sample number one so it has a blue color in Sample number three Gene one is lowly expressed so it's blue and Gene 2 is highly expressed so it's red genes 1 and three are similar so that means in Sample one both Gene 1 and three are red they're highly expressed and in Sample three they're both blue meaning they're lowly expressed genes one and four are also similar however Gene number one is most similar to Gene number three so the second step is to figure out what Gene is most similar to Gene number two so we do all the comparisons and we see that Gene number two is most similar to Gene number four and then we do the same thing for Gene number three and then Gene number four in Step number three we look at the different combinations and figure out which two genes are the most similar once we've done that we merge them into a cluster in this case genes number one and three are more similar than any other combination of genes so genes 1 and three are now cluster number one step four go back to step one but now treat the new cluster like it's a single Gene so in step one we figure out which Gene is most similar to Cluster number one cluster number one is most similar to Gene number four and we figure out which Gene is most similar to Gene number two in this case Gene number two is most similar to Gene number four but notice that we compared Gene number two to Cluster number one and then we do the same thing for Gene number four of the different combinations figure out which two genes are the most similar now merge them into a cluster in this case genes 2 and four are the most similar combination so we've merged them into a cluster now we go back to Step One however since all we have left are two clusters we merge them bam we're all done hierarchical clustering is usually accompanied by a dendrogram it indicates both the similarity and the order that the Clusters were formed cluster number one was formed first and is is most similar it has the shortest Branch cluster number two was second and is the second most similar it has the second shortest Branch cluster number three which contains all of the genes was formed last it has the longest Branch now let's go over a few nitpicky details remember the first step figure out which Gene is most similar to Gene number one well we have to Define what most similar means the method for determining similarity is arbitrarily chosen however the ukian distance between genes is used a lot let's look at an example we'll use a very simple heat map that just has two samples and two genes now we're displaying the values that underly the the colors that we have in the heat map the ukian distance between genes 1 and two is just the square root of the difference in Sample number one squared plus the difference in Sample number two squared here we'll just plug in the values for sample number one we have 1.6 minus 0.5 now let's plug in the values to calculate the difference in Sample number two we have 0.5 minus -1.9 doing the subtraction gives us the square < TK of 2.12 + 2.4 2ar we can think of these values within the parentheses as sides on a triangle so on the x axis we have the distance between Gene 1 and Gene 2 in Sample number one and on the Y AIS we have the distance between Gene 1 1 and two in Sample number two the hypotenuse is the distance between genes 1 and two the Pythagorean theorem says that the hypotenuse equals theare < TK of x^2 + y^2 in this case that means the Square t of 2.12 + 2.4 SAR and that gives us 3.2 the distance between Gene number one and Gene number two when we have more samples we just extend the equation it's no big deal the ukian distance is just one method there are lots more including the Manhattan distance the Manhattan distance is just the absolute value of the differences so instead of squaring the differences and then taking the square root all we do is take the absolute value of the differences we can think of the Manhattan distance in geometric terms by imagining that each difference is a line segment if we take all those line segments and put them together head to tail head to tail and then add that total length of all those line segments together that's the Manhattan distance yes it makes a difference here's a heat map Drawn using the ukian distance and here's the same information drawn as a heat map but now we're using the Manhattan distance the heat maps are very similar but there are also a few differences the choice and distance metric is arbitrary W there is no biological or physical reason to choose one and not the other pick the one that gives you more insight into your data now do you remember how we merged genes 1 and three into cluster number one and compared it to other genes well there are different ways to compare clusters too one simple idea is to use the average of the measurements from each sample but there are lots more and these have effect on clustering as well so let's talk about the different ways to compare clusters for the sake of visualizing how the different methods work imagine our data was spread out on an XY plane now imagine that we have already formed these two clusters and we just want to figure out which cluster this last Point belongs to we can compare that point to the average of each cluster this is called the centroid the closest point in each cluster this is called single linkage or we can compare it to the furthest point in each cluster this is called complete linkage and there are other methods as well here's a heat map that compares the furthest points in the clusters by the way if you use R this is the default setting for the hclust function this heat map compares the average points in the Clusters and this last Heat Map compares the closest points in the Clusters these heat maps are all very similar but there are also differences in the way the data is presented in some summary clusters are formed based on some notion of similarity you have to decide what that is however most programs have reasonable defaults once you have a subcluster you have to decide how it should be compared to other rows columns or subclusters Etc and most programs have good default settings for this as well and the height of the branches in the Dinger gram shows you what is most simple similar hooray we've made it to the end of another exciting stat Quest if you liked this presentation please subscribe to my channel and you'll get more like it also if you'd like me to do something specific feel free to mention it in the comments below
Info
Channel: StatQuest with Josh Starmer
Views: 424,361
Rating: undefined out of 5
Keywords: Joshua Starmer, StatQuest, Hierarchical Clustering, Machine Learning, heatmap
Id: 7xHsRkOdVwo
Channel Id: undefined
Length: 11min 19sec (679 seconds)
Published: Tue Jun 20 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.