Dimensionality reduction with tidymodels for the Billboard Top 100

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi my name is julia silgi and today in this screencast we're going to use this week's tidy tuesday data set on songs from the billboard top 100 and we're going to explore how to implement dimensionality reduction using tidy models in tidy models we use recipes the um the concept we have for feature engineering or data pre-processing to approach dimensionality reduction because often it's something that you do um as part of feature engineering to get ready to train a model and so this um this screencast really digs deep into recipes and how to use them how to think about what a recipe is how we think about how it uses training data how you then apply it to new data often when i've showed using recipes we use it within the context of a workflow so we don't have to think about and some of these [Music] details but it can be good to know how this works especially if you want to be able to problem solve your recipe or debug your recipe or or use it outside of workflows so that's what we're gonna do today let's get started all right let's talk about dimensionality reduction so this i just pasted in code to read this data so um this data from this week's tidy tuesday has two data sets so the first one has um data from the billboard charts so it has um for every week going back a long time you know back here to the 60s to all the way down to um you know recent recent um years it has songs the song the performer and then the song id which is those things squashed together when um what is the um position of the billboard chart how high did it get at its peak and how long was it on the chart totally the other data set that we have here is like um a a supporting data set and so this what this is is um you know it we we have song id which we can use to match up here and then information that spotify makes available through its um through its api things like what genre does spotify put it in and then some audio features like um how danceable is it how loud is it compared to other songs how speechy is it versus musical how acoustic is it um and then a measure of popularity here at the end so what i'm going to do here is i want to combine these two data sets and then we're going to use dimensionality reduction in a couple of different ways to talk through how how we can understand how these things are related so let's first start with the billboard data um i am going to group by that song id and then find the um the max weeks on chart so what this will do is um max weeks like this what this will do is it will tell us for all of the songs that we have on the billboard charts what was the longest time it was ever on the chart so sometimes songs will go on the chart and then off the chart and some songs are on the um the chart for a long time like uh this this jay-z beyonce song and some songs are on for a short time so we've got songs with staying power and then songs that just kind of flit on and off now let's take that audio feature data and let's join it up so we unfortunately don't have audio feature for all of the songs um and and and actually you know there there might be some you know systematic differences and songs we do and don't have so that's something to keep in mind but what we're gonna do here let's um let's let's only keep songs that we have some data for like say that we have some of this these audio features for and then let's um inner join with the the in the information about how long how much saying power these songs have let's call this billboard joined like this all right so this is what we're going to use let's do a little bit of exploratory data analysis and then let's go ahead and talk about dimensionality reduction so um i i actually was really serious about music for a long time and um so for example i'm kind of interested in that time signature yes that time signature um measurement and if we can see how that works like what the distribution is how it's related to tempo [Music] so let's make a um let's make a histogram here where we can put these things next to each other like so all right so we've got some some of these time signatures i think mean like we don't know what it is or something like that so let's say filter time signature greater than one because like three means three four four means four four five means five four i looked this up in the documentation and let's just say time signature and um yeah let's look at that okay so we can see here you know this is a data set of pop songs and 4 4 is just you know way more popular than other titan signatures i have such a soft spot for five four songs but boy they are not very popular are they the three four stretches out here but again so much more so much less popular there so that's kind of you know like these are the kind of information that we have about the song here we try to use in dimensionality reduction dimensionality reduction um works typically because some of these variables are correlated with each other so let's make a correlation plot and see what we have here so let let me let's remind ourselves what do we have in here let's take the um let's start at this first numeric audio feature let's go all the way through to our weeks on chart which is the thing we joined in from the billboard chart um i think let's get rid of it let's just throw away um things that are missing data and then let's use um correlate from the core package here and so now we've computed all these correlations uh a lot of these look pretty low if we do rearrange it will um there we go okay so the highest one here is energy and loudness so songs that have high energy are louder a lot of these are very low if we make a um a plot here let's use a network plot i don't love those default colors so let's um let's put in orange white and um my favorite color midnight blue um and look here okay so um you know energy and loudness energy and loudness are correlated energy and acousticness are anti-correlated acousticness and loudness are anti-correlated we see here that this week's on chart which is the um you know that sort of staying power on the billboard chart that we joined in is correlated somewhat with this spotify track popularity so it's not it's not you know measuring exactly the same thing but it is it is uh you know it is related and they're both sort of a measure of popularity in different ways you know this is not really an audio feature per se but rather a different measure of popularity that we get out of spotify i'm going to leave this in um kind of so we can see how that thing which does track popularity what does it do as we go through different kinds of of dimensionality reduction approaches and how does it act notice that a lot of these things like speechiness mode key instrumentalness over here they are not very correlated at all with this whole kind of set of things over here they are off by themselves so we have a lot of items here which are not very correlated and dimensionality reduction usually tries to take uh take advantage of correlated things to be able to make the new lower dimensions so we'll have to keep that in mind as we move forward speaking of which um let's do that uh there's one more thing i'd like to uh do before we go on and that's with this week's on chart let's just uh i can make a i can make a um you know visualization but let's just look at this like the the mean and the median are down here at 10 but it goes way high like there's a lot of songs that are on the chart for a short amount of time and only a few that are on a really long time so let's um let's uh take the log of that when we move on here okay so let's speaking of which let's move on so we are going to work through some dimensionality reduction examples um max kuhn and i have a new chapter out in our book tidy modeling with r on dimensionality reduction and what this screencast does is walk through sort of a shorter um more limited example of how of the some of the topics in that chapter so i recommend that if you think ah that was interesting or you want to look in more detail let's um take a look at that chapter to learn more so let's do some of the same stuff here to get to a good data set for our our building building our new new lower dimensions new lower dimension versions here and let's so what we're going to do is we are let me load the tidy models meta package we are going to split this into testing and uh would help if i can split a spell initial so we're gonna split into testing and training here i'll do stratified um resampling because it doesn't really ever hurt split okay so i am building my data set and then piping it in to initial split so i'm making a testing and training split here so i have this many total this many in training and this many in testing so the tidy models approach to feature engineering data pre-processing uh really for like causes us to have to talk about data splits because let me type this all out there we go because when we make a recipe so a recipe is the core concept for future engineering for data pre-processing and tidy models when we make a recipe we start out by saying okay what um what variables am i using what data am i using and here i use the training data here and um then so i set this up i say here um here's the here is the the the variables i'm using here's the outcome everything else is a predictor here's the training data that i'm using and let's start out by making a little starter recipe let's say let's say i'm gonna make sure there's no variables that actually don't change at all this is step zero variance so it removes any variables that are all the same the whole way through let's and then let's normalize all the numeric predictors um so what this does is it centers and scales um uh data so it finds okay finds the stan you know the mean and the standard deviation and divides and divides and subtracts by those so that you have centered and scaled data and let's call this our let's think about this is our starter recipe here so when we have this recipe and we we're just starting out with it we've we've defined it and but we have not yet um we have not yet estimated those values so if we think about this step in particular we haven't found the mean and the standard deviation yet the so the next step of what we want to do is we want to um prep that recipe once we prep it you can think of that recipe as trained like this so now after we do this and it no longer says all numeric predictors because it has gone in and it now knows what variables are there and it has computed a mean and a standard deviation for all of these things so the um think of prep like um fit for a model so prep for a recipe is like fit for a model you're going in and you're saying what are the things that i need to do to be able to apply this data pre-processing um set of rules set of things i want to do to new to new data and the way you apply it to new data is the verb we call bake it applies a trained data recipe to something else so think of bake like predict for a model and so for example now i can bake the trained recipe on new data like um like our test whoops uh oh no i just didn't spell it right so let me try to do that recipe trained like so so now what i'm doing here is i'm applying the means and the standard deviations from the training data to the testing data here the reason why we want to do this is to avoid data leakage and information leakage because in machine learning um when we want to if we want to train effective models we need to keep all the um all the rules about what it is that we are doing we need to only use information from the training data during our modeling process and that includes feature engineering and data pre-processing so we do not want to use any information from the testing data when processing the testing data so we don't want to use any of the in any of these values here when computing means or standard deviations or anything else any other things so that's the why recipes works the way it does is to avoid this kind of data leakage or information leakage okay so we have let's call this our base recipe and let's now make a little um a little helper function this helper function is going to be a lot like the one that is in this chapter let's let's load the ggforce package and then i'm going to make a function that i'm going to call plot test results and what i'm going to do is i'm going to take a recipe as an argument and then some data as an argument and i'm going to use the testing data here so this what this function is going to do is it's going to take a recipe prep it on the training that using the training data and then apply it to the testing data and so what this function is going to do is it's basically it's going to make a visualization so we're going to take the recipe we are gonna prep it we're gonna bake it with that dat here which for us we're gonna we're gonna use the testing data and then i'm gonna start making a visualization so i'm gonna use um from this this ggforce package is cool it has like um auto density and auto point it just has some like um we're gonna make a big scatter plot matrix matrix um here using auto point so i think this is how it works geom auto point and i'll put the aes in here the the color i'm going to use color for weeks on chart the example that's in this chapter is for um like a classification problem and here it's more like a regression problem where weeks on chart is a numeric value let's say let's make these transparent let's make them small because there's going to be a lot of them because uh even though it's the testing data and it's smaller it's still there's going to be a lot of them say geom auto density so that's going to be for what's on the um on the diagonal and then we're going to facet matrix and we're going to fast it by everything except this week's on chart and i think we say layer diag like that i think because we're gonna have like this two by two nice matrix and then um the our color is actually not at weeks anymore it's actually the log of week so let's do that okay so this is our little helper function and what this so that what this function does it's going to help us just like go through several examples kind of quickly um and so the first example that we're going to do is pca so what is pca so pca is um uh me you know maybe the most common uh straightforward dimensionality reduction approach so pca is linear it is um unsupervised and it um uh it tries to account for um let's i'll say it like count for variance like it tries to make new um new dimensions that so that the they account for the most variance and that the first principal component accounts for the most variance and then the second for the second most variance and so on going down so what we can do here now is we can take our our base our base recipe we can add a new step like this step pca again on all numeric predictors and let's say um let's go let's go down let's go to uh number components equals four and then let's um let's go into plot test results like this and then let's add a title pca like this ah all right i did something not quite right recipe [Music] recipe plot test results um all right let's see what i didn't so recipe prep aha classic blender classic blender all right all right that looks pretty good i think we'll be able to see the colors a little bit better if i change this from the default color to a um let's use let's use brewer except brewer is for qualitative it's distiller right yes it's distiller and we need to make the direction go the normal way because it switches it inside of it and let's um this is one of my favorite i think this one's really pretty so let's do this i think this will help us to um see this better yeah okay so hopefully i'll make this a little even bigger so it's more on the screen so here we can see that we do see that the um we see a relationship here it's you know it's a it's not dramatic but we do see that there are more songs that have um that were on the chart longer these like long lasting songs up in these areas right of this of these um figures and less over this way and the reason why the reason why this happens um we can go in here and we can get we can get out the info by tidying the um the the recipe you can tidy your recipe before or after it is um prepped you can tidy the whole recipe like i did here or i can tidy the um whoops i've got to prep it like this so i can tidy the recipe here now let's uh make this so i can see it a little better so we can tidy the recipe here and now i get how much does each of these terms contribute to each of these components so let's do just a little bit of munching here we'll munch munch munch make a visualization let's just look at component and let's paste uh pc one whoops one four like that so now i just have the first four components and then let's group by component and let's say slice max abs value and equals five ungroup so i'm saying let's take the top five absolute value um terms for each component so that's what those lines do and then let's plot this let's slur this onto a plot so absolute value on the x-axis terms on the y-axis let's say is it did it go in the positive or negative direction we're making a bar chart and then let's facet wrap little little facets for each component and we'll need to do for free why at least maybe everything free okay and so let's change these labels so what is this on the x-axis it is contribution to p to to principal component and then that what that fill is telling us is is it positive or not because they can go in opposite directions all right so you can see that both in pc1 and pc2 that track popularity is in the top five and so that's how we're getting you know that sort of gentle gradient there in the um week weeks on chart and so pc component remember is unsupervised so we are finding that when we so it didn't know about the um it did not know about the weeks on chart it didn't know about the popularity though and so pc one is about um danceable high high-energy loud popular songs um that are not acoustic pc2 is about happy popular songs that are um oh or maybe not happy low energy popular songs low energy um low low energy dance low densible not danceable popular songs i guess okay that's interesting maybe this is like uh i don't know that's that's pretty interesting okay so that's pc1 um and so we see that we do see this relationship between uh we are able to see that weeks on chart is related to these unsupervised um lower dimensions that we found so we could you know use that and try to train a model with it use these as the new predictors but let's try something else which is new partial least squares so what what is pls um it is very similar to pca except it is supervised so instead of finding just just giving the math the freedom to find whatever um whatever i'm actually going to copy all this whatever kind of new components you want instead we it will find components that are related to the outcome so we have to actually tell it the outcome here in this case weeks on chart like this so let's um make this let's see what it looks like is it gonna go yes okay okay so ah this is interesting because we have more less blobby shapes and kind of more like defined shapes and the gradient is stronger i hope that comes through in the video um the gradient here is stronger the reason why is because we forced it like we we forced it by using pls instead of pca we're saying we're using information from the outcome now they are they are supervised so let's um let's copy this in here and we can see how that turned out and what contributes to these predictors so this will now be p l s and let's call this pls component um i these are so pretty i love them but let's get rid of it so we can look here uh and look that these see that these components are different this is now pls one is almost you know is very strongly just about popularity less so about valence and acousticness pls2 is about popular um uh and speech popularity speechiness loudness energy and whatnot so now popularity is uh much more important in these components because we use the supervised approach so we basically we forced it okay so these are both linear approaches that are very similar to each other one supervised one unsupervised um let's end by talking about something that's really different so it's umap so umap is um let's do this what is umap um so it is not linear it's very powerful um uh and it basically it's it's based on nearest neighbors plus graph networks and then so you start with like the high dimensional space and you find things that are you do like nearest neighbors there then you build a graph network on the things that are close to each other and then you create your new dimensions based on that network um so it we have umap and tidy models in a um an add-on package and so what we put here is umap um umap can actually be unsupervised or supervised for this version let's do the unsupervised version so umap is um as this may take just a minute to pop up because it takes a little while it's a more um complex algorithm but what umap is known for hopefully we'll see it here yes are all these um it's known for making this kind of uh result here that you see with these little you know all these little blobs all these little shapes together and if you look at it like i no longer with you know believe that i see like real really any relationships here with the with the the outcome um and this happens with classification problems too that you'll get these little clusters but the clusters will have multiple classes in them or um classes will be across many clusters or um so you you end up to getting this kind of um structure you're basically guaranteed to get this kind of structure but it's not always the case that the structure is connected to the problem that you're working on um it's pretty sensitive to hyper parameters of uh of the algorithm so umap's very powerful and can be really cool gave you really cool results but also it has some significant limitations there's been i you know kind of some hot takes lately about umap and uh whether it's bad or good but um it's certainly very powerful and interesting a good thing to have in your toolkit but also it's great to understand you know how it works so that you can understand at least you know i mean at some level so that you can understand what the um where it's appropriate and what some of its limitations are all right so today we talked about recipes we talked about prep and bake the reason that recipes work this way is because part of the design of feature engineering and tidy models is avoiding data leakage avoiding information leakage during training of machine learning models and prep remember prep to a pre-processing recipe or a feature engineering recipe prep is to a recipe as fit is to a model bake is to a recipe as predict is to a model so this is you can think about these actions in this way if you're using workflows it takes care of all this under the hood and does the right thing under the hood but remember that a recipe is something you estimate from training data and then apply to testing data um you can do this with something simple like you know centering and scaling your data or you can do this with something very powerful and complex like um dimensionality reduction you know for example like with umap so i hope this was helpful and i'll see you next time
Info
Channel: Julia Silge
Views: 2,461
Rating: 5 out of 5
Keywords:
Id: kE7H1oQ2rY4
Channel Id: undefined
Length: 34min 37sec (2077 seconds)
Published: Wed Sep 15 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.