Data Analysis 5: Data Reduction - Computerphile

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Let's imagine that you work for a major streaming media provider right? So you have I know some 100 million drivers So you've got I don't know ten thousand videos on your site or many more audio files, right so for each user you're gonna have collected information on what they've watched when they've watched it how long they've watched it for whether they Went from this one to this one. Did that work? Was that good for them? And So maybe you've got 30,000 data points per user We're now talking about trillions of data points and your job is to try and predict what someone wants to watch or listen to next best of luck So we've cleaned the data we've transformed our data everything's on the same scale we've joined data sets together The problem is because we've joined data sets together perhaps our data set has got quite large right now or maybe we just work for a company that has a lot a lot of data certainly the General consensus these days is to collect as much data as you can like this isn't always a good idea We what we want remember It's the smallest most compact and useful data set we can otherwise you're just going to be wasting CPU hours or GPU hours training on this wasting time We want to get to the knowledge as quickly as possible And if you can do that with a small amount of data that's going to be great So we've got quite an interesting data set to look at today based on music It's quite common these days when you're building something like a streaming service for example Spotify You might want to have a recommender system This is an idea where you've maybe clustered people who are similar in their tastes, you know what kind of music they're listening to and you know, the attributes of that music and if you know that you can say well this person likes high tempo music So maybe he'd like this track as well. And this is how playlists are generated One of the problems is that you're gonna have to produce Descriptions of the audio on things like tempo and how upbeat they are in order to machine learn on this kind of system Right, and that's what this data sets about. So we've collected a dataset here today. That is Lots and lots of metadata on music tracks right now. These are freely available Tracks and freely available data and put a link in the description if you want to have a look at it yourself I've cleaned it up a bit already because obviously I've been through the process of cleaning and transforming my data So we're gonna load this now this takes quite a long time to do Because there's quite a lot of attributes and quite a lot of instances It's loaded right? How much is this data? Well, we've got 13,500 Observations that's instances, and we've got seven hundred and sixty-two attributes, right? so that means another way of putting this if in sort of machine learning parlance is we've got thirteen thousand instances and 760 features now these features are a combination of things. So let's have a quick look at the columns we're looking at so we can see what this data sets about so names of Music all right, so we've got some 760 features or attributes and you can see there's a lot of slightly meaningless text here But if we look at the top you'll see some actual things that may be familiar to us So we've got the track ID album ID the genre, right? So Jean was an interesting one because maybe we can start to use Some of these audio descriptions to predict what Jean with its music is or something like that things like the track number and the track duration and Then we get on to the actual audio description features. Now. These have been generated by two different libraries the first is called Lib rosa, which is a publicly available library for taking an mp3 and Calculating musical sort of attributes of it What we're trying to do here is represent our data in terms of attributes an mp3 file is not an attribute It's a lot of data. So can we summarize it in some way? Can we calculate by looking at the mp3? What the tempo is what the amplitude is how loud the track is these kind of things this is a kind of thing We're measuring and a lot of these are going to go into a lot of detail down at kind of a waveform level so we have the Lib Roza features first and then if we scroll down After a while we'd get to some echo nest features. Echinus is a company that Produces very interesting features on music and actually these are the features that power Spotify is recommender system and numerous others We've got things like acoustic nurse. How a coup stick does it sound we've got instrumental nurse I'm not convinced that the word speech enos their hat hat to what extent is it speech or not? Speech And then things like tempo how fast is it and valence? How happy does it sound right a track of zero would be quite sad? I guess and a track of one will be really high happy and upbeat and then of course We've got a load of features. I've labeled temporal here and these are going to be based on the actual music data themselves Often when we talk about data reduction We're actually using its dimensionality reduction right well way of thinking about it is we as we started we've been looking at things like attributes and we've been saying what is the Mean or a standard deviation of some attribute on our data but actually when we start to talk about clustering and machine learning We're going to talk a little bit more about dimensions. Now. This is in many ways The number of attributes is the number of dimensions It's just another term for the same thing, but certainly from a machine learning background We refer to a lot of these things as dimensions so you can imagine if you've got some data here So you've got your instances down here and you've got your attributes across here So in this case our music data, we've got each song. So this is puts on one This is on two song three and then all the attributes of a temple echo nest attributes its tempo and things like this These are all dimensions in which this data can vary so they can be different in the first dimension, which is the track ID But they can also down here be different in this dimension Which is for tempo when we say? Some data is seven hundred dimensional What that actually means is it has seven hundred different ways or different attributes in which it can vary and you can imagine that first Of all this is going to get quite big quite quickly My seven hundred a tribute seems like a lot to me Right and depending on what the algorithm you're running is it can get quite slow when you're running Oh this kind of size of data and you can maybe this is a relatively small data set compared to what Spotify might deal with on a daily basis But another way to think about this data is actually points in this space so we have some 700 different attributes that you can vary and when we take a Specific track it sits somewhere in this space So if we were looking at it in just two dimensions You know a track one might be over here and track two over here and track three over here and in three Dimensions track four might be back at the back here. You can imagine the more dimensions We add the further spread out these things are going to get But we can still do all the same things. We can in three dimensions in 700 dimensions. It just takes a little bit longer So one of the problems is that some things like machine learning don't like to have too many dimensions So things like linear regression can get quite slow if you have tens of thousands of attributes or dimensions So remember that perhaps the the default response to anyone collecting data is just deflect it all and worry about it. Later This is a time reporting when you have to worry about it. What we're trying to do is Move any redundant variables if you've got two? Attributes of your music like tempo and valence that turn out to be exactly the same Why are we using Bo for making our problem a little bit harder right now in actual fact echo nests features are pretty good They don't tend to correlate that strongly but you might find where we've collected some data on a big scale actually A lot of it variables are very very similar all the time and you can just remove some of them or combine some of them Together and just make your problem a little bit easier So let's look at this on the music data set and see what we can do So the first thing we can do is we could remove duplicates Ryba sounds like an obvious one and perhaps one that we could also Do during cleaning, but exactly when you do it doesn't really matter as long as you're paying attention what we're going to say is music all equals unique of music all and what that's going to do is look for find any duplicate rows and Remove them the number of rows. We've got will drop by some amount. Let's see thinking It's where you live timer Actually, this is quite a slow process You've got to consider that we're going to look through every single row and try and find any other rows that match Okay, so this is removed a bit about 40 rows So this meant we had some duplicate tracks You can imagine that things might get accidentally added to the database twice or maybe two tracks are actually identical because they were released multiple Times or something like this now what this is doing? The unique function actually finds rows that are exactly the same for every single attribute or every single dimension, of course in practice You might find that you have two versions of the same track, which differ by one second they might have slightly different attributes Hopefully they'll be very very similar. So what we could also do is have a threshold where we said these are too similar They're the same thing. The name is the same. The artist is the same and the audio descriptors are very very similar Maybe we should just remove one of them Well, this is the other thing you could do just for demonstration what we're going to do is focus on just a few of The genres in this data set right just to make things a little bit clearer for visualizations we're going to select just the classical jazz pop and Spoken-word genres, right because these have a good distribution of different amounts in the data set So we're going to run that we're creating a list of genres. We're going to say music is musical Where any time where the genre is in that list of genres we just produced? and that's going to produce a much smaller dataset of 1,600 observations the same number of attributes or dimensions now Normally you would obviously keep most of your data in this is just for a demonstration But removing genres that aren't useful to you for your experiment is a perfectly reasonable way of reducing your data size if that's a problem Assuming they've been labeled right in the first place, right that's on someone else. That's someone else's job Let's imagine but 1,600 is still too long. Now actually computers are getting pretty quick. Maybe 1,600 observations is fine, but Perhaps we want to remove some more The first thing we could do is just chop off the day to half way and keep about half. So let's try that first of all, so we're going to say the first music that's the first few rows of our music is Rows 1 to 835 and all the columns. So we're going to run that and That's even smaller. Right so we can start to whittle down our data. This is not necessarily a good idea We're assuming here that our genre is equally, you know, randomly sampled around our data set. That might not be true You might have all the lock first and then all the pop or something like that If you take the first few, you're just going to get all the rock right depending on what you like That might not be for you So let's plot these on was in the normal data set and you can see that we've got very little spoken word but it is there we have some classical international jazz and pop in sort of roughly the same amount if We plot after we've selected the first 50 you can see we've lost two of the genres like we only have classical International and jazz and there's hardly any jazz. That's not a good idea. So don't do that unless you know that your data is randomized So this is not this is not giving us a good representation of genres if we wanted to predict Jonatha, for example based on the musical features cutting out half the genres seems like an unwise decision So a better thing to do will be to sample randomly from the data set So what we're going to do is we're going to use the sample function to give us 835 random indices into this data and then we're going to use that the index our music data frame instead Alright, that's this line here And hopefully this will give us a better distribution if we plot the original again It looks like this and you can see we've got a broad distribution and then if we plot the randomized version You can see we've still got some spoken. It's actually going up slightly, but the distributions are broadly the same So this is worked exactly how we want So how you select your data? If you're trying to make it a little bit smaller It's very very important and consider but obviously we only had 1,600 here and even the human is whole data set is only 1,300 rows you could imagine that you might have Tens of millions of rows and you've got to think about this before you start just getting rid of them completely Randomized sampling is is a perfectly good way of selecting your data. Obviously, it has a risk that maybe if the distributions of your Genres are a little bit off and maybe you haven't got very much of a certain genre You can't guarantee that the distributions are going to be the same on the way out And if you're trying to predict Jama that's going to be a problem. So perhaps the best approach is stratified sampling This is where we try and maintain the distribution of our classes So for example in this case genre so we could say we all we had 50% Rock 30% pop and 20% spoken and we want to maintain that kind of distribution on the way out Even if we only saw about 50% right? This is a little bit more complicated in our but it can be done And this is a good approach if you want to make absolutely sure with Distributions of your sample data are the same as your original data. We just looked at some ways We can reduce the size of our data set in terms of a number of instances or the number of rows Can we make the number of dimensions or the number of attributes smaller? Because that's often one of the problems and the answer is yes And there's lots of different ways we can do this some more powerful and useful than others One of the ways we can do this is something called correlation analysis so a correlation between two attributes basically tells us that when one of them increases the other one either increases or decreases in General in relation to it. So you might have some data like this. We've actually won And we might have attribute two and they sort of look like this These are the data points for all of our different data obviously We've got a lot of data points and you can see that roughly speaking they kind of increase in this Sort of direction here like this now it might be but if this correlation is very very strong. So basically Attribute to is a copy of attribute one more or less Maybe it doesn't make sense to have attribute two in our data set. Maybe we can remove it without too much of a problem What we can do is something called correlation analysis where we pitch all of the attributes versus all of the other attributes We look for high correlations and we decide Ourselves whether to remove them now, sometimes it's useful just to keep everything in and try not to remove them too early But on the other hand, if you've got a huge amount of data and your correlations are very high This could be one way of doing it. Another option is something called forward or backward attribute selection Now this is the idea that maybe we have a machine learning model or clustering algorithm in mind we can measure the performance of that and then we can remove features and See if the performance remains the same because if it does maybe we didn't need those features so what we could do is we could train our model on let's say a 720 dimensional data set and then we could get a certain level of accuracy and record that then we could try it again by removing One of the dimensions and try on 719 and maybe the accuracy is exactly the same in which case we can say Well, we didn't really need that dimension at all and we can start to whittle down. Are they set this way? Another option is forwards attribute selection this is where we literally train our machine learning on just one of the attributes and then we see what our accuracy is and we keep adding attributes in and Retraining until our Performance plateaus and we can say you know what? We're not gaining anything now by adding more attributes Obviously, there's the question of which order do you tribus in usually bandim? Lee, so what you would do is you would train on all the data for example of a backwards attribute selection You take one out at random If your performance stays the same you can leave it out if your performance gets much worse You put it back in and you don't try that one again And you try a different one and you stole slowly start to take dimensions away and hopefully Whittle Daniel data Let's have a quick look at correlation analysis on this data set you might imagine that if we're calculating features based on the mp3 from Lib rosa or echo nest Maybe they're quite similar a lot of the time and maybe we can remove them Let's have a quick look. So we're just going to focus on one of a set of Lib rosa features just for simplicity So we're going to select only the attributes that contain this chroma kurtosis Field which is one of the attributes that you can calculate using Lib rosa so I'm going to run that we're going to rename them just for a home simplicity to Kurt one Kurt - Kurt 3 and Then we're going to calculate a correlation matrix of each of these different features versus each other like this Ok, finally, we're going to plot this and see what it looks like hopefully we can find some good correlations and we could have candidates for just removing a few of these dimensions if it's redundant and it's not too bad so you can see that we've got for Example Kurt 7 here. So index 7 is fairly similar to 8. That's a correlation of 0.65 Maybe that means that we could remove one over two of those. This one here is 0.5 nine We've got a point four eight over here These are fairly high correlations if you're really stretched for CPU time, or you're worried about a size of your data set This is the kind of thing you could do to remove them Of course, wherever point six five is a strong enough correlation that you want to delete and completely remove one of these dimensions It's really up to you and it's going to depend on your situation one of the reasons that the Correlations aren't quite as hard as you might think is that these libraries have been designed with this in mind if you just if echo Nests just produce 200 feet all exactly the same. It wouldn't be very useful for picking playlists So they've produced 200 features that are widely different. So we're not necessarily going to correlate all the time, right? That's the whole point and that's a really useful feature of this data We've looked at some ways we can try and make our data set a little bit smaller Remember our ultimate goal is a smallest most sort of useful data We can get our hands on right then we can put that into machine learning or clustering and really extract some knowledge The problem is that what we might do may based on correlation analysis or forward backwards attribute selection We might just be deleting data and maybe the correlation wasn't one. It wasn't completely redundant Do we actually want to completely remove this data? Is there another way we can transform our data to make more informed decisions as to what we remove and more effective ones? That's PCA or principal component analysis At the moment. We're just fitting one line through our two-dimensional data There's going to be more principal components later, right? But what we want to do is we want to pick the direction through this data However, many attributes it has that has the most spread. So how do we measure this? Well quite simply
Info
Channel: Computerphile
Views: 48,802
Rating: undefined out of 5
Keywords: computers, computerphile, computer, science, University of Nottingham, Computer Science, Data Analysis, Data, Dr Mike Pound, Dr Mercedes Torres Torres, Data Reduction
Id: 8k56bvhXw4s
Channel Id: undefined
Length: 17min 49sec (1069 seconds)
Published: Tue Jul 09 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.