Multivariate Time Series Data Preprocessing with Pandas in Python | Machine Learning Tutorial

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey guys in this video we are going to see how we can do some time series data pre-processing and we are going to convert a pandas data frame into a set of sequences which are going to be used in later videos for a time series forecasting classification and probably anomaly detection in our case we are going to convert some multivariate data into time series sequences and the data is going to basically contain multiple features so i'm going to show you some of the tricks that i use to convert uh this kind of data into sequences which we are going to later on convert into pytorch datasets and i'm going to show how you can train time series data or some time series models that are based on neural networks of course using pytorch lightning so stay tuned for the next couple of videos as well let's get started and i'll show you how you can do some time series data pre-processing so here we have an empty notebook on google co-op and i'm going to start by doing some library installations i'm going to connect the notebook we don't really need any gpu at least for now and i'm going to install a specific version of tqgm this will basically allow us to do some progress visualization and something cool that i've noticed about this google co-op instances at least recently is that we have those execution basically statistics and when you're doing some long-running task it tells you basically what is happening and i'm going to show you uh this later on when we do this long running tasks next i'm going to continue with a lot of imports actually and here i'm importing matpot sleep and seabourn for the visualizations that we are probably not going to make in this video next i'm doing pandas in numpy i'm going to import cqdm for the progress bars by torch whitening and we are going to do some min max scaling and then i'm basically importing all stuff related to pytorch and in this particular example we are not going to use many of those but in later videos i'm just going to use some of those and i'm not going to basically scroll up and do the imports again so we do basically a bunch of imports right here and this is executing so i'm going to also set up the macro flip library and the various settings for the color palettes right here and the keyboard styling and then i'm going to import or call tkdm pandas which will allow us to basically monkey patch the apply method for pandas which we are going to use uh in a couple of minutes and finally i'm going to call uh pl dot seed everything so this is from pytorch lightning and this will basically go ahead and see everything including some multi-processing steps that you might want to do and next i'm going to what the data that we are going to use and the data itself is going to be from binance and it's provided by crypto date download.com and here we have historical data for different trading pairs so basically i'm interested in this btc to us dollars minute minute data and this will basically be a lot of data that we are going to use and it's going to contain all those values and i'm going to show you basically what those are i've went and used the the site to download the zip file and then i've extracted the data and from here we have this binance btc usd minute data and it's about 32 megabytes i believe yeah 35 or 36 megabytes so next we are going to use fundus to read the data and this should be from the file and we get out complete here and i'm going to parse the date column and if we look at the data you can see that we have the date the symbol which is going to be exclusively btc usdt and a lot of other values which are going to be all those stuff so we have the opening price the high price the low price the closing price at least for this minute because this is minute by minute data and you can see that the data is probably in reverse order because this is the six minute of the hour fifth fourth etc so what i'm going to do after loading the data is sort the values by date and then i'm going to reset the index of the data frame and right now you can see that the minute by minute data is sorted into the correct order so this is all good and let's check the number of rows that we have we have roughly uh yeah about a quarter of a million rows so this should be well enough data to do basically anything you want and let's start with let's continue with the preprocessing one thing that i occasionally or very often doing is basically calculating some statistics along the current row and the next row or the previous row so for example in this case we might want to have a look at what is the change from the previous closing price for example to the to the current one and one really useful method of pandas is the shift method and if we shift this by one position you're going to see that the let me just show you the close price right here so this is the close price and this is the shifted version by one so you can clearly see that this maps to the zeroth index right here this maps to the second index right here etc etc and we have just uh not a number value at the zeroed index so this is basically just shifting the rows by one and i'm going to use this as a new value or a new column into our data frame and we are going to call this previous calls which should be descriptive enough and if i look at the data right now you can see that we've added this new column at the end of the data frame and the first value is not a number which is going to be a bit of a problem but we are going to handle this right now so we want to basically create a new column which is going to be the closing change price and here i'm going to apply the progress apply function or method which is given to us by the the dkdm pandas library method sorry and in here i'm going to check if the current row is not a number or a none i'm going to use the the numpy method for that we are going to return a zero else we are going to calculate the difference or the delta between the current close price and the previous close price and we are going to do this along the first axis which is going to apply this over uh the rows basically so this will create again another column and as you can see the progress apply is giving us a nice progress bar and here is the number of seconds that we passed during this execution and google co-op is great enough to give us this execution summary i guess so this is a good feature i believe and then we are going to have a look at the new data frame again we have this close change price or the delta and as you can see at the first position or the zeroth index we have zero because it was a non value and after that we are just having the the amount that the closing price has changed i believe that this is in uh probably u.s dollars i believe probably so next i'm going to convert all this into a features data frame that we are going to use for our um next couple of models that we are going to have a look at the next couple of videos and to do this i'm going to create a rows list and in this list i'm going to basically convert of that data frame into features that we're going to use so i'm going to iterate over the data frame and i'm going to do that using tkgm and i'm going to pass in the number of rows that we have so it knows how many iterations we are going to do and i'm going to create a dictionary that is going to contain the data for this row i'm going to get the day of the week so to do that i'm going to access the date day of week property then i'm going to do the same thing for the day of month this is just day next we have the week of year and this can be any features that you want but i've chosen those this is a way to basically encode the date next we have the months months yeah and the opening price the the high price the wall price the close change which is going to be the newly created feature by us and then finally the closing price and i'm going to append this to the rows and i'm going to convert all it into a data frame so here is again tkdm and one interesting thing about the google coop thing is that is jumping all around probably telling us where the current execution point is i don't know but it's doing something all right so after this is complete i'm going to just check the shape of it and then i'm going to have a look at the data frame itself now that the feature creation is complete we have this newly created data frame and it has all the same number of rows that we had before but again we are just doing some feature engineering if you will and in here we have the new numbers extracted from the original data next we are going to make the split between training data and test data and in this case i want to have 90 of the data for training and i'm going to do that by calculating about 90 of the features the number of rows that we have and this will come out to this number next we are going to split the data frame based on this index we are going to create those two data frames and i'm going to use the features data frame and get all of the examples before the train and then i'm going to take all the examples uh sorry the range size after this and i'm going to add one just to make sure that an all data is linked is leaking between the training and test data so these are the numbers that we have for training and testing uh as you can see we have the same number of uh columns or features for each data frame which is good and next we are going to do some transformation of the data and the transformation that we are going to do is basically a min max scaler or scale the data scaling the data allows us to probably get faster convergence when training with stochastic gradient descendant uh and when using probably uh deep neural net models for example lstms or gross or one one confs or maybe you if you are kind of on the cutting edge using transformers from four time series data in all those cases we are basically training with stochastic gradient descent so scaling the features and the labels helps us to get a bit of better results faster convergence and probably is going to fight against the exploding or vanishing gradient problems that you might have so one really easy way to do that is to use this sk1 minmax scaler and in here i'm going to specify the range minus one to one so this will be uh the range that all the features are going to be scaled to and then i want the scaler to fit only the training data and this is very important because we don't want to leak any of the test data that we have thus far into the scaler we want to scale the the data using only the mean values and the standard deviations of the training data that we have so the test data is going to be basically just for testing and we did the same thing here with the split we first got the first n examples and then the later on as the time gets on and on we are reserving this data for testing so this is really important because you don't want it to leak any data from the train from the test data before evaluating it next i'm going to apply the transformation of the scaler and i'm going to call the scaler transform method on the training date frame and i'm going to pass the train df index and the columns yeah this is getting ugly so i'm going to re do this so why we're doing all this basically if you just pass in the scaler transform df let me actually execute this and i'll show you if you pass in the scalar transform train df look at the first row you see that this is no longer a data frame and just to preserve the data frame itself we are going to do the transformation but we are going to basically create a new data frame that is the same data frame that we had before but in here we are going to pass in the index and the columns and of course the transformed values which are going to be this numpy array so i'm going to do the same thing for the test data prime okay so run this and next we let's check what we have into the training date right now so basically the same columns but as you can see the numbers are very very different at least in the price range let's compare this with the previous prices we had rather large values around 10k in u.s dollars and now those values are scaled very differently so this is good it means that it's doing something at least and for the final step of the preprocess we are going to convert those training and test data frames into sequences for that purpose i am going to write a new function which we are going to call create sequences and this function is going to take some input data which is going to be up by uh on this date frame is going to contain the target column the com that we want to predict and the sequence length what we are going to do here then is basically cut the data into multiple sequences and i'm going to start by just writing out the function and then we are going to have a look at an example of how it's going to work so uh please stay with me so the sequences are going to be a list right here and then we are going to check the number of rows into the input data next we are going to iterate over the data and we are going to again use tkdm and i'm going to iterate over the range of data size minus the sequence length so what will this do is iterate over all of the data except for the last examples which are going to be the number of sequences that we have and this is required because otherwise we are not going to have enough uh data at the end to basically go through the sequences so the first thing that i'm going to do is get the sequence and i'm going to use the input data from height index to i plus sequence length so this will give us a sequence with the length of the past parameter that we have and next i'm going to take the next value which is going to be the position of the label and this is going to be i plus sequence length and this is exclusive so we are not getting uh this but we are getting this minus one so this is basically how the indexing in python works so the label position right here is going to be at this position which is inclusive i know this might be a bit of uh hard to get at first but i hope that when we look at the examples this all makes sense next i'm going to take the label value itself and i'm going to take the label position row and from that i'm going to take the target column so this will just give us the row value that we have and next i'm going to append the sequence or a tuple of the sequence and the label finally i'm going to return the sequences and this is pretty much the function that we need uh of course you might go ahead and try something a bit uh more advanced like for example sliding windows or maybe another variation of that this is a really simple implementation in which we are taking every next value and just creating a sequence out of it so let me run this and i have a little example of here right here and i'm going to create a dummy data frame and here i want to create a feature which will have the values yeah one two three four and five and i have uh i'll have a column with the name of label 6 7 8 9 10. let's look at the data right here okay so we have this very basic data frame and let's create sequences from this we'll call the function create sequences we are going to pass in the label as target variable and the sequence length is going to be 3. let's see so this runs and it's already telling us that it did two iterations so the number of sequences should be two and you might think that one two three two three four and three four five are the possible sequences that you can get but uh that's not exactly correct because for the last sequence you basically don't have a label so let's check the label the sequences themselves and to do that i'm going to first take the sequence at zero position this will give us the sequence and i'm going to print out a row and then i'm going to print out the label at the zeroth position so the first sequence the label itself and this gives us the first three rows from the the data frame and then the label is the next uh the label from the next row so this works as expected and then we have the next example which is going to be the next sequence and for this one we have 2 3 4 2 3 4 7 8 9 and the label is 10. so note that we don't have anymore uh we don't have any more labels so this is why the number of sequences is two all right i hope that all this makes sense and next i'm going to create a sequence length constant i'm going to pass in for example 60 i'm going to create a training sequences using the training data frame and i'm going to pass in the sequence length i'm going to do the same thing for the test sequences and basically related run now that the pre-processing is complete it roughly took about 40 seconds combined here is in the output of that and let's check for example the first sequence what does it have let's first check the label so this is just a number and the sequence is going to be basically this uh i believe that this is a data frame actually yeah it's a date frame which contains all the examples that we want and it should be 59 because it starts from zero so we have a data frame that contains 60 rows and this will be a sequence that is created from the large training data frame this is the shape of it and to basically check how many sequences that we have we have the training sequences and the length of the test sequences right here so this is let's compare this to the to the training data frame so as you can see let me just copy this this is the number of examples that we get into the training date frame and if we do this you can see that this is exactly the so this should be a true if we did everything correctly so this should be exactly the number of the elements that we have into each sequence so everything kind of checks out so this is great and and this is pretty much how you can pre-process your time series data again we had multiple features you can do the same thing basically for a single feature there is no reason that like this shouldn't work you might need to do some fiddling but it should be kinda alright in the next video i'm going to show you how you can convert those sequences into python data sets and then we are going to train an ostm or some other model to basically predict the bitcoin price into the future so we can get rich all of us thanks for watching guys i really would love to for you to do to give me a like maybe subscribe to the channel as well and drop a comment below if you have any questions thanks for watching bye
Info
Channel: Venelin Valkov
Views: 6,184
Rating: undefined out of 5
Keywords: Machine Learning, Artificial Intelligence, Data Science, Deep Learning, Time Series, PyTorch, Pandas, Python
Id: jR0phoeXjrc
Channel Id: undefined
Length: 30min 24sec (1824 seconds)
Published: Sat Mar 27 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.