PyTorch Tutorial 09 - Dataset and DataLoader - Batch Training

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi everybody welcome back to a new PI torch tutorial today I want to show you the PI touch data set and data loader classes so far our code looked something like this so we had a data set that we loaded somehow for example from a CSV file and then we had our training loop that looped over the number of epochs and then we optimized our model based on the whole data set so this might be very time consuming if we did gradient calculations on the whole training data so a better way for large data sets is to divide the samples into so-called smaller batches and then our training loop looks something like this so we loop over the epochs again and then we do another loop and loop over all the batches and then we get the X&Y batch samples and do the optimization based only on those batches so now if we use the built-in data set and data loader classes from PI torch then PI torch you can do can do the batch calculations and iterations for us so it's very easy to use and now I want to show you how we can use through these classes but Before we jump to the code let's quickly talk about some terms when we talk about vetch training so first one Apoc means one complete forward and backward pass of all the training samples and one the batch size is the number of training samples in one forward and one backward pass and the number of iterations is the number of passes where each pass uses the batch size number of samples so here we have an example if we have 100 samples and our batch size is 20 then we have 5 iterations for one epoch because 100 divided by 20 is 5 so yeah that's what we should know and now let's jump to the code so first I already implemented some modules that we need so torch of course then also torch vision and then from torch dot utils the data we import data set and data loader so the classes I just talked about then let's also import numpy and math and now we can start implementing our own custom data set so let's call this wine data set and this must inherit data set and then we have to implement three things we have to implement the init with self so here we do some data loading for example and then we also must implement the double underscore get item method which gets self and an index so this will allow for indexing later so we can call data set with an index 0 for example and then we also must implement the lang method which only has self and then this will allow that we can call length of our data set so now let's start so in our case we want to look at the wine data set so I have the csv file here and i also put this in my github repository so you can check that out here and so that data set looks like this the first row is the header and here we want to calculate or to predict the wine category so we have three different wine categories 1 2 & 3 and the class label is in the very first column and then all the other columns are the features so let's load this and split our columns into x and y so here we can say X y equals numpy dot load txt and here I must specify the file name so this is in the data folder and then I have a wine folder and then it's called Wyandotte CSV then let's also give a delimiter equals a comma here because this is a comma separated file then let's also give it a data type and so let's say data type equals numpy dot float32 and that's also say skip rows equals 1 so we want to skip the first row because this is our header and now let's split our whole data set into X and Y's we say self dot x equals and here we can use slicing so XY and we want to have all the samples and then we don't want the very first column so we want to start at the column number one and then go all the way to the end so this will give us the X and then self dot y equals XY off and here again we want all the samples but only the very first column and we put this in another array here so that we have the size number of samples by one so this will make it easier later for some calculations so yeah and that's also convert this to a tensor so we can say torched or from numpy and then if this to our our to the function here so torch dot from numpy so we don't you do not need this but we can do we can also convert it later but we can do it right here so let's do this and let's also get the number of samples so let's say self dot number of samples equals XY dot shape and then 0 so the first dimension is the number of sample and then we can return this right years and this is our whole link function so return self dot number of samples and here we can also implement this in one line so we can say return self dot X of this index and then self dot Y of this index so this will return a tuple and yeah now we are done so this is our data set that we just implemented and now let's create this data set so let's say data set equals wine data set and now let's have a look at this data set so now we can say first data equals data set and now we can use indexing so let's have a look at the very first sample and now let's unpack this into features and labels like this so this is first data and now let's print the features and also print the labels to see if this is working and yeah so we have one feature column or only one row so this is one row vector and then the label so the label one in this case and yeah so this is how we get the dataset and now let's see how we use a data loader so we can say data loader equals the built-in data loader class and then we pass we say data set equals this data set and then we can also give this a batch size so batch size equals let's say four in this case then let's say shuffle equals true which is very useful for training and so this was shuffle the data and then we also say workers equals two so you don't need to do this but this might make loading faster because it's using multiple sub processes now and yeah so now let's see how we can use this data loader object so now we can convert this to a iterate iterator so let's say data eater equals eater data loader and then we can call the next function so we can say data equals data eater dot next and then we can all again unpack this by saying features and labels equals data and now let's print the features and the labels if so see if this is working and yeah so here we have it and here in this case I specify specified the batch size to four this is why we see four different feature vectors here and then also for each feature vector the class so for class labels in our labels vector or labels tenza and now we also can iterate over the whole data loader so and not only get to the next item so now let's do a dummy training loop so let's specify some hyper parameters so let's say num epochs equal eat box equals two and then let's get the total number of samples so total samples equals length of our data set and now let's get the number of iterations in one Epoque so this is the total number of samples divided by the batch size divided by four and then we also have to to seal this math see ya this and now let's print our total samples and the number of iterations and then we see we have 178 samples and 45 iterations so now let's do our loop so let's say for a POC in range number of epochs and now we do the second loop and loop over the train loader so let's say for I and here we can already unpack this by saying inputs and labels in and number eight and here we only put in the how did we call it data loader so this is all we have to do and now this enumerate function will give us the index and then also the inputs and the labels here which is already unpacked and now what we should do typically in our training is to do our forward and then our backward pass and then update our weights so this is just a dummy example so in this case I only want to print some information about our batch that we have here so let's say if I plus 1 modulo 5 equals equals 0 so every 5th step you want to print some information so let's print epoch and here let's print the current epoch and then all Epoque so here let's say num epochs and then let's also print the current step so step and this is I plus 1 and then the total steps so this is n iterations here and then let's also print some information about our input so inputs and let's say here we want to print inputs dot shape only and yeah now let's run this to see if this is working and yeah so here we see our print statements so we see that we have two epochs and in every epoch we have 45 steps and every fifth step we print some information and we also see that our tens are is four by 13 so we have our batch sizes four and then thirty different features in each batch and yeah so that's how we use the data set and the data loader classes and then we can very easily get the single batch single batches and yeah of course pie Taj also has some already built-in data sets so for example from torch vision dot data sets dot M nest we get the famous M missed data set and for example we can also get the fashion M this data set or the cipher and a data set or the Coco data set and yeah so the M this data set is one that we will use in one of the next tutorials and for now this is what I wanted to show you about the data set and data loader classes I hope you liked it and please subscribe to the channel and see you next time bye
Info
Channel: Patrick Loeber
Views: 168,818
Rating: undefined out of 5
Keywords: Python, Machine Learning, ML, PyTorch, Deep Learning, DL, Python DL Tutorial, PyTorch Tutorial, PyTorch Course, Neural Net, Dataset, DataLoader
Id: PXOzkkB5eH0
Channel Id: undefined
Length: 15min 27sec (927 seconds)
Published: Sat Jan 04 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.