TensorFlow Tutorial 12 - TensorFlow Datasets

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
in this video we're going to look at tensorflow data sets and how we can load different data sets do pre-processing and load the data [Music] efficiently all right so here are the data sets that are available through us to through keras uh data sets which is what we've been primarily working with so far and as you can see uh it's relatively few right we've been working with scifir10 and mnist and if we now just look at the comparison of tensorflow data set as we can see here we have a bunch more data sets so we have data sets for audio images image i don't know really difference between image image classification object detection question answering structured summarization text translate video i mean there are so many data sets on this one so it's a makes sense for us to learn it and it also becomes uh so you can also load the data extremely efficiently now i feel it's important to mention that what we're going to work with isn't the same as tf.data so tensorflow datasets make makes it easy to load data sets and it uses tf data but if you have let's say custom data set that you want to load you're going to want to use tf.data so tensorflow data set is a high level wrapper intended to make it very easy to load commonly used data sets and in future videos we're going to look at how to use tf.data to build an input pipeline for data sets that perhaps you've gathered on your own or scraped from the internet or something like that with that said there's going to be similarities since uh tf dataset is a wrapper for tf.data so what we learned in this video is still going to be useful for when we load our own custom data sets all right enough talking so what we're going to do now is actually take a look at the code so what you have right here is just the standard imports we've been using there i guess this one is new uh and this one from import tensorflow data sets as tfds uh that's new so if you don't have those you can google how to download them for so forth this one is just pip install tensorflow data sets they're also going to be in the description uh for the links for anaconda and then also the command for pip all right so when you've downloaded that what we're going to do is we're going to first just load the mnist data sets uh so we've been working a lot with mnist and we're all getting tired of eminence i think but this is really just for a simple uh use case of how it actually works to load it so uh what we're gonna do is we're gonna do ds train uh ds test so we're gonna get uh from this tensorflow data set we're gonna get a data set for training a data set for test and then we're also gonna get some ds info about the data set so how we do this is uh tfds.load and then here we're going to do mnist so then we do split and this right here this string is just what you find on that uh the tensorflow data set catalog and then you just you just write the data set name as the string and then we're gonna do split and we're gonna do train and test so this is uh to this sort of we're doing dstrain as the first output in this tuple right here so that's why we're doing split train first and then test um some data sets also have a validation so then you would also do add a validation string here eminence doesn't have that and but so you would have to check for this specific data set sort of what you have what the splits are for that data set then we're going to do shuffle files equals true so what tensorflow datasets usually do is uh store things in something called tf records and usually they store it in multiple files so let's say they have 1000 example per file and the reason is that that way it can be streamed through so if you're working with a server or something like that it can be streamed over internet and then it can be loaded simultaneously while it's training so that's quite you useful if you're training on something like google tpus or something like that and so what we want to do is we want to shuffle those files so that and we want to shuffle those so that it doesn't see the exact same sequence of files even though the the batches inside of that sort of 1000 examples are going to be shuffled and randomized uh we still want to to shuffle the files so that the ordering is not the same all right and then we're going to do as supervised equals true and this means that it will return a tuple of sort of the so of a tuple of image comma label otherwise it's going to return a dictionary and then we're going to do with info equals true so that's why we get this ds info if we remove this one set it to false we're just going to get ds train and ds test so first thing usually we just do print ds info so that we actually see what the data set looks like in this case i think we already know it but so the image shape are 20 to 28 1 and then the d type is tfueint8 and then we have label it has no shape it's just an integer and it's tf in 64 and then we have a number of classes 10. total examples are 70 000 10k and 60k for our training and then you can sort of see citation etc if you're using this for for a paper or something like that all right so if you also have newer versions of tensorflow i think it's a 2.3 and up then you can also do something like figure equals tfts.show examples uh ds train ds info so we're going to set in sort of the test set and then information about the test set and or training set rather and then rows and we can set columns and you'll see what this is going to do all right yeah so what you also need to do is uh you can't do this as supervised it's expecting to get a dictionary for this so let's just rerun it with add supervised equals false and then as you'll see it will look something like this where you can see some examples of your data set so here we have four rows and four columns as we specified and then we can see what the uh beneath we can see the uh the name and then the labels okay so moving on what we need to do now is uh we're gonna create a function so we're going to define normalize image we're going to take in an image and then a label and what we're going to do here we're just uh going to normalize images so so essentially we're going to make sure the image is tf flow 32 and then we're going to just divide by uh 255 to get it in that zero to one range so we're going to return tf.cast image and then tf.float32 and then divide by 255 and then we're going to return the label and so what we can do now is we can do ds train equals dstrain.map and then normalize image all right so we're essentially what this is going to do it's going to map every single example and run it through this function first and then we can also specify num parallel calls since to send in to this normalize image there's no inherent sequence that needs to be done in so this can be done in parallel so then we specify how many parallel calls it's going to do and so you could specify this yourself so let's say five or ten whatever it's essentially going to be a hyper parameter of your model uh one cool thing is that tensorflow lets you do auto tune so it's gonna sort of find what what tensorflow thinks it's is best and then how to get autotune you would do tf.data.x and then we're going to do dstrain equals dstrain.cache and this cache is essentially after the first time it's loaded data it's going to keep track of of some of them in memory so that it's going to be faster for the next time and then dstrain equals dstrain.shuffle and we can essentially set the buffer sort of the shuffle buffer size and let's say a thousand so this means that it's not i'm going to see the entire range of the data set but it's going to also this number kind of depends on the size of the of the files that in this case tftf datasets has stored them in but what you could do here also is you could do dsinfo dot splits and you could obtain the train and then you could just do dot num examples uh and in this way we're gonna be sure that it's gonna shuffle them randomly and then we're gonna do dstrain equals dstrain.batch and then we can set some batch size and let's set the batch size above here so batch size let's do 64. oh all right get that back and then also we're going to do ds train dot prefetch and then we're going to also going to set auto tune on that and then also here on the prefetch essentially it's going to while it's running on the gpu it's going to prefetch 64 examples so that they are ready to be run instantly after the gpu calls are done and then all we got to do here is uh we got to do it for the test set as well so we gotta do pretty much the same thing pretty much the same thing so we're going to ds test sds test.map normalize image and then non-parallel calls is auto-tune and then and we're going to do a batch so we're not going to shuffle it for the test set and then we're going to do lastly we're going to do prefetch prefet fetch and then auto two all right so this is for the actual uh data processing and uh this is going to be similar to when we've uh when we're loading the data ourselves using tf.data so this is just this is so this has is already a tensorflow data set which is has been loaded very conveniently by this thing but after that point so this is exactly what we're going to do when we have custom data sets and so on uh and then let's see what we need to do now is just create a simple model so let's do keras.sequential uh and let's do keras input 28 28 1 and then let's just do one com layer 32 output channels three in kernel size and then relu and let's do flatten and then one dense layer and that's it so all we got to do now is model compile and you've you've seen all of this before so model compile optimizer equals cares.optimizer we can set the learning rate uh and then loss equals keras.losses.sparse categorical cross entropy and then metrics is just going to be accuracy and then we're going to do model.fit on the ds train so normally you would have to send in x and y so if you have x and y you would do x and then y now since we have everything in this ds train we can just send that in and that's going to contain tuples of the x and y labels and then we can do epochs set it to 5 or something like that verbose equals 2 and then model that evaluate on the test set and if there are no errors let's see and this hopefully works yeah uh one thing we had to do uh we did as supervised equals false when we i wanted to do this show examples uh but we got to do a supervised equals true for this to work so let's rerun it and let's see what's wrong so what we got to do here is we got to do from logit equals true otherwise this is not going to train so i think that's what we were missing all right so we get 98 on the test set and uh so that's an example of mnist with images i was thinking that we could also look at something a little bit uh more i guess something different in text classification so this is going to be i guess a little bit more advanced because it's also going to be how to process the text and so on we're going to do it very simple and try to focus on the data set but we're gonna so we're gonna look at imdb data set reviews so essentially that that's uh reviews of movies and and so what we want to do is a sentiment analysis on these reviews of these movies and tell and sort of interpret if the comment is a positive comment or a negative one so let's just give an example i mean some comment might be this movie was terrible and then we would give that a zero since this is negative and then if something someone said this movie was really good then we would set this to maybe one right so uh that's sort of the the data set we're working with and you what we're going to do is similar to what we did before ds train ds test we're going to do and then ds info we're going to tfds.load and we're going to specify imdb reviews we're going to do split is train and then test and then we're going to do shuffle files equals true as supervised equals true and then with info equals true all right so the same as what we just did uh now what we need to do also is uh since what if we just so what we can do first of all actually is we can do print dsinfo and then we can do i don't know something like for text comma label in the strain we can do print text and then let's just quit after one single example so just just so we get an um we get an interpretation of how it looks like and then we're going to do split alright so we get some information here we get test train so we have 25k examples for training 25k for test and then this unsupervised i think are just comments that we don't have a label for and then uh yeah and then we get greyfun i went with eight friends to sneak preview viewing of this film yeah so you can sort of it's quite a long comment but anyways and uh so that's how it looks like what we need to do first of all is actually tokenize it so that we don't get an entire sentence because that's we can't send an entire sentence to a model uh we need to first tokenize it so that let's say we have a string hello i love this i love this movie movie and then uh we we're gonna do so tokenization and essentially the output is then gonna be a list of sort of the specific uh independent word so i love and then this etc you get the point and then the next point is we need to actually numericalize it so that we we can't send in a a string here we need to convert each of these words um to an index using some vocabulary all right so what we need to do is we need to tokenizer and we can use tensorflow datasets for that we can do features.txt.tokenizer and one thing here is that um so tensorflow has a bunch of different ways to process text and uh it's a bit confusing to be honest you have tensorflow data sets that you can pre-process it keras has a tokenization you can pre-process it and then there's also a library called tensorflow text that there isn't really that much information on so yeah i'm not really sure which one of these three are best so far uh this tensorflow dataset seems to work fine but anyways after we get this tokeni tokenizer we're going to do define build vocabulary and we're going to do a vocabulary it's going to be a set and then we're going to do for text and then we don't need the labels in ds train we're going to do vocabulary.update tokenizer.tokenize of that text and then we got to convert it to numpy and then let's do lower so that it doesn't really matter if the if the characters are uppercase or lowercase and uh one thing if you're very observant is that we're adding every word now in our vocabulary and that is not ideal right normally would set some frequency let's say if it occurs five times in our data set then we add it to our vocabulary because that's then it's an important word or something like that so this is not really for efficiency this is not for accuracy i just want to show you a very simple example of how you would do this and uh i think you can try to make this better and uh yeah do something like check how many times that specific word occurs and then add it if it if it occurs a certain amount of times something like that and then we're going to return that vocabulary and so we're going to do vocabulary equals build vocabulary and then one thing we're going to do now is we're going to do an encoder so as i said we need to numericalize the specific all of the tokenized words and that's going to be done with tfts.features.txt.tokentext encoder and then we're gonna send in uh our vocabulary first of all and then we're gonna send specify out of vocabulary token so if we get a word that's not in a vocabulary in this case we're not gonna get any right because we're adding every single word but uh anyways and then we're gonna do lowercase equals true and tokenizer we're gonna specify our tokenizer all right so now we can do define my encoding and we're going to get some text tensor and we're going to get some label and what we're going to do is we're going to return encoder dot encode that text tensor dot numpy so we're going to convert it to numpy first and then we're going to return the label so essentially what this encoder is going to do is it's going to tokenize it and then turn it into an index based on this vocabulary right here and so it's going to do all of the things we need essentially and then um we've got to do another function so we're going to encode map and we're going to send in some text and label and the thing is here that uh the the data loading is also part of the tensorflow graph so uh this is a python function and we need to do a function to uh we need to specify the inputs and the outputs of this function so that it's part of the graph so we're going to do encoded text and then label is tf.py function so essentially we're going to specify that we're going to send it through some python function and we're going to specify the function so my encoding and we also need to specify the input so this is because of its part of the because it's part of the graph so we've got to do text and then label and then we got t out is uh is tf in 64 tf in 64. so because when it's numericalize it's going to become an integer representing the word in the in our vocabulary um and both are going to be integers and then we also got to do encodedtext.set shape so we got to specify uh the shape and we're going to specify none and why we're specifying none is because we essentially have a sequence and that sequence can be arbitrary length although the label are just going to be a single uh integer of zero and one all right and then at the end return encoded text comma label so this can feel a little bit i don't know clunky like unnecessary but and there might be better ways to do it as well this is just um how i managed to get it to work and and it seems to to be that this is a standard way of doing it all right so then we're going to do auto-tune tf.data.experimental.autotune we're going to do ds train sdstrain.map we're going to send it through encode map that's what we call it right encode map and then num parallel calls is auto tune and then we gotta do ds let's do cache so dot cache and then we're going to do ds train equals dstrain.shuffle and let's just write 10 000 and then what we're going to do now is ds train dot padded batch and we got to do a padded batch because all of the sequences in our batch are going to be different in length so we need to we need to pad it uh so that pad it to the longest example and what we do here then is uh we do paddle batch 32 and then we gotta specify padded shapes um so we got to do none and then just a tuple so on newer versions on tensorflow this part is not necessary but essentially what we're doing here is we're specifying which of the of the shapes that are going to be padded so when we specify none here those are for the images or rather the text sequences and that's the one we we want to pad so that's why we write none on that one and then tensorflow is going to know that we want to pad that one and then we're going to do dstrain dot prefetch and then again auto tune and similarly for our test set we're just going to do ds test dot map and code map encode map that's it and then we're going to do padded batch on that one as well so ds test dot padded batch 32 and then again padded shapes are going to be none and just a tuple all right so that was it for the pre-processing of the text and the data set what we got to do now is create our model so we're just going to create a very simple model and then we can we're going to do first of all we're going to do layers dot masking mask value equals zero so essentially so essentially here we're telling tensorflow that the values that are padded they're going to be padded with index of zero those values are going to be ignored in its computation so let's say we have one sequence that's 1000 in length and we have one that's 20 then that 20 in length is going to be padded by 980 of zeros and performing the the computation for all of those are going to be quite unnecessary so when we're doing this layer.masking we're just letting tensorflow know that just ignore those values of zero don't perform any computation um yeah so then we're gonna do layers dot embedding we're gonna do input dimension it's going to be the length of the vocabulary and then plus two uh plus two because we added one when we did this padded batch with index of zero to our vocabulary and then we also have one out of out of uh we also have one index for out of vocabulary words so we're just going to add two on that and then let's specify some output dimension we're just going to specify 32 which is very small for an embedding size normally you would have 300 or something like that but as again this is just for illustration and then what you would also do is you would do some lstm or some sequence model on this in this case all we're going to do is we're going to do global global average pooling 1d so let's just say we had a thousand words in our sequence and then what we did is we mapped this each each index uh each word we mapped that to an alpha dimension of 32 so essentially we had you know batch size and then times 1000 in that case and then what we did after this embedding we're going to get batch size times a thousand times 32. so each word has been mapped to an output dimension of 32 then after this after this average pooling we're just going to get batch size times 32 essentially taking the average of all of the sequences for all of the examples um yeah so then we're gonna do layers.dense 64 activation equals relu and then at the end we're just going to add one dense layer so let's just say it's going to output a single flow float value and then if it's uh if it's less than zero if it's less than zero then it's negative so less than zero negative greater than zero positive or maybe greater or equal so uh what we're gonna do here is we're gonna we're gonna use um binary cross entropy and that's gonna use the sigmoid activation function uh so yeah i guess this this part is a bit more advanced so if you're not really following that that's okay but uh yeah so what we're gonna do now is we're gonna do model compile um i guess also one thing you could do is you could do as we've done previously where it would output two nodes and then you could use sparse sparse categorical cross entropy as we've normally done it's just that when we have two classes we can instead use another loss function and that loss function is keras.losses.binary cross entropy and then from logit equals true and then we can specify the optimizer keras.optimizers.adam 3e minus 4 and then we can specify the clipping value so that we don't get exploding gradient problems and then metrics we're going to do accuracy all right and then in the end we're just going to do model.fit ds train let's do it for 10 epochs and then we're going to evaluate it on the test set so again this is not in any way for accuracy or something it's just for demonstration of how we would take this data set from tensorflow data set build a vocabulary pre-process the text load it efficiently and then create a simple model just to show that it works so let's run it and see what we get all right so first of all can we just talk about how that ran without any errors on the first try i mean i wrote like all of this code and it ran on the first try that got to be the first time that it's actually that has actually happened so on the test set we get 89 uh at the end and let's see if we can scroll up this is why i set the verbose equals two by the way and then we get 96 on the training about so well we can i mean there's some room for regularization and there's also some room we can try and uh train this for a little bit longer uh of course we can make the model bigger and so on but uh for this case this that that's fine so all right so hopefully this video was useful i know this last part might have been a little bit advanced but i also wanted to show you how you would do it for text and uh i also want to mention that there are a bunch of different ways to do this this is one way and if you have an alternate way that you think is better do something like this to process the text then please leave a comment but anyways thank you so much for watching the video and i hope to see you in the next one
Info
Channel: Aladdin Persson
Views: 52,837
Rating: undefined out of 5
Keywords: tensorflow datasets, loading data tensorflow, tensorflow dataset tutorial
Id: YrMy-BAqk8k
Channel Id: undefined
Length: 29min 34sec (1774 seconds)
Published: Sat Aug 22 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.