Any data science project starts with data collection process. AtliQ Agriculture has three options of collecting data. First, we can use ready-made data. We can either buy it from third-party vendor or get it from kaggle etc. Second option is we can have a team of data annotators whose job is to collect these images from farmers and annotate those images either as a healthy potato leaves or having a early or late blight disease. So this team of annotators can work with farmers you know they can maybe go to the fields, farmer fields and they can ask farmers to take the pictures or they can take pictures themselves and they can classify with the help of farmer or by some means you know by domain knowledge that okay these are classified as deceased potato plants versus the healthy potato plants so they can manually collect the data- this option is expensive it requires budget so you have to work with your stakeholders and kind of get the budget approved and it might be time consuming as well. The third option is data scientists can write web scraping scripts to go through different websites which has potato images and collect those images and then use the tools like Docano there are so many tools that are available which can help you annotate the data so either you annotate that or you get annotated images by using those web scraping tools. In this project we are going to use ready-made data from kaggle. We will be using this kaggle data set for our model training you can click on this download button. It's 326 29 megabyte data whatever and it has not only the images for potato disease classification but some tomato and pepper this is classification as well. We are going to ignore all of this we will just focus on these three directories so I had already previously downloaded this zip file when I right click and extract all I get this folder. And in this folder I had the you know tomato all these directories but those directories I have deleted so I deleted those directories manually so I asked you to do the same thing. Go here delete all the directories except these three then you will copy paste this directory into your project directory. Now for project directory I have C code folder and here I am going to create a new folder called potato disease. So I want all of you to practice this code along with me. If you just watch my video it's a waste of your time you practice as you watch this video only then it is useful. You know this is the best advice that someone can give you okay? I have this folder ready for my project and in that I'm going to create a new folder called training okay.? and Then i'm going to launch get bash so I have this get bash which allows me to run all the unix commands you can use windows command prompt as well. and I will run python minus m notebook which is gonna launch you know my jupyter notebook here and in this I will locate my potato this is folder go training create a new python 3 file and this will be my model okay? So you can say okay training whatever just give some name to this particular notebook and then we are going to import some essential libraries so the purpose of this video actually is to download the data set into tf dataset TF data input pipeline and then we will do some data cleaning and we will make our data set ready for model training. So that's the purpose of this video. S here let me download some essential you know modules here and then the first thing I'm going to do is okay? So we had this. Okay so in the download my download folder somewhere I had this planned village directory right? So planned village directory I'm going to the do control C and then control V here. So I will copy all those images into the same folder where I'm running this notebook you know my IPV notebook so you see now I have this directory and if you look at all this this is like early blight so there are thousand images here and if you look at all these images you see there is there are these black dots So this is showing that this potato plant has some kind of disease. If you look at healthy plants healthy leaves are healthy you know. There are no blacks spots and they look pretty good the other one late blight will also have late blight is a little more deteriorating. See you look at all these leaves they look pretty horrible so we have all this data here in our directory and now I'm going to use tensorflow's data set to [Music] download these images into tf.data.dataset. Now if you don't know about TF data set you need to pause this video right now go to Youtube search for tensorflow data input pipeline and you will see my video. Here you need to watch this video it will clarify your concepts. Basically what's the purpose of tf.data.dataset let's say you have all these images on your hard disk okay and you can download these images into batches because there could be so many images right? So if you read these images and in batches into this TF data set structure then you can do like dot filter dot map you can do amazing things so please watch this video and I will now assume that your concepts around TF data sets are clear and we can now load that data using our tf dot like this this particular API so TF dot carer dot pre-processing image data set from directory. Okay now okay what does this do so you can search tensorflow image data set from directory. It will show you an API for this directory so you specify a directory first. So let's say you have main directory you have your classes and you these are all the images so this one call will load all the images into your tensor. Basically into your data set. Okay so our so the first argument is what directory? Okay what is our directory? Okay so let me write this here our directory name is plant village, correct? See plant village that's our data directory then um I will say shuffle is equal to true so that it will just randomly shuffle the images and load them and then I will say image size Okay what is my image size? So let me go here and open this directories you know like if you look at this image size you see 256 by 256 all of these images are 256 by 256. you can verify that. So I will say 256 by 256 but I will store you know I will create couple of constants where because I need to refer to this constants later so I will say okay 256 by 256 is my image size my batch size you know 32 is kind of like a standard batch size I will again store that into a constant and initialize it here and that's pretty much it. I will just say store this into a data set okay okay okay okay okay okay that's wrong okay I did not run this okay so it loaded two one five two files belonging to three classes well which three classes? so you can just do this dot class names, you know. I will just store that into a variable so that I can refer to it later and these are the class names basically your folder names are your class names See these are the three folder names and if you look at this. This has thousand images the second one has 152 third has thousand so two thousand one five two and look if I do length of data set.. Do you have any clue why is it showing 68 just think about pause the video think about it, because every element in the data set is actually a batch of 32 images so if you do 60 to 8 into 32 see you the last batch is not perfect so it is showing little more than two one five two images but you got an idea why this is 68? okay. Let's just explore, you know. I would say let's just explore this data set so I will say for images batch for okay for image batch label badge in dataset.take. Yu know when you do this it gives you one batch. One batch is how many images? 32 images, okay? So I will print just the shape of this thing. I will say shape this and labels bash dot. I will just do see numpy like every element that you get is a tensor so you need to convert that to numpy again. If you don't know the concept or refer to the video that I talked about earlier and you find that there are 32 images each image is 256 by 256. Do you know what is this? You guys are smart it's RGB. It's channels basically you have RGB channels so it's basically three channels and I'm going to initialize that as well here so that you know I can refer to it little later and these images label batch has you already realize zero one two. So this is zero this is one this is two. So there are three classes three images and if you are you know if you want to print let's say each individual image. So I will okay forget about this I will just print first image this this has 32 images I will print first image so for our first image you see it's a tensor. I you want to convert tensor to a numpy you do this and you find all these numbers 3d array every number is between 0 to 255. the color is represented bit with 0 to 255 So that's what this is okay and again if you do shape of this you'll find 256 by 250 by 3 okay first image got it all right now Let's try to visualize these images okay let's say I want to visualize this image so I can use plt dot I am show so this is matplotlib okay plot c matplotlib and when you do I am show it expects 3d array so what is my 3d area well my 3d array is this so I'm printing by the way the first image okay? So Numpy there is some problem so what I need to do is it is float so I converted it to end and now you should see it working okay I don't care about all these numbers so I will just do I will just you know hide that and by the way every time it is shuffling so that's why every time you're seeing different image because it has shuffle randomness to it access is off now I want to display the label like what image is that so how do I display that label? Well you can do plt dot title okay and what is my title? Well my title is label batch, okay? This is my title but this will give you number zero, one, two how can you get the exit this class name? Well we have class name so you supply that as an index I hope you are getting the point See potato early blight okay I want to display a couple of these images so I will just run a full loop I'll say maybe I want to display out of you know first batch is 32 I want to display the let's say 12 images and instead of this I will say I got it okay? I hope that is clear and if you want to show this in a see if you run this it's gonna it's just showing one why because you need to make a subplot. So sub plot three by four is like almost like a matrix and if you do this okay it shows all the images but the dimension is kind of messed up so I will just increase the area you know of each of these images to 10 by 10 and look wonderful it just shows me all the images beautifully. This is healthy leafy, this is early blight late blight and so on now we are going to split our data set into train test split, okay? So let's say data set length is 68. Actual length is by the way 68 into 32 because each element is 32 batch. Okay? Now what we will do is we will keep eighty percent data as training data then we get remaining twenty percent, right? In remaining twenty percent we will do two split so one ten percent split we will do validation and remaining 10 10 percent will do test. So this validation set will be used during the training process on when you run each epoch after each epoch you do validation on this 10 percent. Okay? So you run let's say you know let me define the epoch. So I am going to run 50 epochs this is style and error okay? You could be 20 30. so we'll we'll run let's say 50 epochs and at the end of every epoch we use this validation data set to do the validation. Once we are done through 50 epochs, once we have final model then we use this 10 person data set. This is called test data set to measure the accuracy of our model. Before we deploy our model into the wild we want to use this 10 percent before we deploy our model into the wild we'll use this 10 test data set to test the performance of our model. Now how do you get this split? You know in sklearn we have trained a split method trained is split if you use statistical machine learning in escalant. We have that. We don't have that intensive flow we are going to use data set dot take so when you do data set dot take okay. Let's say 10 it will take first 10 samples now. What is our train size okay so trainings size is 0.8 because it is 80, okay? And what is the length of our data set 68 okay? I'm going to say okay what is 80 of 68 well 54. so I can now do? Take first 54 samples first 54 batches. Actually each batch is 32 so it's much more simple and call it a train data set. Okay? Okay so that's my train data set and if you do a length I hope you're practicing along with me you find 54 and if you do data set dot skip 54 it means you are skipping first 54 and you are getting remaining 54. You know this is like if you use the slicing operator in in Python list it is like 54 and words onwards and this one is like first 54 okay? So I hope if you know Python a little bit this this should be clear and this one. Okay so first data. Okay? So this will be my test data set actually this is not test data set. So this will give you remaining 10, 20. In that you need to again split into validation and test. Correct? So I mean temporarily I will save it as a test data set but if this is not axillary S data set I have 14 and out of that you know my validation size is what? 10 percent? Okay and what I'm doing is a 10 percent of my actual data set is six so I need six samples basically from my taste data set and when I do that I get my validation data data set and if you do validation data set that is six samples and then you will do skip and that will be your actual test data set. So we just split our data set into validation test and train data set. Now the code I wrote was using all the hard coded numbers and you know doesn't it's just a prototype so if you want to wrap all of this into a nice looking Python function let's call it this function and that function. The goal of this function is to take the tensorflow data set okay it should also take what is your split ratio. So I'm just saying if you don't supply anything by default it will say 80 train 10 validation 10 test and I'm also going to do shuffle I'll explain why and shuffle size is 10 000. If you don't know about sample size again watch to my other video that I referred it's very important you watch that okay? Now what I will return in the end is this so we are doing whatever code we are doing. We are just creating a nice looking Python function, that's it, okay? So my train size is okay what is my data set size first of all so my data set size is this length of data set then my train size, my training size is train split like 80 of this and I want to convert it into integer because see I don't want to get these float numbers that's my train size and my validation size is this okay? All right now my train data set is basically whatever we did previously which is you know ds dot take train size and then when you do ds dot skip train size you get remaining 20 percent samples in that you will again take validation size and that's where you get your validation data set and if you do the same thing and just do skip here you get your test data set. Okay so I hope that is clear. Now, we have shuffle arguments so if shuffle I want to just suffer the data set you know so that before we split into train test split the suffering happens and seed is just for predictability you know if you do same seed every time it will give you same result this is just a seed number it can be anything it can be 5 7 anything okay my function is ready and I can now call my function on my data set okay what is the name of my data set here is my data set you see data set so we read all the images into this data set now we are doing train test split train test split. Sorry. Okay see this ran like it ran so fast and I will just confirm the size of my validation set my test set and so on and they are coming to be what we expect it to be actually. Now once again if you have seen my video on tensorflow data input pipeline you would have understood the concepts behind caching and pre-fetching etc. So that's what we are going to do here so we are the training data set that we have. We will first do caching this will you know, it will read the image from the disk and then for the next iteration when you need the same image it will keep that image in the memory. So this improves the performance of your pipeline again. Watch that video because you will get good understanding on this shuffle. Okay how shuffle 1000 works again you need to watch that video so shuffle 1000 will again like shuffle the images I think this since our yeah it can be less than thousand as well um but anyways and then prefetch you know prefetch if you're using GPU and CPU. if GPU is busy training pre-patch will load the next set of batch from your disk and that will improve the performance. Actually if you look at my deep learning playlist I have prefetch and cache video here. So you know this this video talks about prefetch and cache and I can quickly show you. So usually when you are loading batches you know let's say 32 images at a time and I have a GPU Titan RTX when it is training you know you are not reusing CPU when the GPU is training because CPU is sitting idle then when you're done now CPU again reads the batch and GPU is added so this let's say for this example it takes around 12 seconds but if you use prefetch and caching so what's gonna happen is see when you use prefetch and caching while GPU is training batch one CPU will be loading that batch you see so that's your prefetch basically and your cache is something where okay so this is preface and cache is basically if you have read an image so see here I think usually see red you read an image so this is that blue dot and during the second epoch you are reading the same images again okay? Bt if you use cache here you don't see this blue block here so you save time reading those images so go to.. I will link all these videos by the way but if you do code basics deep learning tutorials you know these are the two videos I am referring to so back to the tutorial once again. So that's what I'm doing and here I'm letting tensorflow determine how many batches to load you know while GPU is training and then you can load this here okay? Now my validation and test data set again will use the same paradigm and now my these data sets are kind of optimized for training performance. So my training will run fast now we need to do some pre-processing you all know if you have worked on any image processing you know the first thing we do is scale so the image the numpy array that we saw previously was between 0 to 255 you know it's an RGB scale. You want to divide that by 255 so that you get a number between 0 and one and the way you do that is by doing tf dot eras dot sequential okay? And here I'm supplying my pre-processing pipeline okay? So the way you do rescaling is by using this API. Now don't worry about experimental by the way this is stable. okay I had a conversation with tensorflow folks actually on this they're saying it is stable. So don't worry 1.0 there by 255 this will just scale the image to 255 and we will supply this layer when we actually build our model okay? We need to do one more thing which is resizing. We will resize every image to 256 by 256. so this will resize the image now you will immediately ask me our images are already 256 by 256. Why do we need to resize it? But this layer that we are creating okay See. Let me create this layer so this resize and rescale layer will eventually go to our ultimate model and when we have a trained model and when it starts predicting.. during prediction if you're supplying any image which is not 256 by 256 you know some different dimension this will take care of resizing it. So that's essentially the idea here. Once we have created this layer one more thing we are going to do in terms of preprocessing is use data augmentation to you know make our model robust. Let's say you train a model using some images and then when you try predicting you know at that time if you are supplying an image which is rotated or which is not which is different in a contrast. Then your model will not perform better so for that we use a concept of data augmentation. In youtube you search for tensorflow data augmentation you will find my this my video you must watch that video what we do in that is let's say you have this kind of original image in your training data set you create four new samples four new training samples out of that. You apply different transformation let's say horizontal flip contrast, you see contrast is increased in this image. So you're taking same image you applying some filter, some contrast, some transformation you are generating new training samples. See here I rotated the images you see and I will use now all five images for my training. So I have one image I create four extra image. I use all five images for my training so that my model is robust tomorrow when I start predicting in wild if someone gives me a rotated image, my model knows how to predict that. Okay so that's the idea behind data augmentation and if you have seen that video tensorflow provides beautiful APIs again you are doing same thing where you are creating couple of layers and I'm going to apply a random flip and some rotation you know if you watch that video or the other video you will get a clear understanding. So that's my data augmentation layer which I am going to store here and by the way resize rescale. All these layers I'm going to use ultimately in my actual model so I had only this much for this video in the next video we are going to build a model and train it. In this video just to summarize we loaded our data into tensorflow data set we did some visualization then we did train test split and then we did some preprocessing. We have not completed pre-processing we just created layers for pre-processing. By the way and we will use these layers into our actual model. I hope you're liking it. I hope you are excited to see the next video coming where we'll be actually training the model. It's going to be a lot of fun. If you're liking this series please share it with your friends give it a thumbs up. See, when you give it a thumbs up it helps me with Youtube ranking and this project can go to more people who are trying to learn and the thing about Youtube is you know the learning is free. So if you are doing free learning at least you can give it a thumbs up you know. I mean give it a thumbs down if you don't like it I don't mind it. But if you give thumbs down please leave a comment so that I can improve. Thank you for watching.