Reading the MNIST Dataset as a numpy array.

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi friends welcome to push for a my name is Sean M de Guiche and today I am very happy to bring in this video so I have decided to make a few videos on some easy to implement machine learning algorithms I have seen that many people who have just started the research career are diving straight into very complicated algorithms like convolutional neural networks or say recommend neural networks or deep learning and these kind of stuffs but I believe that it is very essential to understand some of the core concepts of artificial intelligence of machine learning and for that there are two approaches one of it is the theoretical approach in which one has to study some of the books or papers that you know clarify the basic concepts and the other part is to develop your own machine learning algorithms sometimes from scratch sometimes by using some api's but the point is to get your hands dirty with coding the more you code the more you understand the internal dynamics or the what kind of computation is going on behind all these machine and in algorithms and why do these actually work so I have also made another video in which I have you know described in more details or explained in more eaters that why one should not dive straightaway into things like deep learning and why these basic concepts are necessary if you have not seen that I would say please you can check the link I have given in the description below and other than that I I believes that these videos that I am going to publish in the future it will be helpful for du commerce now this video is at present it's basically about data processing now when somebody is working with machine learning the first and foremost thing that one has to do is to process raw data and to tell you that how one should do that I have made this little video in which I will be trying to explain how a very popular data set called the Emnes dataset is organized and how you can extract the information and how we can represent it it you know very say you know easy to use data structure but before all that I would like to ask you to please subscribe to my channel if you want to follow this and you know get updates for the new videos that I will be posting in the future and also don't forget to click the bell icon for the notifications anyways so as I was telling in this video I will be talking about one specific data set called the amnesty Tercel now m-miss data set it is composed of around 70,000 images which are basically off some handwritten digits the digits are mainly taken from poster cards and it's very useful for us you know like image classification algorithms like a typical image classification algorithms it was somehow someone proposed somewhere around say 1998 when young Lacombe published his first paper on you know the first convolutional neural network it was first disclosed around that time and since then it has become one of the benchmark data sets to build algorithms on like whenever you are trying to build new algorithms it is very common practice to work with some very simple data set at first and M NIST is one of those simple data sets in which you can build your own algorithms now for that let me take you to the website of M Ness directly so here so this is the website of the Hemnes database even if you like do a basic Google search in this database it's not normally comes up in the first yeah this is the second link there this is the main website which is this one so in this website you will be getting some downloadable links which will give you the train images the training labels and this is the test set the test images and the testing levels so I will be explaining a bit more about what are you know but what these labels are but before that I will just point you to the downloadable links and also you can see the performance of various algorithms on these data sets and like what is the error rate on the given test set that they have provided so right now I think the best error rate is somewhere around this 0.23 that is like a 99 point say about 98.7 nineteen point seven seven percent accuracy and that's quite a lot that's like almost perfect like yeah almost perfect algorithm and this is like a committee of 35 convolutional neural networks with elastic distortions but these are much more complicated things which is like beyond the scope for newcomers at first one should understand some of the basic things of about machine learning and data processing and all those things so I will explain more about how this these files like as you can see I have told like there are sixty thousand example you know examples in the training set and ten thousand examples in the test set but they have provided only like ten files sorry they have only provided like four files and all of these sixty thousand images are encoded in a specific format within these files and I will show you a code in which you can access all those information and also convert them to a favorable data structure if you study the page more you can learn a bit about the history of how the data set was collected and also if there are some information about how the files are formatted and that will help you in extracting the pixel intensities or like the images or the levels of the images ok so again I have used the term labels again and again so basically the digits are from zero to nine and each of those zero to nine like labels for images so if we have say a specific image with say a three written over it so the label of that image is three so you have 10 levels 0 to 9 and each label corresponds to one of the image or one of the images or rather each of the images have a label corresponding to them now okay another thing that you might want to check out is the paper in which this was first proposed this is a this is called gradient based learning applied to document recognition it was one of the first it's not one of the it is the first convolutional neural network ever proposed and that makes it is one of the most important papers in the history of machine learning so if you just go to Google Scholar and let's check gradient based learning applied to document recognition and you will see that this paper has almost like 13,000 citations and that's quite a big to 13,000 citations you don't get that much amount of citations if your paper is not good okay [Music] so this was one of the first convolutional neural network that was ever proposed and if you go down you can also see the image that they have provided of the cnl yeah so they have shown how they proposed this convolutional internet work it was called a lien at five because it was a five layer network but I will these are much more complicated things which I will be discussing later this is just for like fancy viewing right now if you scroll down and there is a lot of statistical test that has been done on this Emnes data so d you can see the okay so these are the images you see these are like 28 cross 28 images each corresponding to one of the ten in English digits and these are handwritten some as you can understand that handwritten samples obviously have much more variance as compared to like typed samples and hence it is a much more difficult task and for that we have this convolutional neural networks have proved to be quite efficient as compared to the previous algorithms that came before 1998 so like that kind of started one of like that kind of started the deep learning error in computer vision okay so now if you go through these people you can learn a bit more about it but right now our focus is on the image data set now as I said this files are there basically encoding in a binary format and if the binary data is traversed one byte at a time then by setting some specific offsets like taking a few amount of bytes at a time we can extract some specific information for example in the training set so out of these four images are sorry out of these four files one of them you can see is a train labels files like it's the label for the training images and if you can't if you check the offset the offset corresponds to every bite in the coded binary string so starting from zero if we collect a 32-bit integer or a four byte integer we'll be getting a value called two zero four nine and that number is called a magic number which tells you whether that this file is a label file or an image file next 32-bit integer that's a four byte of integer will give you a number called 660 thousand which tells you how many samples there are in the file and after that every unsigned byte will give me a label now label values are given from zero to nine for the image files it's a bit more complicated because images are you know 2d matrices which are of the size 28 plus 20 so to express that so the first 32-bit integers gives us a number 2 0 5 1 which corresponding to image image files and then the next 4 bytes or 32 bits give us the number of samples at a 60,000 and the third set of 4 bytes in the fourth set of 4 bytes this is the number of rows and number of columns after that every unsigned byte gives us a value of a pixel now the pixel values can be from 0 to 255 and zero means background and 255 means foreground the same format is followed for the test set label as well as a test set image file now this is a data structure that is not very you know programming friendly we can say that and so in programming we are normally habituated by using data structures like say arrays or lists or say dictionaries things like this but this is just a kind of a string of binary numbers so we have to extract these values and organize them in a very feasible or suitable data structure for that I have written a code which you can find it in the data depository which corresponding to the name of this channel even in my future videos I will be posting many codes and I will always keep the copies in this repository that you can find and github and I would like you'd like to you to follow these repositories in the future as well to find new codes about you know new systems or new implementations this way I think many newcomers or new researchers they will find a lot of help here so sorry so the name of the repository is the name of the channel that is ghost for AI and if you go down to users you can find this repository so I have only repository or not because I've just started the channel basically so we go into data processors and here we have made a M list loader dot I PI n B so this is ipython notebook file ipython notebooks it is basically an IDE for coding in Python and it's a nice idea in which you can do you know Python programming in a very structured way and it is good for tutorial it's very good for tutorials because you can have really nice formatted parts of text in between the lines of codes and that makes it very easy to work with so what we have to do is we have to clone this repository in our local directory and then we can run it from there so basically the if we click on clone here and we will get a downloadable link we copied and then we have to paste it somewhere so let me go to my coach so let's let us do this here so we open a terminal in this location and say we clone the repository so the link view a copied we can clone it by using this command git clone and then the link will take some time and yes now it's copied so so we have the ipython notebook file in which is in which I have written all the codes for loading the administrator we can open the terminal again in this location and we can open our notebook file Jupiter notebook so this is the command for opening a Jupiter notebook IDE and it opens ID in the browser and that's that is a very cool thing to have actually so when we open the notebook so as I told you that this is a very nice ID in which you can have parts of your codes and also some nicely formatted text to help you understand those codes so it's very good for to specifically very good for tutorials so here we have a data loader for the image data space and I have also given the links or those four links which we I showed you in the website and also here so you can directly download them from here or I have also provided a little code in which can download the data for you so I can explain the code but it's mostly commented if you follow the comments it should be easy to understand the code so here you have a directory in which you want to download the data you have to keep this in mind because this directory will not be needed for the other parts of the codes as well so here I just create the directory if it does not exist we have the downloading you know URLs to download the data and for all the URLs I basically you know call a function called URL retrieve which will download the files for me so if I don't run this part of code I can use shift enter to run and it will download all the images rather not the middle but the four files now I can check the if the files have been downloaded properly or not so the location was dot dot slash dot dot slash data dot m nist data set so dot dot means i go to the previous folder then i go to the previous folder again so you have a folder called data you means data and we have these four archives so these archives are you can extract these archives manually using the graphical user interface or you can also use a little this little code that I've written here and that can be extracted extracted very fast so just let me down this code yeah so this code will extract all the files from the archives and also there is optional code for removing the pre archives because we don't need them anymore this part of code is basically for formatting the ipython notebook it just puts the table on the left hand side I did that so that it's you know you look beautiful anyways but it's not necessary so I have provided the file descriptions here so that we can track how to like traverse the binary sequence so for labels it's for bite of magic number than for bite which says the number of samples and after that every bite gives me a label and for the images we have four bytes of magic number four bytes or which is the number of samples then four bytes which is the number of rows the number of columns and then after that every byte gives me a intensity of a pixel now you may be wondering that an image is like twenty-eight cross twenty-eight but we have like one pixel at a time so how does that work so basically what they have done is so a twenty-eight cross twenty-eight image is basically 784 values which are organized in two dimensions so if we you know read although is twenty-eight cross 28 you know all the pixels in the 28 cross 28 image in a row major format like one row at a time then we will see that we have 784 straight pixels that can be read and they are basically you know return all the 784 pixel values from first to last for every sample okay now what we'll do is we'll convert this you byte files into something called num by eros now numpy is an API for in Python and anybody who's like starting the machine learning journey they and especially you if they are using Python I will I would suggest that they learn a bit about numpy though it like it's not a good thing to just go and learn API is just from scratch but these these things should be learnt more from on a necessity basis but it is good to have some basic knowledge about enum guy you know this API is like numpy scifi and there are like things like network matplotlib some these are some basic ApS that you will be needing in your machine learning career but you will learn them as you go so show you the number documentation is just in case so I think it's basically here don't buy manual number b6 so if we go to their like numpy basics there are some like what are the data types that are used in numpy or how to create your array in numpy many more things which you can study so numpy it basically gives you some tools to do some math I know nice mathematical operations which are very essential in machine learning but again deviating from the topic okay so basically we'll convert this you white files into number arrays for easy processing and we will be creating a dictionary now diction dictionary you can index the values of a dictionary by using a string so in this dictionary there are four keys called train images train labels train test images and test labels and for each of those skills I will have a number array so basically train images will return me numpy array of shape 60,000 cross 28 cross 28 that means basically length there are 60,000 28 cross 20 images train labels will give me the corresponding levels of the images and it's a simple linear array of length sixty thousand and sixty thousand the same for test images and test labels when we have ten thousand imagine ten thousand levels each now this part this code converts the you byte files into numpy array so if you see so for every file if every one of those four files if the file ends with you bite because you know see the word files ends with this string if this ends with you but I will be reading the file and then I will be extracting the information one byte at a time so the first byte gives me the type because it's like the magic number if you remember so it can be two zero four nine or say two zero four five five one pardon me they should there is a mistake here this should be 2 0 5 1 2 0 4 9 is for labels and two zero five one is for images just to confirm if we go to the nest site we can see yes two zero four nine is for label files and two zero five one is for image files even in the test set we can see two zero four nine is for the label friends and 1005 one is for the image friends so pardon my mistake in this documentation I will fix this once at the end of this video so in the code we store the type or the magic number in this variable called type and then the next four bytes give me into the give ourselves the length of the dataset or the number of samples in the rate of it now if the type is 2 0 5 1 that means it belongs to training images sorry if the type is 2 0 5 1 then it means images so the categories physical images and the next set of 4 bytes will give me the number of rows and number of columns after that every bite will give me a pixel value now if you see from the buffer like I am treating the whatever I have read from the file as a buffer and from that buffer I am reading one you int eight at a time starting from an offset of 16 so if you see my number of columns it ended in the ID of 16 like from 12 12 13 14 15 those four bytes gave me the number of columns starting from the 16th byte I will take one bite at a time which will give me 1 pixel intensity of a specific location so I will be extracting a you int 8 format order integer from the offset of 16 from the buffer and I'll store it in passed in the variable called past so past has basically become an umpire in which all the pixel values are organized in form of a long chain and if we reshape the past to get our required shape that is of say 60,000 cross 28 cross 28 or set went 10,000 cross 20 across 20 weight depending on whether it's strain or test we will be reshaping it as lengths comma number of rows comma number of columns if the type was to 0 for 9 then 2 is basically labels and for that we will simply extract 1 integer at a time from the offset of 8 and each of them will give me a length sorry each of them will give me a label and this line just makes sure that the shape of the array is exactly like you know like this like a 6 linear array of 60,000 or linear array of 10,000 now if the length is 10,000 then it's the test set if the length is 60,000 then it's a train set and now I will put this numpy array that we have just created and in the dictionary so in data dictionary so if you see the keys are organized in like this like the first part is either train or test that is the set and the second part after underscore is images or labels which is basically category so category can be images or category can be levels so given that I generate the string set underscore category and in that position I put the number array that have just generated so after this loop is complete we will have something called a data dicta in which we will have the all the information about the pixel intensities and the labels in a very nicely usable data structure so let me run this okay um button okay so I was I'm supposed to I think I change this numpy to this okay basically this this has to be number yeah should work the problem was sometimes when I was making this code I have I would have written a line like this vampire as NP and maybe I made some changes probably here and I removed the line but I forgot to change the some other places where and it was still NP didn't as this should work now yeah so we have read all the four files and now we have the data structure called data dick we can check that of course let me just insert a cell below and we'll just see so data dick is a dictionary to deter addictive dot keys let me check what are the keys of data dick so dated it key is a test images test labels train images and train labels so each of these are represents an umpire it an array so let me show you say test images test images this should be an umpire array now if I print the number it's a huge array and it will like fill all of the screen so let me just print the shape of the numpy area for you so yeah so if you see test images is giving me an umpire array of 10,000 plus 28 cross 28 so if I want to show you one image at least or rather the values corresponding to one image it should be 0 0 comma so I'm taking the 0 sample and keeping all the 28 cross 28 bits so if you see this is the array so most of these values are 0 that is the ground and in some places we have values ranging from 0 to 255 that shows the parts of the text so let me delete this cell for now so ok fine we have this dictionary in which we can you know from which we can be used in many machine learning algorithms in the future but right now maybe we want to see if the images in a directory for that of course like these are values and a human mind cannot perceive them as images until like it is shown as images to them so to convert these values to images we have a little code here in which for every set that is trained or test I go through all the samples and I will read those images and I will save the images in a full directory which is specific of the class of the image so all the images that represents 0 will be inside a folder called 0 all the images of 1 will be inside a folder called 1 like that to do that I will basically we have the to the numpy areas called images and labels and if you are notice that say for train images it's a train image at 60,000 plus 22 plus 20 and the labels are all 60,000 so for every 28 cross pentane image I will have a label so if you check this part of the core code more in more details you will see that I have the data path in which where I am going to save after that I will have something called the set the set is either train or test so there will be a folder which will be named either train or test after slash I will have something called a label a label can be 0 to 9 so it will be directory which will give me the name of the class and finally followed by slash and whenever I am saving the image I will follow the same path structure data path plus set then plus labels and after that I will represent each image we buy a five digit number and I'll save it with an extension of dot PNG using this function called I am safe now I in save is a function which is given in another API called SK image dot IO this is an API which is useful for many image processing algorithms it has a lot of image processing algorithms and you can also see this that in Google if you want dot IO in dot IO Scimitar you say docks yeah so if you see this if you see this documentation you can see this image the input-output package under this game image it has many functions like I am read I am save I am show and many other things even the route package of SK image or cycle image has lots of functions correspondent image processing so usually say like color you see like there are a lot of functions which changes formats of colors and many other things we can do a lot of image processing stuffs with this s Kimmage api anyways so let me run this program and if I'm not rogue I think this might take a bit of time so let me run and let's see yep it's starting it has started to save all the images now it will take a bit of time to save all the images right now it's like in 6000 will go up to 60,000 and then it will do the test and there is 10,000 images in tests as well take a few seconds and let me take the images in the meantime so you see under the under Emily's dataset there should be something called a train and there should be another called a test folder after some time like because in the program if you see it's still the train images that is going on here [Music] so inside the Train a folder called train I have the one folder for every class that is from zero to nine and inside zero I have all the images that correspond to zero and same for one and so on in this way you can see if the images should not be we have done like you've gone through half of it already like at thirty thousand there can be a question that why am i saving in this directory structure especially like why am I having a folder for train and then folders correspond to three classes why not in some other way the thing is that normally and say when I develop deep learning algorithms there is a special function within that in the PI torch API that I use which allows me to directly load images when they are stored in a folder structure corresponding to this that is images sorted inside folders respective of their classes so that function allows me to directly use those images and pass it into our neural networks and that's why I like this folder structure very much you can obviously store in different folder structures as for you need or ask for the need of the algorithm but this one works pretty well and anyways the whenever you're working with algorithms you will never be working with this image files you will be always be converting these image files into some numpy array or numpy or any kind of array with the pixel values and you will be working with those values whether the trained folder is almost complete now yeah the test folder results complete so we have all the images extracted in to folder screen and test inside which there are the classes for you free image and now it's like you have everything ready for working with a machine learning algorithms but one more thing if you're seeing that this is kind of a tiresome way to deal with like you know arrays or images now if in the future we need to generate this array again with either have to parse this you void files or we have to read from the saved images so another way is to use something called pickle now pickle is something that if you see with pickle I can dump or rather save an entire Python object or a Python data structure as a single file and when I load that file back into the Python environment what it will do is it will give me that exact same data structure format that I have I was using during saving the pickle file so it goes with varies to simple operation you know commands so we have the data path and we know that the data dict is the dictionary that we want to save so we open a file called AMD's data dot PKL which is a pickle file extension default pickle file extension we open it in a binary moon and using this file pointer pointer we will call this function called pickle dot dump we'll pass our dictionary or array or list or object or whatever we have and we will also give the file pointer and it will create a file with this name in the location that we have provided and we can load them back with a similar way but using a command called pickle dot load which will return me the same data structure and we can store it it another variable as we want so let me turn this for you so if you see this is a fungal file called Emily's data dot pickle that we have just created and we have loaded it back again and saved in something called a new dict now if you see if I print the new dick dot keys new dict dot keys it's the same keys that that was there we can also check like say new nicht test images dot shape it's still the same shape like 10,000 plus 28 plus 28 which I showed before and so this is also very healthy way of saving large data structures and because after all it's just a single file and it's much less hassle like if you have to copy it from place to place it's a much less hassle you have much less number of you know I nodes in your directories so saying all that like I really hope that this will give you some insight on how to work with raw data and how to convert your raw data into a data structure which can be used for machine learning algorithms I will use this data structure or I will use this Hemnes data set as well for some of my future videos in which I will be explaining some basic from machine learning algorithms some like classification or some clustering something like those and if you want to see those again like you subscribe to the channel so that you can get the updates also don't forget to click the bell icon for the notifications and if you like this video drop a like or just comment if you have to say something or if you have any doubt or you want to ask something feel free to comment below with all that we have a very good day thank you so much bye bye
Info
Channel: Ghosh 4 AI
Views: 14,824
Rating: 4.6051502 out of 5
Keywords: numpy, pickle, MNIST, CNN, RNN, image processing, dataset, data, image, handwritten, digit, character, OCR, deep learning, machine learning, data mining
Id: 6xar6bxD80g
Channel Id: undefined
Length: 39min 55sec (2395 seconds)
Published: Wed Aug 22 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.