Lesson 3 - Deep Learning for Coders (2020)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

So hello, and welcome to Lesson 3 of Practical Deep Learning for Coders. We were looking at getting our model into production last week, and so we're going to finish off that today, and then we're going to start to look behind the scenes at what actually goes on when we train a neural network. We're going to look at the math of what's going on, and we're going to learn about SGD and important stuff like. The order is slightly different to the book: in the book there's a part in the book which says like “Hey, you can either go to lesson 4 or lesson 3 now and then go back to the other one afterwards” so we're doing lesson 4 and then lesson 3. Chapter 4 and then Chapter 3, I should say. You can choose it whichever way you're interested in. Chapter 4 is the more technical chapter about the foundations of how deep learning really works. Whereas Chapter 3 is all about ethics, and so with the lessons we'll do that next week. So we're looking at 02 production notebook, and we've got to look at the fastbook version (the one with…in fact everything I'm looking at today will be in the fastbook version). And remember last week we had a look at our bears, and we created this dataloaders object by using the datablock API which i hope everybody's had a chance to experiment with this week--if you haven't, now's a good time to do it! We kind of skipped over one of the lines a little, which is this itm_tfms. So what this is doing here, when we said “Resize”: the images we downloaded from the internet were lots of different sizes and lots of different aspect ratios some are tall and some are wide some are square and some are big some are small. When you say resize for an item transform it means each item (so an item in this case is one image) is going to be resized to 128x128 by squishing it or stretching it. And so we had a look at, you can always say show_batch to see a few examples, and this is what they look like. Squishing and stretching isn't the only way that we can resize remember we have to make everything into a square before we kind of get it into our model. By the time it gets to our model everything has to be the same size in each mini batch, so that's why... making it a square is not the only way to do that, but it's the easiest way and it’s by far the most common way. Another way to do this is we can create another datablock object, and we can make a datablock object that's an identical copy of an existing datablock object where we can then change just some pieces and we can do that by calling the “new” method which is super handy. So let's create another datablock object, and this time with different item_transforms where we resize using the “Squish” method. We have a question: what are the advantages of having square images versus rectangular ones? That's a great question. Really, it’s simplicity. If you know all of your images are rectangular, of a particular aspect ratio to start with, you may as well just keep them that way. But if you've got some which are tall and some which are wide, making them all square is kind of the easiest. Otherwise you would have to organize them such as all of the tall ones ended up in a mini batch nor the wide ones ended up in a mini batch, and then you'd have to then figure out what the best aspect ratio for each mini batch is, and we actually have some research that does that in fastai2 ( but it's still a bit clunky). I should mention... Okay, I just lied to you--the default is not actually to squish or stretch: the default (I should have said, sorry) the default when we say resize is actually just to grab the center. So actually all we're doing is we’re grabbing the center of each image. So if we want to squish or stretch you can add the ResizeMethod.Squish argument to Resize and you can now see that this black bear is now looking much thinner, but we have got the kind of leaves that are around on each side for instance. another question when you use the dls dot new method what can and cannot be changed -- is it just the transforms? So it's not dls dot new it's bears dot new, right? So we're not creating a new data loaders object; we're creating a new datablock object. I don't remember off the top of my head so check the documentation and I'm sure somebody can pop the answer into the into the forum. So you can see when we use dot squish that this grizzly bear has got pretty kind of wide and weird-looking and this black bear has got pretty weird and thin-looking and it's easiest kind of to see what's going on if we use ResizeMethod dot pad, and what dot pad does as you can see is it just add some black bars around each side. So you can see the grizzly bear was tall so then when we we stretched (squishing and stretching are opposites of each other) so when we stretched it it ended up wide and the black bear was originally a wide rectangle so it ended up looking kind of thin. You don’t have to user to use zeros. Zeros means pad it with black. You can also say like reflect to kind of have the pixels will kind of look a bit better that way if you use reflect. All of these different methods have their own problems; the the pad method is kind of the cleanest you end up with the correct size, you end up with all of the pixels, but you also end up with wasted pixels so you kind of end up with wasted computation. The squish method is the most efficient because you get all of the information you know and and nothing's kind of wasted, but on the downside your neural nets going to have to learn to kind of like recognize when something's being squished or stretched. And in some cases it might -- it wouldn't even know, so if there's two objects you're trying to recognize, one of which tends to be thin and one of which tends to be thick -- in other words they're the same -- they could actually be impossible to distinguish. And then the default cropping approach actually removes some information so in this case, you know, this grizzly bear here we actually lost a lot of its legs, so if figuring it out, what kind of bear it was required looking at its feet, well, we don't have its feet anymore. So they all have downsides. So there's something else that you can do, a different approach, which is instead of to say resize, you can say RandomResizedCrop. And actually this is the most common approach and what random resize crop does is each time it actually grabs a different part of the image and kind of zooms into it, right? So these, this is all the same image and we're just grabbing a batch of four different versions of it and you can see some are kind of, you know, they're all squished in different ways and we've kind of selected different subsets and so forth. Now this kind of seems worse than any of the previous approaches because I'm losing information. Like this one here -- I've actually lost a whole lot of its, of its back, right, but the cool thing about this is that remember we want to avoid overfitting. And when you see a different part of the animal each time, it's much less likely to overfit because you're not seeing the same image on each epoch that you go around. That make sense? So, so this random resized crop approach is actually super popular, and so min_scale 0.3 means we're going to pick at least 30% of the pixels, of kind of the original size each time, and then we’re going to like zoom in to that that square. So this idea of doing something so that each time the model sees the image it looks a bit different to last time is called data augmentation. And this is one type of data augmentation. It's probably the most common, but there are others and one of the best ways to do data augmentation is to use this aug_transforms function. And what aug_transforms does is it actually returns a list of different augmentations. And so there are augmentations which change contrast, which change brightness, which warps a perspective so you can see in this one here it looks like this bit’s much closer to you and this moves much away from you because it's kind of been perspective warped; it rotates them (see this one's actually being rotated), this one's been made really dark, right? These are batch transforms not item transforms. The difference is that item transforms happen one image at a time and so the thing that resizes them all to the same size that has to be an item transform. Pop it all into a mini batch, put it on the GPU and then a batch transform happens to a whole mini batch at a time. And by putting these as batch transforms the augmentation happens super fast because it happens on the GPU. And I don't know if there's any other libraries as we speak which allow you to write your own GPU accelerated transformations that run on the GPU in this way. So this is a super handy thing in first AI 2. So you can check out the documentation or aug transforms and when you do you'll find the documentation for all of the underlying transforms that it basically wraps. Right so you can see if I shift tab, I don't remember if i have shown you this trick before - if you go inside the parentheses of a function and hit shift tab a few times it'll pop open a list of all of the arguments and so you can basically see you can say like oh can I sometimes flip it left right, can I sometimes flip it up down, what's the maximum amount I can rotate, zoom, change the lighting, warp the perspective and so forth. How can we add different augmentations for train and validation sets? So the cool thing is that automatically fastai will avoid doing data augmentation on the validation set. So all of these aug transforms will only be applied to the training set with the exception of RandomResizedCrop. RandomResizedCrop has a different behavior or each, the behavior for the training set is what we just saw which is to randomly pick a subset and zoom into it and the behavior for the validation set is just to grab the center, the largest center square that it can. You can write your own transformations, they're just Python, they are just standard Pytorch code. And by default it will only be applied to the training set. If you want to do something fancy likeRandomResizedCrop where you actually have different things being applied to each, you should come back to the next course to find out how to do that or read the documentation. It's not rocket science but it's that's something most people need to do. Um okay so last time we did bears.new with a RandomResizedCrop, mean scale of 0.5, we added some transforms and we went ahead and trained. Actually since last week I’ve rerun this notebook and it's on a different computer and I've got different images so it's not all exactly the same but I still got a good confusion matrix. Of the black bears 37 were classified correctly 2 were grizzly's and 1 was a teddy. Now plot top losses is interesting you can see in this case there's some clearly kind of odd things going on this is not a bear at all this looks like it's a drawing of a bear. Which it's decided, is predicted as a Teddy but it's meant to be a drawing of a black bear. I can certainly see the confusion. You can see how some parts that have been cut off we’ll talk about how to deal with that later. Now one of the interesting things is that we didn't really do much data cleaning at all before we built this model the only data cleaning we did was just to validate that each image can be opened, there was that verify images call. And the reason for that is it's actually much easier normally to clean your data after you create a model and I'll show you how. We've got this thing called image classifier cleaner where you can pick a category right and training set or validation set and then what it will do is it will then list all of the images in that set and it will pick the ones which is the least confident about, which is the most likely to be wrong, where the loss is the worst to be more precise. And so this this is a great way to look through your data and find problems. So in this case the first one is not a teddy or a brown bear or a black bear it's a puppy dog, right. So this is a great cleaner because what I can do is I can now click delete here, this one here looks a bit like an Ewok rather than a teddy I'm not sure what do you think Rachel is it an Ewok ? I'm going to call it an Ewok ok and so you can kind of go through okay that's definitely not a teddy and so you can either say like oh that's wrong it's actually a grizzly bear or it's wrong it's a black bear or I should delete it or by default is keep it right and you can kind of keep going through until you think like okay they all seem to be fine maybe that one's not and kind of once you get to the point where all seems to be fine you can kind of say okay probably all the rest to fine too because they all have lower losses so they all fit the kind of the mode of a teddy and so then I can run this code here where I just go through cleaner.delete so that's all the things which I've selected delete for and unlink them so unlink is just another way of saying delete a file that's the Python name and then go through all the ones that we said change and we can actually move them to the correct directory. If you haven't seen this before you might be surprised that we've kind of created our own little GUI inside Jupiter notebook. Yeah you can do this, and we built this with less than a screen of code, you can check out the source code in the past AI notebooks so this is a great time to remind you that this is a great time to remind you that fast.ai is built with notebooks and so if you go to the fast.ai repo and clone it and then go to NBS you'll find all of the code of fast.ai written as notebooks and they've got a lot of prose and examples and tests and so forth. So the best place to learn about how this is implemented is to look at the notebooks rather than looking at the module code. Okay, by the way sometimes you'll see like weird little comments like this. These weird little comments are part of a development environment for Jupiter notebook we use called nbdev which we built so Sylvain and I built this thing to make it much easier for us to kind of create books and websites and libraries in Jupiter notebooks so this particular one here hide means when this is turned into a book or into documentation don't show this cell and the reason for that is because you can see I've actually got it in the text right but I thought when you're actually running it it would be nice to have it sitting here waiting for you to run directly so that's why it's shown in the notebook but not in the in the book has shown differently. And you’ll also see things like s: with a quote in the book that would end up saying Sylvain says and then what he says so there's kind of little bits and pieces in the notebooks that just look a little bit odd and that's because it's designed that way in order to show, in order to create stuff in them. Right, so, then last week we saw how you can export that to a pickle file that contains all the information from the model, and then on the server where you're going to actually do your inference, you can then load that save file and you'll get back a learner that you can call predict on. So predict, perhaps the most interesting part of predict is the third thing that it returns which is a tensor, in this case containing three numbers. But the three numbers there's three of them because we have three classes, teddy bear, grizzly bear and black bear, all right? And so this doesn't make any sense until you know what the order of the classes is, kind of in your data loaders. And you can ask the data loaders what the order is by asking for its vocab. So a vocab in fast.ai is a really common concept it's basically any time that you've got like a mapping from numbers to strings or discrete levels the mapping is always taught in the vocab so here this shows us that the activation of black bear is 1-e6, the activation for grizzly is 1 and the activation for teddy is 1e-6, so very very confident that this particular one it was a grizzly not surprisingly this was something called grizzly.JPEG Umm so you need to know this... this mapping in order to display the correct thing, but of course the data loaders object already knows that mapping, and it's all, the vocab, and it's stored in with the loader, so that's how it knows to say grizzly automatically. So the first thing it gives you is the human readable string that you'd want to display. So this is kind of nice that with fast AI 2 you, you save this object, which has everything you need for inference. It's got all the, you know, information about normalization, about any kind of transformation steps, about what the vocab is, so it can display everything correctly. Right. So now we want to deploy this as an app. Now if you've done some web programming before then all you need to know is that this line of code, and this line of code... So this is the line of codes you would call once when your application starts up, and then this is the line of code you would call every time you want to do inference. And there's also a batch version of it which you can look up if you're interested this is just a ‘one at a time’. So there's nothing special if you're already a web programmer or have access to a web programmer. These are you know... You just have to stick these two lines of code somewhere and the three things you get back whether, the human readable string if you're doing categorization, the index of that which in this case is one, is grizzly, and the probability of each class. One of the things we really wanted to do in this course though, is not assume that everybody is a web developer. Most data scientists aren't, but gee wouldn't it be great if all data scientists could at least, like, prototype an application to show off the thing they're working on. And so we've... Trying to kind of curate an approach, which none of its stuff we've built, it's really as curated, which shows how you can create a GUI and create a complete application in Jupyter notebook. So the key pieces of technology we use to do this are, ipython widgets which is always called iPy widgets, and Voila. iPy widgets which we import by default as widgets, and that's also what they use in their own documentation, as GUI widgets. For example a file upload button. So if I create this file upload button and then display it, I see, and we saw this in the last lesson as well or maybe lesson one, an actual clickable button. So I can go ahead and click it, and it says now, OK you've selected one thing. So how do I use that? Well these... Well these widgets have all kinds of methods and properties. And the upload button has a data property, which is an array, containing all of the images you uploaded. So you can pass that to PIL image dot create and so dot create is kind of the standard factory method we use in fast AI to create items and PIL image dot create is smart enough to be able to create an item from all kinds of different things and one of the things that it can create it from is a binary blob, which is what a file upload contains. So then we can display it and there's our teddy. Right? So you can see how, you know, cells of Jupyter notebook can refer to other cells that were created, that were... Kind of have GUI created data in them. So let's hide that teddy away for a moment and the next thing to know about is that there's a kind of widget called output and an output widget is... It's basically something that you can fill in later. Right? So if I delete actually this part here. So I've now got an output widget. Yeah, actually let’s do it this way around. And you can't see the output widget even though I said please display it, because nothing is output. So then in the next cell I can say with that output placeholder display a thumbnail of the image and you'll see that the display will not appear here. It appears back here! Right? Because that's how... That's where the placeholder was. So let's run that again to clear out that placeholder. So we can create another kind of placeholder which is a label. The label is kind of a something where you can put text in it. You can give it a value like, I don't know, please choose an image. okay so we've now got a label containing please choose an image. Let's create another button to do a classification, now this is not a file upload button it's just a general button so this button doesn't do anything, right, it doesn't do anything until we attach an event handler to it. An event handler is a callback, we'll be learning all about callbacks in this course, if you've ever done any GUI programming before or even web programming you'll be familiar with the idea that you write a function which is the thing you want to be called when the button is clicked on and then somehow you tell your framework that this is the on click event. So here I go here's my button run, I say the on click event, the button run is, we call this code and this code is going to do all the stuff we just saw. I create an image from the upload, it's going to clear the output, display the image, call predict and then replace the label with a prediction. There it all is. Now so that hasn't done anything but I can now go back to this classify button which now has an event handler attached to it, so watch this: click, boom, and look that's been filled in, thats been filled in. Right, in case you missed it let's run this again, clear everything out. Okay everything's gone, this is please choose an image, there's nothing here, I click classify, bop, bop. Right so it's kind of amazing how our notebook has suddenly turned into this interactive prototyping playground building applications and so once all this works we can dump it all together and so the easiest way to dump things together is to create a V box. So a V box is a vertical box and it's just it's just something that you put widgets in and so in this case we're going to put the following widgets in, and so in this case we going to put the following widgets; a label that says “select your bear”, then an upload button, a run button an output placeholder and a label for predictions. But let's run these again just to clear everything out so that we're not cheating and let's create our V box. So as you can see it's just got all the all the pieces, right, we've got...oh I accidentally ran the thing that displayed the bear, let's get rid of that. Okay so there it is so now I can click upload, I can choose my bear and then I can click classify and notice this is exactly, that this is, this is the same buttons as these buttons, they're like two places we're viewing the same button, which is kind of a wild idea. So if I click classify it's going to change this label and this label because they're actually both references to the same label; look there we are. So this is our app right and so this is actually how I built that image cleaner GUI, is just using these exact things and I built that image cleaner GUI cell-by-cell in a notebook just like this and so you get this kind of interactive experimental framework for building a GUI so if you're a data scientist who's never done GUI stuff before this is a great time to get started because now you can you can make actual programs. Now of course an actual program running inside a notebook is kind of cool but what we really want is this program to run in a place anybody can run it that's where Voila comes in. So Voila and needs to be installed, so you can just run these lines or install it, it's listed in the prose and what voila does is it takes a notebook and doesn't display anything except for the markdown, the ipython widgets and the outputs, right, so all the code cells disappear and it doesn't give the person looking at that page the ability to run their own code, they can only interact with the widgets, right, so what I did was a copied and pasted that code from the notebook into a separate notebook which only has those lines of code, right, so this is just the same lines of code that we saw before and so this is a notebook, it's just a normal notebook, and then I installed Voila and then when you do that if you navigate to this notebook but you replace “notebooks” up here with Voila, it actually displays not the notebook but just as I said the markdown and the widgets. So here I've got bear classifier and I can click upload, let's do a grizzly bear this time, and this is a slightly different version I actually made this so there's no classify button I thought it would be a bit more fancy to make it so when you click upload it just runs everything, but as you can see there it all is, right, it's all working. So this is the world's simplest prototype but it's, it's a proof-of-concept right so you can add widgets with dropdowns and sliders and charts and you know, everything that you can have in you know, an angular app or a react app or whatever and in fact there's, there's even stuff which lets you use for example the whole Vue JS framework if you know that, it's a very popular JavaScript framework, the whole Vue JS framework you can actually use it in widgets and Voila. So now we want to get it so that this this app can be run by someone out there in the world. So the voila documentation shows a few ways to do that, but perhaps the easiest one is to use a system called Binder. So Binder is at mybinder.org and all you do is you paste in your github repository name here, right, and this is all in the book, so paste in your Github repo name, you change where it says file, we change that to URL, you can see and then you put in the path which we were just experimenting with, right. So you pop that here and then you say launch and what that does is it then gives you a URL. So then this URL you can pass on to people and this is actually your interactive running application, so Binder is free and so this is, you know, anybody can now use this to take their Voila app and make it a publicly available web application. So try it, as it mentions here the first time you do this Binder takes about five minutes to build your site because it actually uses something called Docker to deploy the whole FastAI framework and Python and blah, blah, blah, but once you've done that, that virtual machine will keep running for, you know, as long as people are using it. It'll keep running for a while, that virtual machine will keep running for a while as long as people are using it and you know it's it's reasonably fast. So a few things to note here, being a free service you won't be surprised to hear this is not using a GPU, its using a CPU, and so that might be surprising but we're deploying to something which runs on a CPU. When you think about it though, this makes much more sense to deploy to a CPU than a GPU the, just a moment, the thing that's happening here is that I am passing along, let's go back to my app; in my app I'm passing along a single image at a time, so when I pass along that single image I don't have a huge amount of parallel work for a GPU to do. This is actually something that a CPU is going to be doing more efficiently so we found that for folks coming through this course, the vast majority of the time they wanted to deploy inference on a CPU not a GPU because they're normally this doing one item at a time. It's way cheaper and easier to deploy to a CPU and the reason for that is that you can just use any hosting service you like because just remember this is just a, this is just a program at this point, right, and you can use all the usual horizontal scaling, vertical scaling you know, you can use Heroku, you can use AWS, you can use inexpensive instances super cheap and super easy. Having said that there are times you might need to deploy to a GPU for example maybe you're processing videos and so like a single video on on a CPU to process it might take all day or you might be so successful that you have a thousand requests per second, in which case you could like take 128 at a time, batch them together and put the whole batch on the GPU and get the results back and pass them back around. You gotta be careful of that right because if your requests aren't coming fast enough, your user has to wait for a whole batch of people to be ready to be processed. But you know conceptually, as long as your site is popular enough that could work. The other thing to talk about is, you might want to deploy to a mobile phone and the point in to a mobile phone our recommendation is wherever possible do that by actually deploying to a server and then have a mobile phone talk to the server over a network. Because if you do that, again you can just use a normal Pytorch program on a normal server and normal network calls, it makes life super easy. When you try to run a Pytorch app on a phone, you are suddenly now not in an environment where Pytorch will run natively and so you'll have to like convert your program into some other form. And there are other forms and the the main form that you convert it to is something called ONNX which is specifically designed for kind of super high speed the high performance you know approach that can run on both servers or on mobile phones and it does not require the whole Python and Pytorch kind of runtime in place but it's much more complex than not using it. It's harder to debug, it's harder to set it up, it's harder to maintain it. So if possible keep things simple, and if you're lucky enough that you're so successful that you need to scale it up to GPUs or and stuff like that then great, you know, hopefully you've got the the finances at that point to justify, you know, spending money on an ONNX expert, or serving expert or whatever. And there are various systems you can use, ONNX runtime, and AWS Sagemaker where you can kind of say, here's my ONNX bundle and it’ll serve it for you or whatever. Pytorch also has a mobile framework, same idea. So, all right, so you've got, I mean it's kind of funny we're talking about two different kinds of deployment here, one is deploying like a hobby application you know that you're prototyping, showing off to your friends, explaining to your colleagues how something might work, you know, a little interactive analysis, that's one thing. But maybe you're actually prototyping something that you want to turn into a real product, or an actual real part of your company's operations. When you're deploying, you know, something in real life, there's all kinds of things you got to be careful of. One example of something to be careful of is, let's say you did exactly what we just did. Which actually, this is your homework, is to create your own application and I want you to create your own image search application you can use my exact set of widgets and whatever if you want to, but better still go to the ipywidgets website, and see what other widgets they have and try and come up with something cool try and come and you know try and show off as best as you can and show us on the forum. Now let's say you decided that you want to create an app that would help the users of your app decide if they have healthy skin or unhealthy skin. So if you did the exact thing we just did rather than searching for grizzly bear and teddy bear and so forth on Bing, you would search for healthy skin and unhealthy skin. And so here's what happens right, if I, and remember in our version we never actually looked at being we just used the Bing API the Image Search API but behind the scenes it's just using the website and so if I click healthy if I type healthy skin and say search, I actually discover that the definition of healthy skin is young white women touching their face lovingly. So that's what your your healthy skin classifier would learn to detect, right, and so this is so this is a great example from Deb Raji and you should check out her paper, “Actionable Auditing,” for lots of cool insights about model bias. But I mean here's here's like a fascinating example of how if you weren't looking at your data carefully you you end up with something that doesn't at all actually solve the problem you want to solve. This is tricky. Right? Because the data that you train your algorithm on, if you're building like a new product that didn't exist before, by definition you don't have examples of the kind of data that's going to be used in real life. Right? So you kind of try to find some, from somewhere, and if there and if you do that through like a Google search pretty likely you're not going to end up with a set of data that actually reflects the kind of mix you would see in real life. So you know the main thing here is to say be careful. Right? And, and in particular for your test set, you know, that final set that you check on, really try hard to gather data that reflects the real world. So that goes, you know, for example for the healthy skin example, you might go and actually talk to a dermatologist and try and find like ten examples of healthy and unhealthy skin or something. And that would be your kind of gold standard test. Um. There's all kinds of issues you have to think about in deployment. I can't cover all of them, I can tell you that this O'Reilly book called ‘Building Machine Learning Powered Applications’ is, is a great resource, and this is one of the reasons we don't go into detail about AP [corrects], A/B testing and when should we refresh our data and how we monitor things and so forth, is because that book has already been written, so we don't want to rewrite it. I do want to mention a particular area that I care a lot about though, which is, let's take this example, let's say you're rolling out this bear detection system, and it's going to be attached to video cameras around a campsite. It's going to warn campers of incoming bears. So if we used the model that was trained with that data that we just looked at, you know, those are all very nicely taken pictures of pretty perfect bears. Right? There's really no relationship to the kinds of pictures you're actually going to have to be dealing with in your, in your campsite bear detector, which has, it's going to have video and not images, it's going to be nighttime, there's going to be probably low resolution security cameras, you need to make sure that the performance of the system is fast enough to tell you about it before the bear kills you. You know, there will be bears that are partially obscured by bushes or in lots of shadow or whatever. None of which are the kinds of things you would see normally in like internet pictures. So what we call this, we call this ‘out of domain data’. ‘Out of domain data’ refers to a situation where the data that you are trying to do inference on, is in some way different to the kind of data that you trained with. This is actually... There's no perfect way to answer this question, and when we look at ethics, we’ll talk about some really helpful ways to, to minimize how much this happens. For example, it turns out that having a diverse team is a great way to kind of avoid being surprised by the kinds of data that people end up coming up with, but really is just something you've got to be super thoughtful about. Very similar to that is something called the ‘main shift’ and the ‘main shift’ is where maybe you start out with all of your data is ‘in domain data0’, but over time the kinds of data that you're seeing changes and so over time maybe raccoons start invading your campsite, and you weren't training on racoons before, it was just a bear detector, and so that's called ‘domain shift’ and that's another thing that you have to be very careful of. Rachel, is there a question? No, I was just gonna add to that in saying that, all data is biased, so there's not kind of a, you know, a form of de bias data, perfectly representative in all cases data, and that a lot of the proposals around addressing this have kind of been converging to this idea, and that you see in papers like Timnit Gebru’s ‘Datasheets for Datasets’ of just writing down a lot of the details about your data set, and how it was gathered, and in which situations it's appropriate to use, and how it was maintained, and so there, that's not that, you've totally eliminated bias but that you're just very aware of the attributes of your data set so that you won't be blindsided by them later. And there have been, kind of, several proposals in that school of thought, which I, which I really like, around this idea of just kind of understanding how your data was gathered and what its limitations are. Thanks Rachel. So a key problem here is that you can't know the entire behavior of your neural network. With normal programming you typed in the if statements and the loops and whatever, so in theory you know what the hell it does. Although, it’s still sometimes surprising. In this case you, you didn't tell it anything, you just gave it examples ‘alone from’, and hoped that it learned something useful. There are hundreds of millions of parameters in all of these neural networks, and so there's no way you can understand how they all combine with each other to create complex behavior. So really like, there's a natural compromise here is that we're trying to get sophisticated behavior, so like, like recognizing pictures. S-+ophisticated enough behavior we can describe it and so the natural downside is you can't expect the process that the thing is using to do that to be describable. You, for you to be able to understand it. So our recommendation for kind of dealing with these issues is a very careful deployment strategy, which I've summarized in this little graph, this little chart here. The idea would be, first of all whatever it is that you're going to use the model for, start out by doing it manually. So have a park ranger watching for bears. Have the model running next to them and each time the park ranger sees a bear they can check the model and see like, did it seem to have picked it up. So the model is not doing anything. There's just a person who's like, running it and seeing would it have made sensible choices, and once you're confident that it makes sense, that what it's doing seems reasonable, you know, in those as close to the real-life situation as possible, then deploy it in a time and geography limited way. So pick like one campsite, not the entirety of California, and do it for, you know, one day and have somebody watching it super carefully. Right? So now the basic bear detection is being done by the bear detector but there's still somebody watching it pretty closely, and it's only happening in one campsite, for one day, and so then as you say like: ‘Okay we haven't destroyed our company yet. Let’s do two campsites for a week, and then let's do, you know, the entirety of Marin for a month, and so forth.’ So this is actually what we did when I used to be at this company called ‘Optimal Decisions’. ‘Optimal Decisions’ was a company that I founded to do insurance pricing, and if you, if you change insurance prices by, you know, a percent or two in the wrong direction, in the wrong way, you can basically destroy the whole company. This has happened many times, you know. Insurers are companies that set prices. That's basically the product that they provide. So when we deployed new prices for ‘Optimal Decisions’ we always did it by like saying like: ‘Okay we're going to do it for like five minutes or everybody whose name ends with a D.’ You know? So we kind of try to find some group, which hopefully would be fairly, you know, it would be different, but not too many of them, and we would gradually scale it up, and you've got to make sure that when you're doing this that you have a lot of really good reporting systems in place that you can recognize… Are your customers yelling at you, are your computers burning up, you know, are your, are your computers burning up, are your costs spiraling out of control, and so forth. So it really requires great reporting systems. Does fast AI have methods built-in that provide for incremental learning, i.e., improving the model slowly over time with a single data point each time? Yeah, that's a great question. So this is a little bit different, which is this is really about dealing with ‘domain shift’ and similar issues by continuing to train your model as you do inference, and so the good news is, you don't need anything special for that. It's basically just a transfer learning problem. So you can do this in many different ways. Probably the easiest is just to say, like: ‘Okay, each night...’ Probably the easiest is just to say: ‘Okay, each night, you know, at midnight we're going to set off a task, which grabs all of the previous day's transactions, as mini-batches and trains another epoch.’ And so yeah, that that actually works fine. You can basically think of this as a fine tuning approach, where your pre-trained model is yesterday's model, and your fine-tuning data is today's data. So as you roll out your model, one thing to be thinking about super carefully is that it might change the behavior of the system that it's a part of. And this can create something called a ‘feedback loop’ and ‘feedback loops’ are one of the most challenging things for, for real world model deployment, particularly of machine learning models, because they can take a very minor issue and explode it into a really big issue. So, for example, think about a predictive policing algorithm. It's an algorithm that was trained to recognize, you know, basically trained on data that says whereabouts or arrests being made, and then as you train that algorithm based on where arrests are being made, then you put in place a system that sends police officers to places that the model says are likely to have crime, which in this case where were, were there, where were arrests. Well, then more police go to that place, find more crime, because the more police that are there the more they'll see. They arrest more people, causing, you know, and then if you do this incremental learning, like we're just talking about, then it's going to say: ‘Oh there's actually even more crime here.’ And so tomorrow it sends even more police. And so in that situation you end up like, the predictive policing algorithm ends up kind of sending all of your police on one street block, because at that point all of the arrests are happening there, because that's the only place you have policemen. Right? And I should say police officers. So there's actually a paper about this issue called, ‘To predict and serve?’. And in ‘To predict and serve?’ the author's write this really nice phrase: ‘Predictive policing is aptly named, it is predicting policing, not predicting crime.’ So if the initial model was perfect, whatever the hell that even means, but like it's somehow sent police to exactly the best places to find crime, based on the probability of crimes actually being in place, I guess there's no problem. Right? But as soon as there's any amount of bias. Right? So for example in the US, there's a lot more arrests of black people than of white people, even for crimes where black people and white people are known to do them in the same amount. So in the presence of this bias, or any kind of bias, you're kind of like setting off this domino chain of ‘feedback loops’, where that bias will be exploded over time. So, you know, one thing I like to think about is to think like well: ‘What would happen if this, if this model was just really really really good?’. Like: ‘Who would be impacted?’ You know: ‘What would this extreme result look like? How would you know what was really happening?’ This incredibly predictive algorithm that was like changing the behavior of yours, of your police officers or whatever, you know. ‘What would that look like? What would actually happen?’ And then like, think about like: ‘Okay, what could go wrong?’ And then: ‘What kind of rollout plan? What kind of monitoring systems? What kind of oversight could provide the circuit breaker?’ Because that's what we really need here. Right? Is, we need like, nothing's going to be perfect, you can't be sure that there's no ‘feedback loops’, but what you can do is try to be sure that you see when the behavior of your system is behaving in a way that's not what you want. Did you have anything to add to that Rachel? I would add to that is that you're at risk of potentially having a ‘feedback loop’ anytime that your model is kind of controlling what your next round of data looks like. And I think that's true for pretty much all products, and that can be a hard jump from people, people coming from kind of a science background, where you may be thinking of data as: ‘I have just observed some sort of experiment.’ Where is kind of, whenever you're, you know, building something that interacts with the real world you are now also controlling what your future data looks like based on, kind of, behavior of your algorithm for the current, current round of data. Right? So… So given that you probably can't avoid ‘feedback loops’ the, you know, the, the thing you need to then really invest in is the human in the loop. And so a lot of people like to focus on automating things which I find weird, you know, if you can decrease the amount of human involvement by like 90 percent you've got almost all of the economic upside of automating it completely but you still have the room to put human circuit breakers in place. You need these appeals processes, you need the monitoring, you need, you know, humans involved to kind of go: ‘Hey that's, that's weird. I don't think that's what we want.’ Okay, yes Rachel. And just one more note about that. Those humans though do need to be integrated well with kind of product and engineering, and so one issue that comes up is that in many companies I think that ends up kind of being underneath trust and safety handles a lot of sort of issues with, how things can go wrong, or how your platform can be abused, and often trust and safety is pretty siloed away from product and eng, which actually kind of has the, the control over, you know, these decisions that really end up influencing them. And so having... That. They. The engineers probably consider them to be pretty, pretty annoying a lot of the time, how they get in the way, and get in the way of them getting software out the door. Yeah, but like the kind of, the more integration you can have between those I think it's helpful for the kind of the people building the product to see what is going wrong, and what can go wrong. Right. If the engineers are actually on top of that, they're actually seeing these, these things happening, that it's not some kind of abstract problem anymore. So, you know, at this point now that we've got to the end of chapter 2, you actually know a lot more than most people about, about, deep learning, and actually about some pretty important foundations of machine learning, more generally, and of data products more generally. So now’s a great time to think about writing. So, sometimes we have formatted text that doesn't quite format correctly. In Jupyter notebook by the way it only formats correctly in, in the book book. So, that's what it means when you see this kind of pre-formatted text. So... The... The idea here is to think about starting writing, at this point, before you go too much further. Rachel. There's a question. Oh, okay let's hear the question. Question is: ‘I am, I assume there are fast AI type ways of keeping a nightly updated transfer learning setup. Well could there be one of the fast AI version 4 notebooks, have an example of the nightly transfer learning training, like the previous person asked? I would be interested in knowing how to do that most effectively with fast AI.’ Sure. So I guess my view is there's nothing fast AI specific about that at all. So I actually suggest you read Emmanuel’s book. That book I showed you to understand the kind of the ideas, and if people are interested in this I can also point you with some academic research about this as well, and there's not as much as that there should be, but there is some, there is some good work in this area. Okay. So, the reason we mention writing at this point in our journey is because, you know, things are going to start to get more and more heavy, more and more complicated, and a really good way to make sure that you're on top of it is to try to write down what you've learned. So sorry, I wasn’t sharing the right part of the screen before, but this is what I was describing in terms of the pre-formatted text, which doesn't look correct. So... When... So, Rachel actually has this great article that you should check out which is ‘Why you should blog’, and I will say it's sort of her saying cuz I have it in front of me and she doesn't. Weird as it is. So Rachel says that: ‘The top advice she would give her younger self is to start blogging sooner.’ So Rachel has a math PhD, and this kind of idea of, like, blogging was not exactly something, I think, they had a lot of in the PhD program, but actually it's like, it's a really great way of finding jobs. In fact, most of my students who have got the best jobs are students that have good blog posts. The thing I really love is that it helps you learn by, by writing down, it kind of synthesizes your ideas, and yeah, you know, there's lots of reasons to blog. So there's actually something really cool I want to show you. Yeah. I was also just gonna note I have a second post called ‘Advice for Better Blog Posts’, that's a little bit more advanced, which I'll post a link to as well, and that, talks about some common pitfalls that I've seen in many, in many blog posts, and kind of the importance of putting, putting the time in to do it well, and and some things to think about. So I'll share that post as well. Thanks Rachel. Um, so one reason that sometimes people don't blog is because it's kind of annoying to figure out how to. Particularly, because I think the thing that a lot of you will want to blog about is cool stuff that you're building in Jupyter notebooks. So, we've actually teamed up with a guy called Hamel Husain, and, and with GitHub to create this free product. As usual with fast AI, no ads, no anything, called ‘fastpages’, where you can actually blog with Jupyter notebooks. And so you can go to ‘fastpages’ and see for yourself how to do it, but the basic idea is that, like, you literally click one button, it sets up a plug for you, and then you dump your notebooks into a folder called underscore notebooks, and they get turned into blog posts. It's... It's basically like magic, and Hamel's done this amazing job of this, and so... This means that you can create blog posts where you've got charts, and tables, and images, you know, where they're all actually the output of, of Jupyter notebook, along with all the, the markdown formatted text, headings, and so forth, and hyperlinks, and the whole thing. So this is a great way to start writing about what you're learning about here. So something that Rachel and I both feel strongly about when it comes to blogging is this, which is, don't try to think about the absolute most advanced thing you know and try to write a blog post that would impress Geoff Hinton. Right? Because most people are not Geoff Hinton. So like, (a) you probably won't do a good job, because you're trying to, like, blog for somebody who's more, got more expertise than you, and (b) you've got a small audience now. Right? Actually there's far more people that are not very familiar with deep learning, than people who are. So try to think, you know, and, and you really understand what it's like, what it was like six months ago to be you, because you were there six months ago. So try and write something, which the six months ago version of you would have been, like, super interesting, full of little tidbits you would have loved, you know, that you would, that would have delighted you, that six months ago version of you. Okay. So once again, don't move on until you've had a go at the questionnaire, to make sure that you, you know, understand the key things we think that you need to understand, and, yeah, have a think about these further research questions as well, because they might help you to engage more closely with material. So let's have a break, and we'll come back in five minutes time. So welcome back everybody. This is an interesting moment in the course, because we're kind of jumping from a part of the course, which is, you know, very heavily around kind of the, kind of the, the structure of like what are we trying to do with machine learning, and what are the kind of the pieces, and what do we need to know to make everything kind of work together. There was a bit of code, but not masses. There was basically no math, and we kind of want to put that at the start for everybody who's not, you know, who's kind of wanting to, an understanding of, of these issues, without necessarily wanting to, kind of, dive deep into the code, in the math themselves. And now we're getting into the diving deeper part. If, if you're not interested in that diving deep yourself, you might want to skip to the next lesson about ethics, where we, you know, is kind of, that rounds out the kind of, you know, slightly less technical material. So what we're going to look at here is, we're going to look at what we think of as kind of a toy problem, but just a few years ago is considered a pretty challenging problem. The problem is recognizing handwritten digits, and we're going to try and do it from scratch. Right? And we're gonna try and look at a number of different ways to do it. So, we're going to have a look at a dataset called MNIST, and so, if you've done any machine learning before you may well have come across MNIST. It contains handwritten digits and it was collided into a machine learning data set by a guy called Yann LeCun and some colleagues, and they used that to demonstrate one of the, you know, probably the first computer system to provide really practically useful scalable recognition of handwritten digits. LeNet-5 was the system, was actually used to automatically process like 10% of the checks in the, in the US. So, one of the things that really helps, I think, when building a new model is to, kind of, start with something simple, and gradually scale it up. So, we've created an even simpler version of MNIST, which we call MNIST_SAMPLE, which only has threes and sevens. Okay, so this is a good starting point to make sure that we can, kind of, do something easy. I picked threes and sevens for MNIST_SAMPLE, because they're very different. So I feel like, if we can't do this, we're going to have trouble recognizing every digit. [coughs] So step one is to call untar_data, untar_data is the fast AI function which takes a URL, checks whether you've already downloaded it, if you haven't it downloads it, checks whether you've already uncompressed it, if you haven't, it uncompress is it, and then it finally returns the path of where that ended up. So you can see here URLs.MNIST_SAMPLE. So you could just hit tab to get autocomplete. Is just some, some location. Right? Doesn't really matter where it is, and so then when we... All that, I've already downloaded it, and already uncompressed it, because I've already run this once before, so it happens straight away, and so path shows me where it is. Now in this case path is dot, and the reason path is dot is, because I've used this special base path attribute to path, to tell it kind of like where's my, where's my starting point, you know, and, and that's used to print so when I go here ls, which prints a list of files, these are all relative to where I actually untarred this to. So it just makes it a lot easier not to have to see the whole set of parent path folders. Um ls is actually... So, so path is a... Let's see what kind of type it is. So, it's a pathlib path object. Um, pathlib is part of the Python standard library. It's a really very, very, very nice library, but it doesn't actually have ls. Where there are libraries that we find super helpful, but they don't have exactly the things we want, we liberally add the things we want to them. So we add ls. Right? So if you want to find out what ls is, you know, there's, as we've mentioned it's a few ways you can do it you can pop a question mark there, and that will show you where it comes from. So there's actually a library called fastcore, which is a lot of the foundational stuff in fast AI that is not dependent on PyTorch, or pandas, or any of these big heavy libraries. So, this is part of fastcore and if you want to see exactly what it does, you, of course remember, you can put in a second question mark, to get the source code, and as you can see there's not much source code to it. And, you know, maybe most importantly, please, don't forget about doc, because really importantly that gives you this ‘Show in docs’ link, which you can click on to get to the documentation to see examples, pictures, if relevant, tutorials, tests ,and so forth. So what's, so when you're looking at a new data set, you kind of just used, I always start with just ls, see what's in it, and I can see here there's a train folder, and there's a valid folder, that's pretty normal. So let's look at ls on the train folder, and it's got a folder called 7 and a folder called 3, and so this is looking quite a lot like our bear classifier dataset. We downloaded each set of images into a folder based on what its label was. This is doing it at another level though. The first level of the folder hierarchy is, is it training or valid, and the second level is, what's the label. And this is the most common way for image datasets to be distributed. So let's have a look. Let's just create something called 3s, that contains all of the contents of the three directory. Training. And let's just sort them, so that this is consistent. Do the same for sevens, and let's look at the 3s and you can see there's just, they’re just numbered. All right. So let's grab one of those, open it, and take a look. Okay. So, there's the picture of a 3. And so what is that really? But not 3, im3. So PIL is the Python Imaging Library. It's the most popular library by far for working with images on Python and it's a PNG, not surprisingly. So Jupyter notebook knows how to display many different types and you can actually tell if you create a new type you can tell it how to display your type. And so PIL comes with something that will automatically display the image, like so. What I want to do here though is to look at like how we're going to treat this as numbers, right. And so one easy way to treat things as numbers is to turn it into an array. The array is part of numpy, which is the most popular array programming library for Python. And so if we pass our PIL image object to array, it just converts the image into a bunch of numbers. And the truth is, it was a bunch of numbers the whole time. It was actually stored as a bunch of numbers on disk. It's just that there's this magic thing in Jupyter that knows how to display those numbers on the screen. Now let me say, array(), turning it back into a numpy array. We're kind of removing this ability for Jupyter notebook to know how to display it like a picture. So once I do this, we can then index into that array and (create everything from the) grab everything, all the rows from 4 up to but not including 10, and all the columns from 4 up to and not including 10. And here are some numbers and they are 8-bit unsigned integers, so they are between 0 and 255. So an image, just like everything on a computer, is just a bunch of numbers. And therefore, we can compute with it. We could do the same thing, but instead of saying array(), we could say tensor(). Now our tensor is basically the PyTorch version of a numpy array. And so you can see it looks, it's exactly the same code as above, but I've just replaced array() with tensor(). And the output looks almost exactly the same, except it replaces array with tensor and so you'll see this - that basically a PyTorch tensor and an numpy array behave nearly identically, much if not most of the time. But the key thing is that a PyTorch tensor can also be computed on a GPU, not just a CPU. So in in our work, and in the book, and in the notebooks, in our code, we tend to use tensors, PyTorch tensors, much more often than numpy arrays because they kind of have nearly all the benefits of numpy arrays, plus all the benefits of GPU computation. And they've got a whole lot of extra functionality as well. A lot of people who have used Python for a long time, always jump into numpy because that's what they used to. If that's you, you might want to start considering jumping into tensor. Like wherever you used to write array, just start writing tensor and just see what happens. Because you might be surprised at how many things you can speed up or do it more easily. So let's grab that that 3 image, turn it into a tensor and so that's going to be a 3 image tensor - that's why I've got a im3_t here. And let's grab a bit of it, okay, and turn it into a panda's data frame. And the only reason I'm turning it into a panda's data frame is that pandas has a very convenient thing called background_gradient() that turns a background into a gradient, as you can see. So here is the top bit of the 3. You can see that the 0s are the whites and the numbers near 255 are the blacks. Okay, and there’s some whatsit bits in the middle which, which are grey. So here we have, we can see what's going on when our images, which are numbers, actually get displayed on the screen. It's just it's just doing this, okay, and so I'm just showing a subset here the actual phone number and MNIST is a 28 by 28 pixels square. So that's 768 pixels. So that's super tiny, right. Well my mobile phone, I don't know how many megapixels it is, but it's millions of pixels. So it's nice to start with something simple and small, okay. So, here's our goal - create a model, but by model, I just mean some kind of computer program learnt from data that can recognize 3s versus 7s. You can think of it as a 3 detector. Is it a 3, because if it's not a 3, it's a 7. So have a stop here, pause the video and have a think. How would you do it? How would you, like you don't need to know anything about neural networks, or anything else. How might you, just with common sense, build a 3 detector, okay? So I hope you grabbed a piece of paper, a pen, jotted it some notes down. I’ll tell you the first idea that came into my head was … what if we grab every single 3 in the data set and take the average of the pixels? So what's the average of this pixel, the average of this pixel, the average of this pixel, the average of this pixel, right. And so there'll be a 28 by 28 picture which is the average of all of the 3s, and that would be like the ideal 3. And then we'll do the same for 7s. And then so when we then grab something from the validation set to classify, we’ll say, “Like, oh, is this image closer to the ideal 3s, the ideal 3, the mean of the 3s, or the ideal 7? This is my idea and so I'm going to call this the pixel similarity approach. I'm describing this as a baseline. A baseline is like a super simple model that should be pretty easy to program from scratch with very little magic. You know, maybe it's just a bunch of kind of simple averages, simple arithmetic, which you're super confident is going to be better than, better than a random model, right. And one of the biggest mistakes I see, in even experienced practitioners, is that they fail to create a baseline. And so then they build some fancy Bayesian model or, or some fancy, fancy Bayesian model or some fancy neural network and they go, “Wow, Jeremy look at my amazingly great model!” And I'll say like, “How do you know it's amazingly great?” and they’ll say, “oh, look, the accuracy is 80%.” And then I'll say, “Okay, let's see what happens if we create a model where we always predict the mean. Oh look, that's 85%.” And people get pretty disheartened when they discover this, right. And so make sure you start with a reasonable baseline and then gradually build on top of it. So we need to get the average of the pixels, so we're going to learn some nice Python programming tricks to do this. So the first thing we need to do is we need a list of all of the 7s. So remember we've got the 7s - maybe it is just a list of file names, right. And so for each of those file names in the 7s, lets Image.open() that file just like we did before to get a PIL object, and let's convert that into a tensor. So this thing here is called a list comprehension. So if you haven't seen this before, this is one of the most powerful and useful tools in Python. If you've done something with C#, it's a little bit like link - it's not as powerful as link, but it's a similar idea. If you've done some functional programming in in JavaScript, it's a bit like some of the things you can do with that, too. But basically, we're just going to go through this collection, each item will become called “o”, and then it will be passed to this function, which opens it up and turns it into a tensor. And then it will be collated all back into a list. And so this will be all of the 7s as tensors. So Silva and I use lists and dictionary comprehensions every day. And so you should definitely spend some time checking it out, if you haven't already. So now that we've got a list of all of the 3s as tensors, let's just grab one of them and display it. So remember, this is a tensor, not a PIL image object. Ao Jupyter doesn't know how to display it. So we have to use something a command to display it - and show_image() is a fast.ai command that displays a tensor. And so here is 3. So we need to get the average of all of those 3s. So to get the average, the first thing we need to do is to (turn) change this so it's not a list, but it's a tensor itself. Currently three_tensors[1] has a shape which is 28 by 28. Oh this is this is the rows by columns, the size of this thing, right. But three_tensors itself, it's just a list. But I can't really easily do mathematical computations on that. So what we could do is we could stack all of these 28 by 28 images on top of each other to create a, like a 3d cube of images. And that's still quite a tensor. So a tensor can have as many of these axes or dimensions as you like. And to stack them up you use, funnily enough, stack(). And so this is going to turn the list into a tensor. And as you can see the shape of it is now 6131 by 28 by 28. So it's kind of like a cube of height 6131 by 28 by 28. The other thing we want to do is, if we're going to take the mean we want to turn them into floating-point values, because we don't want to kind of have integers rounding off. The other thing to know is that it's just kind of a standard in computer vision that when you are working with floats, that you expect them to be between 0 and 1. So we just divide by 255, because they were between 0 and 255 before. So this is a pretty standard way to kind of represent a bunch of images in PyTorch. So these three things here are called the axes -- first axis, second axis, third axis, and overall we would say that this is a rank 3 tensor, as it has three axes. So this one here was a rank two tensor -- just has two axes. So you can get the rank from a tensor by just taking the length of its shape: one, two, three. You can also get that from, so the word -- I've been using the word axis -- you can also use the word dimension. I think numpy tends to call it axis; pytorch tends to call it dimension. So the rank is also the number of dimensions: ndim. So you need to make sure that you remember this word. Rank is the number of axes or dimensions in a tensor, and the shape is a list containing the size of each axis in a tensor. So we can now say stacked_threes.mean(). Now, if we just say stacked_threes.mean(), that returns a single number -- that's the average pixel across that whole cube, that whole rank three tensor. But if we say mean(0), that is: take the mean over this axis, so that's the mean across the images, right? And so that's now 28 by 28 again, because we kind of like reduced over this 6131 axis. We took the mean across that axis and so we can show that image, and here is our ideal three. So here's the ideal seven using the same approach. All right, so now let's just grab a three -- it's just any old three -- there it is. And what I'm going to do is I'm going to say, “Well, is this three more similar to the perfect three, or is it more similar to the perfect seven?” And whichever one it's more similar to, I'm going to assume that that's the answer. So we can't just say look at each pixel and say what's the difference between this pixel you know zero zero here, and zero zero here, and then 0 1 here, and then 0 1 here, and take the average. And the reason we can't just take the average is that there's positives and negatives, and they're going to average out to nothing, right, so I actually need them all to be positive numbers. So there's two ways to make them all positive numbers. I could take the absolute value, which simply means remove the minus signs, okay? And then I could take the average of those; that's called the mean absolute difference or L1 norm. Or I could take the square of each difference and then take the mean of that, and then at the end I could take the square root, kind of undoes the squaring, and that's called the root mean squared error, or L2. So let's have a look. Let's take a three and subtract from it the mean of the threes, and take the absolute value, and take the mean and call that the distance using absolute value of the three to a_3. And there is the number, .1. And so this is the mean absolute difference, or L1 norm. So when you see a word like L1 norm, if you haven't seen it before it may sound pretty fancy, but all these math terms that we see, you know you can turn them into a tiny bit of code, right? It's, you know, don't let the mathy bits fool you. They're often -- like in code it's just very obvious what they mean, whereas with math you just, you just have to learn it, or learn how to google it. So here’s the same version for squaring: take the difference, square it, take the mean, and then take the square root. So now we'll do the same thing for our three; this time we'll compare it to the mean of the sevens. All right, so the distance from a_3 to the mean of the threes in terms of absolute was .1, and the distance from a_3 to the mean of the sevens was 0.15. So it's closer to the mean of the threes than it is to the mean of the sevens, so we guess therefore that this is a three, based on the mean absolute difference. Same thing with RMSE (root mean squared error) would be to compare this value with this value, and again root mean squared error is closer to the mean3 than to the mean7. So this is like a machine learning model (kind of); it’s a data-driven model which attempts to recognize threes versus sevens, and so this is a good baseline. I mean, it's a reasonable baseline, it's going to be better than random. We don't actually have to write out “- abs mean” -- we can just actually use L1 loss. Now, L1 loss does exactly that; we don't have to write “- squared” -- we can just write mse_loss, and that doesn't do the square root by default so we have to pop that in. Okay? And as you can see, they're exactly the same numbers. It's very important before we kind of go too much further, to make sure we're very comfortable working with arrays and tensors. And you know, they're so similar. So we could start with a list of lists, right, which is kind of a matrix. We can convert it into an array, or into a tensor. We can display it, and they look almost the same. You can index into a single row, you can index into a single column, and so it's important to know -- this is very important -- colon means every row, because I put it in the first spot. Right, so if it were in the second spot it would mean every column and so therefore comma colon ( ,: ) is exactly the same as removing it. So it just turns out you can always remove colons that are at the end, because they're kind of, they're just implied, right? You never have to, and I often kind of put them in anyway, because just kind of makes it a bit more obvious how these things kind of match up, or how they differ. You can combine them together so give me the first row and everything from the first up to but not including the third column -- back to this at 5, 6. You can add stuff to them; you can check their type. Notice that this is different to the Python type, right, so type is a function; this tells you it's a tensor. If you want to know what kind of tensor, you have to use type as a method. So it's a long tensor. You can multiply them by a float, turns it into a float. You know so have a fiddle around if you haven't done much stuff with numpy or PyTorch before, this is a good opportunity to just go crazy -- try things out. Try things that you think might not work and see if you actually get an error message, you know. So we now want to find out how good is our model? Our model that involves just comparing something to to the mean. So we should not compare… you should not check how good our model is on the training set. As we've discussed, we should check it on a validation set, and we already have a validation set: it's everything inside the valid directory. So let's go ahead and like combine all those steps before. Let's go through everything in the validation set 3.ls(). Open them, turn them into a tensor, stack them all up, turn them into floats, divide by 255. Okay, let's do the same for sevens. So we're just putting all the steps we did before into a couple of lines. I always try to print out shapes, like all the time, because if a shape is not what you expected then you can, you know, get weird things going on. So the idea is we want some function is_three that will return true if we think something is a three. So to do that we have to decide whether our digit that we're testing on is closer to the ideal three or the ideal seven. So let's create a little function that returns the difference between two things, takes the absolute value and then takes the mean. So we're going to create this function mnist_distance that takes the difference between two tensors, takes their absolute value, and then takes the mean. And it takes the mean, and look at this, we got minus this time, it takes the mean over the last -- over the second last and third last -- sorry the last and second last dimensions. So this is going to take the mean across the kind of x and y axes. And so here you can see it's returning a single number, which is the distance of a three from the mean3. So that's the same as the value that we got earlier: .1114. So we need to do this for every image in the validation set because we're trying to find the overall metric. Remember: the metric is the thing we look at to say how good is our model. So here's something crazy: we can call mnist_distance not just on a three, but on the entire validation set, against the mean three. That's wild! Like, there's no normal programming that we would do where we could somehow pass in either a matrix or a rank 3 tensor and somehow it works both times. And what actually happened here is that instead of returning a single number it returned 1,010 numbers. And it did this because it used something called broadcasting. And broadcasting is like the super special magic trick that lets you make Python into a very very high-performance language, and in fact, if you do this broadcasting on GPU tensors and PyTorch, it actually does this operation on the GPU even though you wrote it in Python. Here's what happens. Look here this a - b. So we’re doing a-b on two things. We've got first of all valid_3_tens, so valid three tensor is a thousand or so images, right, and remember that mean3 is just our single ideal three. So what is something of this shape minus something of this shape? Well, broadcasting means that if this shape doesn't match this shape, like if they did match it would just subtract every corresponding item, but because they don't match, it actually acts as if there's a thousand and ten versions of this. So it's actually going to subtract this from every single one of these okay. So broadcasting -- let's look at some examples. So broadcasting requires us to first of all to understand the idea of element-wise operations. This is an element-wise operation. Here is a rank 1 tensor of size 3 and another rank 1 tensor of size 3, so we would say these sizes match (they're the same) and so when I add 1, 2, 3, to 1, 1, 1 I get back 2 3 4. It just takes the corresponding items and adds them together. That's called element-wise operations. So when I have different shapes, as we described before, what it ends up doing is it basically copies this number a thousand and ten times, and it acts as if we had said valid_3_tens minus 1,010 copies of mean3. As it says here it doesn't actually copy mean3 1,010 times; it just pretends that it did, right? It just acts as if it did, so basically kind of loops back around to the start again and again and it does the whole thing in C or in CUDA on the GPU. So then we see absolute value, right? So let's go back up here after we do the minus, we go absolute value so what happens when we call absolute value on something of size 1010 by 28 by 28? It just calls absolute value on each underlying thing right and then finally we call mean. -1 is the last element always in Python, -2 is the second-last. So this is taking the mean over the last two axes, and so then it's going to return just the first axis. So we're going to end up with 1,010 means -- 1,010 distances, which is exactly what we want: we want to know how far away is our each of our validation items from the the ideal three. So then we can create our is_3 function, which is, “Hey, is the distance between the number in question and the perfect three less than the distance between the number in question and the perfect seven?” If it is, it's a three, right? So our three, that was an actual three we had: is it a three? Yes. Okay, and then we can turn that into a float, and “yes” becomes 1.0. Thanks to broadcasting, we can do it for that entire set, right? So this is so cool! We basically get rid of loops. In this kind of programming, you should have very few, very very few loops. Loops make things much harder to read, and hundreds of thousands of times slower (on the GPU potentially tens of millions of times slower). So we can just say is_3 on our whole valid_3_tens and then turn that into float, and then take the mean; so that's going to be the accuracy of the threes on average. And here's the accuracy of the sevens -- it's just one minus that -- and so the accuracy across threes is about 91 and a bit percent. The accuracy on sevens is about 98%, and the average of those two is about 95%. So here we have a model that's 95 percent accurate at recognizing threes from sevens. It might surprise you that we can do that using nothing but arithmetic, right, but so that's what I mean by getting a good baseline. Now the thing is, it's not obvious how we kind of improve this, right? I mean the thing is, it doesn't match Arthur Samuel’s description of machine learning. This is not something where there's a function which has some parameters which we're testing against some kind of measure of fitness, and then using that to like improve the parameters iteratively. We kind of, we just did one step and that's that, okay? So we will try and do it in this way where we arrange for some automatic means of testing the effectiveness of -- he called it a weight assignment, we'd call it a parameter assignment -- in terms of performance, and a mechanism for altering the weight assignment to maximize the performance. But we won’t do it that way, right, because we know from Chapter 1, from Lesson 1, that if we do it that way, we have this like magic box called machine learning that can do -- you know, particularly combined with neural nets -- should be able to solve any problem, in theory, if you can at least find the right set of weights. So we need something that we can get better and better, to learn. So let's think about a function which has parameters. So instead of finding an ideal image and seeing how far away something is from the ideal image, so instead of like having something where we test how far away we are from an ideal image, what we could instead do is come up with a set of weights for each pixel. So we're trying to find out if something is the number three, and so we know that like in the places that you would expect to find ‘3’ pixels, you could give those like high weights. So you can say,”Hey, if there's a dot in those places, we give it like a high score and if there's dots in other places we'll give it like a low score. So we can actually come up with a function where the probability of something being, well in this case let's say an eight, is equal to the pixels in the image multiplied by some sort of weights, and then we sum them up, right, so then anywhere where our -- the image we're looking at, you know, has pixels where there are high weights, it's going to end up with a high probability. So here x is the image that we're interested in, and we're just going to represent it as a vector, so let's just have all the rows stacked up, end to end into a single long line. So we're going to use an approach where we're going to start with a vector W. So a vector is a Rank 1 tensor, okay? We’re going to start with a vector W that's going to contain random weights, random parameters, depending on whether you use the Arthur Samuel version of the terminology or not. And so, we'll then predict whether a number appears to be a three or a seven by using this tiny little function. And then we will figure out how good the model is. Where we will calculate like, how accurate it is or something like that. Yeah this is the loss, and then the key step is we're then going to calculate the gradient. Now the gradient is something that measures for each weight if I made it a little bit bigger will the loss get better or worse. If I made it a little bit smaller will the loss get better or worse? And so if we do that for every weight we can decide for every weight whether we should make that weight a bit bigger or a bit smaller. That’s called the gradient. Right? So once we have the gradient we then step, is the word we use is step. Change all the weights, up a little bit for the ones where the gradient we should, said, we should make them a bit higher, and down a little bit for all the ones where the gradient said they should be a bit lower. So now it should be a tiny bit better and then we go back to step two and calculate a new set of predictions, using this formula, calculate the gradient again, step the weights, keep doing that. So this is basically the flow chart and then at some point when we're sick of waiting or when the loss gets good enough we'll stop. So these seven steps 1, 2, 3, 4, 5, 6, 7… These seven steps are the key to training all deep learning models. This technique is called stochastic gradient descent. Well, it's called gradient descent, we’ll see the stochastic bit very soon. And for each of these seven steps there's lots of choices around exactly how to do it. Right? We've just kind of hand waved a lot, like what kind of random initialization, and how do you calculate the gradient, and exactly what step do you take based on the gradient, and how do you decide when to stop, blah blah blah. Right? So in this... In this course we're going to be like learning about, you know, these steps, you know, that's kind of part one, you know. I then the other big part is like, well what's the actual function, neural network. So how do we train the thing and what is the thing that we train. So, we initialize parameters with random values. We need some function that's going to be the loss function that will return a number that's small if the performance of the model is good. We need some way to figure out whether the weight should be increased a bit or decreased a bit, and then we need to decide like when to stop, which will just say let's just do a certain number of epochs. So, let's like, go even simpler. Right? We're not even going to do MNIST. We're going to start with this function x squared, okay? And in fast AI we've created a tiny little thing called plot function, that plots a function. All right, so there’s our function f, and what we're going to do is we're going to try to find this is our loss function. So we're going to try and find the bottom point. Right? So we're going to try and figure out what is the x value, which is at the bottom. So our seven step procedure requires us to start out by initializing, so we need to pick some value. Right? So the value we pick, which is to say: ‘oh let's just randomly pick minus one and a half.’ Great! So now we need to know, if I increase x a bit, does my, but remember this is my loss does my loss get a bit better, remember better is smaller, or a bit worse. So we can do that easily enough. We can just try a slightly higher x and a slightly lower x and see what happens. Right? And you can see it's just the slope. Right? The slope at this point tells you that if I increase x by a bit then my loss will decrease, because that is the slope at this point. So, if we change our, our weight, our parameter, just a little bit in the direction of the slope. Right? So here is the direction of the slope and so here's the new value at that point, and then do it again, and then do it again, eventually we'll get to the bottom of this curve. Right? So this idea goes all the way back to Isaac Newton, at the very least, and this basic idea is called Newton's method. So a key thing we need to be able to do is to calculate this slope. And the bad news is to do that we need calculus. At least that’s bad news for me because I've never been a fan of calculus. We have to calculate the derivative. Here's the good news, though. Maybe you spent ages in school learning how to calculate derivatives - you don't have to anymore, the computer does it for you, and the computer does it fast. It uses all of those methods that you learned at school and a whole lot more - like clever tricks for speeding them up, and it just does it all automatically. So, for example, it knows (I don't know if you remember this from high school) that the derivative of x squared is 2x. It’s just something it knows, it's part of its kind of bag of tricks, right. So, so PyTorch knows that. PyTorch has an engine built in that can take derivatives and find the gradient of functions. So to do that we start with a tensor, let's say, and in this case we're going to modify this tensor with this special method called requires_grad. And what this does is it tells PyTorch that any time I do a calculation with this xt, it should remember what calculation it does so that I can take the derivative later. You see the underscore at the end? An underscore at the end of a method in PyTorch means that this is called an in-place operation it actually modifies this. So, requires_grad_ modifies this tensor to tell PyTtorch that we want to be calculating gradients on it. So that means it's just going to have to keep track of all of the computations we do so that it can calculate the derivative later. Okay, so we've got the number 3 and let's say we then call f on it (remember f is just squaring it, so 3 squared is 9. But the value is not just 9, it's 9 accompanied with a grad function which is that it knows that a power operation has been taken. So we can now call a special method, backward(). And backward(), which refers to backpropagation, which we'll learn about, which basically means take the derivative. And so once it does that we can now look inside xt, which we said requires grad, and find out its gradient. And remember, the derivative of x squared is 2x. In this case that was 3, 2 times 3 is 6. All right, so we didn't have to figure out the derivative we just call backward(), and then get the grad attribute to get the derivative. so that's how easy it is to do calculus in PyTorch. So what you need to know about calculus is not how to take a derivative, but what it means. And what it means is it's a slope at some point. Now here's something interesting - let's not just take3, let's take a Rank 1 tensor also known as a vector [3., 4., 10.] and let's add sum to our f function. So it's going to go x squared .sum. and now we can take f of this vector, get back 125. And then we can say backward() and grad and look - 2x 2x 2x. So we can calculate, this is, this is vector calculus, right. We're getting the gradient for every element of a vector with the same two lines of code. So that's kind of all you need to know about calculus, right. And if this is, if this idea that, that a derivative or gradient is a slope is unfamiliar, check out Khan Academy. They had some great introductory calculus. And don't forget you can skip all the bits here they teach you how to calculate the gradients yourself. So now that we know how to calculate the gradient, that is the slope of the function, that tells us if we change our input a little bit, how will our output change correspondingly. That's what a slope is, right. And so that tells us that for every one of our parameters, if we know their gradients, then we know if we change that parameter up a bit or down a bit, how will it change our loss. So therefore, we then know how to change our parameters. So what we do is, let's say all of our weights are called “w”, we just subtract off them the gradients multiplied by some small number and that small number is often a number between about 0.001 and 0.1 and it's called the learning rate and this here is the essence of gradient descent So if you pick a learning rate that's very small, then you take the slope and you take a really small step in that direction, and another small step, another small step, another small step, and so on, it's going to take forever to get to the end. If you pick a learning rate that's too big, you jump way too far each time and again, it's going to take forever. And in fact in this case, sorry this case we're assuming we're starting here and it's actually is so big it got worse and worse. Or here's one where we start here and it's like it's not so big it gets worse and worse, but it just takes a long time to bounce in and out right. So picking a good learning rate is really important, both to making sure that it's even possible to solve the problem and that it's possible to solve it in a reasonable amount of time. So we'll be learning about picking, how to pick learning rates in this course. So let's try this, let's try using gradient descent. I said SGD, that's not quite accurate, it's just going to be gradient descent to solve an actual problem. So the problem we're going to solve is, let's imagine you were watching a roller coaster go over the top of a hump, right. So as it comes out of the previous hill it's going super fast and it's going up the hill and it's going slower and slower and slower until it gets to the top of the hump, and then it goes down the other side, it gets faster and faster and faster. So if you like how to stopwatch or whatever or some kind of speedometer and you are measuring it just by hand at kind of equal time points you might end up with something that looks a bit like this, right. And so the way I did this was I just grabbed a range, just grabbed the numbers from naught up to, but not including 20, right. These are the time periods at which I'm taking my speed measurement. And then I've just got some quadratic function here - I multiplied by 3 and then square it and then add 1, whatever, right. And then I also, actually sorry. I take my time - 9.5 square it, times .75, and add 1. And then I add a random number to that or add a random number to every observation. So I end up with a quadratic function which is a bit bumpy. So this is kind of like what it might look like in real life because my speedometer kind of testing is not perfect. All right, so we want to create a function that estimates at any time what is the speed of the roller-coaster. So we start by guessing what function it might be. So we guess that it's a function - a times time squared, plus b times time, plus c - you might remember from school is called a quadratic. So let's create a function, right. And so let's create it using kind of the Alpha Samuels technique, the machine learning technique. This function is going to take two things - it's going to take an input, which in this case is a time, and it's going to take some parameters. And the parameters are a, b, and c. So in Python you can split out a list or a collection into its components, like so. And then here's that function. So we’re not just trying to find any function in the world, we're just trying to find some function which is a quadratic by finding an a, and a b, and a c. So the Arthur Samuel technique for doing this is to next up come up with a loss function; come up with a measurement of how good we are. So if we've got some predictions that come out of our function and the targets which are these, you know, actual values, then we could just do the mean squared error. Okay, so here's that means squared error we saw before - the difference squared, then take the mean. So now we need to go through our seven step process, we want to come up with a set of three parameters a, b and c, which are as good as possible. So step one is to initialize a, b, and c to random values. So this is how you get random values, three of them in PyTorch. And remember we're going to be adjusting them, so we have to tell PyTorch that we want the gradients. I'm just going to save those away so I can check them later. And then I calculate the predictions using that function, f, which was this. And then let's create a little function which just plots how good at this point are our predictions. So here is a function that prints in red our predictions, and in blue our targets. So that looks pretty terrible. So let’s calculate the loss, using the mse function we wrote. Okay, so now we want to improve this. So calculate the gradients using the two steps we saw, call backward and then get grad. And this says that each of our parameters has a gradient that's negative. Let's pick a learning rate of ten to the minus five, or we multiply that by ten to the minus five, and step the weights, And remember step the weights means minus equals the learning rate times the gradient. There’s a wonderful trick here, which I’ve called .data. The reason I've called .data is that .data is a special attribute in PyTorch, which if you use it, then the gradient is not calculated. And we certainly wouldn't want the gradient to be calculated of the actual step we're doing. We only want the gradient to be calculated of our function, f. All right, so when we step the weights we have to use this special .data attribute. After we do that, delete the gradients that we already had and let's see if loss improved. So the loss before was 25800, now it's 5,400. And the plot has gone from something that goes down to -300 to something that looks much better. So let's do that a few times. So I just grabbed those previous lines of code and pasted them all into a single cell. Okay so preds, loss.backward, data grad = none. And then from time-to-time print the loss out, and repeat that ten times. And look getting better and better. And so we can actually look at it getting better and better. So this is pretty cool, right. We have a technique, this is the Arthur Samuel technique for finding a set of parameters that continuously improves by getting feedback from the result of measuring some loss function. So that was kind of the key step, right. This, this is the gradient descent method. So you should make sure that you kind of go back and feel super comfortable with what's happened. And you know, if you're not feeling comfortable, that that's fine, right. If it's been a while, or if you've never done this kind of gradient descent before, this might feel super unfamiliar. So kind of try to find the first cell in this notebook where you don't fully understand what it's doing, and then stop and figure it out. Look at everything that's going on, do some experiments, do some reading until you understand that cell where you're stuck before you move forwards. So let's now apply this to MNIST. So for MNIST we want to use this exact technique and there's basically nothing extra we have to do. Except one thing - we need a loss function. And the metric that we've been using is the error rate, or the accuracy. It's like how often are we correct, right. And and that's the thing that we're actually trying to make good, our metric. But we've got a very serious problem - which is, remember we need to calculate the gradient to figure out how we should change our parameters. And the gradient is the slope or the steepness, which you might remember from school is defined as rise over run. It's (y_new - y_old) divided by (x_new - x_old). So the gradients actually defined when x_new is is very very close to x_old, meaning their difference is very small. That, think about it - accuracy. If I change a parameter by a tiny tiny tiny amount, the accuracy might not change at all because there might not be any 3 that we now predict as a 7 or any 7 that we now predict as a 3, because we change the parameter by such a small amount. So it's it's it's possible, in facr, it's certain, that the gradient is zero at many places and that means that our parameters aren't going to change at all. Because learning rate times gradient is still zero when the gradient’s zero for any learning rate. So this is why the loss function and the metric are not always the same thing. We can't use a metric as our loss if that metric has a gradient of zero. So we need something different. So, we want to find something that kind of is pretty similar to the accuracy in that like as the accuracy gets better this ideal function we want gets better as well but it should not have a gradient of zero. So let's think about that function. Suppose we had three images. Actually, you know what? This is actually probably a good time to stop. Because actually, you know, we've we've kind of, we've got to the point here where we understand gradient descent. We kind of know how to do it with a simple loss function and I actually think before we start lookin g at the MNIST loss function, we shouldn't move on. Because we've got so much so much assignments to do for this week already. So, we've got built your web application, and we've got both step-through-step through this notebook to make sure you fully understand it. So I actually think we should probably stop right here before we make things too crazy. So before I do, Rachel, are there any questions? Okay great, all right. Well thanks everybody. Sorry for that last-minute change of tack there but I think this is going to make sense. So I hope you have a lot of fun with your web applications. Try and think of something that's really fun, really interesting. It doesn't have to be like, important. It could just be some you know cute thing. We've had students before, a student that I think he said he had 16 different cousins, and he created something that would classify a photo based on which of his cousins... It was for, like his fiancee meeting his family. [laughs] You know you can come up with anything you like, but you know, yeah, show off your application and maybe have a look around at what ipywidgets can do, and try and come up with something that you think is pretty cool. All right, thanks everybody. I will see you next week!

Info

Channel: Jeremy Howard

Views: 69,467

Rating: 4.9317827 out of 5

Keywords: deep learning, fastai

Id: 5L3Ao5KuCC4

Channel Id: undefined

Length: 126min 22sec (7582 seconds)

Published: Fri Aug 21 2020